### Beginning Text Preparation

In order to perform text analysis, there are a few R commands you should have up your sleeve. Some of the commands help get you set up and locate all of the files in your corpora. Other commands can be used throughout the programming process to check on your algorithm and make sure everything looks the way you think it should. Learning the following commands will give you a brief introduction to R while also setting you up with a solid toolkit to begin programming.


### Loading files from Github to Carbonate

Since we will be using Carbonate as the example for our file paths and other elements in the scripts and Jupyter notebooks, many of you may want to save these files to Carbonate to make following along a bit easier. To do this you will need a Carbonate account if you do not already have one. Indiana University students, faculty, staff, and sponsored affiliates can request a Carbonate account. Steps on how to do so can be found [here](https://kb.iu.edu/d/aolp). Once you have your account, you'll want to be able to access the [Research Desktop (RED)](https://kb.iu.edu/d/apum). You will also need to have access to RED through the [thinlinc client](https://kb.iu.edu/d/aput). Once you have your account and can access it, you'll want to acquire the [Cyber DH Box Text-Analysis](https://iu.box.com/v/Text-Analysis) repository and save it to Carbonate via Research Desktop. Or download it to Carbonate from our [GitHub Repository](https://github.com/cyberdh/Text-Analysis). Whichever option you use, it will download as a .zip file so you will need to double click the .zip file to extract the repository. 

The nice thing about RED is that it comes with a built in way for you to access your Box account so you can download the repository to Box and use the Text-Analysis notebooks and scripts on Carbonate and your own computer without having to use an SFTP client or some other means of moving files back and forth.

To use Box on RED go to Applications > Storage > Box setup and follow the instructions. You can also get help [here](https://kb.iu.edu/d/apxv#storage)

Now that you have Carbonate, RED, and Box up and running, lets make sure you can run the notebooks and scripts.

### Using Jupyter Notebooks and R Studio
These R Notebooks are actually .ipynb files, which means they run in Jupyter Notebook. Luckily, on RED, there is a Jupyter Notebook icon and the R kernel needed to run .ipynb notebooks in R is already installed. Currently RED is running R version 3.3.1. 

Look in the top left corner of RED and go to Applications > Analytics > Jupyter Notebook and double click Jupyter Notebook. When Jupyer Notebook opens (which might take a minute) you should see a list of files and folders on the left hand side. One of these should say Text-Analysis if you successfully downloaded and unzipped the repository. Then just go to Text-Analysis > R > RNotebooks and choose the notebook you wish to use, it should open up in Jupyter Notebook.

To run the R scripts (same code as the notebooks, but with very little explanation) you will need to use R Studio. The really nice thing about RED is that it comes with R Studio pre-installed as well. Go to Applications > Analytics > R Studio (which should be at the bottom of the list). Double click R Studio and it should start right up. Now go to File > Open File > Text-Analysis > R > Scripts and choose the R script you wish to use.

### Let's get started

First, we need to set our working directory (setwd). This is the folder that points to where your data is stored. In our case we have multiple folders that we may want to use at some point, but they are all contained in our Text-Analysis folder, so we will set that as our working directory.

In [1]:
setwd("~/Text-Analysis")

### Include necessary packages for notebook 

R's extensibility comes in large part from packages. Packages are groups of functions, data, and algorithms that allow users to easily carry out processes without recreating the wheel. Some packages are included in the basic installation of R, others created by R users are available for download. Make sure to have the following packages installed before beginning so that they can be accessed while running the scripts.

In R Studio, packages can be installed by navigating to Tools in the menu bar > Install Packages. Or in the bottom right panel click on the "packages" tab and then click on "install."

### Load Packages

In order to access the packages you have installed within the environment in which you are currently working, you must load them at the beginning of your script. To load packages, use the library() command (see code below).

The three packages listed are often used for text analysis in R:

- **tm:** A framework for text mining applications within R.

- **NLP:** Basic classes and methods for Natural Language Processing.

- **ggplot2:** A system for 'declaratively' creating graphics,
    based on "The Grammar of Graphics". You provide the data, tell 'ggplot2'
    how to map variables to aesthetics, what graphical primitives to use,
    and it takes care of the details.

In [2]:
library(tm)
library(NLP)
library(ggplot2)

Loading required package: NLP

Attaching package: 'ggplot2'

The following object is masked from 'package:NLP':

    annotate



To perform topic modeling, you may load lsa or lda. For data visualization, a popular package besides ggplot2 is dendextend to create a cluster dendrogram. You can peruse the various contributed packages [here](https://cran.r-project.org/web/packages/available_packages_by_name.html). The tutorials included in the R Toolkit will instruct you on which packages to install and load.

### Load data 

Now you are ready to start looking at your data! First, you must load it into your environment. The scan() function will do this for you. If you want to load just one text into your environment, here is the syntax:

The first argument is the filename (or path if the file resides in a different directory than your working directory). 
The second argument "what" specified as type "character" will read the text in as a character vector. 
The third argument "sep" specified as "backslash" + "n" which is the way to code line breaks in R.

So putting everything together, this line reads in the first episode of Star Trek: The Next Generation which is 102.txt, and separates the text into a character vector by line.

In [3]:
text <- scan("data/StarTrekNextGenClean/season1/102.txt", what="character", sep="\n")

The final, crucial aspect of this line is the assignment. "<-" assigns whatever results on the right side of the arrow into the variable specified on the left side. Some programming languages use "=" instead of the arrow. R will also acknowledge this, but using the arrow is best practice.

Here, we have named that variable "text" since it holds the text with which we are working. However, you can name this variable whatever you would like. This line will give the exact same result, although it is best to name the variable in relation to what it holds:

In [4]:
potatoes <- scan("data/StarTrekNextGenClean/season1/102.txt", what="character", sep="\n")

Now that we have the text saved as a variable, we can reuse that variable simply by calling "text" instead of the entire scan line again. The next few commands will use "text" to explore the data.


### R Objects

R is distinct from other programming languages in that it handles objects a little differently. Throughout text analysis, you will need to massage your text and textual data by changing it into various kinds of objects which make things easier. Check out [this tutorial](https://www.nealgroothuis.name/introduction-to-data-types-and-objects-in-r/) by Neal Groothuis which simply explains the various types of data objects. Since some kinds of objects prohibit certain actions, a simple way to check the type of object you are currently working with is class(). For example, the code below shows that the class of the Jane Eyre "text" is a character vector, which we would assume since that is what we specified while loading it in.

In [5]:
class(text)

### Data Inspection

Just as you may want to verify the type of the object with which you are working, you may want to view it from other angles to make sure the data is formed as you expect.

The length function shows the number of elements within an object. If you find that the length is zero, you may have to go back and reload the data, or check to make sure your algorithm is working correctly.

In [6]:
length(text)

We can also look at individual elements - Lets see what the first line of our text is...

In [7]:
text[1]

Or perhaps you would like to see the first few elements:


In [8]:
head(text)

Or last few elements:

In [9]:
tail(text)

There are many more ways to inspect parts of data (check out the CRAN), but these quick checks are helpful while manipulating the data and debugging the inevitable issues you will encounter while developing your script.

### Explore

The above commands are a few tips and tricks to get you started with R. Similar to R's extensibility with packages, the R user community has great resouces for learners. The [CRAN FAQ](https://cran.r-project.org/) and the [CRAN Manual](https://cran.r-project.org/doc/manuals/R-intro.pdf) answers quite a few questions about R and its uses.

Googling the issue, function, or object name with "r" will return helpful resources. If a PDF from cran.r-project.org appears, there you will find extensive documentation and examples for that function, etc. and other related resources. Similarly, any result from r-bloggers.com will most likely be helpful. For any other issues, Stack Overflow is helpful to find answers to common questions as well as ask your own.

The rest of the IU tutorials explain some methods for textual analysis using R. If you are ready to dive in, click on one to begin!