# Co-Occurrence of Words in an individual Shakespeare Play

The following co-occurrence script aims to discover the proximity of words to a word of interest in an individual Shakespeare play. At the end, it will take in a word of the user's choice and find the words that co-occur within a chosen distance (here, distance being the number of words before and after the chosen word, so if you choose 10, it will look at the 10 words before and the 10 words after the word of interest). The script tells how many times each word appears within the specified distance from the chosen word, the furthest distance each word is from the chosen word (within the specified limit), the shortest distance of each word from the chosen word, and the overall average distance.

#### Global Parameters

You will need to have set up a [Karst account](https://kb.iu.edu/d/bezu#account) first. Once you have your Karst account simply go to [rstudio.iu.edu](https://rstudio.iu.edu/auth-sign-in) and login using your IU username and passphrase.  Next, set the working directory by pointing to the location on Karst where you have stored the files. Below, we have chosen to save the folder "Text-Analysis" as a "Project" in R Studio on the Karst super-computer here at Indiana University. It contains the R scripts, texts, notebooks, and results. If you have forked and cloned the Github repository (see [textPrep.Rmd](textPrep.Rmd) for directions on how), simply point to where you have saved the folder. If you save it to your personal Karst folder, it will most likely look very similar to the example below. Karst is a unix server and so the home directory is represented by a ~ and, thus, the path will look like this "~/Text-Analysis/" (with the quotes). Alternatively, if you are on a PC, you will need to use an absolute path such as "C:/Users/XXX" (with the quotes again).

In R Studio, click Session in the menu bar > Set Working Directory > Choose Directory, then select the Text-Analysis directory in which you are working. This will set your working directory in the console pane, but make sure to copy the path into the source pane above to keep the directory constant if you close this script and reopen later. Make sure you click on the blue cube with a "R" in the center to set your working directory to your Text-Analysis project path.

HINT: Your working directory is the folder from which you will be pulling your texts.

In [1]:
setwd("~/Text-Analysis")

#### Include necessary packages for notebook

R's extensibility comes in large part from packages. Packages are groups of functions, data, and algorithms that allow users to easily carry out processes without recreating the wheel. Some packages are included in the basic installation of R, others created by R users are available for download. Make sure to have the following packages installed before beginning so that they can be accessed while running the scripts.

In R Studio, packages can be installed by navigating to Tools in the menu bar > Install Packages. Or in the bottom right panel click on the "packages" tab and then click on "install."

The packages are used within the co-occurrence script:

tidytext - Text mining for word processing and sentiment analysis using 'dplyr', 'ggplot2', and other tidy tools.

dplyr - tool for working with data frame like objects, both in memory and out of memory.

fuzzyjoin - Join tables together based not on whether columns match exactly, but whether they are similar by some comparison. Implementations include string distance and regular expression matching.

tm - this package provides tools (functions) for performing various types of text mining. In this script, we will use tm to perform text cleaning in order to have uniform data for analysis. Check out [this link](https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf) for the documentation!

In [2]:
library(tidytext)
library(dplyr)
library(fuzzyjoin)
library(tm)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: NLP


#### Create a corpus

In [3]:
corpus <- scan("data/shakespeareFolger/Hamlet.txt", what="character", sep="\n")

#### Create early modern stopword list

In [4]:
myStopWords <- scan("data/earlyModernStopword.txt", what="character", sep="\n")

#### Clean corpus

Clean the corpus by:
- Lower casing every word (so 'Love' and 'love' are counted as the same word).
- Removing punctuation.
- Striping whitespace (This means removing gaps larger than a single space).
- Removing stopwords

In [5]:
mycorpus <- tolower(corpus)
mycorpus <- removePunctuation(mycorpus)
mycorpus <- stripWhitespace(mycorpus)
mycorpus <- removeWords(mycorpus, myStopWords)

#### Tokenize the corpus

Tokenize the corpus into a data.frame where each row is one word. Then add a position column, and lastly remove regular english stopwords (different from our earlymodern stopword list).

In [6]:
all_words <- data_frame(text = mycorpus) %>% 
  unnest_tokens(word, text) %>%
  mutate(position = row_number()) %>%
  filter(!word %in% tm::stopwords("en"))

#### Input your parameters

Search through all the words of your corpus looking for your word of choice and mark the position of each occurence. Then determine the distance out from each occurence you wish to consider and determine the distance for each occurrence of each word within the chosen distance.

In [7]:
#Choose your word of interest
nearby_words <- all_words %>%
  filter(word == "love") %>%
  #filter(word %in% c("father", "good"))
  #the script directly above that is commented out is for if you wish to consider more than one word at a time
  select(focus_term = word, focus_position = position) %>%
  #Choose the distance you wish to analyze
  difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 10) %>%
  mutate(distance = abs(focus_position - position))


#### Compute and display your results

In [8]:
words_summarized <- nearby_words %>%
  group_by(word) %>%
  #group_by(focus_word, word)
  #the script directly above that is commented out is for if you are looking at more than one word
  summarize(number = n(),
            maximum_distance = max(distance),
            minimum_distance = min(distance),
            average_distance = mean(distance)) %>%
  arrange(desc(number))
write.csv(words_summarized, file = "ChooseAnyNameYouWant.csv")
print(words_summarized)

# A tibble: 731 x 5
    word number maximum_distance minimum_distance average_distance
   <chr>  <int>            <dbl>            <dbl>            <dbl>
 1  love    105               10                0         2.076190
 2  lord     27               10                1         5.148148
 3  fear     17               10                1         5.352941
 4  time     16                9                2         5.312500
 5 great     14                9                1         4.428571
 6  know     13               10                1         4.384615
 7 think     12               10                1         5.666667
 8  dear     10               10                1         5.300000
 9   let      9               10                1         6.222222
10   man      9               10                1         4.111111
# ... with 721 more rows


### Much of this code was derived from David Robinson on stackoverflow who helped create the tidytext and fuzzyjoin packages in R.