We'll often want this magical line at the start of our notebooks.
It makes plots show up right in the notebook. We might as well get used to it.

In [None]:
%matplotlib inline

# Some text basics

These lines read a text file. The first line creates a file object that points to the file. The second line reads in the contents of that file and assigns it to a variable named `genesis_raw`.

In [None]:
myfile = open('corpora/genesis.txt')
genesis_raw = myfile.read()

`genesis_raw` will be a string with every character in genesis. 
Let's see how many characters it is:

In [None]:
len(genesis_raw)

We can display the first 100 characters:

In [None]:
genesis_raw[:100]

We will often want to break up a long string in to separate words. This is called **tokenizing**.

To do this, we will use a library called **nltk**. 

In [None]:
import nltk
genesis_tokenized = nltk.word_tokenize(genesis_raw)

In [None]:
len(genesis_tokenized)

In [None]:
genesis_tokenized[:10]

Using nltk we can create a Frequency Distribution object that keeps track of counts of each of the words. It can also make a plot for us.

In [None]:
fdist = nltk.FreqDist(genesis_tokenized)
fdist.most_common(15)

In [None]:
fdist.plot(25)

We can also use nltk to create a **dispersion_plot** and a **concordance**.

In [None]:
from nltk.draw import dispersion_plot
dispersion_plot(genesis_tokenized, ["Adam", "Noah"])

In [None]:
from nltk.text import ConcordanceIndex
ci = ConcordanceIndex(genesis_tokenized)
ci.print_concordance("Adam", width=80, lines=10)

# Working with a stop list

Often we will want to remove some common but not so helpful words from a corpus. Let's read one in and split it into a list.

In [None]:
f = open("lists/stop-words_english_5_plus.txt")
stop_list = f.read().split("\n")

Now let's remove these words from our list of common words

In [None]:
words_pruned = []
for w in genesis_tokenized:
    if w.lower() not in stop_list:
        words_pruned.append(w.lower())

In [None]:
fdist_pruned = nltk.FreqDist(words_pruned)
fdist_pruned.most_common(15)

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 7))
fdist_pruned.plot(25)