We'll often want this magical line at the start of our notebooks.
It makes plots show up right in the notebook. We might as well get used to it.

In [None]:
%matplotlib inline

# Some text basics

These lines read a text file. The first line creates a file object that points to the file. The second line reads in the contents of that file and assigns it to a variable named `genesis_raw`.

In [None]:
myfile = open('corpora/genesis.txt')
genesis_raw = myfile.read()

`genesis_raw` will be a string with every character in genesis. 
Let's see how many characters it is:

In [None]:
len(genesis_raw)

We can display the first 100 characters:

In [None]:
genesis_raw[:100]

In [None]:
import nltk

genesis_tokenized = nltk.word_tokenize(genesis_raw)

In [None]:
len(genesis_tokenized)

In [None]:
genesis_tokenized[:10]

In [None]:
fdist = nltk.FreqDist(genesis_tokenized)
fdist.most_common(25)

In [None]:
fdist.plot(25)

### Dispersion plot with nltk

Another library that overlap with Textblob is [nltk](nltk.org). It does some things that TextBlob doesn't do, but is just a tad more complicated. Here we'll use it to make a dispersion plot. and a concordance.

In [None]:
from nltk.draw import dispersion_plot
dispersion_plot(genesis_tokenized, ["Adam", "Noah"])

In [None]:
from nltk.text import ConcordanceIndex
ci = ConcordanceIndex(genesis_tokenized)
ci.print_concordance("Adam", width=80, lines=25)

## Working with a stop list

Often we will want to remove some common but not so helpful words from a corpus. These common-but-not-helpful words are called "stop lists". I put a some in the "lists" folder. 

Let's read one in and split it into a list.

In [None]:
f = open("lists/stop-words_english_5_en.txt")
stop_list = f.read().split("\n")

Now let's remove these words from our list of common words

In [None]:
words_pruned = []
for w in genesis_tokenized:
    if w.lower() not in stop_list:
        words_pruned.append(w.lower())

In [None]:
fdist_pruned = nltk.FreqDist(words_pruned)
fdist_pruned.most_common(25)

Let's get rid of some additional stuff.

In [None]:
stop_list += list('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~’')
stop_list += list("abcdefghijklmnopqrstuvwxyz0123456789")
stop_list += ["''", "``", "--"]

In [None]:
words_pruned = []
for w in genesis_tokenized:
    if w.lower() not in stop_list:
        words_pruned.append(w.lower())

In [None]:
fdist_pruned = nltk.FreqDist(words_pruned)
fdist_pruned.most_common(25)

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 10))
fdist_pruned.plot(25)