# Exploring the NLTK book collection and some basic statistical analysis

This notebook is based on the exercises described in [NLTK Book, Chapter 1: Language Processing and Python](https://www.nltk.org/book/ch01.html)
<footer style="text-align:right;font-size:.8em;">Source: Steven Bird, Ewan Klein, and Edward Loper (2009). Natural Language Processing with Python. O’Reilly Media Inc. http://nltk.org/book</footer>

* [NLTK documentation](https://www.nltk.org/)
* [NLTK API Documentaiton](https://www.nltk.org/api/nltk.html)
* [PyCharm](https://www.jetbrains.com/pycharm/download/#section=mac) An Integrated Data Environment (IDE) for working with Python. Has a greater focus on writing python code

## Take a look at the Book Collection

Let's start by taking a look at the downloader and all the tools/packages available.
Then let's download the "book" collection.

***Note:*** **This interface will look significantly different in the browser notebook and a local installation. If you are running the notebook on your computer a separate download GUI may load.**

In [None]:
import nltk
nltk.download()


In [None]:
from nltk.book import *

Notice the nice message about the `texts()` and `sents()` functions. Let's try those.

In [None]:
# a list of the texts
texts()

In [None]:
# the first sentence to each text
sents()

In [None]:
# look at individual text info.
text1

We can see this is a special NLTK data type. This is signifanct because it will have it's own unique methods and functions, which we will explore.

In [None]:
type(text1)

### Search text

#### concordance

Concordance is one of the special functions available in the books module.
`concordance` will return a string of results showing all the occurrences of a word and the context in which the word appears. This is also known as a KWIC (Key Word in Context)

In [None]:
text1.concordance('alone')

**Note: The default view is limited to 25 rows but you can change that to show more or less by passing in an integer for the number of lines** `lines=n` 
**Similarly we can dictate the width of the line (in characters):** `width=n`

In [None]:
text1.concordance('death', width=40, lines=10)

In [None]:
# Search Multiple Texts
texts = [text1,text2,text5]

for t in texts:
    print (t)
    t.concordance('red',lines=10)
    print('\n')

Find words that are used similarly with the `similar` function. This function will show all of the words that are used in a similar fashion (as determined by NLTK). 

In [None]:
# similar usages of 'love' in Moby Dick
print(text1)
text1.similar('love')

In [None]:
# similar usages to 'love' in Sense and Sensibility
print(text2)
text2.similar('love')

Look at similar ways that similar words are used. `common_contexts` will show how the words are used in similar ways. Compare words from the list of words returned from the similar function.

In [None]:
text2.common_contexts(["love","affection"])

In [None]:
text1.common_contexts(['love','sea'])

### dispersion
see a dispersion plot of the word in the context of the entire text.

note: matplotlib is a python library for plotting and visualizing data. We will look at it more later.

In [None]:
%matplotlib inline
text1.dispersion_plot(['blue','red','green','yellow','gold'])

In [None]:
# take a closer look at an area of the text
text1[150000:210000].count('gold')

The word "gold" appears 16 times in this region of the text. 

In [None]:
text1.concordance('gold')

# Other analysis tools

Find the length of a text, including words, and punctuation, i.e. all tokens.

In [None]:
len(text3)

This is distinct from a word list of all the words that appear in the text (i.e. not inlcuding repeated words.)

In [None]:
sorted(set(text3))

In [None]:
len(set(text3))

We can see that Genesis is made up of 2789 unique tokens.

We can see how ***"lexically rich"*** a text is by dividing the number of words used by the number of words in the text. This will give us an idea of how often words are used in a text. We git a percentage that will indicate what percentage of the words used in the entire text are distince. In other words, does the text use the same words repeatedly, or does it have a rich vocabulary and use many different words through out the text.

This can be a little misleading when thinking about text of different size. There is a finite vocabulary to introduce into a text, while there is an infinite possibility to a text lengths. A text that is 1000 words long and uses only 100 words in its vocabulary, would have the same percentage as a text that is 100,000, but uses a 10,000 word vocabulary. Can we really say that they both have the same lexical diversity. To make comparisons about lexical diversity the texts need to be of similar lengths. There are other mathematical ways to normalize this value as well and this is where we would like the input of a data scientist.

In [None]:
# This will show that the number of distinct words is just 6% of the total text. 
(len(set(text3))/len(text3)) * 100

#### looking at particular words
We can take a closer look at particular words in a text.

In [None]:
text5

How often does a word appear in a text. Keep in mind we have not done anything to the text, so searching for 'lol', is not the same as searching for ('LOL') in this function.

In [None]:
text5.count('lol')

In [None]:
# This will show us that 'lol' is used in 1.5% of the text.
text5.count('lol')/len(text5) *100

But we missed some LOLs... (lol)

In [None]:

text5.count('LOL')

We can save our self some time by turning these into reusable functions.

In [None]:
def lex_div(text):
    return (len(set(text))/len(text)) * 100
    
def wrd_freq(text,word):
    return text.count(word)/len(text) *100

In [None]:
lex_div(text5)

In [None]:
wrd_freq(text5,'lol')

In [None]:
# create a function to find and return all the locations in the text of a particular word.
def find(word,text):
    for x in range(len(text)):
        if text[x] == word:
            print (word + ': ' + str(x))



In [None]:
find('Ishmael',text1)

In [None]:
text1.concordance('Ishmael')

Now we have a way to locate the concordances in the text. The first concordance should correspond to the first index position given in our function. 

*Note: we may need to do some fine tuning for other words. Our function does not take cpaitalization into account while the concordance function does. Our function would not show us "ishmael" if it appeared in the text, but the concordance would.*

In [None]:
print(" ".join(text1[4712:4750]))

#### putting words into lists.

In [None]:
# the first sentence of Moby Dick
sent = text1[4712:4716]
sent

In [None]:
text1[1]

In [None]:
#we can apply our function for lexical diversity to a list as well. 
# Every word will be uniques so we should get a 100%
lex_div(sent)

In [None]:
# Remember a few sentences are predefined with the book collection we imported. sent1-sent9
sent1, sent2, sent3

In [None]:
type(sent2)

Create a list and try using some of the functions we used earlier on it.

In [None]:
sen1 = ['It','is','only','a','scratch','!']

In [None]:
vocab = set(sen1)
vocab

In [None]:
# make vocab a list
list(vocab)

In [None]:
len(sen1)

In [None]:
sen1.count('scratch')

we can use python addition operators to concatenate lists as well.

In [None]:
sen2 = ['I','did','not'] + ['vote','for','you']
sen2

I forgot the period append will add an item to the end of a list.

In [None]:
# add a period to the end of the list
sen2.append('.')

In [None]:
sen2

##### List index
every item in a list has an index number associated with it. The index starts at 0, so to find the first item in a list we can refer to the index position '0'

In [None]:
sen1[0]

We can also do the reverse and get the index position of a word in the list.

In [None]:
sen1.index('scratch')

In [None]:
sen1[4]

To get more than one item from the list we can use : notation to indicate a span of the list.

In [None]:
text1[4712:4716]

In [None]:
# to start from the beginning we do not need to refer to the index
text1[:8]

In [None]:
# or the end
text1[260800:]

In [None]:
# The entire text
text1[:]

Finally we can recreate the original string (sentence).

In [None]:
string1 = " ".join(sen1)
string1

Also note that strings also can take some of the functions we have been using on lists.

In [None]:
len(string1)

In [None]:
string1[0]

In [None]:
string1[:6]

## Bringing in some Statistics and Math

### Frequency Distribution

We can use this method to help determine the importance of words. The idea being that words used more frequently will be more important. `FreqDist` will create an index of all the words used in the text (samples) and record how many times each word, or sample, appears in the text. 

In [None]:
from nltk.probability import FreqDist
fq_dis1 =FreqDist(text1)
print(fq_dis1)

That is a large collection, 19317 samples. Looking at it all is daunting but do we really need to see all of it.

Note: Outcomes should equal the number of tokens in the text. `len(text1)` Let's check.


In [None]:
len(text1)

In [None]:
# this will show us the words used most frequently. We can choose the range.
# so to see the top 50 we use:
fq_dis1.most_common(50)

In [None]:
# find the frequency of a particular word:
fq_dis1['Ishmael']

In [None]:
# see the frequency score of a word. 
fq_dis1.freq('Ishmael')

We can plot this two ways. The first way we see the individual values of each word. In the second we see how the word counts accumulate towards the total word count.

In [None]:
fq_dis1.plot(50)

In [None]:
# plot the frequency distribution:
fq_dis1.plot(50, cumulative=True)

A lot of these words are not helpful in understanding the text. "whale" might be the only telling word in our frequency distribution. You can also look at the **hapaxes** or the words that only appear once to see if these are more insightful.

In [None]:
fq_dis1.hapaxes()

The rare words don't offer much insight either, in this case. It looks like we will need to be more selective of what words in the text we want to consider.

Try removing stop words from the text as we saw in the earlier notebook and then measure the frequency distribution.

In [None]:
from nltk.corpus import stopwords

### long words

Can we glean anything from looking at the longer words in the text. In python this is written: `word for word in words if len(word)>n` where `n` is the word length we want to examine.

In [None]:
# removing duplicates first by creating a set of the text. 
# Our iteration returns a list and so our result will return duplicates if they exist. 
# We should also put the list into a set to remove duplicates.

words_10 = set([word for word in text1 if len(word) > 10])
sorted(words_10)

Interesting but still not very insightful. However with other texts this might not be the case. look at other texts and see if the word length reveals anything. 

In [None]:
words_15 = set([word for word in text5 if len(word) > 15])
sorted(words_15)

In [None]:
words_15 = set([word for word in text4 if len(word) > 10])
sorted(words_15)

Now we can combine these features (frequency distribution and word length) to get another look into the text. Let's try the chat text and find words that are 8 characters or longer and are found more than 5 times.

In [None]:
fq_dis5=FreqDist(text5)
sorted(word for word in set(text5) if len(word) > 8 and fq_dis5[word] >5)


What words was used the most in texts?

In [None]:
# warning this will be dissapointing
fq_dis5.max()

In [None]:
fq_dis5.plot(50, cumulative=True)

## Bigrams and collocation
**Collocation** is a sequence of words that occur together with some frequency that is statistically significant. We also want to focus on collocation that has meaning. e.g. 'dark chocolate' and not 'the chocolate'

**Bigram** is a a pair of consecutive words, e.g. "hot tub". 

For bigrams we have a function in NLTK. the bigrams function takes a Python list evaluates it as word pairs. We can then store the bigrams in a list.

In [None]:
from nltk import bigrams
bgram = list(bigrams(['more', 'is', 'said', 'than', 'done']))

In [None]:
bgram

In [None]:
#we can get all of the bigrams for a text:
list(set(bigrams(text1)))

This is where collocation is useful. The collocation function will essentially give us a list of bigrams that are statistically relevant because the collocations contain rare words that appear with other words at a higher frequency than is statistically expected. This is basically the same thing we did earlier to create bigrams but it is a function of the book module.

In [None]:
text1.collocations()

In [None]:
# reminder: text8 are the Personals
text8.collocations()

### Other ways of Counting

We can also Count the words based on word length. 

In [None]:
#first get a list that contains all the words' length.
ls = [len(word) for word in text1]
print(ls)

In [None]:
#now get the freq distribution of that list:
fq_dist_wordlength = FreqDist(ls)

In [None]:
fq_dist_wordlength

In [None]:
# look at the distribution with the highest frequnecy
fq_dist_wordlength.max()

In [None]:
fq_dist_wordlength[3]

In [None]:
# What is the top 10
fq_dist_wordlength.most_common(10)

In [None]:
# what is the percentage of frequency of 3 letter words?
fq_dist_wordlength.freq(3) * 100

In [None]:
# get the total number of samples
# again this should be the same as the word count of the text.
fq_dist_wordlength.N() == len(text1)

In [None]:
fq_dist_wordlength.plot(cumulative=True)

### some quick observations



#### additional Word (string data types) operators 
`s.startswith(t)`	test if s starts with t

`s.endswith(t)`	test if s ends with t

`t in s`	test if t is a substring of s

`s.islower()`	test if s contains cased characters and all are lowercase

`s.isupper()`	test if s contains cased characters and all are uppercase

`s.isalpha()`	test if s is non-empty and all characters in s are alphabetic

`s.isalnum()`	test if s is non-empty and all characters in s are alphanumeric

`s.isdigit()`	test if s is non-empty and all characters in s are digits

`s.istitle()`	test if s contains cased characters and is titlecased (i.e. all words in s have initial capitals)

In [None]:
#quick examples:
word = "hello2"
word.startswith('h')

In [None]:
word.endswith('e')

In [None]:
word.endswith('lo2')

In [None]:
word2 = 'hell'
word2 in word

In [None]:
word.islower()

In [None]:
word.isalpha()

In [None]:
word.isalnum()

In [None]:
num='839'
num.isalnum()

In [None]:
num.isalnum()

In [None]:
word3 = "The Greatest Book Ever"
word3.istitle()

In [None]:
word3.isdigit()

In [None]:
num.isdigit()

In [None]:
# find hyphenated words with specific text
sorted(w for w in set(text7) if '-' in w and 'down' in w)

In [None]:
# search for variations of a words:
sorted(w for w in set(text1) if w.startswith('run') or w.startswith('ran'))


In [None]:
#find all upper case words:
[word.upper() for word in text1]