# NLTK (Natural Language Toolkit) and working with unstructured (text) data

This notebook will introduce you to the Natural Language Toolkit, better known as NLTK (https://www.nltk.org/). 
NLTK is a leading platform for building Python programs to work with human language data.
This type of work is called Natural Language Processing (NLP) and is a complex field.
We can only provide the basics, but should give you a start if you want to work with text data.

**Warning** NLTK can have a steep learning curve. They have introductory materials and other materials that you may find useful listed below.

 * NLTK Book
   * http://www.nltk.org/book/
 * PythonProgramming tutorial series (with videos, quizzes, and many components)
   * https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/
 * Reasonable walkthroughs
   * https://likegeeks.com/nlp-tutorial-using-python-nltk/
   * https://www.guru99.com/stemming-lemmatization-python-nltk.html
 * If you want another advanced package look at gensim (topic modeling)
   * https://radimrehurek.com/gensim/

A general NLP pipeline can be found in chapter 3 of the NLTK book. See image below for an example using HTML webpages.
<img src="http://www.nltk.org/images/pipeline1.png"/>

In this notebook we setup a 3 stage pipeline:
1. open and read data
2. clean up data
3. analyze data

## 0. Get the dataset and install NLTK
First, we need to download our dataset and install the NLTK package.
We will be using the book "Frankenstein; Or, The Modern Prometheus" by Mary Wollstonecraft Shelley from Project Gutenberg.
This was the top downloaded book last week by a female author (as well as any author!).

https://www.gutenberg.org/ebooks/84

In [None]:
# Download the dataset - only do this once
!wget -O frankenstein.txt https://www.gutenberg.org/files/84/84-0.txt 

In [None]:
# Install NLTK - only do this once
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

## 1. Open and read the data

Look at chapter 3 of the NLTK book for how to get text data including HTML, files, etc.
http://www.nltk.org/book/ch03.html


In [None]:
# Open up, read, and close our frankenstein.txt file
fh = open("frankenstein.txt")
raw_text = fh.read()
fh.close()

## 2. Clean up the data

Now we want to take our raw text and clean it up so that we are able to work with it. 
In this example we will clean up the data and only work with Chapters 1-4 as an example.
Usually you do not want to process an entire text file (for example, in this file we see unnrelated text at the top of the file). So this will give us practice in cleaning the data as well. And will speed up processing by having fewer sentences to work with as we go.

### Split the raw text by line and cut text to Chapters 1-4

In [None]:
# Create a list of text entries. One entry per line.
lines = raw_text.split("\n")

# Print the hundredth line of the book
print(lines[100])


In [None]:
# Slice the lines between line
# Hint: I just looked at the line numbers in a text editor for simplicity.
chapter_lines = lines[652:1516]

print(chapter_lines)

Notice, chapters is not a string of text. It is a list of lines. So we will convert it back to a string of text using join. 
We will join lines together using a simple space " ".

In [None]:
# Join our chapter_lines list into a single string of text called chapters
# See example in Chapter 4 : http://www.nltk.org/book/ch04.html

chapters = " ".join(chapter_lines)

# Check it out
print(chapters)

### Tokenize words
Tokenization takes raw text and divides it up into tokens. There are a variety of ways of 'tokenizing' text. Two of the most common are words and sentences, but you can also do it by phrases and other characteristics.

In [None]:
from nltk.tokenize import word_tokenize

word_tokens = word_tokenize(chapters)

# Check it out. It is now a list of words.
print(word_tokens)

### Tokenize sentences

In [None]:
from nltk.tokenize import sent_tokenize

sent_tokens = sent_tokenize(chapters)

# Check it out. It is now a list of sentences.
print(sent_tokens)

## Manual tokenization
Sometimes you have to tokenize your text manually. Here we will tokenize by Chapters.
This is custom programming that is not always supported so you have to write your own Python code.

In [None]:
# Start splitting by chapters
# I leave it as an exercise for you to work through the code.
# Notice, for Chapter 1 it must be after the Chapter 1 split and before the Chapter 2 split.

ch0s = chapters.split("Chapter 1")
ch1s = ch0s[1].split("Chapter 2")
ch2s = ch1s[1].split("Chapter 3")
ch3s = ch2s[1].split("Chapter 4")

# Now we have text for each chapter if we want!
ch1 = ch1s[0]
ch2 = ch2s[0]
ch3 = ch3s[0]
ch4 = ch4s[0]

# Make a chapter tokens if we wish.
chapter_tokens = [ch1, ch2, ch3, ch4]

### Removing "Stop Words"
Stop words are commonly occurring words that rarely are the focus of NLP.
Let's look at the most common words and then see if we cannot remove many of the ones that 'get in our way' such as: the, of, was, ...

In [None]:
# Look at the most common words using the FreqDist function.
# See chapter 1: https://www.nltk.org/book/ch01.html

from nltk.probability import FreqDist


fdist = FreqDist(word_tokens)

# Let's see frequency distribution of the 50 most common words in our text
print(fdist.most_common(50))

In [None]:
# Side note: You can look at frequency of a single word such as genius
print(fdist['genius'])

In [None]:
# We can also plot the top 20 words and their counts
fdist.plot(20)

In [None]:
# Now let's remove these common stop words
#
from nltk.corpus import stopwords 

stop_words = set(stopwords.words('english')) 

# print(stop_words) # Uncomment if you want to check it out.

# Create a clean_words list
clean_words = []
# Loop through each word, w, in word_tokens
for w in word_tokens:
    # If the word is not in the stop_words list, add it to our clean_words list using append
    if w not in stop_words:
        clean_words.append(w)

# Check it out
print(clean_words)

### You try it
Now plot the top 20 most frequently used words of our cleaned up dataset. Look at the code examples above for guidance.

**Important note** You can create your own list of additional stop words to remove. This is usually necessary when you work in a specific area. I have applied NLP to a variety of areas and we always needed to create a customized list of words that occurred frequently in our data, but did not provide any useful insights. So they were continually added to our evolving list of words to remove.

### Lemmatization and Stemming
We need to take irregular text (including pluralization and tenses) such as bicycle or bicycles are converted to base word bicycle. 
We will try to distil them down to their base word. There are two approaches to doing this: lemmatization and stemming.
There are multiple stemmers and lemmatizers, here are two.

Look here for an example: https://www.guru99.com/stemming-lemmatization-python-nltk.html


In [None]:
from nltk.stem.porter import PorterStemmer

stemmer  = PorterStemmer()

stem_words = []
for w in clean_words:
    stem_w = stemmer.stem(w)
    stem_words.append(stem_w)

# Check it out
print(stem_words)

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lem_words = []
for w in clean_words:
    lem_w = lemmatizer.lemmatize(w)
    lem_words.append(lem_w)
    
# Check it out
print(lem_words)

### Look at the difference
Take a look at the differences between stemming and lemmatizing words. One important difference that you cannot see if that of speed. Lemmatizing words is more complex so it is slower. If you are working with a lot of text (social media) sometimes you cannot lemmatize all of the text, because it is simply too slow. However, lemmatizing is generally considered superior, but there are caveats to that, which we will not get into today. :)

## 3. Analyze the data

### Tagging Part of Speech (POS)
We can tag our sentences and identify Part of Speech. 


### You try it!
Go to Chapter 7 of the NLTK book here: http://www.nltk.org/book/ch07.html#fig-ie-architecture
Look at the code example for post_tag and use the pos_tag function to create a new list of words with pos tagged.
I would suggest using the lem_words list we just created above.
Save the new pos list under a useful name such as pos_words.

In [None]:
# POS tagging (you try it)



In [None]:
# Run this statement to see what you just created. A list of words with their parts of speech.
nltk.pos_tag(["Hello","world","this","is","my","statement"])

#### ?! What is all this? Time for some reading
Here is where NLP gets a little complex. You have to divde into grammar and syntax. Scan the first couple sections of Chapter 8 to get a sense. If text analytics could be an area of interest for you, then take some time to look over this chapter. Sometimes you need to dive into the mechanics of language for your problem. Other times you can avoid this entirely.
http://www.nltk.org/book/ch08.html

### Collocation
Chapter 2 covers bigrams and how we can look at the frequency distribution of words. Read more here: http://www.nltk.org/book/ch02.html#chap-corpora

Example code can be found in the collocations howto guide here:
https://www.nltk.org/howto/collocations.html

In [None]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

# Build a bigram finder using our word tokens
finder = BigramCollocationFinder.from_words(word_tokens)

# Get the top 10 most collocated words (those that are highly collocated).
finder.nbest(bigram_measures.pmi, 10)

In [None]:
# Let's remove collocated bigrams that occur fewer than 3 times
finder.apply_freq_filter(5)

# Now get the top 10 more commonly used collocated words
finder.nbest(bigram_measures.pmi, 10)