# Feature Engineering and Big Grams Example

This notebook contains examples of NLP feature engineering, specifically Big Grams; the combination of words as features or dimensions.

**Bigrams** are every pair of words in the corpus; some pairs of words that appear at least a minimum of times (usually five) hold more information than others.

For example, the sentence 'I live in Santa Ana.' has the bigrams: (I, live); (live, in); (in, Santa); and (Santa, Ana). The bigram (Santa, Ana) will carry more information than (in, Santa) because it conveys the name of a place rather than a random pairing of words.

**Ngrams** can be groups of n number of words or characters.

You can rank any of these groups with a metric known as **Pointwise Mutual Information Score**, a statistical measure of dependency between the items within a group. The higher the score, the more information that the group conveys.

*This notebook is adapted from the Learn.co leson titled "Corpus Statistics - Lab"*

In [None]:
import nltk
from nltk.corpus import gutenberg, stopwords
from nltk.collocations import *
from nltk import FreqDist
from nltk import word_tokenize
import string
import re

In [None]:
file_ids = gutenberg.fileids()
file_ids

# Part 1: Clean The Text

In [None]:
# Load Moby Dick
moby_dick_text = gutenberg.raw(file_ids[-6])

In [None]:
#Remove preface
print(moby_dick_text[21945:22400])
moby_dick_text = moby_dick_text[21945:]

In [None]:
#Tokenize text, treat all apostrophies as one word

pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
moby_dick_tokens_raw = nltk.regexp_tokenize(moby_dick_text, pattern)

In [None]:
#Make all words lowercase
moby_dick_tokens = [word.lower() for word in moby_dick_tokens_raw]

# Part 2: Exploring Text

## Frequency and normalized frequency distributions

In [None]:
#Create a Frequency Distribution to view what words are in the text
moby_dick_freqdist = FreqDist(moby_dick_tokens)
moby_dick_freqdist.most_common(25)

In [None]:
stopwords_list = stopwords.words('english')
stopwords_list += list(string.punctuation)

In [None]:
#Remove all stopwords from the text
moby_dick_words_stopped = [word for word in moby_dick_tokens if word not in stopwords_list]

In [None]:
moby_dick_freqdist = FreqDist(moby_dick_words_stopped)
moby_dick_freqdist.most_common(25)

In [None]:
#The total number of unique words in Moby dick after stopped words are removed is:
len(moby_dick_freqdist)

In [None]:
#Normalize the frequency by dividing each word's frequency by the total number
#of words in the corpus

#Obtain a total word count in the corpus
total_word_count = sum(moby_dick_freqdist.values())

#View the top 25 words by normalize frequency distribution
moby_dick_top_25 = moby_dick_freqdist.most_common(25)
print("Word\t\t\tNormalized Frequency")
for word in moby_dick_top_25:
    normalized_frequency = word[1] / total_word_count
    print("{} \t\t\t {:.4}".format(word[0], normalized_frequency))


## Bigrams

In [None]:
# Create bigrams (groups of two tokens)

# Call the bigram method within NLTK
bigram_measures = nltk.collocations.BigramAssocMeasures()

# Apply the Moby Dick text to the above bigram method
moby_dick_finder = BigramCollocationFinder.from_words(moby_dick_words_stopped)

# Score the bigrams
moby_dick_scored = moby_dick_finder.score_ngrams(bigram_measures.raw_freq)

In [None]:
# The top twenty five scored bigrams
# Note the normalized frequencies
moby_dick_scored[:25]

## Pointwise Mutual Information Score (PMI)

In [None]:
# Create PMI finder for bigrams
moby_dick_pmi_finder = BigramCollocationFinder.from_words(moby_dick_words_stopped)

# Apply the minimum amount of appearences a bigram must have to count
moby_dick_pmi_finder.apply_freq_filter(5)

# Calculate PMI for the remaining bigrams with five or more appearances within the text
moby_dick_pmi_scored = moby_dick_pmi_finder.score_ngrams(bigram_measures.pmi)

In [None]:
#Too many bigrams, apply a bigger filter

len(moby_dick_pmi_scored)

In [None]:
# Apply the minimum amount of appearences a bigram must have to count
moby_dick_pmi_finder.apply_freq_filter(20)

# Calculate PMI for the remaining bigrams with five or more appearances within the text
moby_dick_pmi_scored = moby_dick_pmi_finder.score_ngrams(bigram_measures.pmi)

In [None]:
len(moby_dick_pmi_scored)

In [None]:
moby_dick_pmi_scored