# Feature Engineering and Big Grams Example

This notebook contains examples of NLP feature engineering, specifically Big Grams; the combination of words as features or dimensions.

**Bigrams** are every pair of words in the corpus; some pairs of words that appear at least a minimum of times (usually five) hold more information than others.

For example, the sentence 'I live in Santa Ana.' has the bigrams: (I, live); (live, in); (in, Santa); and (Santa, Ana). The bigram (Santa, Ana) will carry more information than (in, Santa) because it conveys the name of a place rather than a random pairing of words.

**Ngrams** can be groups of n number of words or characters.

You can rank any of these groups with a metric known as **Pointwise Mutual Information Score**, a statistical measure of dependency between the items within a group. The higher the score, the more information that the group conveys.

*This notebook is adapted from the Learn.co leson titled "Corpus Statistics - Lab"*

In [1]:
import nltk
from nltk.corpus import gutenberg, stopwords
from nltk.collocations import *
from nltk import FreqDist
from nltk import word_tokenize
import string
import re

In [2]:
file_ids = gutenberg.fileids()
file_ids

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

# Part 1: Clean The Text

In [3]:
# Load Moby Dick
moby_dick_text = gutenberg.raw(file_ids[-6])

In [4]:
#Remove preface
print(moby_dick_text[21945:22400])
moby_dick_text = moby_dick_text[21945:]

CHAPTER 1

Loomings.


Call me Ishmael.  Some years ago--never mind how long
precisely--having little or no money in my purse, and nothing
particular to interest me on shore, I thought I would sail about a
little and see the watery part of the world.  It is a way I have of
driving off the spleen and regulating the circulation.  Whenever I
find myself growing grim about the mouth; whenever it is a damp,
drizzly November in my soul; whenever 


In [5]:
#Tokenize text, treat all apostrophies as one word

pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
moby_dick_tokens_raw = nltk.regexp_tokenize(moby_dick_text, pattern)

In [6]:
#Make all words lowercase
moby_dick_tokens = [word.lower() for word in moby_dick_tokens_raw]

# Part 2: Exploring Text

## Frequency and normalized frequency distributions

In [7]:
#Create a Frequency Distribution to view what words are in the text
moby_dick_freqdist = FreqDist(moby_dick_tokens)
moby_dick_freqdist.most_common(25)

[('the', 14175),
 ('of', 6469),
 ('and', 6325),
 ('a', 4628),
 ('to', 4539),
 ('in', 4077),
 ('that', 2953),
 ('his', 2495),
 ('it', 2395),
 ('i', 1982),
 ('but', 1805),
 ('he', 1760),
 ('as', 1720),
 ('with', 1692),
 ('is', 1688),
 ('was', 1627),
 ('for', 1593),
 ('all', 1514),
 ('this', 1382),
 ('at', 1304),
 ('by', 1175),
 ('not', 1141),
 ('from', 1072),
 ('him', 1058),
 ('so', 1053)]

In [8]:
stopwords_list = stopwords.words('english')
stopwords_list += list(string.punctuation)

In [9]:
#Remove all stopwords from the text
moby_dick_words_stopped = [word for word in moby_dick_tokens if word not in stopwords_list]

In [10]:
moby_dick_freqdist = FreqDist(moby_dick_words_stopped)
moby_dick_freqdist.most_common(25)

[('whale', 1030),
 ('one', 896),
 ('like', 639),
 ('upon', 561),
 ('man', 474),
 ('old', 446),
 ('sea', 436),
 ('ahab', 436),
 ('ship', 429),
 ('ye', 426),
 ('would', 424),
 ('though', 382),
 ('yet', 345),
 ('head', 336),
 ('time', 332),
 ('long', 330),
 ('still', 312),
 ('captain', 306),
 ('said', 299),
 ('two', 293),
 ('great', 292),
 ('boat', 292),
 ('seemed', 282),
 ('must', 281),
 ('white', 280)]

In [11]:
#The total number of unique words in Moby dick after stopped words are removed is:
len(moby_dick_freqdist)

16939

In [12]:
#Normalize the frequency by dividing each word's frequency by the total number
#of words in the corpus

#Obtain a total word count in the corpus
total_word_count = sum(moby_dick_freqdist.values())

#View the top 25 words by normalize frequency distribution
moby_dick_top_25 = moby_dick_freqdist.most_common(25)
print("Word\t\t\tNormalized Frequency")
for word in moby_dick_top_25:
    normalized_frequency = word[1] / total_word_count
    print("{} \t\t\t {:.4}".format(word[0], normalized_frequency))


Word			Normalized Frequency
whale 			 0.009445
one 			 0.008216
like 			 0.00586
upon 			 0.005144
man 			 0.004347
old 			 0.00409
sea 			 0.003998
ahab 			 0.003998
ship 			 0.003934
ye 			 0.003906
would 			 0.003888
though 			 0.003503
yet 			 0.003164
head 			 0.003081
time 			 0.003044
long 			 0.003026
still 			 0.002861
captain 			 0.002806
said 			 0.002742
two 			 0.002687
great 			 0.002678
boat 			 0.002678
seemed 			 0.002586
must 			 0.002577
white 			 0.002568


## Bigrams

In [13]:
# Create bigrams (groups of two tokens)

# Call the bigram method within NLTK
bigram_measures = nltk.collocations.BigramAssocMeasures()

# Apply the Moby Dick text to the above bigram method
moby_dick_finder = BigramCollocationFinder.from_words(moby_dick_words_stopped)

# Score the bigrams
moby_dick_scored = moby_dick_finder.score_ngrams(bigram_measures.raw_freq)

In [14]:
# The top twenty five scored bigrams
# Note the normalized frequencies
moby_dick_scored[:25]

[(('sperm', 'whale'), 0.0013113250802384228),
 (('white', 'whale'), 0.0008619899128839982),
 (('moby', 'dick'), 0.0007427785419532324),
 (('captain', 'ahab'), 0.0005868867491976157),
 (('old', 'man'), 0.0005685465382851903),
 (('mast', 'head'), 0.0004309949564419991),
 (('right', 'whale'), 0.00040348464007336086),
 (('mast', 'heads'), 0.0003301237964236589),
 (('cried', 'ahab'), 0.0003026134800550206),
 (('sperm', "whale's"), 0.0003026134800550206),
 (('aye', 'aye'), 0.00029344337459880787),
 (('quarter', 'deck'), 0.0002842732691425951),
 (('captain', 'peleg'), 0.0002751031636863824),
 (('whale', 'ship'), 0.0002751031636863824),
 (('look', 'ye'), 0.0002567629527739569),
 (('mr', 'starbuck'), 0.0002567629527739569),
 (('one', 'hand'), 0.00024759284731774416),
 (('let', 'us'), 0.00022925263640531865),
 (('whale', 'boat'), 0.00022925263640531865),
 (('cried', 'stubb'), 0.00021091242549289317),
 (('one', 'side'), 0.00021091242549289317),
 (('every', 'one'), 0.00020174232003668043),
 (('nev

## Pointwise Mutual Information Score (PMI)

In [15]:
# Create PMI finder for bigrams
moby_dick_pmi_finder = BigramCollocationFinder.from_words(moby_dick_words_stopped)

# Apply the minimum amount of appearences a bigram must have to count
moby_dick_pmi_finder.apply_freq_filter(5)

# Calculate PMI for the remaining bigrams with five or more appearances within the text
moby_dick_pmi_scored = moby_dick_pmi_finder.score_ngrams(bigram_measures.pmi)

In [16]:
#Too many bigrams, apply a bigger filter

len(moby_dick_pmi_scored)

566

In [17]:
# Apply the minimum amount of appearences a bigram must have to count
moby_dick_pmi_finder.apply_freq_filter(20)

# Calculate PMI for the remaining bigrams with five or more appearances within the text
moby_dick_pmi_scored = moby_dick_pmi_finder.score_ngrams(bigram_measures.pmi)

In [18]:
len(moby_dick_pmi_scored)

26

In [19]:
moby_dick_pmi_scored

[(('moby', 'dick'), 10.359590813068548),
 (('mast', 'heads'), 8.540967999333212),
 (('quarter', 'deck'), 8.519527859009505),
 (('mr', 'starbuck'), 8.135514397943794),
 (("d'ye", 'see'), 8.032704998385219),
 (('thou', 'art'), 7.78084951664443),
 (('captain', 'peleg'), 7.234385877826657),
 (('never', 'mind'), 7.206797850980234),
 (('aye', 'aye'), 7.1823814338670005),
 (('sperm', "whale's"), 6.989490718803568),
 (('mast', 'head'), 6.942705362927478),
 (('let', 'us'), 6.522304098768705),
 (('cried', 'stubb'), 6.1178816505439695),
 (('sperm', 'whale'), 6.00342990976144),
 (("whale's", 'head'), 5.827739648806954),
 (('cried', 'ahab'), 5.7347156337227645),
 (('captain', 'ahab'), 5.709058076945897),
 (('white', 'whale'), 5.151507457077564),
 (('look', 'ye'), 5.141939629062065),
 (('old', 'man'), 4.999183405983786),
 (('right', 'whale'), 4.98596089614583),
 (('one', 'hand'), 3.9679173068552167),
 (('one', 'side'), 3.764336751231667),
 (('every', 'one'), 3.5729257497781326),
 (('whale', 'boat'),