# Feature Engineering and Big Grams Example

This notebook contains examples of NLP feature engineering, specifically Big Grams; the combination of words as features or dimensions.

**Bigrams** are every pair of words in the corpus; some pairs of words that appear at least a minimum of times (usually five) hold more information than others.

For example, the sentence 'I live in Santa Ana.' has the bigrams: (I, live); (live, in); (in, Santa); and (Santa, Ana). The bigram (Santa, Ana) will carry more information than (in, Santa) because it conveys the name of a place rather than a random pairing of words.

**Ngrams** can be groups of n number of words or characters.

You can rank any of these groups with a metric known as **Pointwise Mutual Information Score**, a statistical measure of dependency between the items within a group. The higher the score, the more information that the group conveys.

In [1]:
import nltk
from nltk.corpus import gutenberg, stopwords
from nltk.collocations import *
from nltk import FreqDist
from nltk import word_tokenize
import string
import re

In [2]:
file_ids = gutenberg.fileids()
file_ids

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [3]:
# Load Moby Dick
moby_dick_text = gutenberg.raw(file_ids[-6])

In [4]:
#Remove preface
print(moby_dick_text[21945:22400])
moby_dick_text = moby_dick_text[21945:]

CHAPTER 1

Loomings.


Call me Ishmael.  Some years ago--never mind how long
precisely--having little or no money in my purse, and nothing
particular to interest me on shore, I thought I would sail about a
little and see the watery part of the world.  It is a way I have of
driving off the spleen and regulating the circulation.  Whenever I
find myself growing grim about the mouth; whenever it is a damp,
drizzly November in my soul; whenever 


In [5]:
#Tokenize text, treat all apostrophies as one word

pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
moby_dick_tokens_raw = nltk.regexp_tokenize(moby_dick_text, pattern)

In [6]:
moby_dick_tokens = [word.lower() for word in moby_dick_tokens_raw]

In [7]:
moby_dick_freqdist = FreqDist(moby_dick_tokens)
moby_dick_freqdist.most_common(25)

[('the', 14175),
 ('of', 6469),
 ('and', 6325),
 ('a', 4628),
 ('to', 4539),
 ('in', 4077),
 ('that', 2953),
 ('his', 2495),
 ('it', 2395),
 ('i', 1982),
 ('but', 1805),
 ('he', 1760),
 ('as', 1720),
 ('with', 1692),
 ('is', 1688),
 ('was', 1627),
 ('for', 1593),
 ('all', 1514),
 ('this', 1382),
 ('at', 1304),
 ('by', 1175),
 ('not', 1141),
 ('from', 1072),
 ('him', 1058),
 ('so', 1053)]