# NLP (Natural Language Processing)

#### TL; DR
To develop a deeper intuition with NLP or vectorization of words to do sentimental analysis

#### Packages for NLP
NLTK

### Reference

Sentdex [video1](https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/)

### Key Terms

- **Tokenizing**: grouping of text (2 types of separators: sentences and words)
- **Corporas**: body of text with same subject/theme
- **Lexicon**: words & their meanings
- **Stop Words**: "fluff" meaningless words that are typically removed
- **Stemming**: typically referred to as the process of removing the end of words that connote a different tense
- **Lemmatizing**: gets the root of the words in contrast to stemming
- **Tagging**: labeling words as nouns, verbs, adjectives, etc...
- **Chunking**: phrases of words that contain a noun surrounded by a verb, adverb that are related

*[Regular Expressions](https://pythonprogramming.net/regular-expressions-regex-tutorial-python-3/): own language/symbols 

- **Chinking**: a chink is a chunk that is removed ofrom a chunk

In [1]:
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

import matplotlib.pyplot as plt
%matplotlib inline

### Grouping Sentences & Words

In [2]:
example_text = "Hello Mr Smith, how are you doing today? The weather is great, \
                and Python is awesome. The sky is pinkish-blue. \
                You shouldn't eat cardboard."

In [3]:
print(sent_tokenize(example_text))

['Hello Mr Smith, how are you doing today?', 'The weather is great,                 and Python is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard."]


In [4]:
print(word_tokenize(example_text))

['Hello', 'Mr', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard', '.']


In [5]:
for i in word_tokenize(example_text):
    print(i)

Hello
Mr
Smith
,
how
are
you
doing
today
?
The
weather
is
great
,
and
Python
is
awesome
.
The
sky
is
pinkish-blue
.
You
should
n't
eat
cardboard
.


### Stopwords

In [6]:
example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))
print(stop_words)

set([u'all', u'just', u"don't", u'being', u'over', u'both', u'through', u'yourselves', u'its', u'before', u'o', u'don', u'hadn', u'herself', u'll', u'had', u'should', u'to', u'only', u'won', u'under', u'ours', u'has', u"should've", u"haven't", u'do', u'them', u'his', u'very', u"you've", u'they', u'not', u'during', u'now', u'him', u'nor', u"wasn't", u'd', u'did', u'didn', u'this', u'she', u'each', u'further', u"won't", u'where', u"mustn't", u"isn't", u'few', u'because', u"you'd", u'doing', u'some', u'hasn', u"hasn't", u'are', u'our', u'ourselves', u'out', u'what', u'for', u"needn't", u'below', u're', u'does', u"shouldn't", u'above', u'between', u'mustn', u't', u'be', u'we', u'who', u"mightn't", u"doesn't", u'were', u'here', u'shouldn', u'hers', u"aren't", u'by', u'on', u'about', u'couldn', u'of', u"wouldn't", u'against', u's', u'isn', u'or', u'own', u'into', u'yourself', u'down', u"hadn't", u'mightn', u"couldn't", u'wasn', u'your', u"you're", u'from', u'her', u'their', u'aren', u"it's",

In [7]:
word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

print(word_tokens)
print(filtered_sentence)

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


In [8]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

example_words = ["python", "pythoner", "pythoning",
                "pythoned", "pythonly"]

In [9]:
[ps.stem(w) for w in example_words]

['python', u'python', u'python', u'python', u'pythonli']

In [10]:
for w in example_words:
    print(ps.stem(w))

python
python
python
python
pythonli


In [11]:
new_text = "It is important to by very pythonly while you are pythoning with python. \
All pythoners have pythoned poorly at least once."


In [12]:
words = word_tokenize(new_text)

for w in words:
    print(ps.stem(w))

It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc
.


### Tagging

In [13]:
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

In [14]:
train_text = state_union.raw('2005-GWBush.txt')
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [15]:
def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            print(chunked)
            for subtree in chunked.subtress(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)

    except Exception as e:
        print(str(e))

In [16]:
process_content()

(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
  'S/POS
  (Chunk ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk THE/NNP UNION/NNP January/NNP)
  31/CD
  ,/,
  2006/CD
  (Chunk THE/NNP PRESIDENT/NNP)
  :/:
  (Chunk Thank/NNP)
  you/PRP
  all/DT
  ./.)
'Tree' object has no attribute 'subtress'


### Chinking

In [17]:
def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)

            chunkGram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT|TO>+{"""

            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            print(chunked)
#             chunked.draw()

    except Exception as e:
        print(str(e))

In [56]:
#process_content()

### Name Entity Recognition

In [19]:
def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=True)
            print(namedEnt)
#             namedEnt.draw()
    except Exception as e:
        print(str(e))

In [57]:
#process_content()

### Lemmatizing

In [21]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

cat
cactus
goose
rock
python
good
best
run
run


### Corpora

In [27]:
print(nltk.__file__)

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/__init__.pyc


In [26]:
from nltk.tokenize import sent_tokenize, PunktSentenceTokenizer
from nltk.corpus import gutenberg

# sample text
sample = gutenberg.raw("bible-kjv.txt")

tok = sent_tokenize(sample)

for x in range(5):
    print(tok[x])


[The King James Bible]

The Old Testament of the King James Bible

The First Book of Moses:  Called Genesis


1:1 In the beginning God created the heaven and the earth.
1:2 And the earth was without form, and void; and darkness was upon
the face of the deep.
And the Spirit of God moved upon the face of the
waters.
1:3 And God said, Let there be light: and there was light.
1:4 And God saw the light, that it was good: and God divided the light
from the darkness.


### Wordnet/ Similarity

In [29]:
from nltk.corpus import wordnet

In [30]:
syns = wordnet.synsets("program")

In [31]:
print(syns[0].name())

plan.n.01


In [32]:
print(syns[0].definition())

a series of steps to be carried out or goals to be accomplished


In [33]:
print(syns[0].examples())

[u'they drew up a six-step plan', u'they discussed plans for a new bond issue']


In [36]:
synonyms = []
antonyms = []

for syn in wordnet.synsets("good"):
    for l in syn.lemmas():
        print("l:", l)
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())
            
# print(set(synonyms))
# print('\n')
# print(set(antonyms))

('l:', Lemma('good.n.01.good'))
('l:', Lemma('good.n.02.good'))
('l:', Lemma('good.n.02.goodness'))
('l:', Lemma('good.n.03.good'))
('l:', Lemma('good.n.03.goodness'))
('l:', Lemma('commodity.n.01.commodity'))
('l:', Lemma('commodity.n.01.trade_good'))
('l:', Lemma('commodity.n.01.good'))
('l:', Lemma('good.a.01.good'))
('l:', Lemma('full.s.06.full'))
('l:', Lemma('full.s.06.good'))
('l:', Lemma('good.a.03.good'))
('l:', Lemma('estimable.s.02.estimable'))
('l:', Lemma('estimable.s.02.good'))
('l:', Lemma('estimable.s.02.honorable'))
('l:', Lemma('estimable.s.02.respectable'))
('l:', Lemma('beneficial.s.01.beneficial'))
('l:', Lemma('beneficial.s.01.good'))
('l:', Lemma('good.s.06.good'))
('l:', Lemma('good.s.07.good'))
('l:', Lemma('good.s.07.just'))
('l:', Lemma('good.s.07.upright'))
('l:', Lemma('adept.s.01.adept'))
('l:', Lemma('adept.s.01.expert'))
('l:', Lemma('adept.s.01.good'))
('l:', Lemma('adept.s.01.practiced'))
('l:', Lemma('adept.s.01.proficient'))
('l:', Lemma('adept.s.01.

In [39]:
w1 = wordnet.synset("ship.n.01")
w2 = wordnet.synset("boat.n.01")

print(w1.wup_similarity(w2))

0.909090909091


In [40]:
w1 = wordnet.synset("ship.n.01")
w2 = wordnet.synset("car.n.01")

print(w1.wup_similarity(w2))

0.695652173913


In [44]:
w1 = wordnet.synset("cactus.n.01")
w2 = wordnet.synset("cat.n.01")

print(w1.wup_similarity(w2))

0.5


### Text Classification

In [45]:
import random
from nltk.corpus import movie_reviews

In [49]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

In [51]:
print(documents[1])

([u'alchemy', u'is', u'steeped', u'in', u'shades', u'of', u'blue', u'.', u'kieslowski', u"'", u's', u'blue', u',', u'that', u'is', u'.', u'with', u'its', u'examination', u'of', u'death', u',', u'isolation', u',', u'character', u'restoration', u',', u'and', u'recovery', u'from', u'loss', u',', u'suzanne', u'myers', u"'", u'new', u'independent', u'film', u'echoes', u'the', u'polish', u'director', u"'", u's', u'internationally', u'-', u'acclaimed', u'1993', u'release', u'.', u'language', u'aside', u',', u'the', u'principal', u'difference', u'between', u'the', u'films', u'is', u'that', u',', u'while', u'kieslowski', u'took', u'great', u'pains', u'to', u'draw', u'us', u'into', u'the', u'main', u'character', u"'", u's', u'world', u',', u'alchemy', u'keeps', u'its', u'viewers', u'at', u'arm', u"'", u's', u'length', u'.', u'as', u'a', u'result', u',', u'while', u'we', u"'", u're', u'able', u'to', u'appreciate', u'the', u'film', u"'", u's', u'intellectual', u'tapestry', u',', u'it', u'is', u'em

In [53]:
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(15))

[(u',', 77717), (u'the', 76529), (u'.', 65876), (u'a', 38106), (u'and', 35576), (u'of', 34123), (u'to', 31937), (u"'", 30585), (u'is', 25195), (u'in', 21822), (u's', 18513), (u'"', 17612), (u'it', 16107), (u'that', 15924), (u'-', 15595)]
253
35


In [54]:
print(all_words["stupid"])
print(all_words['awesome'])

253
35


### Converting Words to Features w/ NLTK