# NLP (Natural Language Processing)

#### TL; DR
To develop a deeper intuition with NLP or vectorization of words to do sentimental analysis

#### Packages for NLP
NLTK

### Reference

Sentdex [video1](https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/)

### Key Terms

- **Tokenizing**: grouping of text (2 types of separators: sentences and words)
- **Corporas**: body of text with same subject/theme
- **Lexicon**: words & their meanings
- **Stop Words**: "fluff" meaningless words that are typically removed
- **Stemming**: typically referred to as the process of removing the end of words that connote a different tense
- **Lemmatizing**: gets the root of the words in contrast to stemming
- **Tagging**: labeling words as nouns, verbs, adjectives, etc...
- **Chunking**: phrases of words that contain a noun surrounded by a verb, adverb that are related

*[Regular Expressions](https://pythonprogramming.net/regular-expressions-regex-tutorial-python-3/): own language/symbols 

- **Chinking**: a chink is a chunk that is removed ofrom a chunk

In [1]:
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

import matplotlib.pyplot as plt
%matplotlib inline

### Grouping Sentences & Words

In [2]:
example_text = "Hello Mr Smith, how are you doing today? The weather is great, \
                and Python is awesome. The sky is pinkish-blue. \
                You shouldn't eat cardboard."

In [3]:
print(sent_tokenize(example_text))

['Hello Mr Smith, how are you doing today?', 'The weather is great,                 and Python is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard."]


In [4]:
print(word_tokenize(example_text))

['Hello', 'Mr', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard', '.']


In [5]:
for i in word_tokenize(example_text):
    print(i)

Hello
Mr
Smith
,
how
are
you
doing
today
?
The
weather
is
great
,
and
Python
is
awesome
.
The
sky
is
pinkish-blue
.
You
should
n't
eat
cardboard
.


### Stopwords

In [6]:
example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))
print(stop_words)

{'it', 'ourselves', 'each', 'have', 'your', 'down', 'mightn', 'over', 'all', 'this', 'up', 'be', 'hadn', 'while', "she's", "hasn't", 'was', 'needn', 'should', 'why', "it's", 'above', 'most', 'so', 'if', "aren't", 'own', 'for', 'did', 'in', "that'll", "won't", 'by', 'here', 'mustn', 'where', 'about', 'just', "don't", 'then', 'will', 'into', 'further', 'don', "you'd", 'his', 'more', 'too', 'herself', 'after', 'them', 'between', 'you', "wouldn't", 'yourself', 've', 'with', "haven't", 'a', 'having', 'doesn', 'on', 'that', 'hasn', 'any', 'only', 'she', "weren't", 'at', "you've", 'through', 'off', 'themselves', 'other', 'very', 'd', 'y', 'yours', 'itself', "wasn't", "needn't", 'because', 'he', 'now', "you're", 'or', 'couldn', 'whom', 'hers', 'does', 'they', 'an', 'haven', 'isn', 'can', 'the', 'those', 'him', 'am', 'few', 'same', 'o', "doesn't", 'to', 'such', 'won', 'until', 'as', 'ours', 'of', 'i', 'there', 'do', 'her', 'my', "should've", 'its', 'll', 'didn', 'ain', 'under', 'what', "couldn'

In [7]:
word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

print(word_tokens)
print(filtered_sentence)

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


In [8]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

example_words = ["python", "pythoner", "pythoning",
                "pythoned", "pythonly"]

In [9]:
[ps.stem(w) for w in example_words]

['python', 'python', 'python', 'python', 'pythonli']

In [10]:
for w in example_words:
    print(ps.stem(w))

python
python
python
python
pythonli


In [11]:
new_text = "It is important to by very pythonly while you are pythoning with python. \
All pythoners have pythoned poorly at least once."


In [12]:
words = word_tokenize(new_text)

for w in words:
    print(ps.stem(w))

It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc
.


### Tagging

In [13]:
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

In [14]:
train_text = state_union.raw('2005-GWBush.txt')
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [15]:
def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            print(chunked)
            for subtree in chunked.subtress(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)

    except Exception as e:
        print(str(e))

In [16]:
#process_content()

### Chinking

In [17]:
def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)

            chunkGram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT|TO>+{"""

            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            print(chunked)
#             chunked.draw()

    except Exception as e:
        print(str(e))

In [18]:
#process_content()

### Name Entity Recognition

In [19]:
def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=True)
            print(namedEnt)
#             namedEnt.draw()
    except Exception as e:
        print(str(e))

In [20]:
#process_content()

### Lemmatizing

In [21]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

cat
cactus
goose
rock
python
good
best
run
run


### Corpora

In [22]:
print(nltk.__file__)

/Users/marktblack/anaconda3/lib/python3.7/site-packages/nltk/__init__.py


In [23]:
from nltk.tokenize import sent_tokenize, PunktSentenceTokenizer
from nltk.corpus import gutenberg

# sample text
sample = gutenberg.raw("bible-kjv.txt")

tok = sent_tokenize(sample)

for x in range(5):
    print(tok[x])


[The King James Bible]

The Old Testament of the King James Bible

The First Book of Moses:  Called Genesis


1:1 In the beginning God created the heaven and the earth.
1:2 And the earth was without form, and void; and darkness was upon
the face of the deep.
And the Spirit of God moved upon the face of the
waters.
1:3 And God said, Let there be light: and there was light.
1:4 And God saw the light, that it was good: and God divided the light
from the darkness.


### Wordnet/ Similarity

In [24]:
from nltk.corpus import wordnet

In [25]:
syns = wordnet.synsets("program")

In [26]:
print(syns[0].name())

plan.n.01


In [27]:
print(syns[0].definition())

a series of steps to be carried out or goals to be accomplished


In [28]:
print(syns[0].examples())

['they drew up a six-step plan', 'they discussed plans for a new bond issue']


In [29]:
synonyms = []
antonyms = []

for syn in wordnet.synsets("good"):
    for l in syn.lemmas():
        print("l:", l)
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())
            
# print(set(synonyms))
# print('\n')
# print(set(antonyms))

l: Lemma('good.n.01.good')
l: Lemma('good.n.02.good')
l: Lemma('good.n.02.goodness')
l: Lemma('good.n.03.good')
l: Lemma('good.n.03.goodness')
l: Lemma('commodity.n.01.commodity')
l: Lemma('commodity.n.01.trade_good')
l: Lemma('commodity.n.01.good')
l: Lemma('good.a.01.good')
l: Lemma('full.s.06.full')
l: Lemma('full.s.06.good')
l: Lemma('good.a.03.good')
l: Lemma('estimable.s.02.estimable')
l: Lemma('estimable.s.02.good')
l: Lemma('estimable.s.02.honorable')
l: Lemma('estimable.s.02.respectable')
l: Lemma('beneficial.s.01.beneficial')
l: Lemma('beneficial.s.01.good')
l: Lemma('good.s.06.good')
l: Lemma('good.s.07.good')
l: Lemma('good.s.07.just')
l: Lemma('good.s.07.upright')
l: Lemma('adept.s.01.adept')
l: Lemma('adept.s.01.expert')
l: Lemma('adept.s.01.good')
l: Lemma('adept.s.01.practiced')
l: Lemma('adept.s.01.proficient')
l: Lemma('adept.s.01.skillful')
l: Lemma('adept.s.01.skilful')
l: Lemma('good.s.09.good')
l: Lemma('dear.s.02.dear')
l: Lemma('dear.s.02.good')
l: Lemma('dear.s

In [30]:
w1 = wordnet.synset("ship.n.01")
w2 = wordnet.synset("boat.n.01")

print(w1.wup_similarity(w2))

0.9090909090909091


In [31]:
w1 = wordnet.synset("ship.n.01")
w2 = wordnet.synset("car.n.01")

print(w1.wup_similarity(w2))

0.6956521739130435


In [32]:
w1 = wordnet.synset("cactus.n.01")
w2 = wordnet.synset("cat.n.01")

print(w1.wup_similarity(w2))

0.5


### Text Classification

In [33]:
import random
from nltk.corpus import movie_reviews

In [34]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

In [35]:
# print(documents[1])

In [36]:
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(15))

[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595)]


In [37]:
print(all_words["stupid"])
print(all_words['awesome'])

253
35


### Converting Words to Features w/ NLTK

In [38]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

In [39]:
def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)
        
    return features

In [40]:
# print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))
featuresets = [(find_features(rev), category) for (rev, category) in documents]

## Naive Bayes Classifier

#### High Level Overview of the algorithm

2 main things:
1. Naive "assumption of independence"
2. Bayes formula:

$$P(A|B)=\frac{P(A)P(B|A)}{P(B)}$$

or 

$$\text{posterior}=\frac{\text{prior occurrence}*\text{likelihood}}{\text{evidence}}$$

In [41]:
train_end = 1900

training_set = featuresets[:train_end]
testing_set = featuresets[train_end:]

In [42]:
classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive Bayes Performance:\n")
print(nltk.classify.accuracy(classifier, testing_set)*100)
classifier.show_most_informative_features(15)

Naive Bayes Performance:

78.0
Most Informative Features
              schumacher = True              neg : pos    =     11.7 : 1.0
                  regard = True              pos : neg    =     11.0 : 1.0
                   sucks = True              neg : pos    =     10.6 : 1.0
                  annual = True              pos : neg    =      9.6 : 1.0
                bothered = True              neg : pos    =      8.4 : 1.0
                  turkey = True              neg : pos    =      8.2 : 1.0
           unimaginative = True              neg : pos    =      7.7 : 1.0
                 idiotic = True              neg : pos    =      7.2 : 1.0
                    mena = True              neg : pos    =      7.0 : 1.0
             silverstone = True              neg : pos    =      7.0 : 1.0
                  shoddy = True              neg : pos    =      7.0 : 1.0
                  suvari = True              neg : pos    =      7.0 : 1.0
                    lame = True            

In [43]:
import pickle

# pickle
# save_classifier = open("./data/naivebayes.pickle","wb")
# pickle.dump(classifier, save_classifier)
# save_classifier.close()

classifier_f = open("./data/naivebayes.pickle", "rb")
classifier = pickle.load(classifier_f)
classifier_f.close()

In [44]:
print(nltk.classify.accuracy(classifier, testing_set)*100)
classifier.show_most_informative_features(10)

70.0
Most Informative Features
               insulting = True              neg : pos    =     10.2 : 1.0
                    sans = True              neg : pos    =      9.0 : 1.0
            refreshingly = True              pos : neg    =      8.3 : 1.0
              mediocrity = True              neg : pos    =      7.0 : 1.0
             bruckheimer = True              neg : pos    =      6.3 : 1.0
                   wires = True              neg : pos    =      6.3 : 1.0
               dismissed = True              pos : neg    =      6.3 : 1.0
             overwhelmed = True              pos : neg    =      6.3 : 1.0
                  wasted = True              neg : pos    =      6.0 : 1.0
                flawless = True              pos : neg    =      5.9 : 1.0


## Scikit-Learn Sklearn

In [45]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB

#### MulitNomial NB

In [46]:
mnb_classifier = SklearnClassifier(MultinomialNB())
mnb_classifier.train(training_set)
print('MNB Classifier accuracy percent:',
     (nltk.classify.accuracy(mnb_classifier, testing_set))*100)

MNB Classifier accuracy percent: 77.0


In [47]:
bnb_classifier = SklearnClassifier(BernoulliNB())
bnb_classifier.train(training_set)
print('BNB Classifier accuracy percent:',
     (nltk.classify.accuracy(bnb_classifier, testing_set))*100)

BNB Classifier accuracy percent: 78.0


## Logistic Regression, SGDClassifier, SVM

In [48]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

In [49]:
LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", 
      (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)



LogisticRegression_classifier accuracy percent: 79.0


In [50]:
SGDC_classifier = SklearnClassifier(SGDClassifier())
SGDC_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", 
      (nltk.classify.accuracy(SGDC_classifier, testing_set))*100)

SGDClassifier_classifier accuracy percent: 79.0




In [51]:
SVC_classifier = SklearnClassifier(SVC())
SVC_classifier.train(training_set)
print("SVC_classifier accuracy percent:", 
      (nltk.classify.accuracy(SVC_classifier, testing_set))*100)



SVC_classifier accuracy percent: 78.0


In [52]:
LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", 
      (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

LinearSVC_classifier accuracy percent: 78.0


In [53]:
NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:", 
      (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)

NuSVC_classifier accuracy percent: 82.0


### Combining Algo's to Create a Voting System

In [54]:
from nltk.classify import ClassifierI
from statistics import mode

In [55]:
class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

In [56]:
voted_classifier = VoteClassifier(classifier,
                                  NuSVC_classifier,
                                  LinearSVC_classifier,
                                  SGD_classifier,
                                  mnb_classifier,
                                  bnb_classifier,
                                  LogisticRegression_classifier)

print("voted_classifier accuracy percent:", (nltk.classify.accuracy(voted_classifier, testing_set))*100)

print("Classification:", voted_classifier.classify(testing_set[0][0]), "Confidence %:",voted_classifier.confidence(testing_set[0][0])*100)
print("Classification:", voted_classifier.classify(testing_set[1][0]), "Confidence %:",voted_classifier.confidence(testing_set[1][0])*100)
print("Classification:", voted_classifier.classify(testing_set[2][0]), "Confidence %:",voted_classifier.confidence(testing_set[2][0])*100)
print("Classification:", voted_classifier.classify(testing_set[3][0]), "Confidence %:",voted_classifier.confidence(testing_set[3][0])*100)
print("Classification:", voted_classifier.classify(testing_set[4][0]), "Confidence %:",voted_classifier.confidence(testing_set[4][0])*100)
print("Classification:", voted_classifier.classify(testing_set[5][0]), "Confidence %:",voted_classifier.confidence(testing_set[5][0])*100)

NameError: name 'SGDClassifier_classifier' is not defined

In [57]:
short_pos = open("./data/positive.txt", "r", encoding='latin-1').read()
short_neg = open("./data/negative.txt", "r", encoding='latin-1').read()

In [58]:
documents = []

for r in short_pos.split('\n'):
    documents.append((r, "pos"))
    
for r in short_neg.split('\n'):
    documents.append((r, "neg"))
    
all_words = []

short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)

for w in short_pos_words:
    all_words.append(w.lower())
    
for w in short_neg_words:
    all_words.append(w.lower())

In [59]:
all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:5000]

def find_features(document):
    words = word_tokenize(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

In [60]:
featuresets = [(find_features(rev), category) for (rev, category) in documents]

random.shuffle(featuresets)

In [61]:
training_set = featuresets[:10000]
testing_set = featuresets[10000:]

In [None]:
classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

SGDC_classifier = SklearnClassifier(SGDClassifier())
SGDC_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDC_classifier, testing_set))*100)

LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)


voted_classifier = VoteClassifier(
                                  NuSVC_classifier,
                                  LinearSVC_classifier,
                                  MNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier)

print("voted_classifier accuracy percent:", (nltk.classify.accuracy(voted_classifier, testing_set))*100)

## Sentiment Module

In [62]:
import sentiment_mod as s

10664


In [63]:
# positive review
print(s.sentiment("This movie was awesome! The acting was great, plot was wonderful, and there were pythons...so yea!"))

('pos', 1.0)


In [64]:
# negative review
print(s.sentiment("This movie was utter junk. There were absolutely 0 pythons. I don't see what the point was at all. Horrible movie, 0/10"))

('neg', 1.0)


In [75]:
# positive review
print(s.sentiment("I've seen better"))


('neg', 1.0)
