# Simple applications of NLTK

In the following notebook, we exemplify some applications of the NLTK (Natural Language Tool KIT) library for NLP. For an overview of important terminology in NLP, we refer to [here](https://www.kdnuggets.com/2017/02/natural-language-processing-key-terms-explained.html). For a nice overview of applications of NLTK in python, consider this [link](https://likegeeks.com/nlp-tutorial-using-python-nltk/). To get a broader perspective on other libraries for NLP in python, go [here](https://medium.com/@srimanikantapalakollu/top-5-natural-language-processing-python-libraries-for-data-scientist-32463d36feae).

### The points we focus on here are:

- Desintegration of a text into tokens (sentences and words)
- Stemming (a way to delete endings of words to establish correlations in between word stems)
- POS-Tagging (Part of Speech) (to identify the type of word: noun, verb, adjective etc.)
- relevance of words/sentiment analysis

For this, we use data originally retrieved from [here](https://www.kaggle.com/snap/amazon-fine-food-reviews). In the respository a processed version of it is provided. For a gentle (and more detailed) introduction of the first part (tokenizing), check also [here](https://github.com/andreaspts/ML_NLP_Analyses/blob/master/simple_exploration_into_regular_expressions.ipynb). For the last part, we use the information gained before to train a (multinomial) naive Bayes model. (The latter was explored [here](https://github.com/andreaspts/ML_Naive_Bayes_for_Classification).)

In [1]:
#import relevant packages
import nltk

## 1) Tokenizing

In [2]:
#tokenization via a trained machine learning model

text = "He went into a supermarket in St. Louis. There, he bought wine for 10$."

sentences = nltk.sent_tokenize(text)

print(sentences)

['He went into a supermarket in St. Louis.', 'There, he bought wine for 10$.']


Notice that the tokenizer (a machine learning model) is smart enough to split at a full stop an not at the . after St!

In [3]:
#splitting the sentence is done via the regular expression method
for sentence in sentences:
    print(nltk.word_tokenize(sentence))

['He', 'went', 'into', 'a', 'supermarket', 'in', 'St.', 'Louis', '.']
['There', ',', 'he', 'bought', 'wine', 'for', '10', '$', '.']


In [4]:
len(nltk.word_tokenize(sentence))

9

## 2) POS-Tagging

POS tagging can be done in various ways (i.e. via different machine learning methods) via the NLTK library. A nice overview with performance check can be found here: https://natemccoy.github.io/2016/10/27/evaluatingnltktaggerstutorial.html.

The following discussion is baded on the perceptron tagger. 

In [5]:
#to assign a tag (what type) to the tokens
for sentence in sentences:
    print(nltk.pos_tag(nltk.word_tokenize(sentence)))

[('He', 'PRP'), ('went', 'VBD'), ('into', 'IN'), ('a', 'DT'), ('supermarket', 'NN'), ('in', 'IN'), ('St.', 'NNP'), ('Louis', 'NNP'), ('.', '.')]
[('There', 'EX'), (',', ','), ('he', 'PRP'), ('bought', 'VBD'), ('wine', 'NN'), ('for', 'IN'), ('10', 'CD'), ('$', '$'), ('.', '.')]


The tags for each object are called treebank tags.

In [6]:
#develop model which filters for adjectives

for sentence in sentences:
    tagged_words = nltk.pos_tag(nltk.word_tokenize(sentence))
    
    final_sentence = []
    for tagged_word in tagged_words:
        final_sentence.append(tagged_word[0] + "/" + tagged_word[1])
        
    #print(final_sentence)
    print(" ".join(final_sentence))

He/PRP went/VBD into/IN a/DT supermarket/NN in/IN St./NNP Louis/NNP ./.
There/EX ,/, he/PRP bought/VBD wine/NN for/IN 10/CD $/$ ./.


These tagged sentences could be further processed using machine learning methods.

## 3) Stemming

An overview of stemming algorithms for various European languages can be found here https://snowballstem.org/algorithms/.

In [7]:
#importing relevant packages
from nltk.stem import SnowballStemmer

In [8]:
s = SnowballStemmer("english")

In [9]:
s.stem("cars")

'car'

In [10]:
s.stem("quickly")

'quick'

In [11]:
s.stem("followed")

'follow'

## 4) Lemmatizing

Lemmatizer work in a similar fashion as stemmers but are able to retrieve word associations (at the cost of computational time). 

In [12]:
#import relevant packages
from nltk.stem.wordnet import WordNetLemmatizer

l = WordNetLemmatizer()

In [13]:
l.lemmatize("going", "v") #associate to verb

'go'

In [14]:
sentence = "He is going to the supermarket down the street."

words_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))

In [15]:
from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):
    
    if treebank_tag.startswith("J"):
        return wordnet.ADJ
    elif treebank_tag.startswith("V"):
        return wordnet.VERB
    elif treebank_tag.startswith("N"):
        return wordnet.NOUN
    elif treebank_tag.startswith("R"):
        return wordnet.ADV
    else:
        return wordnet.NOUN #if we can't identify the type, we return that it is a noun (crude)

In [16]:
for word in words_tagged:
    
    print(l.lemmatize(word[0], get_wordnet_pos(word[1])))

He
be
go
to
the
supermarket
down
the
street
.


## 5) Application: Combination of NLTK and machine learning

In [17]:
#import relevant packages

import pandas as pd

In [41]:
#define data frame

df = pd.read_csv("./Reviews_10000.csv.bz2")

df.head()

Unnamed: 0,Id,ProductId,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [42]:
df.shape

(10000, 8)

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
Id                        10000 non-null int64
ProductId                 10000 non-null object
HelpfulnessNumerator      10000 non-null int64
HelpfulnessDenominator    10000 non-null int64
Score                     10000 non-null int64
Time                      10000 non-null int64
Summary                   10000 non-null object
Text                      10000 non-null object
dtypes: int64(5), object(3)
memory usage: 625.1+ KB


In [44]:
#define variable/text column with descriptions
df = df#.sample(100)
texts = df["Text"]#.sample(500)

In [45]:
#print(texts)

In [46]:
#splitting text into sentences

texts_transformed = []

for review in texts:
    sentences = nltk.sent_tokenize(review)
    adjectives = []
    #tokenizing sentences into words
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        #proceed with pos-tagging
        words_tagged = nltk.pos_tag(words)
        for word_tagged in words_tagged:
            #filter for adjectives
            if word_tagged[1] == "JJ":
                adjectives.append(word_tagged[0])
    texts_transformed.append(" ".join(adjectives))

In [47]:
len(texts_transformed)

10000

In [48]:
#print(texts_transformed)

In [49]:
#importing relevant packages
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

In [113]:
#define variables for training

X = texts_transformed
Y = (df["Score"] >=3)

In [114]:
#train/test splitting
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 0, test_size = 0.2)

In [115]:
#words and their occurrence have to be transformed into integers
cv = CountVectorizer(max_features = 100) # take the max_features most often appearing words
cv.fit(X_train)

X_train = cv.transform(X_train)
X_test = cv.transform(X_test)

In [116]:
X_train.shape

(8000, 100)

In [117]:
#define Naive Bayes ML model: MultinomialNB

model = MultinomialNB()
model.fit(X_train, Y_train)

print("Training score: " + str(model.score(X_train, Y_train)))
print("Test score: "  + str(model.score(X_test, Y_test)))

Training score: 0.854125
Test score: 0.8455


In [118]:
#the names of the max_features are given and its order matters for the naive Bayes approach
cv.get_feature_names()

['able',
 'added',
 'amazing',
 'artificial',
 'available',
 'awesome',
 'bad',
 'big',
 'bitter',
 'black',
 'bold',
 'br',
 'brown',
 'cheap',
 'cold',
 'dark',
 'delicious',
 'different',
 'difficult',
 'disappointed',
 'dry',
 'due',
 'easy',
 'excellent',
 'expensive',
 'extra',
 'fantastic',
 'fat',
 'favorite',
 'few',
 'fine',
 'first',
 'flavored',
 'flavorful',
 'free',
 'french',
 'fresh',
 'full',
 'glad',
 'gluten',
 'good',
 'great',
 'green',
 'happy',
 'hard',
 'healthy',
 'high',
 'hot',
 'huge',
 'large',
 'last',
 'light',
 'little',
 'local',
 'long',
 'low',
 'many',
 'mild',
 'much',
 'natural',
 'new',
 'next',
 'nice',
 'old',
 'only',
 'organic',
 'original',
 'other',
 'own',
 'perfect',
 'picky',
 'pleased',
 'quick',
 'real',
 'regular',
 'rich',
 'right',
 'same',
 'second',
 'several',
 'similar',
 'single',
 'small',
 'smooth',
 'soft',
 'sour',
 'special',
 'strong',
 'such',
 'super',
 'sure',
 'sweet',
 'tasty',
 'top',
 'weak',
 'white',
 'whole',
 'w

In [119]:
#the parameters of the MultinomialNB are given
model.coef_[0]

array([-5.02042506, -5.6974861 , -5.51339306, -5.50458243, -4.91290471,
       -5.64566103, -4.60242621, -4.72735213, -5.06463115, -5.3135272 ,
       -5.23014559, -4.76014196, -5.80996408, -6.09764615, -5.76344407,
       -5.25034829, -4.00441129, -4.29798151, -5.88407205, -6.25179683,
       -5.46165739, -5.68690399, -4.23689381, -4.5204691 , -4.75598395,
       -5.01503421, -5.59638998, -5.7189923 , -4.09017811, -4.21009562,
       -5.25717426, -4.03622312, -5.36558356, -5.32079995, -4.02817478,
       -5.56795205, -4.40013922, -4.78545977, -5.3135272 , -5.10498244,
       -2.44719408, -2.71892163, -4.89844663, -4.53367062, -4.57435419,
       -4.43564591, -4.4538835 , -3.97358033, -5.84633173, -5.0932864 ,
       -4.69560344, -4.9275749 , -3.38011721, -4.54368739, -4.97292272,
       -4.76431733, -3.988877  , -5.77487276, -3.84719013, -4.50420858,
       -4.83360911, -5.15318455, -4.03420495, -4.40013922, -4.46312856,
       -4.55042143, -5.7189923 , -3.12926876, -4.71931996, -4.35

In [120]:
#we assemble these together to get a list of tuples
adj = list(zip(model.coef_[0], cv.get_feature_names()))

#sort this list according to coefficients in the tuples
adj = sorted(adj)

In [121]:
for i in adj:
    print(i)

(-6.251796833976684, 'disappointed')
(-6.097646154149426, 'cheap')
(-5.884072053851367, 'difficult')
(-5.884072053851367, 'sour')
(-5.858754245867077, 'worth')
(-5.846331725868519, 'huge')
(-5.834061633276705, 'top')
(-5.809964081697645, 'brown')
(-5.798129624050642, 'special')
(-5.774872761886375, 'mild')
(-5.7634440660627515, 'cold')
(-5.7634440660627515, 'picky')
(-5.7634440660627515, 'wrong')
(-5.718992303491918, 'fat')
(-5.718992303491918, 'original')
(-5.6974860982709545, 'added')
(-5.686903988940418, 'due')
(-5.676432689073122, 'super')
(-5.645661030406369, 'awesome')
(-5.645661030406369, 'weak')
(-5.635610694552867, 'similar')
(-5.596389981399586, 'fantastic')
(-5.567952046079053, 'french')
(-5.540300514748542, 'pleased')
(-5.5133930618286175, 'amazing')
(-5.504582432146463, 'artificial')
(-5.504582432146463, 'second')
(-5.487190689434594, 'quick')
(-5.487190689434594, 'single')
(-5.461657387429429, 'dry')
(-5.365583557339807, 'flavored')
(-5.335506102102529, 'such')
(-5.320799

We observe that adjectives with a negative connotation are judged by our model with numbers which are more negative than those with a positive connotation.