# NLP in Action
> My journey to learn NLP

- toc: true
- branch: master
- badges: false
- comments: false
- categories: [fastpages, jupyter]

## System tools

### Install a pip package in the current Jupyter kernel

In [None]:
# import sys
# !{sys.executable} -m pip install vaderSentiment

## NLP Tools

### Regular Expressions

In [61]:
import re

#### basics of regular expressions

- [] indicates character class
- r'[\s]' is equivalent to r'\t\n\r\x0b\x0c' match all the spaces, tabs, returns, new lines and 
 form-feed
- r'[a-z]' match all lowercase
- r'[0-9]' match any digit
- r'[_a-zA-Z]' match any underscore character or letter of the english alphabet

Match any sentence that begins with hi|hello|hey followed by space(s) and a word 

In [18]:
r = "(hi|hello|hey)[ ]*([a-z]*)"

In [45]:
print(re.match(r, 'Hello Rosa', flags=re.IGNORECASE))
print(re.match(r, "hi ho, hi ho, it's off to work ...", flags=re.IGNORECASE))
print(re.match(r, "hey, what's up", flags=re.IGNORECASE))

<re.Match object; span=(0, 10), match='Hello Rosa'>
None
<re.Match object; span=(0, 9), match='hey, what'>


Example with a complex pattern

In [22]:
r = r"[^a-z]*([y]o|[h']?ello|ok|hey|(good[ ])?(morn[gin']{0,3}|"\
     r"afternoon|even[gin']{0,3}))[\s,;:]{1,3}([a-z]{1,20})"

re_greeting = re.compile(r, flags=re.IGNORECASE)

In [46]:
print(re_greeting.match('Hello Rosa'))
print(re_greeting.match('Hello Rosa').groups())
print(re_greeting.match("Good morning Rosa"))
print(re_greeting.match("Good Manning Rosa"))
print(re_greeting.match('Good evening Rosa Parks').groups())
print(re_greeting.match("Good Morn'n Rosa"))
print(re_greeting.match("yo Rosa"))

<re.Match object; span=(0, 10), match='Hello Rosa'>
('Hello', None, None, 'Rosa')
<re.Match object; span=(0, 17), match='Good morning Rosa'>
None
('Good evening', 'Good ', 'evening', 'Rosa')
<re.Match object; span=(0, 16), match="Good Morn'n Rosa">
<re.Match object; span=(0, 7), match='yo Rosa'>


### Simple Chat Bot

Enter the text 'good morning rosa or hello rose'

In [33]:
my_names = set(['rosa', 'rose', 'chatty', 'chatbot', 'bot', 'chatterbot'])
curt_names = set(['hal', 'you', 'u'])
greeter_name = ''
match = re_greeting.match(input())

if match:
    at_name = match.groups()[-1]
    if at_name in curt_names:
        print("Good one.")

    elif at_name.lower() in my_names:
        print("Hi {}, How are you?".format(greeter_name))

 hello rose


Hi , How are you?


### Word Permutations

n=3 permutations with text 'Good morning Rosa!'

In [34]:
from itertools import permutations
print([" ".join(combo) for combo in permutations("Good morning Rosa!".split(), 3)])

['Good morning Rosa!', 'Good Rosa! morning', 'morning Good Rosa!', 'morning Rosa! Good', 'Rosa! Good morning', 'Rosa! morning Good']


### Count Words

Counting words using Counter. Dict output

In [36]:
from collections import Counter
print(Counter("Guten Morgen Rosa".split()))
print(Counter("Good morning morning , Rosa!".split()))

Counter({'Guten': 1, 'Morgen': 1, 'Rosa': 1})
Counter({'morning': 2, 'Good': 1, ',': 1, 'Rosa!': 1})


## Word Tokenization

### Tokens

#### split into tokens

In [39]:
sentence = "Thomas Jefferson began building Monticello at age of 26."
sentence.split()
str.split(sentence)

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'age',
 'of',
 '26.']

split a sentence into tokens, order it and convert to a vector

In [44]:
import numpy as np
token_sequence = str.split(sentence)
vocab = sorted(set(token_sequence))
num_tokens = len(token_sequence)
vocab_size = len(vocab)
onehot_vectors = np.zeros((num_tokens, vocab_size), int)

for i, word in enumerate(token_sequence):
    onehot_vectors[i, vocab.index(word)] = 1

print(', '.join(vocab))
print('*********************')
print(onehot_vectors)

26., Jefferson, Monticello, Thomas, age, at, began, building, of
*********************
[[0 0 0 1 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 1 0]
 [0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0]
 [0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 1]
 [1 0 0 0 0 0 0 0 0]]


Building a dataframe with matrix of vectors

In [47]:
import pandas as pd
print(pd.DataFrame(onehot_vectors, columns=vocab))

   26.  Jefferson  Monticello  Thomas  age  at  began  building  of
0    0          0           0       1    0   0      0         0   0
1    0          1           0       0    0   0      0         0   0
2    0          0           0       0    0   0      1         0   0
3    0          0           0       0    0   0      0         1   0
4    0          0           1       0    0   0      0         0   0
5    0          0           0       0    0   1      0         0   0
6    0          0           0       0    1   0      0         0   0
7    0          0           0       0    0   0      0         0   1
8    1          0           0       0    0   0      0         0   0


construct a frequency vector

In [60]:
sentences = """Thomas Jefferson began building Monticello at the age of 26.\n"""
sentences += """Construction was done mostly by local masons and carpenters.\n"""
sentences += "He moved into the South Pavilion in 1770.\n"
sentences += """Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."""

print('text to analize:', sentences)
print('*********************************************************')

#1. construct a dict of dicts
corpus = {}
for i, sent in enumerate(sentences.split('\n')):
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in sent.split()) 

print('first dict:', corpus['sent0'])
print('*********************************************************')
#2. convert dict to dataframe 
df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
print(df[df.columns[:10]])
#	    Thomas	Jefferson	began	building	Monticello	at	the	age	of	26.
# sent0	    1	    1	    1	        1	        1	    1	1	1	1	1
# sent1	    0	    0	    0	        0	        0	    0	0	0	0	0
# sent2	    0	    0	    0	        0	        0	    0	1	0	0	0
# sent3	    0	    0	    0	        0	        1	    0	0	0	0	0

print('**********************************************************')
print('word shared by two sentences:',[(k, v) for (k, v) in (df.loc['sent0'] & df.loc['sent3']).
                                       items() if v])

text to analize: Thomas Jefferson began building Monticello at the age of 26.
Construction was done mostly by local masons and carpenters.
He moved into the South Pavilion in 1770.
Turning Monticello into a neoclassical masterpiece was Jefferson's obsession.
*********************************************************
first dict: {'Thomas': 1, 'Jefferson': 1, 'began': 1, 'building': 1, 'Monticello': 1, 'at': 1, 'the': 1, 'age': 1, 'of': 1, '26.': 1}
*********************************************************
       Thomas  Jefferson  began  building  Monticello  at  the  age  of  26.
sent0       1          1      1         1           1   1    1    1   1    1
sent1       0          0      0         0           0   0    0    0   0    0
sent2       0          0      0         0           0   0    1    0   0    0
sent3       0          0      0         0           1   0    0    0   0    0
**********************************************************
word shared by two sentences: [('Monticello', 1

split into tokens using [regular expressions](#### basics of regular expressions)

In [63]:
import re
sentence = """Thomas Jefferson began building Monticello at the age of 26."""
tokens = re.split(r'[-\s.,;!?]+', sentence)
print('list of tokens:', tokens)
print('***********************************************')
# a better version of tokenization fast and manageable
pattern = re.compile(r"([-\s.,;!?])+")
tokens = pattern.split(sentence)
print('last 10 tokens:', tokens[-10:])  
# ['the', ' ', 'age', ' ', 'of', ' ', '26', '.', '']

list of tokens: ['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26', '']
***********************************************
last 10 tokens: [' ', 'the', ' ', 'age', ' ', 'of', ' ', '26', '.', '']


#### ignoring punctuation

In [64]:
print('removing punctuation:', [x for x in tokens if x not in '- \t\n.,;!?'])

removing punctuation: ['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26']


#### ignoring whitespaces with tokeneizer

In [65]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+|$[0-9.]+|\S+')
print(tokenizer.tokenize(sentence))

['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26', '.']


#### managing contractions with tokeneizer

In [66]:
from nltk.tokenize import TreebankWordTokenizer
sentence = """Monticello wasn't designated as UNESCO World Heritage Site until 1987."""
tokenizer = TreebankWordTokenizer()
print(tokenizer.tokenize(sentence))

['Monticello', 'was', "n't", 'designated', 'as', 'UNESCO', 'World', 'Heritage', 'Site', 'until', '1987', '.']


#### tokenize informal conversation

In [68]:
from nltk.tokenize.casual import casual_tokenize
message = """RT @TJMonticello Best day everrrrrrr at Monticello. Awesommmmmmeeeeeeee day :*)"""
print('tokens:', casual_tokenize(message))

print('***************************************************')
print('best approach to tokens:', casual_tokenize(message, reduce_len=True, strip_handles=True))

tokens: ['RT', '@TJMonticello', 'Best', 'day', 'everrrrrrr', 'at', 'Monticello', '.', 'Awesommmmmmeeeeeeee', 'day', ':*)']
***************************************************
best approach to tokens: ['RT', 'Best', 'day', 'everrr', 'at', 'Monticello', '.', 'Awesommmeee', 'day', ':*)']


### n-grams

In [74]:
from nltk.util import ngrams
print('list of tuples:', list(ngrams(tokens, 2)))
print('*****************************************************')
print('list of triplets:', list(ngrams(tokens, 3)))
print('*****************************************************')
print('list of 2-grams', [' '.join(x) for x in list(ngrams(tokens, 2))])

list of tuples: [('Thomas', 'Jefferson'), ('Jefferson', 'began'), ('began', 'building'), ('building', 'Monticello'), ('Monticello', 'at'), ('at', 'the'), ('the', 'age'), ('age', 'of'), ('of', '26')]
*****************************************************
list of triplets: [('Thomas', 'Jefferson', 'began'), ('Jefferson', 'began', 'building'), ('began', 'building', 'Monticello'), ('building', 'Monticello', 'at'), ('Monticello', 'at', 'the'), ('at', 'the', 'age'), ('the', 'age', 'of'), ('age', 'of', '26')]
*****************************************************
list of 2-grams ['Thomas Jefferson', 'Jefferson began', 'began building', 'building Monticello', 'Monticello at', 'at the', 'the age', 'age of', 'of 26']


### Stopwords

In [79]:
# removing stopwords be careful, depends on the application
stop_words = ['a', 'an', 'the', 'on', 'of', 'off', 'this', 'is']
tokens = ['the', 'house', 'is', 'on', 'fire']
tokens_without_stopwords = [x for x in tokens if x not in stop_words]
print('tokens without stopwords:', tokens_without_stopwords)

# canonical stopwords
import nltk
#nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
# print('size of nltk stopwords:', len(stop_words))
print('first seven words', stop_words[:7])
# ['i', 'me', 'my', 'myself', 'we', 'our', 'ours']
print('stopwords with 1 character in nltk:', [sw for sw in stop_words if len(sw) == 1])
# ['i', 'a', 's', 't', 'd', 'm', 'o', 'y']

# stopwords comparison between sklearn and NLTK
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as sklearn_stop_words
nltk_stop_words = nltk.corpus.stopwords.words('english')
print('nltk stopwords:', len(nltk_stop_words))
print('sklearn stopwords:', len(sklearn_stop_words))

#print(len(nltk_stop_words.union(sklearn_stop_words)))
#print(len(nltk_stop_words.intersection(sklearn_stop_words)))

tokens without stopwords: ['house', 'fire']
first seven words ['i', 'me', 'my', 'myself', 'we', 'our', 'ours']
stopwords with 1 character in nltk: ['i', 'a', 's', 't', 'd', 'm', 'o', 'y']
nltk stopwords: 179
sklearn stopwords: 318


### Normalizing capitalization

In [81]:
tokens = ['House', 'Visitor', 'Center']
print([x.lower() for x in tokens])

['house', 'visitor', 'center']


### Stemming

#### stemmer with regular expressions

In [82]:
def stem(phrase):
    return ' '.join([re.findall('^(.*ss|.*?)(s)?$', word)[0][0].strip("'") for word in phrase.lower().split()])

print(stem('houses'))
print(stem("Doctor House's calls"))
# doctor house call

house
doctor house call


#### complete stemmer

In [83]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
' '.join([stemmer.stem(w).strip("'") for w in "dish washer's wash dishes".split()])
# dish washer wash dish

'dish washer wash dish'

### Lemmatization

In [87]:
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
# if post isn't specified the lemmatizer assumes it is a noun
print(lemmatizer.lemmatize('better'))
print(lemmatizer.lemmatize('better', pos='a'))
print(lemmatizer.lemmatize('good', pos='a'))
print(lemmatizer.lemmatize('goods', pos='a'))
print(lemmatizer.lemmatize('goods', pos='n'))
print(lemmatizer.lemmatize('goodness', pos='n'))
print(lemmatizer.lemmatize('best', pos='a'))

better
good
good
goods
good
goodness
best


### Sentiment analysis

#### vader model for sentiment analysis

In [103]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sa = SentimentIntensityAnalyzer()
text1 = "Python is very readable and it is great for NLP."
text2 = "Python is not a bad choice for most applications."

print('lexicon words with spaces', [(tok, score) for tok, score in sa.lexicon.items() if " " in tok])
print('*******************************************')
print('dict sentiment for text1:', sa.polarity_scores(text=text1))
print('dict sentiment for text2:', sa.polarity_scores(text=text2))

lexicon words with spaces [("( '}{' )", 1.6), ("can't stand", -2.0), ('fed up', -1.8), ('screwed up', -1.5)]
*******************************************
dict sentiment for text1: {'neg': 0.0, 'neu': 0.687, 'pos': 0.313, 'compound': 0.6249}
dict sentiment for text2: {'neg': 0.0, 'neu': 0.737, 'pos': 0.263, 'compound': 0.431}


#### sentiment score for a given corpus

In [104]:
corpus = ["Absolutely perfect! Love it! :-) :-) :-)",
          "Horrible! Completely useless. :(",
          "It was Ok. Some good and some bad things"]

for doc in corpus:
    scores = sa.polarity_scores(doc)
    print('{:+}: {}'.format(scores['compound'], doc))

+0.9428: Absolutely perfect! Love it! :-) :-) :-)
-0.8768: Horrible! Completely useless. :(
-0.1531: It was Ok. Some good and some bad things


#### can't install nlpia

In [106]:
# naive bayes model
#from nlpia.data.loaders import get_data
#movies = get_data('hutto_movies')
#movies.head().round(2)
#     sentiment                                            text
# id
# 1        2.27  The Rock is destined to be the 21st Century...
# 2        3.53  The gorgeously elaborate continuation of ''...
# 3       -0.60                     Effective but too tepid ...
# 4        1.47  If you sometimes like to go to the movies t...
# 5        1.73  Emerges as something rare, an issue movie t...
#movies.describe().round(2)
#        sentiment
# count   10605.00
# mean        0.00
# min        -3.88
# max         3.94

#### A complete path for predicting sentiment analysis

generate sentiment from movie data. Remember you can't install nlpia where the data movie is

In [111]:
import pandas as pd
import numpy as np

reviews_train = []
for line in open('movie_data/full_train.txt', 'r'):
    reviews_train.append(line.strip())
movies = pd.DataFrame(reviews_train, columns=['review'])

movies['sentiment'] = np.random.uniform(-4, 4, 25000).round(2)
# sample the data
movies = movies.sample(n=1000).reset_index(drop=True) 

convert 'review' column to bag of words

In [121]:
import pandas as pd
from nltk.tokenize import casual_tokenize
from collections import Counter
pd.set_option('display.width', 75)
bags_of_words = []

# tokenize each row, append into a list of dicts and convert to dataframe
for text in movies.review.to_list():
    bags_of_words.append(Counter(casual_tokenize(text)))
df_bows = pd.DataFrame.from_records(bags_of_words)
df_bows = df_bows.fillna(0).astype(int)

print('bows shape:', df_bows.shape)
print('**************************************************')
print('first review:', movies.loc[0].review)
print('**************************************************')
print('tokens of the first review that appear in the first 5 rows of the dataset', 
      df_bows.head()[list(bags_of_words[0].keys())])

bows shape: (1000, 22298)
**************************************************
first review: I'll give writer/director William Gove credit for finding someone to finance this ill-conceived "thriller." A good argument for not wasting money subscribing to HBO, let alone buying DVDs based on cover art and blurbs. A pedestrian Dennis Hopper and a game Richard Grieco add nothing significant to their resumes, although the art direction is not half bad. The dialogue will leave you grimacing with wonder at its conceit; this is storytelling at its worst. No tension, no suspense, no dread, no fear, no empathy, no catharsis, no nothing. A few attractive and often nude females spice up the boredom, but this is definitely a film best seen as a trailer. I feel sorry for the guy who greenlighted this thing. Good for late-night, zoned-out viewing only. You have been warned.
**************************************************
tokens of the first review that appear in the first 5 rows of the dataset    I'l

predict the sentiment and computing metrics for accuracy

In [144]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

# fit the model and convert continuous to boolean label
nb = nb.fit(df_bows, movies.sentiment > 0)

# values from review columns goes from -4 to 4 so this code normalize to the "ground true" sentiment
movies['predicted_sentiment'] = nb.predict_proba(df_bows)[:,1] * 8 - 4  
movies['error'] = (movies.predicted_sentiment - movies.sentiment).abs()

# metrics
print('MAE:', movies.error.mean()) 

# support columns
movies['sentiment_ispositive'] = (movies.sentiment > 0).astype(int)
movies['predicted_ispositive'] = (movies.predicted_sentiment > 0).astype(int)
movies['''sentiment predicted_sentiment sentiment_ispositive predicted_ispositive'''.split()].head(8)

#     sentiment  predicted_sentiment  sentiment_ispositive  predicted_ispositive
# id
# 1    2.266667                   4                    1                    1
# 2    3.533333                   4                    1                    1
# 3   -0.600000                  -4                    0                    0
# 4    1.466667                   4                    1                    1
# 5    1.733333                   4                    1                    1
# 6    2.533333                   4                    1                    1
# 7    2.466667                   4                    1                    1
# 8    1.266667                  -4                    1                    0

# prediction over positives
(movies.predicted_ispositive == movies.sentiment_ispositive).sum() / len(movies)

MAE: 2.0512326689917173


0.993

In [None]:
nb.predict_proba( )