# NLP in Action I
> My journey to learn NLP

- toc: true
- branch: master
- badges: false
- comments: false
- categories: [fastpages, jupyter]

## System tools

### Install a pip package in the current Jupyter kernel

In [2]:
import sys
# !{sys.executable} -m pip install vaderSentiment

## Text Tools

### Regular Expressions

In [61]:
import re

#### basics of regular expressions

- [] indicates character class
- r'[\s]' is equivalent to r'\t\n\r\x0b\x0c' match all the spaces, tabs, returns, new lines and 
 form-feed
- r'[a-z]' match all lowercase
- r'[0-9]' match any digit
- r'[_a-zA-Z]' match any underscore character or letter of the english alphabet

Match any sentence that begins with hi|hello|hey followed by space(s) and a word 

In [18]:
r = "(hi|hello|hey)[ ]*([a-z]*)"

In [45]:
print(re.match(r, 'Hello Rosa', flags=re.IGNORECASE))
print(re.match(r, "hi ho, hi ho, it's off to work ...", flags=re.IGNORECASE))
print(re.match(r, "hey, what's up", flags=re.IGNORECASE))

<re.Match object; span=(0, 10), match='Hello Rosa'>
None
<re.Match object; span=(0, 9), match='hey, what'>


Example with a complex pattern

In [22]:
r = r"[^a-z]*([y]o|[h']?ello|ok|hey|(good[ ])?(morn[gin']{0,3}|"\
     r"afternoon|even[gin']{0,3}))[\s,;:]{1,3}([a-z]{1,20})"

re_greeting = re.compile(r, flags=re.IGNORECASE)

In [46]:
print(re_greeting.match('Hello Rosa'))
print(re_greeting.match('Hello Rosa').groups())
print(re_greeting.match("Good morning Rosa"))
print(re_greeting.match("Good Manning Rosa"))
print(re_greeting.match('Good evening Rosa Parks').groups())
print(re_greeting.match("Good Morn'n Rosa"))
print(re_greeting.match("yo Rosa"))

<re.Match object; span=(0, 10), match='Hello Rosa'>
('Hello', None, None, 'Rosa')
<re.Match object; span=(0, 17), match='Good morning Rosa'>
None
('Good evening', 'Good ', 'evening', 'Rosa')
<re.Match object; span=(0, 16), match="Good Morn'n Rosa">
<re.Match object; span=(0, 7), match='yo Rosa'>


### Simple Chat Bot

Enter the text 'good morning rosa or hello rose'

In [33]:
my_names = set(['rosa', 'rose', 'chatty', 'chatbot', 'bot', 'chatterbot'])
curt_names = set(['hal', 'you', 'u'])
greeter_name = ''
match = re_greeting.match(input())

if match:
    at_name = match.groups()[-1]
    if at_name in curt_names:
        print("Good one.")

    elif at_name.lower() in my_names:
        print("Hi {}, How are you?".format(greeter_name))

 hello rose


Hi , How are you?


### Word Permutations

n=3 permutations with text 'Good morning Rosa!'

In [34]:
from itertools import permutations
print([" ".join(combo) for combo in permutations("Good morning Rosa!".split(), 3)])

['Good morning Rosa!', 'Good Rosa! morning', 'morning Good Rosa!', 'morning Rosa! Good', 'Rosa! Good morning', 'Rosa! morning Good']


### Count Words

Counting words using Counter. Dict output

In [36]:
from collections import Counter
print(Counter("Guten Morgen Rosa".split()))
print(Counter("Good morning morning , Rosa!".split()))

Counter({'Guten': 1, 'Morgen': 1, 'Rosa': 1})
Counter({'morning': 2, 'Good': 1, ',': 1, 'Rosa!': 1})


## Word Tokenization

### Tokens

#### split into tokens

In [39]:
sentence = "Thomas Jefferson began building Monticello at age of 26."
sentence.split()
str.split(sentence)

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'age',
 'of',
 '26.']

split a sentence into tokens, order it and convert to a vector

In [44]:
import numpy as np
token_sequence = str.split(sentence)
vocab = sorted(set(token_sequence))
num_tokens = len(token_sequence)
vocab_size = len(vocab)
onehot_vectors = np.zeros((num_tokens, vocab_size), int)

for i, word in enumerate(token_sequence):
    onehot_vectors[i, vocab.index(word)] = 1

print(', '.join(vocab))
print('*********************')
print(onehot_vectors)

26., Jefferson, Monticello, Thomas, age, at, began, building, of
*********************
[[0 0 0 1 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 1 0]
 [0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0]
 [0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 1]
 [1 0 0 0 0 0 0 0 0]]


Building a dataframe with matrix of vectors

In [47]:
import pandas as pd
print(pd.DataFrame(onehot_vectors, columns=vocab))

   26.  Jefferson  Monticello  Thomas  age  at  began  building  of
0    0          0           0       1    0   0      0         0   0
1    0          1           0       0    0   0      0         0   0
2    0          0           0       0    0   0      1         0   0
3    0          0           0       0    0   0      0         1   0
4    0          0           1       0    0   0      0         0   0
5    0          0           0       0    0   1      0         0   0
6    0          0           0       0    1   0      0         0   0
7    0          0           0       0    0   0      0         0   1
8    1          0           0       0    0   0      0         0   0


construct a frequency vector

In [60]:
sentences = """Thomas Jefferson began building Monticello at the age of 26.\n"""
sentences += """Construction was done mostly by local masons and carpenters.\n"""
sentences += "He moved into the South Pavilion in 1770.\n"
sentences += """Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."""

print('text to analize:', sentences)
print('*********************************************************')

#1. construct a dict of dicts
corpus = {}
for i, sent in enumerate(sentences.split('\n')):
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in sent.split()) 

print('first dict:', corpus['sent0'])
print('*********************************************************')
#2. convert dict to dataframe 
df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
print(df[df.columns[:10]])
#	    Thomas	Jefferson	began	building	Monticello	at	the	age	of	26.
# sent0	    1	    1	    1	        1	        1	    1	1	1	1	1
# sent1	    0	    0	    0	        0	        0	    0	0	0	0	0
# sent2	    0	    0	    0	        0	        0	    0	1	0	0	0
# sent3	    0	    0	    0	        0	        1	    0	0	0	0	0

print('**********************************************************')
print('word shared by two sentences:',[(k, v) for (k, v) in (df.loc['sent0'] & df.loc['sent3']).
                                       items() if v])

text to analize: Thomas Jefferson began building Monticello at the age of 26.
Construction was done mostly by local masons and carpenters.
He moved into the South Pavilion in 1770.
Turning Monticello into a neoclassical masterpiece was Jefferson's obsession.
*********************************************************
first dict: {'Thomas': 1, 'Jefferson': 1, 'began': 1, 'building': 1, 'Monticello': 1, 'at': 1, 'the': 1, 'age': 1, 'of': 1, '26.': 1}
*********************************************************
       Thomas  Jefferson  began  building  Monticello  at  the  age  of  26.
sent0       1          1      1         1           1   1    1    1   1    1
sent1       0          0      0         0           0   0    0    0   0    0
sent2       0          0      0         0           0   0    1    0   0    0
sent3       0          0      0         0           1   0    0    0   0    0
**********************************************************
word shared by two sentences: [('Monticello', 1

split into tokens using [regular expressions](#### basics of regular expressions)

In [63]:
import re
sentence = """Thomas Jefferson began building Monticello at the age of 26."""
tokens = re.split(r'[-\s.,;!?]+', sentence)
print('list of tokens:', tokens)
print('***********************************************')
# a better version of tokenization fast and manageable
pattern = re.compile(r"([-\s.,;!?])+")
tokens = pattern.split(sentence)
print('last 10 tokens:', tokens[-10:])  
# ['the', ' ', 'age', ' ', 'of', ' ', '26', '.', '']

list of tokens: ['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26', '']
***********************************************
last 10 tokens: [' ', 'the', ' ', 'age', ' ', 'of', ' ', '26', '.', '']


#### ignoring punctuation

In [64]:
print('removing punctuation:', [x for x in tokens if x not in '- \t\n.,;!?'])

removing punctuation: ['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26']


#### ignoring whitespaces with tokeneizer

In [65]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+|$[0-9.]+|\S+')
print(tokenizer.tokenize(sentence))

['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26', '.']


#### managing contractions with tokeneizer

In [66]:
from nltk.tokenize import TreebankWordTokenizer
sentence = """Monticello wasn't designated as UNESCO World Heritage Site until 1987."""
tokenizer = TreebankWordTokenizer()
print(tokenizer.tokenize(sentence))

['Monticello', 'was', "n't", 'designated', 'as', 'UNESCO', 'World', 'Heritage', 'Site', 'until', '1987', '.']


#### tokenize informal conversation

In [68]:
from nltk.tokenize.casual import casual_tokenize
message = """RT @TJMonticello Best day everrrrrrr at Monticello. Awesommmmmmeeeeeeee day :*)"""
print('tokens:', casual_tokenize(message))

print('***************************************************')
print('best approach to tokens:', casual_tokenize(message, reduce_len=True, strip_handles=True))

tokens: ['RT', '@TJMonticello', 'Best', 'day', 'everrrrrrr', 'at', 'Monticello', '.', 'Awesommmmmmeeeeeeee', 'day', ':*)']
***************************************************
best approach to tokens: ['RT', 'Best', 'day', 'everrr', 'at', 'Monticello', '.', 'Awesommmeee', 'day', ':*)']


### n-grams

In [74]:
from nltk.util import ngrams
print('list of tuples:', list(ngrams(tokens, 2)))
print('*****************************************************')
print('list of triplets:', list(ngrams(tokens, 3)))
print('*****************************************************')
print('list of 2-grams', [' '.join(x) for x in list(ngrams(tokens, 2))])

list of tuples: [('Thomas', 'Jefferson'), ('Jefferson', 'began'), ('began', 'building'), ('building', 'Monticello'), ('Monticello', 'at'), ('at', 'the'), ('the', 'age'), ('age', 'of'), ('of', '26')]
*****************************************************
list of triplets: [('Thomas', 'Jefferson', 'began'), ('Jefferson', 'began', 'building'), ('began', 'building', 'Monticello'), ('building', 'Monticello', 'at'), ('Monticello', 'at', 'the'), ('at', 'the', 'age'), ('the', 'age', 'of'), ('age', 'of', '26')]
*****************************************************
list of 2-grams ['Thomas Jefferson', 'Jefferson began', 'began building', 'building Monticello', 'Monticello at', 'at the', 'the age', 'age of', 'of 26']


### Stopwords

In [79]:
# removing stopwords be careful, depends on the application
stop_words = ['a', 'an', 'the', 'on', 'of', 'off', 'this', 'is']
tokens = ['the', 'house', 'is', 'on', 'fire']
tokens_without_stopwords = [x for x in tokens if x not in stop_words]
print('tokens without stopwords:', tokens_without_stopwords)

# canonical stopwords
import nltk
#nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
# print('size of nltk stopwords:', len(stop_words))
print('first seven words', stop_words[:7])
# ['i', 'me', 'my', 'myself', 'we', 'our', 'ours']
print('stopwords with 1 character in nltk:', [sw for sw in stop_words if len(sw) == 1])
# ['i', 'a', 's', 't', 'd', 'm', 'o', 'y']

# stopwords comparison between sklearn and NLTK
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as sklearn_stop_words
nltk_stop_words = nltk.corpus.stopwords.words('english')
print('nltk stopwords:', len(nltk_stop_words))
print('sklearn stopwords:', len(sklearn_stop_words))

#print(len(nltk_stop_words.union(sklearn_stop_words)))
#print(len(nltk_stop_words.intersection(sklearn_stop_words)))

tokens without stopwords: ['house', 'fire']
first seven words ['i', 'me', 'my', 'myself', 'we', 'our', 'ours']
stopwords with 1 character in nltk: ['i', 'a', 's', 't', 'd', 'm', 'o', 'y']
nltk stopwords: 179
sklearn stopwords: 318


### Normalizing capitalization

In [81]:
tokens = ['House', 'Visitor', 'Center']
print([x.lower() for x in tokens])

['house', 'visitor', 'center']


### Stemming

#### stemmer with regular expressions

In [82]:
def stem(phrase):
    return ' '.join([re.findall('^(.*ss|.*?)(s)?$', word)[0][0].strip("'") for word in phrase.lower().split()])

print(stem('houses'))
print(stem("Doctor House's calls"))
# doctor house call

house
doctor house call


#### complete stemmer

In [83]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
' '.join([stemmer.stem(w).strip("'") for w in "dish washer's wash dishes".split()])
# dish washer wash dish

'dish washer wash dish'

### Lemmatization

In [87]:
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
# if post isn't specified the lemmatizer assumes it is a noun
print(lemmatizer.lemmatize('better'))
print(lemmatizer.lemmatize('better', pos='a'))
print(lemmatizer.lemmatize('good', pos='a'))
print(lemmatizer.lemmatize('goods', pos='a'))
print(lemmatizer.lemmatize('goods', pos='n'))
print(lemmatizer.lemmatize('goodness', pos='n'))
print(lemmatizer.lemmatize('best', pos='a'))

better
good
good
goods
good
goodness
best


### Sentiment analysis

#### vader model for sentiment analysis

In [103]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sa = SentimentIntensityAnalyzer()
text1 = "Python is very readable and it is great for NLP."
text2 = "Python is not a bad choice for most applications."

print('lexicon words with spaces', [(tok, score) for tok, score in sa.lexicon.items() if " " in tok])
print('*******************************************')
print('dict sentiment for text1:', sa.polarity_scores(text=text1))
print('dict sentiment for text2:', sa.polarity_scores(text=text2))

lexicon words with spaces [("( '}{' )", 1.6), ("can't stand", -2.0), ('fed up', -1.8), ('screwed up', -1.5)]
*******************************************
dict sentiment for text1: {'neg': 0.0, 'neu': 0.687, 'pos': 0.313, 'compound': 0.6249}
dict sentiment for text2: {'neg': 0.0, 'neu': 0.737, 'pos': 0.263, 'compound': 0.431}


#### sentiment score for a given corpus

In [104]:
corpus = ["Absolutely perfect! Love it! :-) :-) :-)",
          "Horrible! Completely useless. :(",
          "It was Ok. Some good and some bad things"]

for doc in corpus:
    scores = sa.polarity_scores(doc)
    print('{:+}: {}'.format(scores['compound'], doc))

+0.9428: Absolutely perfect! Love it! :-) :-) :-)
-0.8768: Horrible! Completely useless. :(
-0.1531: It was Ok. Some good and some bad things


#### naive model for sentiment analysis

code to use movies data with nlpia but i can't install it

In [167]:
# naive bayes model
#from nlpia.data.loaders import get_data
#movies = get_data('hutto_movies')
#movies.head().round(2)

generate sentiment from movie data. Remember you can't install nlpia where the data movie is

In [163]:
import pandas as pd
import numpy as np

reviews_train = []
for line in open('movie_data/full_train.txt', 'r'):
    reviews_train.append(line.strip())
movies = pd.DataFrame(reviews_train, columns=['review'])

movies['sentiment'] = np.random.uniform(-4, 4, 25000).round(2)
# sample the data
movies = movies.sample(n=1000).reset_index(drop=True)
movies

Unnamed: 0,review,sentiment
0,Woman (Miriam Hopkins as Virginia) chases Man ...,3.49
1,The combination of reading the Novella and vie...,-3.81
2,When my now college age daughter was in presch...,2.21
3,Absolutely awful movie. Utter waste of time.<b...,-0.94
4,I disagree with previous comment about this mo...,0.14
...,...,...
995,It's rare that I feel a need to write a review...,-0.86
996,Regardless of what personal opinion one may ha...,0.90
997,I have recently seen this movie due to Jake's ...,0.18
998,There is absolutely no plot in this movie ...n...,-2.02


sentiment goes from -4 to 4

In [166]:
movies.describe()

Unnamed: 0,sentiment
count,1000.0
mean,-0.01573
std,2.362015
min,-3.99
25%,-2.055
50%,-0.145
75%,2.18
max,4.0


convert 'review' column to bag of words

In [154]:
import pandas as pd
from nltk.tokenize import casual_tokenize
from collections import Counter
pd.set_option('display.width', 75)
bags_of_words = []

# tokenize each row, append into a list of dicts and convert to dataframe
for text in movies.review.to_list():
    bags_of_words.append(Counter(casual_tokenize(text)))
df_bows = pd.DataFrame.from_records(bags_of_words)
df_bows = df_bows.fillna(0).astype(int)

print('bows shape:', df_bows.shape)
print('**************************************************')
print('first review:') 
print(movies.loc[0].review)
print('**************************************************')
print('108 tokens of the first review, with a lexicon of 22298 tokens') 
df_bows.head()[list(bags_of_words[0].keys())]

bows shape: (1000, 22298)
**************************************************
first review:
I'll give writer/director William Gove credit for finding someone to finance this ill-conceived "thriller." A good argument for not wasting money subscribing to HBO, let alone buying DVDs based on cover art and blurbs. A pedestrian Dennis Hopper and a game Richard Grieco add nothing significant to their resumes, although the art direction is not half bad. The dialogue will leave you grimacing with wonder at its conceit; this is storytelling at its worst. No tension, no suspense, no dread, no fear, no empathy, no catharsis, no nothing. A few attractive and often nude females spice up the boredom, but this is definitely a film best seen as a trailer. I feel sorry for the guy who greenlighted this thing. Good for late-night, zoned-out viewing only. You have been warned.
**************************************************
108 tokens of the first review, with a lexicon of 22298 tokens


Unnamed: 0,I'll,give,writer,/,director,William,Gove,credit,for,finding,...,thing,Good,late-night,zoned-out,viewing,only,You,have,been,warned
0,1,1,1,1,1,1,1,1,4,1,...,1,1,1,1,1,1,1,1,1,1
1,0,0,0,12,0,0,0,0,2,0,...,0,0,0,0,0,0,0,2,0,0
2,0,0,0,0,0,0,0,0,1,0,...,1,0,0,0,0,1,0,1,0,0
3,0,0,0,6,1,0,0,0,3,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,4,0,0,0,0,2,0,...,0,0,0,0,0,2,0,0,0,0


predict the sentiment and computing metrics for accuracy

In [147]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

# fit the model and convert continuous to boolean label
nb = nb.fit(df_bows, movies.sentiment > 0)

# values from review columns goes from -4 to 4 so this code normalize to the "ground true" 
# sentiment. predict_proba return a n x 2 ndarray, we use el "1s" column
movies['predicted_sentiment'] = nb.predict_proba(df_bows)[:,1] * 8 - 4  
movies['error'] = (movies.predicted_sentiment - movies.sentiment).abs()

# metrics
print('MAE:', movies.error.mean()) 

# support columns
movies['sentiment_ispositive'] = (movies.sentiment > 0).astype(int)
movies['predicted_ispositive'] = (movies.predicted_sentiment > 0).astype(int)
movies['''sentiment predicted_sentiment sentiment_ispositive predicted_ispositive'''.split()].head(8)

# prediction over positives
print('positives accuracy:', (movies.predicted_ispositive == 
                             movies.sentiment_ispositive).sum() / len(movies))

MAE: 2.0512326689917173
positives accuracy: 0.993


## Math with words

### Term Frequency

In [158]:
from nltk.tokenize import TreebankWordTokenizer
from collections import Counter
sentence = """The faster Harry got to the store, the faster Harry, the faster, would get home."""
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(sentence.lower())
bags_of_words = Counter(tokens)
bags_of_words.most_common(4)  # top 4 bag of words
bags_of_words_df = pd.DataFrame.from_records([bags_of_words])  # put the tokens on columns, give the form of many sentences
tf = (bags_of_words_df / len(bags_of_words)).round(2)
tf

Unnamed: 0,the,faster,harry,got,to,store,",",would,get,home,.
0,0.36,0.27,0.18,0.09,0.09,0.09,0.27,0.09,0.09,0.09,0.09


### Bag Of Words

#### Counting

In [160]:
from collections import Counter
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()

# van't install nlpia but from nlpia.data.loaders import kite_text

kite_text = """A kite is traditionally a tethered heavier-than-air craft with wing surfaces that react against the air 
to create lift and drag. A kite consists of wings, tethers, and anchors. Kites often have a bridle to guide the face of 
the kite at the correct angle so the wind can lift it. A kite’s wing also may be so designed so a bridle is not needed;
when kiting a sailplane for launch, the tether meets the wing at a single point. A kite may have fixed or moving anchors. 
Untraditionally in technical kiting, a kite consists of tether-set-coupled wing sets; even in technical kiting, though, 
a wing in the system is still often called the kite.
The lift that sustains the kite in flight is generated when air flows around the kite’s surface, producing low pressure 
above and high pressure below the wings. The interaction with the wind also generates horizontal drag along the direction 
of the wind. The resultant force vector from the lift and drag force components is opposed by the tension of one or more 
of the lines or tethers to which the kite is attached. The anchor point of the kite line may be static or moving (such 
as the towing of a kite by a running person, boat, free-falling anchors as in paragliders and fugitive parakites or 
vehicle).
The same principles of fluid flow apply in liquids and kites are also used under water.
A hybrid tethered craft comprising both a lighter-than-air balloon as well as a kite lifting surface is called a kytoon.
Kites have a long and varied history and many different types are flown individually and at festivals worldwide. Kites 
may be flown for recreation, art or other practical uses. Sport kites can be flown in aerial ballet, sometimes as part 
of a competition. Power kites are multi-line steerable kites designed to generate large forces which can be used to 
power activities such as kite surfing, kite landboarding, kite fishing, kite buggying and a new trend snow kiting. 
Even Man-lifting kites have been made.
"""
tokens = tokenizer.tokenize(kite_text.lower())
token_counts = Counter(tokens)

In [168]:
#collapse
token_counts

Counter({'a': 20,
         'kite': 14,
         'is': 7,
         'traditionally': 1,
         'tethered': 2,
         'heavier-than-air': 1,
         'craft': 2,
         'with': 2,
         'wing': 5,
         'surfaces': 1,
         'that': 2,
         'react': 1,
         'against': 1,
         'the': 26,
         'air': 2,
         'to': 5,
         'create': 1,
         'lift': 4,
         'and': 10,
         'drag.': 1,
         'consists': 2,
         'of': 10,
         'wings': 1,
         ',': 14,
         'tethers': 2,
         'anchors.': 2,
         'kites': 8,
         'often': 2,
         'have': 4,
         'bridle': 2,
         'guide': 1,
         'face': 1,
         'at': 3,
         'correct': 1,
         'angle': 1,
         'so': 3,
         'wind': 2,
         'can': 3,
         'it.': 1,
         'kite’s': 2,
         'also': 3,
         'may': 4,
         'be': 5,
         'designed': 2,
         'not': 1,
         'needed': 1,
         ';': 2,
         'when':

#### Removing stopwords

In [169]:
import nltk
# nltk.download('stopwords', quiet=True)
stopwords = nltk.corpus.stopwords.words('english')
tokens = [x for x in tokens if x not in stopwords]
kite_counts = Counter(tokens)

In [170]:
#collapse
kite_counts

Counter({'kite': 14,
         'traditionally': 1,
         'tethered': 2,
         'heavier-than-air': 1,
         'craft': 2,
         'wing': 5,
         'surfaces': 1,
         'react': 1,
         'air': 2,
         'create': 1,
         'lift': 4,
         'drag.': 1,
         'consists': 2,
         'wings': 1,
         ',': 14,
         'tethers': 2,
         'anchors.': 2,
         'kites': 8,
         'often': 2,
         'bridle': 2,
         'guide': 1,
         'face': 1,
         'correct': 1,
         'angle': 1,
         'wind': 2,
         'it.': 1,
         'kite’s': 2,
         'also': 3,
         'may': 4,
         'designed': 2,
         'needed': 1,
         ';': 2,
         'kiting': 3,
         'sailplane': 1,
         'launch': 1,
         'tether': 1,
         'meets': 1,
         'single': 1,
         'point.': 1,
         'fixed': 1,
         'moving': 2,
         'untraditionally': 1,
         'technical': 2,
         'tether-set-coupled': 1,
         'sets'

#### Vectorizing the text

In [176]:
document_vector = []
doc_length = len(tokens)
for key, value in kite_counts.most_common():
    document_vector.append(value / doc_length)
print('Printing the first 10 values of the list')
document_vector[:10]

Printing the first 10 values of the list


[1.75, 1.75, 1.0, 0.625, 0.5, 0.5, 0.375, 0.375, 0.375, 0.25]

#### Lexicon

Lexicon with a new corpus

In [182]:
# corpus
docs = ["The faster Harry got to the score, the faster and faster Harry would get home."]
docs.append("Harry is hairy and faster than Jill.")
docs.append("Jill is not as hairy as Harry.")

doc_tokens = []
for doc in docs:
    doc_tokens += [sorted(tokenizer.tokenize(doc.lower()))]
print('number of tokens of the first term in the list:', len(doc_tokens[0]))

all_doc_tokens = sum(doc_tokens, [])
print('number of total tokens:', len(all_doc_tokens))

lexicon = sorted(set(all_doc_tokens))
print('**************************************************')
print('size of the lexicon:', len(lexicon))
print('lexicon:', lexicon)

number of tokens of the first term in the list: 17
number of total tokens: 33
**************************************************
size of the lexicon: 18
lexicon: [',', '.', 'and', 'as', 'faster', 'get', 'got', 'hairy', 'harry', 'home', 'is', 'jill', 'not', 'score', 'than', 'the', 'to', 'would']


vectorize the sentences in a list of OrderedDict

In [184]:
from collections import OrderedDict
zero_vector = OrderedDict((token, 0) for token in lexicon)

import copy
doc_vectors = []
for doc in docs:
    vec = copy.copy(zero_vector)
    tokens = tokenizer.tokenize(doc.lower())
    token_counts = Counter(tokens)
    for key, value in token_counts.items():
        vec[key] = value / len(lexicon)
    doc_vectors.append(vec)

print('Printint the First OrderedDict that corresponds to the first sentence:')
doc_vectors[2]

Printint the First OrderedDict that corresponds to the first sentence:


OrderedDict([(',', 0),
             ('.', 0.05555555555555555),
             ('and', 0),
             ('as', 0.1111111111111111),
             ('faster', 0),
             ('get', 0),
             ('got', 0),
             ('hairy', 0.05555555555555555),
             ('harry', 0.05555555555555555),
             ('home', 0),
             ('is', 0.05555555555555555),
             ('jill', 0.05555555555555555),
             ('not', 0.05555555555555555),
             ('score', 0),
             ('than', 0),
             ('the', 0),
             ('to', 0),
             ('would', 0)])

#### CosineSimilarity

In [185]:
import math
def cosine_sim(vec1, vec2):
    """ Let's convert our dictionaries to lists for easier matching."""
    vec1 = [val for val in vec1.values()]
    vec2 = [val for val in vec2.values()]

    dot_prod = 0
    for i, v in enumerate(vec1):
        dot_prod += v * vec2[i]

    mag_1 = math.sqrt(sum([x**2 for x in vec1]))
    mag_2 = math.sqrt(sum([x**2 for x in vec2]))
    return dot_prod / (mag_1 * mag_2)

### TF-IDF

#### 1st approach TF-IDF, the harder version

In [195]:
kite_history = """Kites were invented in China, where materials ideal for kite building were readily 
available: silk fabric for sail material; fine, high-tensile-strength silk for flying line; and resilient
bamboo for a strong, lightweight framework.
The kite has been claimed as the invention of the 5th-century BC Chinese philosophers Mozi (also Mo Di) 
and Lu Ban (also Gongshu Ban). By 549 AD paper kites were certainly being flown, as it was recorded that 
in that year a paper kite was used as a message for a rescue mission. Ancient and medieval Chinese sources
describe kites being used for measuring distances, testing the wind, lifting men, signaling, and 
communication for military operations. The earliest known Chinese kites were flat (not bowed) and 
often rectangular. Later, tailless kites incorporated a stabilizing bowline. Kites were decorated with 
mythological motifs and legendary figures; some were fitted with strings and whistles to make musical 
sounds while flying. From China, kites were introduced to Cambodia, Thailand, India, Japan, Korea and 
the western world.
After its introduction into India, the kite further evolved into the fighter kite, known as the patang 
in India, where thousands are flown every year on festivals such as Makar Sankranti.
Kites were known throughout Polynesia, as far as New Zealand, with the assumption being that the 
knowledge diffused from China along with the people. Anthropomorphic kites made from cloth and wood 
were used in religious ceremonies to send prayers to the gods. Polynesian kite traditions are used by 
anthropologists get an idea of early “primitive” Asian traditions that are believed to have at one time 
existed in Asia."""

detect a list of tokens from texts

In [198]:
intro_tokens = tokenizer.tokenize(kite_text.lower())
history_tokens = tokenizer.tokenize(kite_history.lower())
intro_total = len(intro_tokens)
history_total = len(history_tokens)
print('tokens in kite text:', intro_total)
print('tokens in kite history:', history_total)
print('')

print("This section is to show what zipf's law is")
# compute the TF of one word in two documents
intro_tf = {}
history_tf = {}
intro_counts = Counter(intro_tokens)
intro_tf['kite'] = intro_counts['kite'] / intro_total
history_counts = Counter(history_tokens)
history_tf['kite'] = history_counts['kite'] / history_total

# the tf of word "kite" in document A is twice times the tf of word kite in document B 
# (according to zipf's law)
print('Term Frequency of word "kite" in intro document is: {:.4f}'.format(intro_tf['kite']))
print('Term Frequency of word "kite" in history document is: {:.4f}'.format(history_tf['kite']))

# But zipf's law is not fulfilled with word "and"
intro_tf['and'] = intro_counts['and'] / intro_total
history_tf['and'] = history_counts['and'] / history_total
print('Term Frequency of word "and" in intro document is: {:.4f}'.format(intro_tf['and']))
print('Term Frequency of word "and" in history document is: {:.4f}'.format(history_tf['and']))
print('')

tokens in kite text: 361
tokens in kite history: 295

This section is to show what zipf's law is
Term Frequency of word "kite" in intro document is: 0.0388
Term Frequency of word "kite" in history document is: 0.0203
Term Frequency of word "and" in intro document is: 0.0277
Term Frequency of word "and" in history document is: 0.0305



calculating the TF-IDF of only 3 words: "and", "kite" and "china"

In [197]:
num_docs_containing_and = 0
for doc in [intro_tokens, history_tokens]:
    if 'and' in doc:
        num_docs_containing_and += 1

num_docs_containing_kite = 0
for doc in [intro_tokens, history_tokens]:
    if 'and' in doc:
        num_docs_containing_kite += 1

num_docs_containing_china = 0
for doc in [intro_tokens, history_tokens]:
    if 'and' in doc:
        num_docs_containing_china += 1

intro_tf['china'] = intro_counts['china'] / intro_total
history_tf['china'] = history_counts['china'] / history_total
num_docs = 2
intro_idf = {}
history_idf = {}
intro_idf['and'] = num_docs / num_docs_containing_and
history_idf['and'] = num_docs / num_docs_containing_and
intro_idf['kite'] = num_docs / num_docs_containing_kite
history_idf['kite'] = num_docs / num_docs_containing_kite
intro_idf['china'] = num_docs / num_docs_containing_china
history_idf['china'] = num_docs / num_docs_containing_china

intro_tfidf = {}
intro_tfidf['and'] = intro_tf['and'] * intro_idf['and']
intro_tfidf['kite'] = intro_tf['kite'] * intro_idf['kite']
intro_tfidf['china'] = intro_tf['china'] * intro_idf['china']
history_tfidf = {}
history_tfidf['and'] = history_tf['and'] * history_idf['and']
history_tfidf['kite'] = history_tf['kite'] * history_idf['kite']
history_tfidf['china'] = history_tf['china'] * history_idf['china']
print('TF-IDF of the words in document intro is: ', intro_tfidf)
print('TF-IDF of the words in documenty history is ',history_tfidf)

TF-IDF of the words in document intro is:  {'and': 0.027700831024930747, 'kite': 0.038781163434903045, 'china': 0.0}
TF-IDF of the words in documenty history is  {'and': 0.030508474576271188, 'kite': 0.020338983050847456, 'china': 0.010169491525423728}


#### 2nd approach TF-IDF with similarities.

converting a list of text into a list of tuples containing the token and its count

In [208]:
import copy
print('corpus:', docs)
print('')
document_tfidf_vectors = []
for doc in docs:
    vec = copy.copy(zero_vector)
    tokens = tokenizer.tokenize(doc.lower())
    token_counts = Counter(tokens)

    for key, value in token_counts.items():
        docs_containing_key = 0
        for _doc in docs:
            if key in _doc:
                docs_containing_key += 1
        tf = value / len(lexicon)
        if docs_containing_key:
            idf = len(docs) / docs_containing_key
        else:
            idf = 0
        vec[key] = tf * idf
    document_tfidf_vectors.append(vec)
document_tfidf_vectors

corpus: ['The faster Harry got to the score, the faster and faster Harry would get home.', 'Harry is hairy and faster than Jill.', 'Jill is not as hairy as Harry.']



[OrderedDict([(',', 0.16666666666666666),
              ('.', 0.05555555555555555),
              ('and', 0.08333333333333333),
              ('as', 0),
              ('faster', 0.25),
              ('get', 0.16666666666666666),
              ('got', 0.16666666666666666),
              ('hairy', 0),
              ('harry', 0.0),
              ('home', 0.16666666666666666),
              ('is', 0),
              ('jill', 0),
              ('not', 0),
              ('score', 0.16666666666666666),
              ('than', 0),
              ('the', 0.5),
              ('to', 0.16666666666666666),
              ('would', 0.16666666666666666)]),
 OrderedDict([(',', 0),
              ('.', 0.05555555555555555),
              ('and', 0.08333333333333333),
              ('as', 0),
              ('faster', 0.08333333333333333),
              ('get', 0),
              ('got', 0),
              ('hairy', 0.08333333333333333),
              ('harry', 0.0),
              ('home', 0),
              ('i

In [210]:
query = "How long does it take to get to the store?"
query_vec = copy.copy(zero_vector)

# normalize the query
tokens = tokenizer.tokenize(query.lower())
# a dic with tokens with counts
token_counts = Counter(tokens)

documents = docs
for key, value in token_counts.items():
    docs_containing_key = 0
    for _doc in documents:
        if key in _doc.lower():
            docs_containing_key += 1
    # Avoiding divide-by-zero-error
    if docs_containing_key == 0:
        continue
    tf = value / len(tokens)
    idf = len(documents) / docs_containing_key
    query_vec[key] = tf * idf

print('TF-IDF of query(list version):',query_vec)

TF-IDF of query(list version): OrderedDict([(',', 0), ('.', 0), ('and', 0), ('as', 0), ('faster', 0), ('get', 0.2727272727272727), ('got', 0), ('hairy', 0), ('harry', 0), ('home', 0), ('is', 0), ('jill', 0), ('not', 0), ('score', 0), ('than', 0), ('the', 0.2727272727272727), ('to', 0.5454545454545454), ('would', 0)])


computing similarities between documents

In [204]:
print('similarities between query and 1st, 2nd and 3rd document:')
print(cosine_sim(query_vec, document_tfidf_vectors[0]))
print(cosine_sim(query_vec, document_tfidf_vectors[1]))
print(cosine_sim(query_vec, document_tfidf_vectors[2]))

similarities between query and 1st, 2nd and 3rd document:
0.5677922680888605
0.0
0.0


#### 3rd Approach - The best version with sckit learn.

In [218]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = docs
vectorizer = TfidfVectorizer(min_df=1)
model = vectorizer.fit_transform(corpus)
print('corpus ndarray output:')
print('message: need to include the similarities(like 2nd approach)between'
      +'TF-IDF vectors: query and corpus!')
model.todense().round(2)

corpus ndarray output:
message: need to include the similarities (like 2nd approach)betweenTF-IDF vectors: query and corpus!


array([[0.16, 0.  , 0.48, 0.21, 0.21, 0.  , 0.25, 0.21, 0.  , 0.  , 0.  ,
        0.21, 0.  , 0.64, 0.21, 0.21],
       [0.37, 0.  , 0.37, 0.  , 0.  , 0.37, 0.29, 0.  , 0.37, 0.37, 0.  ,
        0.  , 0.49, 0.  , 0.  , 0.  ],
       [0.  , 0.75, 0.  , 0.  , 0.  , 0.29, 0.22, 0.  , 0.29, 0.29, 0.38,
        0.  , 0.  , 0.  , 0.  , 0.  ]])

## Finding Meanings in Words

### Spam classifier

#### get the data

In [12]:
import pandas as pd
from nlpia.data.loaders import get_data
pd.options.display.width = 120
sms = get_data('sms-spam')

index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]
sms = pd.DataFrame(sms.values, columns=sms.columns, index=index)
sms['spam'] = sms.spam.astype(int)
print('number of spam messages', len(sms))
sms.spam.sum()

sms.head(6)


number of spam mails 4837


Unnamed: 0,spam,text
sms0,0,"Go until jurong point, crazy.. Available only ..."
sms1,0,Ok lar... Joking wif u oni...
sms2!,1,Free entry in 2 a wkly comp to win FA Cup fina...
sms3,0,U dun say so early hor... U c already then say...
sms4,0,"Nah I don't think he goes to usf, he lives aro..."
sms5!,1,FreeMsg Hey there darling it's been 3 week's n...


#### TF-IDF transformation