## My N-Gram Model
Create a smart keyboard that predicts next word based on previous word(s).  

Using SKLearn CountVectorizer object to convert text to frequency counts.  
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [666]:
#meta 9/3/2018
#prev: based on Bigram model with training wheels 
#work with 3 small corpa in predictNextWord_Bigram_TrainingWheels.ipynb

#prev: using a small sample of 3 data sources: twitter, blogs, and news
#prev: use regex to find all n-grams with prev word == prev_token.

#prev: 2-gram model with training wheels
#    figured out 2-gram model first, according to N-gram model instructions.
#    used regex to find all n-grams with prev word == prev_token.
#    wrote function after "training wheels"

#here: build 3-gram model with training wheels
#    need to compute a subset of rows:
#    for bigrams, only 1 row where first word = prev_token
#    for trigrams, 1 row where two first words = prev_token
#    for n-grams, 1 row where n-1 first words (before last) = prev_token

#surprisingly similar to bigram model.
#note: no longer need unigrams, pass n-1 instead of 1 to CountVectorizer => benefits smaller dtm!

#next: build 3-gram model with no training wheels in predictNextWord.ipynb


In [667]:
import time
import numpy as np
import pandas as pd
import re #for regex and pattern matching
import matplotlib.pyplot as plt #for drawing plots
%matplotlib inline

#NLP libraries
from sklearn.feature_extraction.text import CountVectorizer

#not used
#from collections import Counter #for document-term counting


In [668]:
### start clock
start_time=time.time()

### 0. Load Data
Data source is 3 files

In [669]:
#get rid of punctuation
#  re explanation: 
#  refer to https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python
#  replaces not (^) word characters or spaces with the empty string. 
#  Be careful though, the \w matches underscore too usually for example
# originally was
#   words_news = re.sub(r'[^\w\s]','',open('sampleData/en_US.news_small.txt').read().lower())
#get rid of numbers too
text_twitter_in = re.sub(r'[^\w\s]|[\d]','',open('sampleData/en_US.twitter_small.txt').read().lower())
text_news_in = re.sub(r'[^\w\s]|[\d]','',open('sampleData/en_US.news_small.txt').read().lower())
text_blogs_in = re.sub(r'[^\w\s]|[\d]','',open('sampleData/en_US.blogs_small.txt').read().lower())



In [670]:
#preview - notice \n markers of each new line
text_twitter_in


'how are you btw thanks for the rt you gonna be in dc anytime soon love to see you been way way too long\nwhen you meet someone special youll know your heart will beat more rapidly and youll smile for no reason\ntheyve decided its more fun if i dont\nso tired d played lazer tag  ran a lot d ughh going to sleep like in  minutes \nwords from a complete stranger made my birthday even better \nfirst cubs game ever wrigley field is gorgeous this is perfect go cubs go\ni no i get another day off from skool due to the wonderful snow  and this wakes me updamn thing\nim coo jus at work hella tired r u ever in cali\nthe new sundrop commercial hehe love at first sight\nwe need to reconnect this week\ni always wonder how the guys on the auctions shows learned to talk so fast all i hear is djsosnekspqnslanskam\ndammnnnnn what a catch\nsuch a great picture the green shirt totally brings out your eyes\ndesk put together room all set up oh boy oh boy\nim doing it\nbeauty brainstorming in the alchemy o

In [671]:
#capture beg and end of line with special delimeters <s> </s> 
text_twitter = re.sub(r'\n',' </s> </s> <s> <s> ',text_twitter_in)
text_twitter = ' '.join((' <s> <s> ', text_twitter,' </s> </s> '))

text_news = re.sub(r'\n',' </s> </s> <s> <s> ', text_news_in)
text_news = ' '.join((' <s> <s> ', text_news,' </s> </s> '))

text_blogs = re.sub(r'\n',' </s> </s> <s> <s> ', text_blogs_in)
text_blogs = ' '.join((' <s> <s> ', text_blogs,' </s> </s> '))

In [672]:
#preview
text_twitter #class string
#text_news
#text_blogs

' <s> <s>  how are you btw thanks for the rt you gonna be in dc anytime soon love to see you been way way too long </s> </s> <s> <s> when you meet someone special youll know your heart will beat more rapidly and youll smile for no reason </s> </s> <s> <s> theyve decided its more fun if i dont </s> </s> <s> <s> so tired d played lazer tag  ran a lot d ughh going to sleep like in  minutes  </s> </s> <s> <s> words from a complete stranger made my birthday even better  </s> </s> <s> <s> first cubs game ever wrigley field is gorgeous this is perfect go cubs go </s> </s> <s> <s> i no i get another day off from skool due to the wonderful snow  and this wakes me updamn thing </s> </s> <s> <s> im coo jus at work hella tired r u ever in cali </s> </s> <s> <s> the new sundrop commercial hehe love at first sight </s> </s> <s> <s> we need to reconnect this week </s> </s> <s> <s> i always wonder how the guys on the auctions shows learned to talk so fast all i hear is djsosnekspqnslanskam </s> </s> <

In [673]:
#combine all words
text_all = text_twitter + ' ' + text_news + ' ' + text_blogs
text_all2 = []
text_all2.append(text_all)


In [674]:
#validate counts
print ('Twitter words', len(text_twitter))
print ('News words', len(text_news))
print ('News words', len(text_blogs))
print ('All words', len(text_all2[0]))

text_all[-100:]

Twitter words 208842
News words 8477
News words 22447
All words 239768


'ommon prayer he became bishop of london and worked to improve the conditions of the poor  </s> </s> '

### Continue with Trigrams

In [675]:
#need [not unigrams,] bigrams and trigrams
n=3
vectorizer = CountVectorizer(token_pattern=r'(?u)[\<[\/]*]?\b\w+\b[\>*]?', ngram_range=(n-1, n))
#dtm with 1 document
dtm = vectorizer.fit_transform(text_all2) #class 'scipy.sparse.csr.csr_matrix'
print ('DTM type ', type(dtm))

vocab_list = vectorizer.get_feature_names() #class list
vocab_total=len(vocab_list)
print('Successful vocab - with trigrams: ', vocab_total)

#preview vocab and verify trigrams
vocab_list[-30:] #class list

DTM type  <class 'scipy.sparse.csr.csr_matrix'>
Successful vocab - with trigrams:  64104


['zig ziglar',
 'zig ziglar </s>',
 'ziglar </s>',
 'ziglar </s> </s>',
 'zija are',
 'zija are going',
 'zirconias </s>',
 'zirconias </s> </s>',
 'zither a',
 'zither a euphonium',
 'zkabob and',
 'zkabob and brighten',
 'zodhopefully its',
 'zodhopefully its true',
 'zombie magazine',
 'zombie magazine with',
 'zombies </s>',
 'zombies </s> </s>',
 'zone </s>',
 'zone </s> </s>',
 'zone before',
 'zone before the',
 'zone block',
 'zone block get',
 'zoom in',
 'zoom in and',
 'zooming is',
 'zooming is overused',
 'zutara and',
 'zutara and ect']

In [676]:
dtm

<1x64104 sparse matrix of type '<class 'numpy.int64'>'
	with 64104 stored elements in Compressed Sparse Row format>

In [677]:
#query dtm - only works with 1 row;
#if multiple rows, there's no instance of '</s> <s>'
print ('DTM type ', type(dtm))
ngram_value = '</s> <s>'
#ngram_value = 'am sam'
ngram_idx = vocab_list.index(ngram_value)
print ('Query dtm: how many times an n-gram occurs in the text')
dtm[0,ngram_idx]

DTM type  <class 'scipy.sparse.csr.csr_matrix'>
Query dtm: how many times an n-gram occurs in the text


2607

#### From sparse matrix into NumPy array  
NumPy arrays supports a greater variety of operations than a list

In [678]:
#convert from current format, sparse matrix, into a normal numpy array 
print ('DTM type before: ', type(dtm))
dtm = dtm.toarray()
print ('DTM type after', type(dtm))
dtm[0]

DTM type before:  <class 'scipy.sparse.csr.csr_matrix'>
DTM type after <class 'numpy.ndarray'>


array([2608, 2607, 2607, ...,    1,    1,    1], dtype=int64)

In [679]:
#convert python list storing vocab into numpy array
vocab = np.array(vocab_list)
vocab[:10]

array(['</s> </s>', '</s> </s> <s>', '</s> <s>', '</s> <s> <s>',
       '<s> <s>', '<s> <s> a', '<s> <s> abcnews', '<s> <s> about',
       '<s> <s> act', '<s> <s> actorfest'], dtype='<U44')

In [680]:
#query dtm
ngram_idx = list(vocab).index(ngram_value)
dtm[0,ngram_idx]

2607

#### Using NumPy indexing is more natural

In [681]:
dtm[0,vocab == ngram_value]

array([2607], dtype=int64)

#### Print frequency counts (aka dtm)

In [682]:
dtm[0,:]

array([2608, 2607, 2607, ...,    1,    1,    1], dtype=int64)

In [683]:
#print dtm frequency counts
df = pd.DataFrame(dtm,columns = vocab)
df

Unnamed: 0,</s> </s>,</s> </s> <s>,</s> <s>,</s> <s> <s>,<s> <s>,<s> <s> a,<s> <s> abcnews,<s> <s> about,<s> <s> act,<s> <s> actorfest,...,zone before,zone before the,zone block,zone block get,zoom in,zoom in and,zooming is,zooming is overused,zutara and,zutara and ect
0,2608,2607,2607,2607,2608,14,1,3,1,1,...,1,1,1,1,1,1,1,1,1,1


### Calculate some trigram probabilities from this corpus - manually
Didn't think I could at this point.  Thought I had to build bigram to trigram matrix first.  

Wrong.  I have everything at this point.  Probably not efficient, but sufficient.  Worry about efficiency after figure out stats.


In [684]:
#P(I|<s>) = ?


In [685]:
#calculate some trigram probabilities - 
#P(I|<s>) = ? 

n_gram_of = 'your own fairytale'
n_gram_given = 'your own'
p_ngram = dtm[0,list(vocab).index(n_gram_of)] / dtm[0,list(vocab).index(n_gram_given)]

p_ngram

0.2

In [686]:
#P(get|I) = ?

n_gram_of = 'your heart will'
n_gram_given = 'your heart'
p_ngram = dtm[0,list(vocab).index(n_gram_of)] / dtm[0,list(vocab).index(n_gram_given)]

p_ngram

0.25

In [687]:
#P(no|i)  = ?

n_gram_of = 'for no reason'
n_gram_given = 'for no'
p_ngram = dtm[0,list(vocab).index(n_gram_of)] / dtm[0,list(vocab).index(n_gram_given)]

p_ngram 

1.0

In [688]:
#P(<s>|</s>)  = .67

n_gram_of = '</s> <s> <s>'
n_gram_given = '</s> <s>'
p_ngram = dtm[0,list(vocab).index(n_gram_of)] / dtm[0,list(vocab).index(n_gram_given)]

p_ngram 

1.0

### Calculate specific trigram probabilities - Loop vs Vectorized
Don't need to build a matrix of trigrams to bigram.  
Only need to compute a subset of rows:  
 - for trigrams, only 1 row where first 2 words = prev token 

In [689]:
prev_token = 'your eyes'
#Q: given 'your eyes', what are all trigrams and their probabilites?
#P([trigrams]|your eyes)  = [p1,p2,..., pn] 

#work with all vocab
#print(vocab)

#--check if vocab contains a bigram token and if yes, where
print ('vocab contains a bigram token? ', prev_token in vocab) #True
#vocab.index(prev_token) #works with list, doesn't work with numpy array
list(vocab).index(prev_token) #works with numpy array



vocab contains a bigram token?  True


63648

#### Find n-grams (trigrams n=3)
Use regex to find all n-grams with prev word(s) == prev_token.

In [690]:
#build regex
my_regex = r'\b' + prev_token + r'\b \w+?$'
print (my_regex)

\byour eyes\b \w+?$


In [691]:
#--check if vocab contains n-grams with prev word = prev_token 'your eyes' (includes trigrams starting with prev_token)
for term in vocab:
    #print (term, prev_token)
    #print (term == prFev_token)
    if re.search(my_regex, term):
        print (term)


your eyes say
your eyes tonight
your eyes where


#### Vectorized
no looping  

In [692]:
r = re.compile(my_regex)
eligible_ngrams_list = list(filter(r.search, vocab_list)) # Read Note
print(eligible_ngrams_list)

['your eyes say', 'your eyes tonight', 'your eyes where']


#### How to read it
Given previous words 'your eyes', get all 'eligible' n-grams:  

Correct for bigrams: 
- Given previous word 'your', get all bigrams starting with prev_token.

Correct for trigrams:
- Given previus two words 'your eyes', get trigrams with prev_token as the word before last.  

Later for 3+ grams: 
- Given previus n words, get n-grams with prev_token as the word before last.  


Once we found all 'eligible' n-grams, compile a maxtrix with counts of relevant frequencies.

#### Compute prob of all trigrams 

Looping

In [693]:
prev_ngram_count = dtm[0,vocab==prev_token]
print('Previsou n-gram ', prev_token, prev_ngram_count)

for ngram in eligible_ngrams_list:
    print(ngram)
    #print(np.where(vocab==ngram, 1, -1))
    
    #index based
    ngram_idx = vocab_list.index(ngram)
    print('index ', ngram_idx)
    ngram_count2  = dtm[0,ngram_idx]
    print('count', ngram_count2)

    #less code - not index based
    #ngram_count  = dtm[0,vocab == ngram]
    #print('count', ngram_count)
    
    #compute prob of each trigram | prev_token
    ngram_prob = ngram_count2 / prev_ngram_count
    print ('relative prob ', ngram_prob)

#eligible_idx = vocab_list.index([eligible_ngrams_list]) #$acerror value error - correct behavior
    

Previsou n-gram  your eyes [5]
your eyes say
index  63650
count 2
relative prob  [0.4]
your eyes tonight
index  63651
count 1
relative prob  [0.2]
your eyes where
index  63652
count 1
relative prob  [0.2]


#### Vectorized 

Given prev_token, compute probability of all trigrams with first words == prev_token

In [694]:
#reminders: prev_token = 'your'
#  get a vector (np.array) where value == term
a = np.where(vocab==prev_token,1,0)
a.sum() #given our vocab can only be 1 (found) or 0 (not found)

#  get index of a n-gram
prev_token_idx = vocab_list.index(prev_token)
prev_token_idx

#  get all values from list where values match regex pattern
eligible_ngrams_list = list(filter(r.search, vocab_list))
eligible_ngrams_list
#note: this list gives us all trigrams we're interested in to build a 3-gram model

['your eyes say', 'your eyes tonight', 'your eyes where']

### Next: get index of each trigram, compute relative probs and pick the trigram with the highest prob

In [695]:
# get index of each trigram
eligible_ngrams_idx = [i for i, w in enumerate(vocab_list) if re.search(my_regex,w)]
eligible_ngrams_idx


[63650, 63651, 63652]

In [696]:
# compute relative trigram probs 
#[print(i) for i, w in enumerate(vocab_list) if re.search(my_regex,w)]
eligible_ngrams_probs = [dtm[0,i]/prev_ngram_count for i, w in enumerate(vocab_list) if re.search(my_regex,w)]
eligible_ngrams_probs


[array([0.4]), array([0.2]), array([0.2])]

In [697]:
# pick the trigram with the highest prob - only one (if more than one, pick the first one)
best_ngram_idx = np.argmax(eligible_ngrams_probs)
print("Best ngram index: ", best_ngram_idx)
best_ngram_value = eligible_ngrams_list[best_ngram_idx]
print("Best ngram value: ",best_ngram_value)
next_word = best_ngram_value.split()[n-1]
print("Predict next word: ", next_word)

Best ngram index:  0
Best ngram value:  your eyes say
Predict next word:  say


In [698]:
# [optional] - if want to consider more than one next predictions, and array has equal probs
# pick the trigram(s) with the highest prob - if more than one
best_ngrams_value = max(eligible_ngrams_probs)

#best_ngrams = np.isin(model_probs, best_ngrams_value)
#best_ngrams_idx = np.where(best_ngrams, 1, -1)
#best_ngrams_idx

In [699]:
#Predict next word - wrap into a function
#must have prev declared variables:
#vocab
#vocab_list
#dtm

def predictNextWord(prev_term):
    #reminders: prev_term = 'your'
    
    # get a vector (np.array) where value == prev_term
    prev_term_exists = np.where(vocab==prev_term,1,0)
    ##print("exists? ", prev_term_exists.sum())
    
    # if prev_term exists, predict next word
    if prev_term_exists.sum(): #given our vocab can only be 1 (found) or 0 (not found)

        ## get index of a n-gram
        ##prev_term_idx = vocab_list.index(prev_term)
        # get count of prev_term
        prev_term_count = dtm[0,vocab==prev_term]
        
        #build regex to find all n-grams with prev word == prev_term.
        this_regex = r'\b' + prev_term + r'\b \w+?$'
        ##print (this_regex)
        r = re.compile(this_regex)

        #  get all values from list where values match regex pattern
        eligible_ngrams_list = list(filter(r.search, vocab_list))
        print("Eligible ngram list: ", eligible_ngrams_list)
        
        # get index of each trigram
        eligible_ngrams_idx = [i for i, w in enumerate(vocab_list) if re.search(this_regex,w)]
        print("Eligible ngram index: ", eligible_ngrams_idx)
        
        # compute relative trigram probs 
        eligible_ngrams_probs = [dtm[0,i]/prev_term_count for i, w in enumerate(vocab_list) if re.search(this_regex,w)]
        print("Eligible ngram probs: ", eligible_ngrams_probs)
        
        # pick the trigram with the highest prob - only one (if more than one, pick the first one)
        best_ngram_idx = np.argmax(eligible_ngrams_probs)
        print("Best ngram index: ", best_ngram_idx)
        best_ngram_value = eligible_ngrams_list[best_ngram_idx]
        print("Best ngram value: ",best_ngram_value)
        next_word = best_ngram_value.split()[n-1]
        print("Predict next word: ", next_word)
        
        return next_word
    else:
        #if prev_term doesn't exist, deal with it later
        print ("No such term found")


### Test the function - Trigram only (with training wheels)
error handling:  
add if bigram exists but no trigrams

In [700]:
#check if such word exists
test_ngram ='how about'
print ('Term exists? ', test_ngram in vocab_list)

predictNextWord(test_ngram)

Term exists?  True
Eligible ngram list:  ['how about data']
Eligible ngram index:  [26136]
Eligible ngram probs:  [array([1.])]
Best ngram index:  0
Best ngram value:  how about data
Predict next word:  data


'data'

#### Rough notes
discard later

In [701]:
#--check if vocab contains tokens starting with 'your' and if yes, where
#no loop, check if vocab contains trigram token(s) starting/ending with prev_token 
#refer to https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.char.html
#https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.core.defchararray.startswith.html#numpy.core.defchararray.startswith
np.core.defchararray.endswith(vocab, prev_token) #returns boolean array

#np.where(re.search(my_regex, vocab[:]),1,-1)
np.where(np.core.defchararray.startswith(vocab,prev_token),1,-1)


array([-1, -1, -1, ..., -1, -1, -1])

In [702]:
print('Found n-grams ending with: ', prev_token)
#ngrams_idx = np.where(np.core.defchararray.endswith(vocab, ' ' + prev_token))
ngrams_idx = np.where(np.core.defchararray.endswith(vocab, prev_token ))
ngrams_idx[0].tolist()
list(vocab[ngrams_idx])

Found n-grams ending with:  your eyes


['behind your eyes',
 'but your eyes',
 'close your eyes',
 'out your eyes',
 'your eyes']

#### Pausing here...
Pausing after 3-gram model.  pretty much the same as 2-gram model

#### How to read it
Given previous words 'your eyes', get probabilities of all trigrams starting with 'your eyes'.  Max probability wins.

Next: generalize to n-grams

In [703]:
#calc prob of relevant tokens
dtm[0, ngrams_idx[0].tolist()]
ngrams_prob = dtm[0, ngrams_idx[0].tolist()]/vocab_total

print("N-grams and their probabilities: ")
print(vocab[ngrams_idx])
ngrams_prob.tolist()


N-grams and their probabilities: 
['behind your eyes' 'but your eyes' 'close your eyes' 'out your eyes'
 'your eyes']


[1.559965056782728e-05,
 1.559965056782728e-05,
 3.119930113565456e-05,
 1.559965056782728e-05,
 7.799825283913641e-05]

### Xtra

#### Word Counts with CountVectorizer
https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

In [704]:
#$xtra - code snippet
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
(1, 8)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 1 1 1 1 2]]


In [705]:
#$xtra - regex example with a hard coded word
#--check if vocab contains bigrams starting with prev token 'your' 
for term in vocab:
    #print (term, prev_token)
    #print (term == prev_token)
    if re.search(r'^own \w+?$', term):
        print (term)
        
#--check if vocab contains n-grams with prev token 'your' except bigrams
print ('\n')
for term in vocab:
    #print (term, prev_token)
    #print (term == prev_token)
    if re.search(r' own \w+?$', term):
        #print (term.find(prev_token))
        print (term)

print ('\n--Together')        
#--(combine the two above) check if vocab contains n-grams with prev token 'your' including bigrams
for term in vocab:
    #print (term, prev_token)
    #print (term == prev_token)
    if re.search(r'\bown\b \w+?$', term):
        #print (term.find(prev_token))
        print (term)

own and
own cat
own clothes
own content
own danni
own entertainment
own environment
own fairytale
own innermost
own instances
own of
own party
own perfume
own redbox
own supplyon
own your


are own your
emorys own instances
her own and
my own cat
my own clothes
my own supplyon
ones own perfume
still own redbox
their own content
their own entertainment
verry own danni
you own of
your own environment
your own fairytale
your own innermost
your own party

--Together
are own your
emorys own instances
her own and
my own cat
my own clothes
my own supplyon
ones own perfume
own and
own cat
own clothes
own content
own danni
own entertainment
own environment
own fairytale
own innermost
own instances
own of
own party
own perfume
own redbox
own supplyon
own your
still own redbox
their own content
their own entertainment
verry own danni
you own of
your own environment
your own fairytale
your own innermost
your own party


In [706]:
#$xtra - vectorized lookup: regex example with a hard coded word; works with list.  gave up on making it work with np.array
r = re.compile(r'\bown\b \w+?$')
newlist = list(filter(r.search, vocab_list)) # Read Note
print(newlist)

['are own your', 'emorys own instances', 'her own and', 'my own cat', 'my own clothes', 'my own supplyon', 'ones own perfume', 'own and', 'own cat', 'own clothes', 'own content', 'own danni', 'own entertainment', 'own environment', 'own fairytale', 'own innermost', 'own instances', 'own of', 'own party', 'own perfume', 'own redbox', 'own supplyon', 'own your', 'still own redbox', 'their own content', 'their own entertainment', 'verry own danni', 'you own of', 'your own environment', 'your own fairytale', 'your own innermost', 'your own party']
