## My N-Gram Model
Create a smart keyboard that predicts next word based on previous word(s).  

Using SKLearn CountVectorizer object to convert text to frequency counts.  
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [126]:
#meta 8/12/2018
#prev: based on preliminary work with a tiny text corpus in Ngram_exercises.ipynb

#here: using a small sample of 2 data sources: twitter and news
#    got unigrams and bigrams
#    no need to pre-compute the matrix unigram x unigram which would give us all probs of a next unigram, given prev word in a bigram
#    only calculating prob vector given a start token

#next: expand to trigrams

In [127]:
import time
import numpy as np
import pandas as pd
import re #for regex and pattern matching
import matplotlib.pyplot as plt #for drawing plots
%matplotlib inline

#NLP libraries
from sklearn.feature_extraction.text import CountVectorizer

#not used
#from collections import Counter #for document-term counting


In [128]:
### start clock
start_time=time.time()

### 0. Load Data
Data source is 3 files

In [129]:
#get rid of punctuation
#  re explanation: 
#  refer to https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python
#  replaces not (^) word characters or spaces with the empty string. 
#  Be careful though, the \w matches underscore too usually for example
# originally was
#   words_news = re.sub(r'[^\w\s]','',open('sampleData/en_US.news_small.txt').read().lower())
#get rid of numbers too
text_twitter_in = re.sub(r'[^\w\s]|[\d]','',open('sampleData/en_US.twitter_small.txt').read().lower())
text_news_in = re.sub(r'[^\w\s]|[\d]','',open('sampleData/en_US.news_small.txt').read().lower())


In [130]:
#preview - notice \n markers of each new line
#words_news_in


In [131]:
#capture beg and end of line with special delimeters <s> </s> 
text_twitter = re.sub(r'\n','</s> <s> ',text_twitter_in)
text_twitter = ' '.join((' <s>', text_twitter,'</s>'))

text_news = re.sub(r'\n',' </s> <s> ', text_news_in)
text_news = ' '.join(('<s>', text_news,'</s>'))

In [132]:
#preview
text_twitter #class string
text_news

'<s> he wasnt home alone apparently </s> <s> the st louis plant had to close it would die of old age workers had been making cars there since the onset of mass automotive production in the s </s> <s> wsus plans quickly became a hot topic on local online sites though most people applauded plans for the new biomedical center many deplored the potential loss of the building </s> <s> the alaimo group of mount holly was up for a contract last fall to evaluate and suggest improvements to trenton water works but campaign finance records released this week show the two employees donated a total of  to the political action committee pac partners for progress in early june partners for progress reported it gave more than  in both direct and inkind contributions to mayor tony mack in the two weeks leading up to his victory in the mayoral runoff election june  </s> <s> and when its often difficult to predict a laws impact legislators should think twice before carrying any bill is it absolutely nec

In [133]:
#combine all words
text_all = text_twitter + ' ' + text_news
text_all2 = []
text_all2.append(text_all)


In [134]:
#validate counts
print ('Twitter words', len(text_twitter))
print ('News words', len(text_news))
print ('All words', len(text_all))

words_all

Twitter words 1692
News words 1901
All words 3594


' <s> how are you btw thanks for the rt you gonna be in dc anytime soon love to see you been way way too long</s> <s> when you meet someone special youll know your heart will beat more rapidly and youll smile for no reason</s> <s> theyve decided its more fun if i dont</s> <s> so tired d played lazer tag  ran a lot d ughh going to sleep like in  minutes </s> <s> words from a complete stranger made my birthday even better </s> <s> first cubs game ever wrigley field is gorgeous this is perfect go cubs go</s> <s> i no i get another day off from skool due to the wonderful snow  and this wakes me updamn thing</s> <s> im coo jus at work hella tired r u ever in cali</s> <s> the new sundrop commercial hehe love at first sight</s> <s> we need to reconnect this week</s> <s> i always wonder how the guys on the auctions shows learned to talk so fast all i hear is djsosnekspqnslanskam</s> <s> dammnnnnn what a catch</s> <s> such a great picture the green shirt totally brings out your eyes</s> <s> des

### Start with Unigrams

In [135]:
#get frequency counts
vectorizer = CountVectorizer(token_pattern=r'(?u)[\<[\/]*]?\b\w+\b[\>*]?', ngram_range=(1, 1))
#print(type(vectorizer))
print('Successful regex for this vocab - unigrams with index')

vectorizer.fit(text_all2)
print(vectorizer.vocabulary_)

unigrams = list(vectorizer.vocabulary_)
unigrams


Successful regex for this vocab - unigrams with index
{'<s>': 1, 'how': 157, 'are': 26, 'you': 385, 'btw': 53, 'thanks': 322, 'for': 126, 'the': 324, 'rt': 278, 'gonna': 136, 'be': 35, 'in': 165, 'dc': 85, 'anytime': 23, 'soon': 305, 'love': 199, 'to': 335, 'see': 285, 'been': 39, 'way': 360, 'too': 339, 'long': 193, '</s>': 0, 'when': 368, 'meet': 211, 'someone': 304, 'special': 306, 'youll': 386, 'know': 180, 'your': 387, 'heart': 148, 'will': 370, 'beat': 36, 'more': 215, 'rapidly': 266, 'and': 19, 'smile': 300, 'no': 228, 'reason': 267, 'theyve': 327, 'decided': 86, 'its': 173, 'fun': 130, 'if': 160, 'i': 158, 'dont': 99, 'so': 302, 'tired': 334, 'd': 82, 'played': 251, 'lazer': 183, 'tag': 319, 'ran': 265, 'a': 2, 'lot': 197, 'ughh': 349, 'going': 135, 'sleep': 299, 'like': 189, 'minutes': 213, 'words': 375, 'from': 129, 'complete': 74, 'stranger': 311, 'made': 201, 'my': 222, 'birthday': 46, 'even': 111, 'better': 42, 'first': 124, 'cubs': 81, 'game': 131, 'ever': 113, 'wrigley':

['<s>',
 'how',
 'are',
 'you',
 'btw',
 'thanks',
 'for',
 'the',
 'rt',
 'gonna',
 'be',
 'in',
 'dc',
 'anytime',
 'soon',
 'love',
 'to',
 'see',
 'been',
 'way',
 'too',
 'long',
 '</s>',
 'when',
 'meet',
 'someone',
 'special',
 'youll',
 'know',
 'your',
 'heart',
 'will',
 'beat',
 'more',
 'rapidly',
 'and',
 'smile',
 'no',
 'reason',
 'theyve',
 'decided',
 'its',
 'fun',
 'if',
 'i',
 'dont',
 'so',
 'tired',
 'd',
 'played',
 'lazer',
 'tag',
 'ran',
 'a',
 'lot',
 'ughh',
 'going',
 'sleep',
 'like',
 'minutes',
 'words',
 'from',
 'complete',
 'stranger',
 'made',
 'my',
 'birthday',
 'even',
 'better',
 'first',
 'cubs',
 'game',
 'ever',
 'wrigley',
 'field',
 'is',
 'gorgeous',
 'this',
 'perfect',
 'go',
 'get',
 'another',
 'day',
 'off',
 'skool',
 'due',
 'wonderful',
 'snow',
 'wakes',
 'me',
 'updamn',
 'thing',
 'im',
 'coo',
 'jus',
 'at',
 'work',
 'hella',
 'r',
 'u',
 'cali',
 'new',
 'sundrop',
 'commercial',
 'hehe',
 'sight',
 'we',
 'need',
 'reconnect

### Continue with Bigrams
Building unigram x unigram matrix results in bigram probabilities.  So need to generate frequency counts for bigrams.

In [137]:
#need unigrams and bigrams 
vectorizer = CountVectorizer(token_pattern=r'(?u)[\<[\/]*]?\b\w+\b[\>*]?', ngram_range=(1, 2))
#dtm with 1 document
dtm = vectorizer.fit_transform(text_all2) #class 'scipy.sparse.csr.csr_matrix'
print ('DTM type ', type(dtm))

vocab = vectorizer.get_feature_names() #class list
vocab_total=len(vocab)
print('Successful vocab - with bigrams: ', vocab_total)

vocab #class list

DTM type  <class 'scipy.sparse.csr.csr_matrix'>
Successful vocab - with bigrams:  999


['</s>',
 '</s> <s>',
 '<s>',
 '<s> and',
 '<s> beauty',
 '<s> but',
 '<s> charlevoix',
 '<s> dammnnnnn',
 '<s> desk',
 '<s> first',
 '<s> ford',
 '<s> he',
 '<s> how',
 '<s> i',
 '<s> im',
 '<s> its',
 '<s> looking',
 '<s> more',
 '<s> packing',
 '<s> rt',
 '<s> so',
 '<s> such',
 '<s> the',
 '<s> there',
 '<s> theyve',
 '<s> tommorows',
 '<s> watch',
 '<s> we',
 '<s> when',
 '<s> words',
 '<s> wsus',
 'a',
 'a catch',
 'a certain',
 'a complete',
 'a conservative',
 'a contract',
 'a few',
 'a great',
 'a hot',
 'a laws',
 'a long',
 'a lot',
 'a movie',
 'a new',
 'a quick',
 'a separate',
 'a total',
 'absolutely',
 'absolutely necessary',
 'accessing',
 'accessing inappropriate',
 'according',
 'according to',
 'accountability',
 'accountability saying',
 'action',
 'action committee',
 'again',
 'again in',
 'age',
 'age workers',
 'ago',
 'ago when',
 'alaimo',
 'alaimo group',
 'alchemy',
 'alchemy office',
 'all',
 'all i',
 'all set',
 'alone',
 'alone apparently',
 'always',

In [138]:
dtm

<1x999 sparse matrix of type '<class 'numpy.int64'>'
	with 999 stored elements in Compressed Sparse Row format>

In [139]:
#query dtm - only works with 1 row;
#if multiple rows, there's no instance of '</s> <s>'
print ('DTM type ', type(dtm))
ngram_value = '</s> <s>'
#ngram_value = 'am sam'
ngram_idx = list(vocab).index(ngram_value)
print ('Query dtm: how many times an n-gram occurs in the text')
dtm[0,ngram_idx]

DTM type  <class 'scipy.sparse.csr.csr_matrix'>
Query dtm: how many times an n-gram occurs in the text


32

#### From sparse matrix into NumPy array  
NumPy arrays supports a greater variety of operations than a list

In [140]:
#convert from current format, sparse matrix, into a normal numpy array 
print ('DTM type before: ', type(dtm))
dtm = dtm.toarray()
print ('DTM type after', type(dtm))
dtm

DTM type before:  <class 'scipy.sparse.csr.csr_matrix'>
DTM type after <class 'numpy.ndarray'>


array([[33, 32, 33,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2,  2,  1,
         1,  1,  1,  1,  1,  1,  4,  1,  1,  1,  1,  1,  1,  1,  1, 17,
         1,  1,  1,  1,  1,  1,  2,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  2,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  8,  1,  1,  1,  1,  1,  1,  1,  1,  2,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  4,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  2,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  3,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2,  1, 

In [141]:
#convert python list storing vocab into numpy array
vocab = np.array(vocab)
vocab

array(['</s>', '</s> <s>', '<s>', '<s> and', '<s> beauty', '<s> but',
       '<s> charlevoix', '<s> dammnnnnn', '<s> desk', '<s> first',
       '<s> ford', '<s> he', '<s> how', '<s> i', '<s> im', '<s> its',
       '<s> looking', '<s> more', '<s> packing', '<s> rt', '<s> so',
       '<s> such', '<s> the', '<s> there', '<s> theyve', '<s> tommorows',
       '<s> watch', '<s> we', '<s> when', '<s> words', '<s> wsus', 'a',
       'a catch', 'a certain', 'a complete', 'a conservative',
       'a contract', 'a few', 'a great', 'a hot', 'a laws', 'a long',
       'a lot', 'a movie', 'a new', 'a quick', 'a separate', 'a total',
       'absolutely', 'absolutely necessary', 'accessing',
       'accessing inappropriate', 'according', 'according to',
       'accountability', 'accountability saying', 'action',
       'action committee', 'again', 'again in', 'age', 'age workers',
       'ago', 'ago when', 'alaimo', 'alaimo group', 'alchemy',
       'alchemy office', 'all', 'all i', 'all set', 'alone'

In [142]:
#query dtm
ngram_idx = list(vocab).index(ngram_value)
dtm[0,ngram_idx]

32

#### Using NumPy indexing is more natural

In [143]:
dtm[0,vocab == ngram_value]

array([32], dtype=int64)

#### Print frequency counts (aka dtm)

In [170]:
#print dtm frequency counts
df = pd.DataFrame(dtm,columns = vocab)
df

Unnamed: 0,</s>,</s> <s>,<s>,<s> and,<s> beauty,<s> but,<s> charlevoix,<s> dammnnnnn,<s> desk,<s> first,...,you btw,you gonna,you meet,youll,youll know,youll smile,your,your eyes,your heart,your mailbox
0,33,32,33,1,1,1,1,1,1,1,...,1,1,1,2,1,1,3,1,1,1


### Calculate some bigram probabilities from this corpus - manually
Didn't think I could at this point.  Thought I had to build unigram to bigram matrix first.  

Wrong.  I have everything at this point.  Probably not efficient, but sufficient.  Worry about efficiency after figure out stats.


In [147]:
#P(I|<s>) = ?


In [145]:
#calculate some bigram probabilities - 
#P(I|<s>) = ? 

n_gram_of = '<s> i'
n_gram_given = '<s>'
p_bigram = dtm[0,list(vocab).index(n_gram_of)] / dtm[0,list(vocab).index(n_gram_given)]

p_bigram

0.06060606060606061

In [149]:
#P(get|I) = ?

n_gram_of = 'i get'
n_gram_given = 'i'
p_bigram = dtm[0,list(vocab).index(n_gram_of)] / dtm[0,list(vocab).index(n_gram_given)]

p_bigram

0.16666666666666666

In [155]:
#P(no|i)  = ?

n_gram_of = 'i no'
n_gram_given = 'no'
p_bigram = dtm[0,list(vocab).index(n_gram_of)] / dtm[0,list(vocab).index(n_gram_given)]

p_bigram 

0.5

In [156]:
#P(<s>|</s>)  = .67

n_gram_of = '</s> <s>'
n_gram_given = '</s>'
p_bigram = dtm[0,list(vocab).index(n_gram_of)] / dtm[0,list(vocab).index(n_gram_given)]

p_bigram 

0.9696969696969697

### Calculate specific bigram probabilities - vectorized
Don't need to build a matrix of bigrams to unigram.  
Only need to compute one row of the matrix at a time

In [174]:
#compute prob of all bigrams 
start_token = 'your'
#Q: given 'your', what are all bigrams and their probabilites?
#P([bigrams]|your)  = [p1,p2,..., pn] 

#work with all vocab
#print(vocab)

#--check if vocab contains a unigram token and if yes, where
print (start_token in vocab) #True
#vocab.index(start_token)

#--check if vocab contains bigram token(s) starting with 'am' 
#for term in vocab:
#    print (term.startswith(start_token))

#--check if vocab contains tokens starting with 'your' and if yes, where
#no loop, check if vocab contains bigram token(s) starting with 'am' 
#refer to https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.char.html
#https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.core.defchararray.startswith.html#numpy.core.defchararray.startswith
np.core.defchararray.startswith(vocab, start_token) #returns boolean array

print('Found bigrams with: ', start_token)
bigrams_idx = np.where(np.core.defchararray.startswith(vocab, start_token + ' '))
bigrams_idx[0].tolist()

True
Found bigrams with:  your


[996, 997, 998]

In [175]:
#$actodo not used - keep for now
#vocab[np.where(np.core.defchararray.startswith(vocab, start_token + ' '))]
#np.where(np.core.defchararray.startswith(vocab, start_token + ' '),0,-1)


In [176]:
#calc prob of relevant tokens
dtm[0, bigrams_idx[0].tolist()]
bigrams_prob = dtm[0, bigrams_idx[0].tolist()]/vocab_total

print("Bigrams and their probabilities: ")
print(vocab[bigrams_idx])
bigrams_prob.tolist()


Bigrams and their probabilities: 
['your eyes' 'your heart' 'your mailbox']


[0.001001001001001001, 0.001001001001001001, 0.001001001001001001]

### Xtra

#### Word Counts with CountVectorizer
https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

In [None]:
#$xtra - code snippet
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())