## My N-Gram Model
Create a smart keyboard that predicts next word based on previous word(s).  

Using SKLearn CountVectorizer object to convert text to frequency counts.  
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [94]:
#meta 9/2/2018
#prev: based on preliminary work with 3 small corpa in predictNextWord_BigramModel.ipynb
#    2-gram model with training wheels
#    figured out 2-gram model first, according to N-gram model instructions.
#    used regex to find all n-grams with prev word == prev_token.


#->  for bigrams, only 1 row where first word = prev_token
#    for trigrams, 1+ rows where second word = prev_token
#    for n-grams, 1+ rows where word before last = prev_token

#here: 2-gram model, no training wheels
#    get data, write function, use function

In [95]:
import time
import numpy as np
import pandas as pd
import re #for regex and pattern matching
import matplotlib.pyplot as plt #for drawing plots
%matplotlib inline

#NLP libraries
from sklearn.feature_extraction.text import CountVectorizer

#not used
#from collections import Counter #for document-term counting


In [96]:
### start clock
start_time=time.time()

### 0. Load Data
Data source is 3 files

In [97]:
#get rid of punctuation
#  re explanation: 
#  refer to https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python
#  replaces not (^) word characters or spaces with the empty string. 
#  Be careful though, the \w matches underscore too usually for example
# originally was
#   words_news = re.sub(r'[^\w\s]','',open('sampleData/en_US.news_small.txt').read().lower())
#get rid of numbers too
text_twitter_in = re.sub(r'[^\w\s]|[\d]','',open('sampleData/en_US.twitter_small.txt').read().lower())
text_news_in = re.sub(r'[^\w\s]|[\d]','',open('sampleData/en_US.news_small.txt').read().lower())
text_blogs_in = re.sub(r'[^\w\s]|[\d]','',open('sampleData/en_US.blogs_small.txt').read().lower())

### how long did it take to read in text?
end_time=time.time()
print('Loaded data in ', (end_time - start_time)/ 60 )


Loaded data in  0.0002685666084289551


In [98]:
#preview - notice \n markers of each new line
text_news_in


'he wasnt home alone apparently\nthe st louis plant had to close it would die of old age workers had been making cars there since the onset of mass automotive production in the s\nwsus plans quickly became a hot topic on local online sites though most people applauded plans for the new biomedical center many deplored the potential loss of the building\nthe alaimo group of mount holly was up for a contract last fall to evaluate and suggest improvements to trenton water works but campaign finance records released this week show the two employees donated a total of  to the political action committee pac partners for progress in early june partners for progress reported it gave more than  in both direct and inkind contributions to mayor tony mack in the two weeks leading up to his victory in the mayoral runoff election june \nand when its often difficult to predict a laws impact legislators should think twice before carrying any bill is it absolutely necessary is it an issue serious enough

In [99]:
#capture beg and end of line with special delimeters <s> </s> 
text_twitter = re.sub(r'\n','</s> <s> ',text_twitter_in)
text_twitter = ' '.join((' <s>', text_twitter,'</s>'))

text_news = re.sub(r'\n',' </s> <s> ', text_news_in)
text_news = ' '.join(('<s>', text_news,'</s>'))

text_blogs = re.sub(r'\n',' </s> <s> ', text_blogs_in)
text_blogs = ' '.join(('<s>', text_blogs,'</s>'))

In [100]:
#preview
#text_twitter #class string
text_news
#text_blogs


'<s> he wasnt home alone apparently </s> <s> the st louis plant had to close it would die of old age workers had been making cars there since the onset of mass automotive production in the s </s> <s> wsus plans quickly became a hot topic on local online sites though most people applauded plans for the new biomedical center many deplored the potential loss of the building </s> <s> the alaimo group of mount holly was up for a contract last fall to evaluate and suggest improvements to trenton water works but campaign finance records released this week show the two employees donated a total of  to the political action committee pac partners for progress in early june partners for progress reported it gave more than  in both direct and inkind contributions to mayor tony mack in the two weeks leading up to his victory in the mayoral runoff election june  </s> <s> and when its often difficult to predict a laws impact legislators should think twice before carrying any bill is it absolutely nec

In [101]:
#combine all words
text_all = text_twitter + ' ' + text_news + ' ' + text_blogs
text_all2 = []
text_all2.append(text_all)


In [102]:
#validate counts
print ('Twitter words', len(text_twitter))
print ('News words', len(text_news))
print ('News words', len(text_blogs))
print ('All words', len(text_all2[0]))

text_all[-100:]

Twitter words 1692
News words 1901
News words 7649
All words 11244


't than anything but after staring at it for a while and all of us cheering he started to dig in </s>'

### 1.  From Text to Tokens to Frequency Counts (bigrams)
Convert a collection of text documents to a matrix of token counts.  
Generate frequency counts for bigrams.

In [103]:
#need unigrams and bigrams 
n=2
vectorizer = CountVectorizer(token_pattern=r'(?u)[\<[\/]*]?\b\w+\b[\>*]?', ngram_range=(1, n))
#dtm with 1 document
dtm = vectorizer.fit_transform(text_all2) #class 'scipy.sparse.csr.csr_matrix'
print ('DTM type ', type(dtm))

vocab_list = vectorizer.get_feature_names() #class list
print('Successful vocab - with bigrams: ', len(vocab_list))

#preview vocab and verify trigrams
vocab_list[-20:] #class list

DTM type  <class 'scipy.sparse.csr.csr_matrix'>
Successful vocab - with bigrams:  2794


['you mr',
 'you need',
 'you out',
 'you produce',
 'you that',
 'you to',
 'you who',
 'you will',
 'youll',
 'youll know',
 'youll smile',
 'your',
 'your die',
 'your eyes',
 'your graduation',
 'your heart',
 'your mailbox',
 'your own',
 'yourself',
 'yourself about']

In [104]:
dtm

<1x2794 sparse matrix of type '<class 'numpy.int64'>'
	with 2794 stored elements in Compressed Sparse Row format>

In [105]:
#query dtm 
print ('DTM type ', type(dtm))
ngram_value = '</s> <s>' #'am sam'
ngram_idx = vocab_list.index(ngram_value)
print ('Query dtm: how many times an n-gram occurs in the text')
dtm[0,ngram_idx]

DTM type  <class 'scipy.sparse.csr.csr_matrix'>
Query dtm: how many times an n-gram occurs in the text


55

### 1a. From sparse matrix into NumPy array  
NumPy arrays supports a great variety of operations

In [106]:
#convert from current format, sparse matrix, into a normal numpy array 
print ('DTM type before: ', type(dtm))
dtm = dtm.toarray()
print ('DTM type after', type(dtm))
dtm[0]

DTM type before:  <class 'scipy.sparse.csr.csr_matrix'>
DTM type after <class 'numpy.ndarray'>


array([56, 55, 56, ...,  3,  1,  1], dtype=int64)

In [107]:
#convert python list storing vocab into numpy array
vocab = np.array(vocab_list)
vocab[:10]

array(['</s>', '</s> <s>', '<s>', '<s> a', '<s> although', '<s> and',
       '<s> april', '<s> beauty', '<s> but', '<s> chad'], dtype='<U25')

In [108]:
#query dtm
ngram_idx = list(vocab).index(ngram_value)
dtm[0,ngram_idx]

55

#### Using NumPy indexing is more natural

In [109]:
dtm[0,vocab == ngram_value]

array([55], dtype=int64)

#### Print frequency counts (aka dtm)

In [110]:
dtm[0,:]

array([56, 55, 56, ...,  3,  1,  1], dtype=int64)

In [111]:
#friendly print dtm frequency counts
df = pd.DataFrame(dtm,columns = vocab)
df

Unnamed: 0,</s>,</s> <s>,<s>,<s> a,<s> although,<s> and,<s> april,<s> beauty,<s> but,<s> chad,...,youll smile,your,your die,your eyes,your graduation,your heart,your mailbox,your own,yourself,yourself about
0,56,55,56,1,1,1,1,1,1,1,...,1,8,1,1,1,1,1,3,1,1


### 2. Create Function to Predict Next Word
Calculate specific bigram probabilities (vectorized and list comprehensions).  

Don't need to build a matrix of bigrams to unigram.  
Only need to compute a subset of rows:  
 - for bigrams, only 1 row where first word = prev token 

In [112]:
#Predict next word - wrap into a function
#must have prev declared variables:
#vocab
#vocab_list
#dtm

def predictNextWord(prev_term):
    #reminders: prev_term = 'your'
    
    # get a vector (np.array) where value == prev_term
    prev_term_exists = np.where(vocab==prev_term,1,0)
    ##print("exists? ", prev_term_exists.sum()) #given our vocab can only be 1 (found) or 0 (not found)
    
    # if prev_term exists, predict next word
    if prev_term_exists.sum(): 

        ## get index of a n-gram
        ##prev_term_idx = vocab_list.index(prev_term)
        # get count of prev_term
        prev_term_count = dtm[0,vocab==prev_term]
        
        #build regex to find all n-grams with prev word == prev_term.
        this_regex = r'\b' + prev_term + r'\b \w+?$'
        ##print (this_regex)
        r = re.compile(this_regex)

        #  get all values from list where values match regex pattern
        eligible_ngrams_list = list(filter(r.search, vocab_list))
        print("Eligible ngram list: ", eligible_ngrams_list)
        
        # get index of each bigram
        eligible_ngrams_idx = [i for i, w in enumerate(vocab_list) if re.search(this_regex,w)]
        print("Eligible ngram index: ", eligible_ngrams_idx)
        
        # compute relative bigram probs 
        eligible_ngrams_probs = [dtm[0,i]/prev_term_count for i, w in enumerate(vocab_list) if re.search(this_regex,w)]
        print("Eligible ngram probs: ", eligible_ngrams_probs)
        
        # pick the bigram with the highest prob - only one (if more than one, pick the first one)
        best_ngram_idx = np.argmax(eligible_ngrams_probs)
        print("Best ngram index: ", best_ngram_idx)
        best_ngram_value = eligible_ngrams_list[best_ngram_idx]
        print("Best ngram value: ",best_ngram_value)
        next_word = best_ngram_value.split()[n-1]
        print("Predict next word: ", next_word)
        
        return next_word
    else:
        #if prev_term doesn't exist, deal with it later
        print ("No such term found")


### Test the function - Bigram only (with training wheels)
error handling:  
add if unigram exists but no bigrams

In [113]:
#check if such word exists
test_word ='with'
print ('Term exists? ', test_word in vocab_list)

predictNextWord(test_word)

Term exists?  True
Eligible ngram list:  ['with and', 'with but', 'with circle', 'with graduation', 'with great', 'with making', 'with my', 'with no', 'with not', 'with plots', 'with promise', 'with tes', 'with the', 'with your']
Eligible ngram index:  [2691, 2692, 2693, 2694, 2695, 2696, 2697, 2698, 2699, 2700, 2701, 2702, 2703, 2704]
Eligible ngram probs:  [array([0.05555556]), array([0.05555556]), array([0.05555556]), array([0.05555556]), array([0.05555556]), array([0.05555556]), array([0.05555556]), array([0.11111111]), array([0.05555556]), array([0.05555556]), array([0.05555556]), array([0.05555556]), array([0.22222222]), array([0.05555556])]
Best ngram index:  12
Best ngram value:  with the
Predict next word:  the


'the'

#### How to read it
Given previous word 'your', get all 'eligible' n-grams:  

Correct for bigrams: 
- Given previous word 'your', get all bigrams starting with prev_token.

Correct for trigrams:
- Given previus two words 'word_x your', get trigrams with prev_token as the word before last.  

Later for 3+ grams: 
- Given previus n words, get n-grams with prev_token as the word before last.  


Once we found all 'eligible' n-grams, compile a maxtrix with counts of relevant frequencies.