## My N-Gram Model
Create a smart keyboard that predicts next word based on previous word(s).  

Using SKLearn CountVectorizer object to convert text to frequency counts.  
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [46]:
#meta 9/2/2018
#prev: based on preliminary work with 3 small corpa in predictNextWord_BigramModel.ipynb
#    2-gram model with training wheels
#    figured out 2-gram model first, according to N-gram model instructions.
#    used regex to find all n-grams with prev word == prev_token.


#->  for bigrams, only 1 row where first word = prev_token
#    for trigrams, 1+ rows where second word = prev_token
#    for n-grams, 1+ rows where word before last = prev_token

#prev: 2-gram model, no training wheels
#    get data, write function, use function

#here: testing performance of 2-gram model with bigger text
#    all twitter text (175M words), not enuf memory (only 4GB) for news and blogs
#    e2e takes about 35 min, function takes about 3 min to run 
#    obviosly unacceptable performance
#    see Learning notes at the end for summary and performance improvement obvious ideas

In [2]:
import time
import numpy as np
import pandas as pd
import re #for regex and pattern matching
import matplotlib.pyplot as plt #for drawing plots
%matplotlib inline

#NLP libraries
from sklearn.feature_extraction.text import CountVectorizer

#not used
#from collections import Counter #for document-term counting


In [3]:
### start clock
start_time=time.time()

### 0. Load Data
Data source is 3 files

In [4]:
#get rid of punctuation
#  re explanation: 
#  refer to https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python
#  replaces not (^) word characters or spaces with the empty string. 
#  Be careful though, the \w matches underscore too usually for example
# originally was
#   words_news = re.sub(r'[^\w\s]','',open('sampleData/en_US.news_small.txt').read().lower())
#get rid of numbers too
text_twitter_in = re.sub(r'[^\w\s]|[\d]','',open('myData/en_US.twitter.txt').read().lower())
##text_news_in = re.sub(r'[^\w\s]|[\d]','',open('myData/en_US.news.txt').read().lower())
##text_blogs_in = re.sub(r'[^\w\s]|[\d]','',open('myData/en_US.blogs.txt').read().lower())

### how long did it take to read in text?
end_time=time.time()
print('Loaded data in ', (end_time - start_time)/ 60 )


Loaded data in  0.3536584814389547


In [5]:
#preview - notice \n markers of each new line
#text_news_in


In [6]:
#capture beg and end of line with special delimeters <s> </s> 
text_twitter = re.sub(r'\n','</s> <s> ',text_twitter_in)
text_twitter = ' '.join((' <s>', text_twitter,'</s>'))

"""text_news = re.sub(r'\n',' </s> <s> ', text_news_in)
text_news = ' '.join(('<s>', text_news,'</s>'))

text_blogs = re.sub(r'\n',' </s> <s> ', text_blogs_in)
text_blogs = ' '.join(('<s>', text_blogs,'</s>'))
"""

"text_news = re.sub(r'\n',' </s> <s> ', text_news_in)\ntext_news = ' '.join(('<s>', text_news,'</s>'))\n\ntext_blogs = re.sub(r'\n',' </s> <s> ', text_blogs_in)\ntext_blogs = ' '.join(('<s>', text_blogs,'</s>'))\n"

In [19]:
#preview
#text_twitter #class string
#text_news
#text_blogs


In [25]:
#combine all words
text_all = text_twitter ## + ' ' + text_news + ' ' + text_blogs
text_all2 = []
text_all2.append(text_all)

In [26]:
### how long did it take to process text?
end_time=time.time()
print('Processed text in ', (end_time - start_time)/ 60 )

Loaded data in  5.365266752243042


In [27]:
#validate counts
print ('Twitter words', len(text_twitter))
##print ('News words', len(text_news))
##print ('News words', len(text_blogs))
print ('All words', len(text_all2[0]))

text_all[-100:]

Twitter words 174263060
All words 174263060


'our woman happy attention affection treat her like a queen and sex her like a pornstar</s> <s>  </s>'

### 1.  From Text to Tokens to Frequency Counts (bigrams)
Convert a collection of text documents to a matrix of token counts.  
Generate frequency counts for bigrams.

In [28]:
#need unigrams and bigrams 
n=2
vectorizer = CountVectorizer(token_pattern=r'(?u)[\<[\/]*]?\b\w+\b[\>*]?', ngram_range=(1, n))
#dtm with 1 document
dtm = vectorizer.fit_transform(text_all2) #class 'scipy.sparse.csr.csr_matrix'
print ('DTM type ', type(dtm))

vocab_list = vectorizer.get_feature_names() #class list
print('Successful vocab - with bigrams: ', len(vocab_list))

#preview vocab and verify trigrams
vocab_list[-20:] #class list

DTM type  <class 'scipy.sparse.csr.csr_matrix'>
Successful vocab - with bigrams:  5960862


['ａｐｅ in',
 'ｈｅｂbut',
 'ｈｅｂbut its',
 'ｏ',
 'ｏ </s>',
 'ｏｏ',
 'ｏｏ </s>',
 'ｒｔif',
 'ｒｔif youre',
 'ｰノ',
 'ｰノ </s>',
 'ｿﾛﾗｲﾌﾞlivebar',
 'ｿﾛﾗｲﾌﾞlivebar uncle',
 'ﾉ',
 'ﾉ now',
 'ﾉ ﾉ',
 'ﾉﾉ',
 'ﾉﾉ </s>',
 '𝛑',
 '𝛑 day']

In [29]:
dtm

<1x5960862 sparse matrix of type '<class 'numpy.int64'>'
	with 5960862 stored elements in Compressed Sparse Row format>

In [30]:
#query dtm 
print ('DTM type ', type(dtm))
ngram_value = '</s> <s>' #'am sam'
ngram_idx = vocab_list.index(ngram_value)
print ('Query dtm: how many times an n-gram occurs in the text')
dtm[0,ngram_idx]

DTM type  <class 'scipy.sparse.csr.csr_matrix'>
Query dtm: how many times an n-gram occurs in the text


2360148

### 1a. From sparse matrix into NumPy array  
NumPy arrays supports a great variety of operations

In [31]:
#convert from current format, sparse matrix, into a normal numpy array 
print ('DTM type before: ', type(dtm))
dtm = dtm.toarray()
print ('DTM type after', type(dtm))
dtm[0]

DTM type before:  <class 'scipy.sparse.csr.csr_matrix'>
DTM type after <class 'numpy.ndarray'>


array([2360149, 2360148, 2360149, ...,       1,       1,       1],
      dtype=int64)

In [32]:
#convert python list storing vocab into numpy array
vocab = np.array(vocab_list)
vocab[:10]

array(['</s>', '</s> <s>', '<s>', '<s> </s>', '<s> _', '<s> __',
       '<s> ___', '<s> ____', '<s> _____', '<s> ______'], dtype='<U131')

In [39]:
### how long did it take to build?
end_time=time.time()
print('Built dtm and vocab in ', (end_time - start_time)/ 60 )

Built dtm and vocab in  34.95712695121765


In [33]:
#query dtm
ngram_idx = list(vocab).index(ngram_value)
dtm[0,ngram_idx]

2360148

#### Using NumPy indexing is more natural

In [34]:
dtm[0,vocab == ngram_value]

array([2360148], dtype=int64)

#### Print frequency counts (aka dtm)

In [35]:
dtm[0,:]

array([2360149, 2360148, 2360149, ...,       1,       1,       1],
      dtype=int64)

In [36]:
#friendly print dtm frequency counts
df = pd.DataFrame(dtm,columns = vocab)
df

Unnamed: 0,</s>,</s> <s>,<s>,<s> </s>,<s> _,<s> __,<s> ___,<s> ____,<s> _____,<s> ______,...,ｰノ </s>,ｿﾛﾗｲﾌﾞlivebar,ｿﾛﾗｲﾌﾞlivebar uncle,ﾉ,ﾉ now,ﾉ ﾉ,ﾉﾉ,ﾉﾉ </s>,𝛑,𝛑 day
0,2360149,2360148,2360149,1,200,58,42,22,12,10,...,1,1,1,2,1,1,1,1,1,1


### 2. Create Function to Predict Next Word
Calculate specific bigram probabilities (vectorized and list comprehensions).  

Don't need to build a matrix of bigrams to unigram.  
Only need to compute a subset of rows:  
 - for bigrams, only 1 row where first word = prev token 

In [43]:
#Predict next word - wrap into a function
#must have prev declared variables:
#vocab
#vocab_list
#dtm

def predictNextWord(prev_term):
    #reminders: prev_term = 'your'
    
    # get a vector (np.array) where value == prev_term
    prev_term_exists = np.where(vocab==prev_term,1,0)
    ##print("exists? ", prev_term_exists.sum()) #given our vocab can only be 1 (found) or 0 (not found)
    
    # if prev_term exists, predict next word
    if prev_term_exists.sum(): 

        ## get index of a n-gram
        ##prev_term_idx = vocab_list.index(prev_term)
        # get count of prev_term
        prev_term_count = dtm[0,vocab==prev_term]
        
        #build regex to find all n-grams with prev word == prev_term.
        this_regex = r'\b' + prev_term + r'\b \w+?$'
        ##print (this_regex)
        r = re.compile(this_regex)

        #  get all values from list where values match regex pattern
        eligible_ngrams_list = list(filter(r.search, vocab_list))
        ##print("Eligible ngram list: ", eligible_ngrams_list)
        
        # get index of each bigram
        eligible_ngrams_idx = [i for i, w in enumerate(vocab_list) if re.search(this_regex,w)]
        ##print("Eligible ngram index: ", eligible_ngrams_idx)
        
        # compute relative bigram probs 
        eligible_ngrams_probs = [dtm[0,i]/prev_term_count for i, w in enumerate(vocab_list) if re.search(this_regex,w)]
        ##print("Eligible ngram probs: ", eligible_ngrams_probs)
        
        # pick the bigram with the highest prob - only one (if more than one, pick the first one)
        best_ngram_idx = np.argmax(eligible_ngrams_probs)
        print("Best ngram index: ", best_ngram_idx)
        best_ngram_value = eligible_ngrams_list[best_ngram_idx]
        print("Best ngram value: ",best_ngram_value)
        next_word = best_ngram_value.split()[n-1]
        print("Predict next word: ", next_word)
        
        return next_word
    else:
        #if prev_term doesn't exist, deal with it later
        print ("No such term found")


### Test the function - Bigram only (with training wheels)
error handling:  
add if unigram exists but no bigrams

In [45]:
start_time = time.time()
#check if such word exists
test_word ='favorite'
##print ('Term exists? ', test_word in vocab_list)

predictNextWord(test_word)

### how long did it take to run the function?
end_time=time.time()
print('Function predicted next word in ', (end_time - start_time)/ 60 )

Best ngram index:  2504
Best ngram value:  favorite song
Predict next word:  song
Function predicted next word in  2.144909930229187


#### How to read it
Given previous word 'your', get all 'eligible' n-grams:  

Correct for bigrams: 
- Given previous word 'your', get all bigrams starting with prev_token.

Correct for trigrams:
- Given previus two words 'word_x your', get trigrams with prev_token as the word before last.  

Later for 3+ grams: 
- Given previus n words, get n-grams with prev_token as the word before last.  


Once we found all 'eligible' n-grams, compile a maxtrix with counts of relevant frequencies.

### Learning Notes
Performance issues:  
Twitter words 174263060  
Successful vocab - with bigrams:  5960862  
Lines 2360149	2360148	2360149  

Took 35 min to run e2e.
Takes about 2.5 min to predict next word

Obvious issues to resolve:
- Underscores
- Unicode characters
- Use Counter instead of CountVectorizer



