In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# show all outputs of a cell (such as if df.head() and df.tail() are in the same cell
#default is 'last_expr'

In [3]:
import pickle
import numpy as np
import pandas as pd
import sklearn.model_selection
o = open('/Users/Work/Desktop/Work/Projects/Springboard/Final_Files/reviews_pickle.pkl','r')
df=pickle.load(o)
df=pd.DataFrame(df)
o.close()

# Bag-of-Words Model

**Natural language processing (NLP)** is the process of transforming natural-language into machine usable data.<br>
A Bag-of-Words model consists of the steps: Tokenization, Transformation, Vectorization <br>

**Tokenization** is the splitting of the text into parts, either by words, characters, pairs, groups of words, etc.  <br>

n-grams are groups of words in a split.<br>
Unigrams, bigrams, and trigrams are on one-word, two-word, and three-word groups, respectively. <br>
Tokens is what we call unigrams. <br>
The corpus is the set of texts used for analysis.<br>
Start with unigrams and expand to n-grams, if necessary.

**Transformation** is the process of transforming tokens in best preparing them for the model.<br>  Examples include: *stemming*, meaning stripping words of their suffixes so that _jumping, jump, jumped_, etc can be in the same category, and making all letters lowercase<br>
 
After tokenizing and transforming, make a **dictionary** of unique words (when dealing with just unigrams) with keys as words and values as occurrences.  Usually not all words are used to build the model, just the top-N words with the most number of occurrences.

**Vectorization** is the process of creating a matrix of the dictionary keys (words/tokens/n-grams/features) as columns and observations as rows.  For a given row, each cell is the number of times that dictionary key appeared in that observation's text.  <br>

**Stop words** are unimportant, filler words like "the", "is", etc that typically have high counts.  Most NLP libraries include pre-built stop word lists.  If your project has its own specific stop words that are not in these lists, select a stop-word threshold, such that any words that appear in over x% of the documents are excluded.  x is typically 90%. <br>

Bag-of-words features (the transformed n-grams we selected that are dictionary keys) are called **sparse** because they have a lot of zero values across observations.  Only a small number of text features are typically found in a given text. <br>

Recommended algorithms include those that can handle sparse data (Naive Bayes) or those than can handle many low-significance features (Random Forests).

# Exploring the Bag-of-Words Model

In [69]:
Y=df.stars
f= lambda x: 1 if x>=3 else 0
#f= lambda x: 2 if x>=4 else 1 if x>=3 else 0 #if we wanted 3 categories
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split\
(df.text, Y, test_size=0.30, random_state = 5)

In [39]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english', lowercase=True,token_pattern='(?u)\\b\\w\\w+\\b') 
vectorizer.fit(X_train)
X_train_count=vectorizer.transform(X_train)
pd.DataFrame(X_train_count.toarray(),columns=vectorizer.get_feature_names()).head() #create df of term frequencies

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

Unnamed: 0,00,000,001,00am,00pm,01,01pm,02,02pm,03,...,étage,étoile,étudiants,étudier,été,évitez,être,ö_ö,über,überholt
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**fit()** learns the vocabulary dictionary of all tokens in the documents.  We fit on the train split.  We could do transformer=vectorizer.fit(), then transformer.transform(), but vectorizer has an internal state so it remembers the dictionary vocab values after vectorizer.fit, so we can just do: vectorizer.fit(), then vectorizer.transform(). <br>
**transform()** transforms documents to document-term matrix (count matrix) using only the vocabulary learned with fit(), which is just from the train split. <br>
**fit_transform()** Performs fit() and then transform() with one function! <br>
**stop_words** - the default is None, meaning no words will be excluded <br>
**(lowercase=True)** - Default value of stop_words is None, all letters are converted to lowercase by default  <br>
**token_pattern='(?u)\\b\\w\\w+\\b'** - Default value is such that punctuation around words is ignored, so 'hi!' is the same as 'hi'

The **CountVectorizer()** object's **fit_transform()** method learns the dictionary of words, and then uses the words in that dictionary to create a matrix of counts for all words in the column/list/array of data that is passed to it.  The argument **stop_words**='english' removes all words in the built-in stop word list.

In [40]:
X_train_count.shape # rows are observations, columns are tokens

(5935, 13572)

Once the CountVectorizer() object has been fitted to the data, it creates a dictionary of words as keys and counts as values.  This dictionary can be accessed using the **vocabulary** method of CountVectorizer().

In [41]:
vectorizer.vocabulary_.items()[:5] #items() creates a list of tuples from the dictionary

[(u'fawn', 4621),
 (u'raining', 9554),
 (u'writings', 13424),
 (u'gag', 5149),
 (u'hendertucky', 5792)]

The words 'hi' and 'hello' appear 6884 and 6841 times, respectively.

In [18]:
vectorizer.vocabulary_.get('hi') 
vectorizer.vocabulary_.get('hello')

5815

5778

The stop-word list, as shown below, excludings 'hi' and 'hello'.  _True_ indicates that errors are returned, since the words cannot be found).  These words happen to be in this dataset, and I would like to add them to the stop-word list (to explore how this process would work).

In [42]:
from sklearn.feature_extraction import text
stop_list=list(set(text.ENGLISH_STOP_WORDS))
print stop_list[:5] 

['all', 'show', 'anyway', 'fifty', 'four']


In [43]:
try:
    (stop_list.index('hi')) 
except ValueError:
    True
try:
    (stop_list.index('hello')) 
except ValueError:
    True

True

True

In [44]:
stop_list=text.ENGLISH_STOP_WORDS.union(['hi','hello'])

It is common practice to set a stop word threshold of .90 so that all words that appear in over 90% of the documents are excluded from dictionary. 

In [45]:
vectorizer = CountVectorizer(stop_words=stop_list,max_df=.90) #default value of max_df is 1.0
X_train_count = vectorizer.fit_transform(X_train)
vectorizer.vocabulary_.get('hi')==None #'hi' now appears 0 times
vectorizer.vocabulary_.get('hello')==None #'hello' now appears 0 times

True

True

# Building the Final Bag-of-Words Model

In [230]:
Y=df.stars
f= lambda x: 1 if x>=3 else 0
#f= lambda x: 2 if x>=4 else 1 if x>=3 else 0 #if we wanted 3 categories
Y=Y.map(f)
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split\
(df.text, Y, test_size=0.30, random_state = 5)

In [231]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text
stop_list=list(set(text.ENGLISH_STOP_WORDS))
vectorizer = CountVectorizer(stop_words=stop_list,max_df=.90) #default value of max_df is 1.0.  It's common to do .9.
X_train_count = vectorizer.fit_transform(X_train)
X_test_count = vectorizer.transform(X_test)

In [232]:
from sklearn.naive_bayes import MultinomialNB
import sklearn.model_selection
from sklearn import metrics
clf = MultinomialNB().fit(X_train_count, Y_train) 

# The TF-IDF model

The TF-IDF model reduces the impact of tokens that occur very frequently in a training corpus and are hence less informative (unlike the Bag-of-Words model).  It takes the matrix of counts from the Bag-of-Words model and transforms it into a matrix of TF-IDFs.  It does this by scaling down terms that are more frequent, and scaling up terms that are less frequent.  Words that had to be excluded in Bag-of-Words (stop-words) receive very low weights in this model, so this model accounts for stop-words without excluding them.  We can still choose to exclude them in this model though.  <br>

TF-IDF = tf x idf <br>
tf is term frequency <br>
idf is inverse document frequency <br>

The formula is the number of times the word appears, scaled by how infrequent it is in the document where the more infrequent it is, then the greater the scalar.
If a rare word appears 1 time in a document, it will have a much higher tf-idf than an extremely common word that appears 1 time in a document.

# Using sklearn.feature_extraction.text.TfidfTransformer

For *TfidfTransformer*, the default formula is tf-idf= (tf x idf) where tf=count of terms. <br>
By default, **sublinear_tf=False**. When True, tf is replaced with 1+log(tf), meaning sublinear tf scaling is used.  This accounts for the unreasonable assumption that 20 occurrences of a term in a document carries twenty times the significance of a single occurrence, which is what is implied by simply using a count.  Logging the count is one solution to this unreasonable assumption.   <br>


By default, **smooth_idf=True**, which adds a 1 inside the log term: <br>
**idf = log[(1+n)/(1+df)] + 1** <br>
**=log((1+docs)/(1+docs with term)) + 1** <br>

n is the total number of documents (observations) <br>
df is the document frequency, the number of documents that contain the term t <br>

The 1 within the log term prevents the possibility of zero division, meaning a term appearing in none of the documents.  The 1 outside the log term causes terms that appear in all documents to not be ignored.  Without the 1 added, idf=log(n/n)=0 and then (tf)(idf) = (tf)(0) = 0.  With the 1, idf=0+1=1 and then (tf)(1) = tf.

When **smooth_idf=False**, the 1 is removed from within the log term<br>
**idf = log(n/df) +1** <br>
**=log(docs / docs with term) + 1** <br>

In [224]:
Y=df.stars
f= lambda x: 1 if x>=3 else 0
#f= lambda x: 2 if x>=4 else 1 if x>=3 else 0 #if we wanted 3 categories
Y=Y.map(f)
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split\
(df.text, Y, test_size=0.30, random_state = 5)

In [225]:
#Building the count matrices
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english',max_df=.90) 
X_train_count = vectorizer.fit_transform(X_train) 
X_test_count = vectorizer.transform(X_test) 

In [226]:
#Building the tf-idf matrices
transformer=TfidfTransformer(smooth_idf=True, sublinear_tf=False) #transformer.idf_==None returns true at this point
X_train_tfidf=transformer.fit_transform(X_train_count) #fit() and transform().  transformer.idf_ now returns idfs
X_test_tfidf=transformer.transform(X_test_count)

**transformer.fit(X_train_count)** learns the idf vector from the train split, where the idf vector is the weights applied to the term freqencies in the tf-idf model.
TfidfTransformer() has an internal state, so when you do TfidfTransformer().fit(X\_train\_count), idfs are generated from the term frequencies, and then when you call TfidfTransformer() again and access its idf\_ attribute, you'll get the generated idfs. <br>
**transformer.idf\_** returns idfs from having fit the transformer on the train split<br>

In [81]:
transformer.idf_

array([ 5.75696515,  8.5901785 ,  8.9956436 , ...,  8.9956436 ,
        8.9956436 ,  8.9956436 ])

In [227]:
X_train_tfidf.toarray()[0] #array of tf-idf's for the first document
#contains an array of TF-IDFs for each document, where each array has the TF-IDF for each term

array([ 0.,  0.,  0., ...,  0.,  0.,  0.])

In [228]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
clf = MultinomialNB().fit(X_train_tfidf, Y_train)
Y_pred=clf.predict(X_test_tfidf)
from sklearn.metrics import accuracy_score, precision_score, recall_score,  f1_score
print 'Accuracy score: %r' %round((accuracy_score(Y_test, Y_pred)),3)
print 'Precision score: %r' %round((precision_score(Y_test, Y_pred)),3)
print 'Sensitivity score: %r' %round((recall_score(Y_test, Y_pred)),3)
print 'F1 score: %r' %round((f1_score(Y_test, Y_pred)),3)

Accuracy score: 0.783
Precision score: 0.762
Sensitivity score: 0.985
F1 score: 0.859


# Using sklearn.feature_extraction.text.TfidfVectorizer

Converts a collection of raw documents into a matrix of TF-IDF features.  Combines CountVectorizer and TfidfTransformer into a single model, so that you can use the original data as an input (unlike TfidfTransformer, which requires the matrix of counts that is outputted from CountVectorizer()).

In [234]:
Y=df.stars
f= lambda x: 1 if x>=3 else 0
#f= lambda x: 2 if x>=4 else 1 if x>=3 else 0 #if we wanted 3 categories
Y=Y.map(f)
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split\
(df.text, Y, test_size=0.30, random_state = 5)

In [235]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=.95, smooth_idf=True,sublinear_tf=False,stop_words='english')
X_train_tfidf=vectorizer.fit_transform(X_train)
X_test_tfidf=vectorizer.transform(X_test)

In [236]:
clf = MultinomialNB().fit(X_train_tfidf, Y_train)
Y_pred=clf.predict(X_test_tfidf)
from sklearn.metrics import accuracy_score, precision_score, recall_score,  f1_score
print 'Accuracy score: %r' %round((accuracy_score(Y_test, Y_pred)),3)
print 'Precision score: %r' %round((precision_score(Y_test, Y_pred)),3)
print 'Sensitivity score: %r' %round((recall_score(Y_test, Y_pred)),3)
print 'F1 score: %r' %round((f1_score(Y_test, Y_pred)),3)

Accuracy score: 0.783
Precision score: 0.762
Sensitivity score: 0.985
F1 score: 0.859


# Exploring the Naive Bayes Model by starting with Bayes Theorem

Bayes Theorem allows us to calculate the probability of an event occurring, given that another event has occurred, called the **posterior probability**, the probability that we are unaware of, from probabilities related to the event called **prior probabilities**, probabilities that we are aware of, that are given to us, and that do not reflect the fact that this other event has occurred.  These probabilities are simply just referred to as priors and posteriors.<br>  

Another definition: The probability of an event when we have knowledge of conditions that might be related to the event.  https://en.wikipedia.org/wiki/Bayes'_theorem


In [306]:
'''Bayes Theorem formula - my favorite formula (more formulas shown below):
                    P(Likelihood of Evidence) * Prior prob of outcome
P(outcome|evidence) = _________________________________________________
                                         P(Evidence)                                        
'''   ;                        

Calculating the probabiltiy that a sample is actually positive, given a positive prediction for that sample. <br>
P = Positive Sample (Y=1)<br>
N = Negative Sample (Y=0) <br>
TP = True Positive = positive prediction is correct (Y_pred=1 and Y=1)<br>
Pos = Positive Prediction (Y_pred=1)<br>
Sensitivity is the proportion of positives that are predicted as positive.  <br>
Specificity is the proportion of negatives that are predicted as negative.  <br>
1-Specificity is the proportion of negatives that are predicted as positive. <br>
<font size=+1>$Posterior\_Probability = \frac{Conditional\_ProbabilityxPrior\_Probability}{Evidence(=Prior\_Probability)} $ <br>                                       
<font size=+1>$P(P | Pos) = \frac{P(Pos|P) P(P)}{P(Pos)} $
<br>
$P(Y=1 | Y\_pred=1) = \frac{P(Y\_pred=1|Y=1) P(Y=1)}{P(Y\_pred=1)} $ <br>
<br>$= \frac{P(TP)P(P)}{P(TP)P(P) + P(FP) P(N)} $<br>
<br>$= \frac{Sensitivity P(P)}{SensitivityP(P) + (1 - Specificity) P(N)} $<br>


<font size=+1>$P(Pos)$<br>
$= P(Pos|P)P(P)+P(Pos|N)P(N)$<br>
$= P(TP)P(P)+P(FP)P(N)$<br>
$= SensitivityP(P) + (1 - Specificity) P(N) $ <br>

$P(P)+P(N)=1$

Bayes Theorem says that the likelihood of an actual positive given a positive prediction equals: <br> (the likelihood of a positive prediction, given an actual positive) x (the likelihood of an actual positive) / (the likelihood of a positive prediction) <br>

P(Pos|P) is the probablity of a positive prediction given the sample is actually positive.  <br>This is the same thing as the probability of a true positive P(TP).

# Bayes Theorem applied to the Starbucks text review TF-IDF model results

In [259]:
predictions=pd.DataFrame(zip(Y_pred,Y_test),columns=['Y-hat','Y'])
total=len(Y_test)
pos=sum(Y_pred)
p=sum(Y_test)
n=len(list(itertools.ifilter(lambda x: x==0, Y_test)))
import itertools
tpfp=list(itertools.compress(Y_pred,Y_test))
tnfn=list(itertools.compress(Y_pred,Y_test==False))
tp=sum(tpfp) #return values of Y_pred only where values of Y_test are true (=1)
fp=len(list(itertools.ifilter(lambda x: x==1, tnfn))) #filters and returns values where x==1
fn=len(list(itertools.ifilter(lambda x: x==0, tpfp))) #filters and returns values where x==1
tn=len(list(itertools.ifilter(lambda x: x==0, tnfn))) #filters and returns values where x==0
sensitivity=tp/float(p)
specificity=fp/float(n)
prob_p=p/float(total)
prob_n=n/float(total)
prob_pos=sensitivity*prob_p+(1-specificity)*prob_n
posterior=sensitivity*prob_p/prob_pos
print 'Test split statistics'
print 'The posterior probability of Y=1, given Y-hat=1:      %r' %(round(posterior,3))
print 'The prior probability of Y-hat=1:                     %r' %(round(prob_pos,3))
print 'Accuracy (same as above, but calculated differently)  %r' %(round((tn+tp)/float(total),3))
print 'The prior probability of Y=1:                         %r' %(round(prob_p,3))
print 'Sensitivity                                           %r' %(round(sensitivity,3))
print 'Specificity                                           %r' %(round(specificity,3))

Test split statistics
The posterior probability of Y=1, given Y-hat=1:      0.848
The prior probability of Y-hat=1:                     0.783
Accuracy (same as above, but calculated differently)  0.783
The prior probability of Y=1:                         0.674
Sensitivity                                           0.985
Specificity                                           0.635


In [260]:
tn,fp,fn,tp

(303, 527, 26, 1688)

In [257]:
sklearn.metrics.confusion_matrix(Y_test,Y_pred)

array([[ 303,  527],
       [  26, 1688]])

In [268]:
#np.where(predictions.Y==0,'yes',0)
#C = np.where(cond, A, B)
#defines C to be equal to A where cond is True, and B where cond is False.\

# Naive Bayes Model

If we want to predict the probability of a categorical target taking on a certain value, assuming the values of multiple features, then we can use the Naive Bayes model.  This model is built using Bayes Theorem, but assuming that features are **conditionally independent**.  It does not assume that features are **independent**, and independence of features does not necessarilly imply conditional independence of features.  This conditional independence assumption, a simplification, is why we call this model a "Naive" Bayes model.  This makes the formula simpler to calculate, which is good because usually there are multiple pieces of evidence.  While this assumption often DOES NOT hold, this model is still known for outperforming many other more sophisticated models!   <br>

1. This model estimates prior and conditional probabilities from the training data: P(feature j=c), P(target y=k).  
2. It calculates the probability of the target taking value k, given the evidence of the observation (certain feature values) by using Bayes Theorem and assuming features are conditionally independent.  
3. It selects the target value that producest the maximum conditional probability.  

Thus, for a given observation, this model predicts a target value that has the maximium conditional probability of occurring, given the evidence (feature values) for that observation. 

Thus,
<font size=+2>$\frac{P(y|  x1,x2,...,xj) = P(x1|y)P(x2|y)...P(xj|y)P(y)}{P(x1,x2,...,xj)}$ <font size=+0><br>
where y is the target, categorical variable equaling a value and x1,...,xj are the independent features.

https://en.wikipedia.org/wiki/Naive_Bayes_classifier <br>
https://en.wikipedia.org/wiki/Conditional_independence<br>
http://stackoverflow.com/questions/10059594/a-simple-explanation-of-naive-bayes-classification<br>
http://www.saedsayad.com/naive_bayesian.htm<br>

The **chain rule** in probability theory allows joint probabilities to be calculated using conditional probabilties:<br>  P(x1,x2,x3)=P(x3|x2,x1)P(x2,x1) <br>
=P(x3|x2,x1)P(x2|x1)P(x1) <br>

The **conditional independence** assumption is that the knowledge/occurrence of x1 (x2) does not affect the likelihood of x2 (x1), given that Y has occurred.   Conditional independence does not imply independence.  Independence does not imply conditional independence.  (source: https://en.wikipedia.org/wiki/Conditional_independence)<br>
P(x2|x1,Y) = P(x2|Y) <br>
Equivalently and invoking the chain rule, <br>
P(x2,x1|Y) = P(x2|Y)P(x1|Y)

In estimating the likelihood of outcome Y given the evidence (feature values) for an observation,
P(x1,x2|Y)=P(x1,x2,Y)/P(Y) <br>
=P(x2|x1,Y)P(x1|Y)_P(Y)**/**P(Y)_ by the chain rule (notice that the P(Y)'s cancel out) <br>
=P(x2|Y)P(x1|Y) by the the conditional independence assumption <br>


In [304]:
''' In other words:
P(Outcome|Multiple Evidence) = 
P(Evidence1|Outcome) * P(Evidence2|outcome) * ... * P(EvidenceN|outcome) * P(prior prob of outcome)
_________________________________________________________________________________________________

                                 P(Multiple Evidence)
''';

The multinomial Naive Bayes classifier is suitable for discrete (0,1,2,3) features and thus is great for word counts, though fractional counts, such as for TF-IDF may also work.

# Udacity's Machine Learning Engineer - Naive_Bayes_tutorial Problem

**source**: <br>
https://github.com/udacity/machine-learning/blob/master/projects/practice_projects/naive_bayes_tutorial/Naive_Bayes_tutorial.ipynb <br>
**forum post where I mention error:**
https://discussions.udacity.com/t/error-in-notation-naive-bayes-tutorial-ipynb-practice-project/230754

**The question: what is the probability that Jill says the words freedom and immigration?** <br>
The real question: what is the probability that the words freedom and immigration are said in a speech, given that the candidate is Jill?

Probability that Jill Stein says 'freedom': 0.1 ---------> P(F|J)<br>
Probability that Jill Stein says 'immigration': 0.1 -----> P(I|J)<br>
Probability that Jill Stein says 'environment': 0.8 -----> P(E|J)<br>
Probability that Gary Johnson says 'freedom': 0.7 -------> P(F|G)<br>
Probability that Gary Johnson says 'immigration': 0.2 ---> P(I|G)<br>
Probability that Gary Johnson says 'environment': 0.1 ---> P(E|G) <br>
P(G) = .5 <br>
P(J) = .5 <br>

P(J|F&I) = P(F&I&J) / P(F&I) <br>

**As for the numerator:** <br>
By the chain rule, a joint probability can be expressed using conditional probabilities: <br>
= P(F|I&J)P(I&J) <br>
= P(F|I&J)P(I|J)P(J) <br>
Due to the independence assumption of features,P(F|I&J)=P(F|J) : <br>
P(F&I&J)=P(F|J)P(I|J)P(J) <br>

**As for the denominator:** <br>
Due to the independence assumption of features: <br>
P(F&I)=P(F&I&J)+P(F&I&G)<br>
By the chain rule: <br>
=P(F|I&J)P(I&J)+P(F|I&G)P(I&G)<br>
=P(F|I&J)P(I|J)P(J)+P(F|I&G)P(I|G)P(G)<br>
Due to the independence assumption of features: <br>
P(F&I)=P(F|J)P(I|J)P(J)+P(F|G)P(I|G)P(G)

In [280]:
numerator=.1*.1*.5
denominator=.1*.1*.5+.7*.2*.5
numerator/denominator

0.06666666666666668