# Naive Bayes

<img src="http://blog-assets.bigfishgames.com/uploads/2012/08/Choose-your-own-adventure.jpg"/>

Great, now that we've learned the basics of Naive Bayes, it's up to you to choose what path to take. Once you pull the repo, you should have a file called "07_insult.csv". This is a csv from a [Kaggle Insult Competition](https://www.kaggle.com/c/detecting-insults-in-social-commentary), where each row contains a comment and has been classified as an Insult or Not.


**Using this data, you have the following choices:**

* Implement your own insult/not insult classifier using Naive Bayes. Think of this as the Spam example, but instead of Spam/Ham, it is now Insult/NotInsult. (Of course, this generalizes to more than just two classes). It may be easier to calculating probabilties by creatiing two "documents" (i.e. word array) of all the insults and all the noninsults. Building your own version will help you get an appreciation for what's going on under the hood of the sklearn implementations. (You can also use the Pandas features for aggregation like I did).

    Rather than multiply probabilities for $\prod{P(w|c)}P(c)$, take the log, and compute $\sum{log(P(w|c)} +log(P(c))$ i.e. log(x^a * y^b * z^c) == log(x^a) + log(y^b) + log(z^c) == a*log(x) + b*log(y) + c*log(z)

    Remember to add Laplace Smoothing to instances of $P(w|c)$


* Learn and use SKLearn's MultinomialNB() implementation of Naive Bayes. Similarly, you can use BernoulliNB() as well. The multinomial model (which we learned in class), accounts for multiple occurrences of the same word. The Bernoulli model only counts documents with the presence of the word. Unlike Multinomial models, the Bernoulli model also penalizes for the absence of a word. Multinomial models are generally better when you have many features. Bernoulli models are generally better with fewer, more 'predictive' features, but suffer when these features are noisy.

    Count Vectorizer may prove helpful here. (from sklearn.feature_extraction import text; count_vectorizer = text.CountVectorizer() ). Count Vectorizer transforms a corpus (set of documents) into a matrix of token counts, where token counts are the frequency of words.
    

* If you have used SKLearn or rolled your own Naive Bayes model, what words proved to be most related to an "insult"? i.e. If you were to choose ten words to insult someone, what ten words should they be? (Hint: Think about word frequencies)
    

* All of the above


**Possible improvements once you have 1 or 2:**
* You may notice from looking at the term frequency matrix (#documents x #terms), that some of these terms seem useless. Perhaps you can preprocess the matrix by dropping columns that have less than x total counts across all documents.
* We've created a set of 1-gram features. i.e. each word itself is a feature. But in reality, we might think that sets of words are more useful. Rather than looking at "not" and "terrible" by themselves, having a feature of "not terrible" is much more useful. This would be a 2-gram or bi-gram. This concept extends to n-grams which is a sequence of words of length n. When creating a CountVectorizer, use the option for n_gram_range=(1,xx). This creates a set of features for all of 1-grams (single words), 2-grams (two word sequences), up to xx-grams (xx word sequences). **Don't increase this too high!  n-grams for higher n require a lot more space and computation!**
* Naturally, there are a lot of words that don't really contribute to any meaning. "of","and","it", etc. These are called stop words. We can remove these through the stop_words="english" option when creating a CountVectorizer.
* Similarly, words have a lot of variations. "Talks", "Talked", "Talk" are all essentially the same word, but would be considered different by the tokenizer. Stemming helps remove redundancies by stemming, or truncating words to base words. Try using the [NLTK Stemmer](http://www.nltk.org/howto/stem.html)
* Lastly, instead of using word frequencies in the MultinomialNB, try using tf-idf weights instead. [TF-IDF Weighting in SKLearn](http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting). tf-idf (term frequency - inverse document frequency) is similar to word counts, but also penalizes when the word shows up in multiple documents.



** Additional Helpful Links **

[Count Vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

[High Level Overview of Multinomial Naive Bayes](https://web.stanford.edu/class/cs124/lec/naivebayes.pdf)

[Multinomial vs Bernoulli Naive Bayes](http://blog.datumbox.com/machine-learning-tutorial-the-naive-bayes-text-classifier/)

[Improving Naive Bayes](https://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv('07_insult.csv')

In [3]:
# Your Code here

In [4]:
data.head()

Unnamed: 0,Insult,Comment
0,1,"""You fuck your dad."""
1,0,"""i really don't understand your point.\xa0 It ..."
2,0,"""A\\xc2\\xa0majority of Canadians can and has ..."
3,0,"""listen if you dont wanna get married to a man..."
4,0,"""C\xe1c b\u1ea1n xu\u1ed1ng \u0111\u01b0\u1edd..."


# Sample SKLearn Implementation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
data = pd.read_csv('07_insult.csv') #Read the data

In [None]:
from sklearn.feature_extraction import text #Package

In [None]:
count_vectorizer = text.CountVectorizer() #Create a CountVectorizer object to help convert the documents into a term matrix

In [None]:
count_vectorizer.fit(data.Comment) #Fit the initial data. i.e. Learn all the Vocabulary

In [None]:
X_csr = count_vectorizer.transform(data.Comment) #The resulting matrix "train" is stored as a sparse matrix

In [None]:
count_vectorizer.get_feature_names() # Get the names of the n-grams

In [None]:
X_dense = X_csr.toarray() # CountVectorizer uses a sparse matrix, but you can get the full dense array as well

# If we wanted the column names as well as in a pandas dataframe
X_df = pd.DataFrame( X_csr.toarray(),columns=count_vectorizer.get_feature_names() ) 
y = data.Insult

In [None]:
from sklearn import naive_bayes

In [None]:
nb = naive_bayes.MultinomialNB() #Default alpha=1.0. This is equivalent to using the laplace smoothing we mentioned earlier

In [None]:
nb.fit(X_csr,y) # fitting. i.e. Calculate P(c) for all c. Calculate P(w|c) for all w and c.
#nb.fit(X_dense,y) # Same as above
#nb.fit(X_df,y)    # Same as above

In [None]:
nb.score(X_csr,y) # Great! Except we fit and tested on the same dataset

In [None]:
from sklearn.cross_validation import cross_val_score

In [None]:
# Use cross validation to get a more accurate measure of generalization error
# If you pass in X_df (Pandas dataframe), you need to convert it to a 
# Seems to be a bug where you can't just pass in a Pandas Dataframe for X
# Need to convert to a numpy array
cvScore = cross_val_score(nb,X_csr,y,cv=10) 
cvScore = cross_val_score(nb,X_df.values,y,cv=10) 

In [None]:
cvScore.mean()