## CHAPTER 3: Naive Bayes

Used for Supervised Learning (when you do know the labels unlike k-means) and it's typically used for document classification.

The core premise uses Bayes Theorem.  See a better refresher on probability theory

but basic Premise

P (A N B) = P (A N B)

P (A|B) * P (B) = P (B | A) * P(A)

P (A | B) = P(B|A) * P(A) / P(B)

In the book DataSmart, Mailchimp lauches a new product called Mandrill that is a real animal, so our goal is to try to classify tweets as either applying to MailChimp product or something else (band, video game etc)

So in Naive Bayes, we basically will take the document (in this case tweet) and create a vector of words.

The we want to calculate

p(app | word 1 , word 2, word 3, etc) = p(word 1, word 2, word 3 | app) * p(app) / p(word 1 , word 2...)
and 
P(not app | word 1, word 2, word 3 etc) = p(word 1 word 2, etc | other) * p(other) / p(word 1, word 2)

Usually if p(app | word1) > p(not app | word 1) then we say it's an app. Notice with both of these probabilities
that the denominator is the same, so we can actually drop it since it's more about which value is higher
than the actual value

So becomes 

P(app | word 1, word 2) = p (word 1, word 2, word 3 | app)

P (other | word 1, word 2) = p(word 1, word 2, word 3 | other)

But these joint conditional prbability distributions are not easy to calculate.

So what you assume is (it's a Naive assumption, hence Naive Bayes) that all the words are independent. Even though this isn't true in practice, it still works because you are applying this simplification to both probability calculations, so even though it will prevent you from calculating accurate probabilities all you care about is the relative direction (which one is higher) and this assumption shouldn't affect the relative difference between the two as it introduces bias into both calculations by taking the indepdendence assumption

So now you have

P (app | word 1, word 2) = p(word 1, word 2 | app) * p(app) = p(word 1 | app) * p(word 2 | app) * p(app)

P(other | word 1, word 2) = p(word 1, word 2 | other) * p(other) = p(word 1 | other) * p(word 2 | other) * p(other)


so where do you get the prior probabilities, you can try to estimate them
from data.

If you can't and just assume it's 50/ 50 notice they drop out too since it would be the same in both

so really comes down to compariong

p(word 1 | app) * p(word2 | app) 

vs

p(word 1 | other) * p(word 2 | other)


So ultimately coms down to how do you calculate p(word 1 | app)

You have the labels, so you can get the number of words from app tweets and calculate what % of them are word 1
and this will be your probability

## Couple of tricky points

1. Dealing with rare words.  After you have train the model, what if you see a rare word that is not in either class, then you would get p(rare word | app) = 0 so then when you multiply everything together you get 0. so you add 1, and assume you have seen it once, but that's unfair to words seen once , so add 1 to those, unfair to words seen twice, so you basically increment every count by 1 for every word so that the relative probabilities stay constant and you don't run into these 0 cases - additive smoothing


2. Dealign with floating point underflow - a lot of words will be really small probabilities, and so when you multiply all of these together you could overwhelm the computer. Instead what you should do is take the natural log and then multiply

In [1]:
# Bring in the data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
excel_file_obj = pd.ExcelFile("../raw_data/Mandrill.xlsx")
# Tweets about the app
app_tweets_df = pd.read_excel(excel_file_obj, "AboutMandrillApp")
# All Other tweets
other_tweets_df = pd.read_excel(excel_file_obj, "AboutOther")

In [3]:
app_tweets_df.head()

Unnamed: 0,Tweet,Lower,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7
0,[blog] Using Nullmailer and Mandrill for your ...,[blog] using nullmailer and mandrill for your ...,[blog] using nullmailer and mandrill for your ...,[blog] using nullmailer and mandrill for your ...,[blog] using nullmailer and mandrill for your ...,[blog] using nullmailer and mandrill for your ...,[blog] using nullmailer and mandrill for your ...,[blog] using nullmailer and mandrill for your ...
1,[blog] Using Postfix and free Mandrill email s...,[blog] using postfix and free mandrill email s...,[blog] using postfix and free mandrill email s...,[blog] using postfix and free mandrill email s...,[blog] using postfix and free mandrill email s...,[blog] using postfix and free mandrill email s...,[blog] using postfix and free mandrill email s...,[blog] using postfix and free mandrill email s...
2,@aalbertson There are several reasons emails g...,@aalbertson there are several reasons emails g...,@aalbertson there are several reasons emails g...,@aalbertson there are several reasons emails g...,@aalbertson there are several reasons emails g...,@aalbertson there are several reasons emails g...,@aalbertson there are several reasons emails g...,@aalbertson there are several reasons emails g...
3,@adrienneleigh I just switched it over to Mand...,@adrienneleigh i just switched it over to mand...,@adrienneleigh i just switched it over to mand...,@adrienneleigh i just switched it over to mand...,@adrienneleigh i just switched it over to mand...,@adrienneleigh i just switched it over to mand...,@adrienneleigh i just switched it over to mand...,@adrienneleigh i just switched it over to mand...
4,@ankeshk +1 to @mailchimp We use MailChimp for...,@ankeshk +1 to @mailchimp we use mailchimp for...,@ankeshk +1 to @mailchimp we use mailchimp for...,@ankeshk +1 to @mailchimp we use mailchimp for...,@ankeshk +1 to @mailchimp we use mailchimp for...,@ankeshk +1 to @mailchimp we use mailchimp for...,@ankeshk +1 to @mailchimp we use mailchimp for...,@ankeshk +1 to @mailchimp we use mailchimp for...


In [4]:
other_tweets_df.head()

Unnamed: 0,Tweet,Lower,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7
0,¿En donde esta su remontada Mandrill?,¿en donde esta su remontada mandrill?,¿en donde esta su remontada mandrill?,¿en donde esta su remontada mandrill?,¿en donde esta su remontada mandrill,¿en donde esta su remontada mandrill,¿en donde esta su remontada mandrill,¿en donde esta su remontada mandrill
1,".@Katie_PhD Alternate, 'reproachful mandrill' ...",".@katie_phd alternate, 'reproachful mandrill' ...",".@katie_phd alternate, 'reproachful mandrill' ...",".@katie_phd alternate, 'reproachful mandrill' ...",".@katie_phd alternate, 'reproachful mandrill' ...",".@katie_phd alternate, 'reproachful mandrill' ...",".@katie_phd alternate, 'reproachful mandrill' ...",.@katie_phd alternate 'reproachful mandrill' ...
2,".@theophani can i get ""drill"" in there? it wou...",".@theophani can i get ""drill"" in there? it wou...",".@theophani can i get ""drill"" in there? it wou...",".@theophani can i get ""drill"" in there? it wou...",".@theophani can i get ""drill"" in there it wou...",".@theophani can i get ""drill"" in there it wou...",".@theophani can i get ""drill"" in there it wou...",".@theophani can i get ""drill"" in there it wou..."
3,“@ChrisJBoyland: Baby Mandrill Paignton Zoo 29...,“@chrisjboyland: baby mandrill paignton zoo 29...,“@chrisjboyland: baby mandrill paignton zoo 29...,“@chrisjboyland baby mandrill paignton zoo 29t...,“@chrisjboyland baby mandrill paignton zoo 29t...,“@chrisjboyland baby mandrill paignton zoo 29t...,“@chrisjboyland baby mandrill paignton zoo 29t...,“@chrisjboyland baby mandrill paignton zoo 29t...
4,“@MISSMYA #NameAnAmazingBand MANDRILL!” Mint C...,“@missmya #nameanamazingband mandrill!” mint c...,“@missmya #nameanamazingband mandrill!” mint c...,“@missmya #nameanamazingband mandrill!” mint c...,“@missmya #nameanamazingband mandrill!” mint c...,“@missmya #nameanamazingband mandrill ” mint c...,“@missmya #nameanamazingband mandrill ” mint c...,“@missmya #nameanamazingband mandrill ” mint c...


In [5]:
# Going to do the text cleaning myself
app_tweets = app_tweets_df.loc[:, "Tweet"].tolist()
other_tweets = other_tweets_df.loc[:, "Tweet"].tolist()

In [6]:
def process_tweets(list_of_tweets):
    """Follow same pre-processing as book
       1. lowercase
       2. Remove a period followed by space
       3. Remove colon followed by space
       4. remove ?, !, ;, ,
       """
    final_tweets = [w.lower() for w in list_of_tweets]
    symbols = [". ", ": ", "?", "!", ";", ","]
    new_words = []
    for symbol in symbols:
        final_tweets = [tweet.replace(symbol, " ") for tweet in final_tweets]
    return final_tweets

In [7]:
cleaned_app_tweets = process_tweets(app_tweets)
cleaned_other_tweets = process_tweets(other_tweets)

In [8]:
def split_list_of_tweets_into_list_of_words(list_of_tweets):
    return [word for tweet in list_of_tweets for word in tweet.split(" ")]

In [9]:
cleaned_app_words = split_list_of_tweets_into_list_of_words(cleaned_app_tweets)
cleaned_other_words = split_list_of_tweets_into_list_of_words(cleaned_other_tweets)

In [10]:
cleaned_app_words[0:5]

['[blog]', 'using', 'nullmailer', 'and', 'mandrill']

In [11]:
cleaned_other_words[0:5]

['¿en', 'donde', 'esta', 'su', 'remontada']

Going to use counter from collections which will take a list of words and return a dictionary that gives me the counts

Going to create class that will keep track of my words and total words

In [12]:
from collections import Counter
class WordProbabilityCounter:
    
    def __init__(self, list_of_words):
        self.word_counter = Counter(list_of_words)
        self.num_words = self.get_total_words()
    
    def get_total_words(self):
        return sum(self.word_counter.values())
    
    def filter_out_tokens_less_than_3(self):
        """In book, he gets rid of tokens (words) that are 3 or less characters"""
        self.word_counter = {k:v for k, v in self.word_counter.items() if len(k) > 3}
        self.num_words = self.get_total_words()
    
    def additive_smoothing(self):
        """Adjust everything so if you have to add 1 for rare words everything is relatively
           the same
        """
        self.word_counter = {k: v + 1 for k, v in self.word_counter.items()}
        self.num_words = self.get_total_words()
    
    def fit_model(self):
        self.filter_out_tokens_less_than_3()
        self.additive_smoothing()
    
    def get_probability_for_word(self, word):
        """Given a word returns the natural log of the probability in order
           to prevent int underflow
        """
        word_count = self.word_counter.get(word, 1)
        return np.log(word_count / self.num_words)
    
    def get_probability_for_tweet(self, tweet):
        return sum(self.get_probability_for_word(word) for word in tweet.split(" "))

In [13]:
cleaned_words_model = WordProbabilityCounter(cleaned_app_words)
cleaned_words_model.fit_model()

In [14]:
other_words_model = WordProbabilityCounter(cleaned_other_words)
other_words_model.fit_model()

### Now it's time to test the model

In [15]:
test_tweets = pd.read_excel(excel_file_obj, "TestTweets")
test_tweets = test_tweets.loc[:, ["Number", "Class", "Tweet"]]

In [16]:
test_tweets.head(30)

Unnamed: 0,Number,Class,Tweet
0,1,APP,Just love @mandrillapp transactional email ser...
1,2,APP,@rossdeane Mind submitting a request at http:/...
2,3,APP,@veroapp Any chance you'll be adding Mandrill ...
3,4,APP,@Elie__ @camj59 jparle de relai SMTP!1 million...
4,5,APP,"would like to send emails for welcome, passwor..."
5,6,APP,"From Coworker about using Mandrill: ""I would ..."
6,7,APP,@mandrill Realised I did that about 5 seconds ...
7,8,APP,Holy shit. It’s here. http://www.mandrill.com/
8,9,APP,Our new subscriber profile page: activity time...
9,10,APP,@mandrillapp increases scalability ( http://bi...


### Going to now use my model to calculate probabilities based on frequencies in each (other or app) then going to assign class with highest probability

In [17]:
test_tweets.loc[:, "app_prob"] = test_tweets.loc[:, "Tweet"].apply(cleaned_words_model.get_probability_for_tweet)
test_tweets.loc[:, "other_prob"] = test_tweets.loc[:, "Tweet"].apply(other_words_model.get_probability_for_tweet)
test_tweets.loc[:, "prediction"] = np.where(test_tweets["app_prob"] >= test_tweets["other_prob"],
                                      "APP",
                                      "OTHER")

In [18]:
test_tweets.head()

Unnamed: 0,Number,Class,Tweet,app_prob,other_prob,prediction
0,1,APP,Just love @mandrillapp transactional email ser...,-85.900442,-97.181465,APP
1,2,APP,@rossdeane Mind submitting a request at http:/...,-128.256084,-140.781974,APP
2,3,APP,@veroapp Any chance you'll be adding Mandrill ...,-73.491794,-75.440103,APP
3,4,APP,@Elie__ @camj59 jparle de relai SMTP!1 million...,-183.036787,-192.742443,APP
4,5,APP,"would like to send emails for welcome, passwor...",-139.353995,-147.009004,APP


### Assessing Accuracy

In [19]:
total = test_tweets.shape[0]
wrong_guesses = np.sum(test_tweets["Class"] != test_tweets["prediction"])
print("% Inaccurate:", wrong_guesses / total)
print("Num Inaccurate:", wrong_guesses)

% Inaccurate: 0.05
Num Inaccurate: 1


### This lines up with what the book got, this shows the power of this model, could have also used this with pre-built modules

In [20]:
# Start from the top
app_tweets_df.loc[:, "class"] = 1
app_final_df = app_tweets_df.loc[:, ["Tweet", "class"]]

other_tweets_df.loc[:, "class"] = 0
other_final_df = other_tweets_df.loc[:, ["Tweet", "class"]]

In [21]:
training_df = app_final_df.append(other_final_df)

In [22]:
training_df["class"].value_counts()

1    150
0    150
Name: class, dtype: int64

In [23]:
training_df.head()

Unnamed: 0,Tweet,class
0,[blog] Using Nullmailer and Mandrill for your ...,1
1,[blog] Using Postfix and free Mandrill email s...,1
2,@aalbertson There are several reasons emails g...,1
3,@adrienneleigh I just switched it over to Mand...,1
4,@ankeshk +1 to @mailchimp We use MailChimp for...,1


Count Use a Vectorize to do the cleaning work for us

In [24]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(strip_accents="ascii", 
                     lowercase=True,
                     analyzer="word"
                    )

X_train = cv.fit_transform(training_df.loc[:, "Tweet"])
X_test = cv.transform(test_tweets.loc[:, "Tweet"])
y_train = training_df.loc[:, "class"]

In [25]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB(fit_prior=False) # just assume uniform rather than calculate priors
naive_bayes.fit(X_train, y_train)
predictions = naive_bayes.predict(X_test)

In [26]:
# Scoring
from sklearn.metrics import accuracy_score
print("Accuracy", accuracy_score(test_tweets.loc[:, "Class"] == "APP", predictions))

Accuracy 0.9


So here we see that the built in classifier missed one more (missed 2 out of 20 compared to 1 out of 20). I'm not suprised given that the word vectorizer is slightly different it's not unreasonable that they are off by 1 here, but it's nice to see comparison of off the shelf tooling, though this notebook also shows you how easy it is to code it from scratch