# Naive Bayes classifier practice

## (predict if a tweet is about the Mandrill app or not)

All our data is in the naive_bayes_data folder. Let's load up the training examples into two lists of tweets:

In [1]:
import os
app_tweets   = open("naive_bayes_data/training_tweets_app.txt", encoding='utf-8').read().splitlines()
other_tweets = open("naive_bayes_data/training_tweets_other.txt", encoding='utf-8').read().splitlines()

### Let's check if our tweets loaded up successfully.

Feel free to explore the data set by changing the index that we're looking at.

In [2]:
print(app_tweets[0])

﻿[blog] Using Nullmailer and Mandrill for your Ubuntu Linux server outboud mail:  http://bit.ly/ZjHOk7  #plone


In [3]:
print(other_tweets[0])

﻿¿En donde esta su remontada Mandrill?


### Let's normalise our tweets, by:
* Converting them to lower case
* Substituting the punctuation marks dot '.' and colon ':', followed by a space " ", to just " " -- because we don't want to split "... google.com ..." into two words, but we want to split "... Google. Microsoft ..."
* Substituting the punctiation marks ",", "?", "!", ";" to " "

After this normalisation, we can treat our tweets as sequences of lowercase words separated by spaces " ".

In [4]:
def normalise_tweet(tweet):
    tweet_lowercase = tweet.lower()
    tweet_no_dot = str.replace(tweet_lowercase,". ", " ")
    tweet_no_colon = str.replace(tweet_no_dot, ": ", " ")
    tweet_no_comma = str.replace(tweet_no_colon, ",", " ")
    tweet_no_question = str.replace(tweet_no_comma, "?", " ")
    tweet_no_exclamation = str.replace(tweet_no_question, "!", " ")
    tweet_no_semicolon = str.replace(tweet_no_exclamation, ";", " ")
    return tweet_no_semicolon

In [5]:
app_tweets_normalised   = [normalise_tweet(tweet) for tweet in app_tweets]
other_tweets_normalised = [normalise_tweet(tweet) for tweet in other_tweets]

Again, let's check that our code works on a tweet:

In [6]:
print(app_tweets[0])
print(app_tweets_normalised[0])

﻿[blog] Using Nullmailer and Mandrill for your Ubuntu Linux server outboud mail:  http://bit.ly/ZjHOk7  #plone
﻿[blog] using nullmailer and mandrill for your ubuntu linux server outboud mail  http://bit.ly/zjhok7  #plone


### Let's split our tweets into the so-called bags of words

To do that, we split each tweet by cutting at the space " " characters -- and remove duplicated words so that we get each word in a tween just once.

In [7]:
app_bags_of_words   = [set(tweet.split()) for tweet in app_tweets_normalised]
other_bags_of_words = [set(tweet.split()) for tweet in other_tweets_normalised]

In [8]:
print(app_bags_of_words[0])
print(other_bags_of_words[0])

{'server', '\ufeff[blog]', 'nullmailer', 'your', 'outboud', 'using', 'mandrill', '#plone', 'linux', 'ubuntu', 'and', 'http://bit.ly/zjhok7', 'mail', 'for'}
{'remontada', 'mandrill', 'esta', 'donde', 'su', '\ufeff¿en'}


With this, the pre-processing step is done. Now we have to build the actual naive Bayes model.

### Count how many app/other tweets does every word appear in

To do this, we'll use a dictionary with key:word -> value:(# app tweets containing word, # other tweets containing word)

In [9]:
word_to_tweet_counts = {}
for bag in app_bags_of_words:
    for word in bag:
        if len(word) > 3:
            if word in word_to_tweet_counts:
                (app_count, other_count) = word_to_tweet_counts[word]
                app_count += 1
                word_to_tweet_counts[word] = (app_count,other_count)
            else:
                word_to_tweet_counts[word] = (1,0) # we have only seen the word in an app tweet
for bag in other_bags_of_words:
    for word in bag:
        if len(word) > 3:
            if word in word_to_tweet_counts:
                (app_count, other_count) = word_to_tweet_counts[word]
                other_count += 1
                word_to_tweet_counts[word] = (app_count,other_count)
            else:
                word_to_tweet_counts[word] = (0,1) # we have only seen the word in an other tweet

Let's check if the dictionary is populating well by printing a common word, such as "mandrill".

Feel free to print other words to check the result, such as "email".

In [10]:
print(word_to_tweet_counts["mandrill"])

(90, 89)


In [11]:
print(word_to_tweet_counts["email"])

(26, 0)


Now, in our Bayes model, instead of counts we need to be keeping probabilities that the word would appear in an app/other tweet.

P(word|app) = (# word's appearances in an app tweet)/(# app tweets)

P(word|other) = (# word's appearances in an other tweet)/(# other tweets)

Take care when dividing: is 4/5=0, or is 4/5=0.8?

<span style="color:red">In the book they are calculating the probabilities based on the total number of unique words in each dataset, and here they are calculated based on the total number of bags in each dataset. I have fixed this ...
</span>

<span style="color:red"> Another problem in the cell below is that it doesn't account for the additive smooting ... so I added it
</span>

In [12]:
import math

num_apps = 0.0
num_others = 0.0

for word in word_to_tweet_counts:
    (app_count,other_count) = word_to_tweet_counts[word]
    app_count += app_count + 1.0 
    other_count += other_count + 1.0
    if app_count > 0.0 : num_apps += app_count
    if other_count > 0.0 : num_others += other_count
    word_to_tweet_counts[word] = (app_count,other_count)

word_to_tweet_probs = {}
for word in word_to_tweet_counts:
    (app_count,other_count) = word_to_tweet_counts[word]
    if app_count != 0 and other_count != 0:
        (app_prob,other_prob) = (math.log(app_count/num_apps), math.log(other_count/num_others))
    elif app_count == 0:
        (app_prob,other_prob) = (math.log(1.0/num_apps), math.log(other_count/num_others))
    else:
        (app_prob,other_prob) = (math.log(app_count/num_apps),  math.log(1.0/num_others))
    word_to_tweet_probs[word] = (app_prob, other_prob)

Let's check if the probabilities are fine for common words, such as "and":

In [13]:
print("App tweets: ", num_apps)
print("Other tweets: ", num_others)
print("Counts for the word: ", word_to_tweet_counts["photo"])
print("Probabilities for the word: ", word_to_tweet_probs["photo"])

App tweets:  4676.0
Other tweets:  3960.0
Counts for the word:  (3.0, 1.0)
Probabilities for the word:  (-7.35158603392385, -8.283999304248526)


## Let's try to classify our test tweets now!

First, let us pre-process the test tweets in the same fashion as we did with the training ones:

In [14]:
test_tweets = open("naive_bayes_data/test_tweets.txt", encoding='utf-8').read().splitlines()
test_tweets_normalised = [normalise_tweet(tweet) for tweet in test_tweets]
test_bags_of_words = [set(tweet.split()) for tweet in test_tweets_normalised]

<span style="color:red"> The problem in the cell below is that it doesn't account for the additive smooting ... so I added it
</span>

In [15]:
predictions = []

for test_bag in test_bags_of_words:
    total_app_prob, total_other_prob = 0.0, 0.0
    for word in test_bag:
        if word in word_to_tweet_probs:
            (app_prob, other_prob) = word_to_tweet_probs[word]
        else:
            app_prob, other_prob = math.log(1.0/num_apps), math.log(1.0/num_others)
        total_app_prob += app_prob
        total_other_prob += other_prob
    if total_app_prob > total_other_prob:
        predictions.append("APP")
    else:
        predictions.append("OTHER")

Now, let's check our predictions!

In [16]:
test_tweet_answers = open("naive_bayes_data/test_tweets_answers.txt").read().splitlines()

for tweet_number, tweet in enumerate(test_tweets):

    print (tweet)
    print ("Our prediction is:", predictions[tweet_number])
    print ("True class is    :", test_tweet_answers[tweet_number])
    print ("\n")

﻿Just love @mandrillapp transactional email service - http://mandrill.com Sorry @SendGrid and @mailjet #timetomoveon
Our prediction is: APP
True class is    : APP


@rossdeane Mind submitting a request at http://help.mandrill.com with account details if you haven't already? Glad to take a look!
Our prediction is: APP
True class is    : APP


@veroapp Any chance you'll be adding Mandrill support to Vero?
Our prediction is: APP
True class is    : APP


@Elie__ @camj59 jparle de relai SMTP!1 million de mail chez mandrill / mois comparé à 1 million sur lite sendgrid y a pas photo avec mailjet
Our prediction is: APP
True class is    : APP


would like to send emails for welcome, password resets, payment notifications, etc. what should i use? was looking at mailgun/mandrill
Our prediction is: APP
True class is    : APP


From Coworker about using Mandrill:  "I would entrust email handling to a Pokemon".
Our prediction is: APP
True class is    : APP


@mandrill Realised I did that about 5 sec