# Naive Bayes

* Training a naive bayes model on a sentiment analysis task
* Test using your model
* Compute ratios of positive words to negative words
* Error analysis

<a name='0'></a>
## Importing Functions and Data

In [3]:
import pdb
from nltk.corpus import stopwords, twitter_samples
import numpy as np
import pandas as pd
import nltk
import string
from nltk.tokenize import TweetTokenizer
from os import getcwd

nltk.download('twitter_samples')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/user/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

def process_tweet(tweet):
    # Tokenize the tweet into words
    words = word_tokenize(tweet)
    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word.lower() not in stop_words]
    
    # Apply stemming
    stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(word) for word in filtered_words]
    
    # Remove punctuation
    no_punct_words = [word for word in stemmed_words if word not in string.punctuation]
    
    return no_punct_words

# Example usage
tweet = "This is an example tweet with some stop words and various word forms."
processed_words = process_tweet(tweet)
print(processed_words)


['exampl', 'tweet', 'stop', 'word', 'variou', 'word', 'form']


In [8]:
# get the sets of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# split the data into two pieces, one for training and one for testing (validation set)
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

# avoid assumptions about the length of all_positive_tweets
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

<a name='1'></a>
## 1 - Process the Data

`process_tweet`  does this for you.

In [9]:
custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"

# print cleaned tweet
print(process_tweet(custom_tweet))

['rt', 'twitter', 'chapagain', 'hello', 'great', 'day', 'good', 'morn', 'http', '//chapagain.com.np']


In [10]:
# count_tweets

def count_tweets(result, tweets, ys):
    '''
    Input:
        result: a dictionary that will be used to map each pair to its frequency
        tweets: a list of tweets
        ys: a list corresponding to the sentiment of each tweet (either 0 or 1)
    Output:
        result: a dictionary mapping each pair to its frequency
    '''
    for y, tweet in zip(ys, tweets):
        for word in process_tweet(tweet):
            # define the key, which is the word and label tuple
            pair = (word, y)
            
            # if the key exists in the dictionary, increment the count
            if pair in result:
                result[pair] += 1

            # else, if the key is new, add it to the dictionary and set the count to 1
            else:
                result[pair] = 1

    return result

In [11]:
# Testing your function

result = {}
tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
ys = [1, 0, 0, 0, 0]
count_tweets(result, tweets, ys)

{('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2}

<a name='2'></a>
## 2 - Training Naive Bayes

Naive bayes is an algorithm that could be used for sentiment analysis. It takes a short time to train and also has a short prediction time.

$P(D_{pos})$ is the probability that the document is positive.
$P(D_{neg})$ is the probability that the document is negative.


#### Prior and Logprior

The prior probability represents the underlying probability in the target population that a tweet is positive versus negative.  In other words, if we had no specific information and blindly picked a tweet out of the population set, what is the probability that it will be positive versus that it will be negative? That is the "prior".

The prior is the ratio of the probabilities $\frac{P(D_{pos})}{P(D_{neg})}$.


#### Positive and Negative Probability of a Word
- $freq_{pos}$ and $freq_{neg}$ are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
- $N_{pos}$ and $N_{neg}$ are the total number of positive and negative words for all documents (for all tweets), respectively.
- $V$ is the number of unique words in the entire set of documents, for all classes, whether positive or negative.


$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

Add the "+1" for additive smoothing.  

#### Log likelihood

$$\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}$$

##### Create `freqs` dictionary


In [12]:
# Build the freqs dictionary for later uses
freqs = count_tweets({}, train_x, train_y)

<a name='ex-2'></a>
### Exercise 2 - train_naive_bayes
Given a freqs dictionary, `train_x` (a list of tweets) and a `train_y` (a list of labels for each tweet), implement a naive bayes classifier.

- Calculate $V$
- Calculate $freq_{pos}$ and $freq_{neg}$
- Calculate $N_{pos}$, and $N_{neg}$
- Calculate $D$, $D_{pos}$, $D_{neg}$
- Calculate the logprior
- Calculate log likelihood


In [13]:
def train_naive_bayes(freqs, train_x, train_y):
    '''
    Input:
        freqs: dictionary from (word, label) to how often the word appears
        train_x: a list of tweets
        train_y: a list of labels correponding to the tweets (0,1)
    Output:
        logprior: the log prior. (equation 3 above)
        loglikelihood: the log likelihood of your Naive Bayes equation. (equation 6 above)
    '''
    loglikelihood = {}
    logprior = 0


    # Calculate V, the number of unique words in the vocabulary
    vocab = set([pair[0] for pair in freqs.keys()])
    V = len(vocab)

    # Calculate N_pos, N_neg, V_pos, V_neg
    N_pos = N_neg = 0
    for pair in freqs.keys():
        if pair[1] > 0:
            N_pos += freqs[pair]
        else:
            N_neg += freqs[pair]
    
    # Calculate D, the number of documents
    D = len(train_y)

    # Calculate D_pos, the number of positive documents
    D_pos = sum(train_y)

    # Calculate D_neg, the number of negative documents
    D_neg = D - D_pos

    # Calculate logprior
    logprior = np.log(D_pos) - np.log(D_neg)
    
    # For each word in the vocabulary...
    for word in vocab:
        # Get the positive and negative frequency of the word
        freq_pos = freqs.get((word, 1), 0)
        freq_neg = freqs.get((word, 0), 0)

        # Calculate the probability that each word is positive, and negative
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)

        # Calculate the log likelihood of the word
        loglikelihood[word] = np.log(p_w_pos / p_w_neg)


    return logprior, loglikelihood


In [14]:
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
print(logprior)
print(len(loglikelihood))

0.0
16310


<a name='3'></a>
## 3 - Testing

Now that we have the `logprior` and `loglikelihood`, we can test the naive bayes function by making predicting on some tweets!

In [15]:
# naive_bayes_predict
def naive_bayes_predict(tweet, logprior, loglikelihood):
    '''
    Input:
        tweet: a string
        logprior: a number
        loglikelihood: a dictionary of words mapping to numbers
    Output:
        p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)
    '''
    ### START CODE HERE ###
    # Process the tweet to get a list of words
    word_l = process_tweet(tweet)

    # Initialize probability to zero
    p = 0.0

    # Add the logprior
    p += logprior

    for word in word_l:
        # Check if the word exists in the loglikelihood dictionary
        if word in loglikelihood:
            # Add the log likelihood of that word to the probability
            p += loglikelihood[word]

    ### END CODE HERE ###

    return p

In [16]:
my_tweet = 'She smiled.'
p = naive_bayes_predict(my_tweet, logprior, loglikelihood)
print('The expected output is', p)

The expected output is 1.4568612465665989


In [17]:
# Experiment with your own tweet.
my_tweet = 'He laughed.'
p = naive_bayes_predict(my_tweet, logprior, loglikelihood)
print('The expected output is', p)

The expected output is -0.4056672935496634


In [19]:
#test_naive_bayes

def test_naive_bayes(test_x, test_y, logprior, loglikelihood, naive_bayes_predict=naive_bayes_predict):
    """
    Input:
        test_x: A list of tweets
        test_y: the corresponding labels for the list of tweets
        logprior: the logprior
        loglikelihood: a dictionary with the loglikelihoods for each word
    Output:
        accuracy: (# of tweets classified correctly)/(total # of tweets)
    """
    accuracy = 0  # return this properly

    y_hats = []
    for tweet in test_x:
        # If the prediction is > 0
        if naive_bayes_predict(tweet, logprior, loglikelihood) > 0:
            # The predicted class is 1
            y_hat_i = 1
        else:
            # Otherwise, the predicted class is 0
            y_hat_i = 0

        # Append the predicted class to the list y_hats
        y_hats.append(y_hat_i)

    # Error is the average of the absolute values of the differences between y_hats and test_y
    error = np.mean(np.abs(y_hats - test_y))

    # Accuracy is 1 minus the error
    accuracy = 1 - error

    return accuracy


In [20]:
print("Naive Bayes accuracy = %0.4f" %
      (test_naive_bayes(test_x, test_y, logprior, loglikelihood)))

Naive Bayes accuracy = 0.7640


In [21]:
# Run this cell to test your function
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    # print( '%s -> %f' % (tweet, naive_bayes_predict(tweet, logprior, loglikelihood)))
    p = naive_bayes_predict(tweet, logprior, loglikelihood)
#     print(f'{tweet} -> {p:.2f} ({p_category})')
    print(f'{tweet} -> {p:.2f}')

I am happy -> 2.08
I am bad -> -1.25
this movie should have been great. -> 2.00
great -> 2.07
great great -> 4.14
great great great -> 6.21
great great great great -> 8.28


<a name='4'></a>
## 4 - Filter words by Ratio 

Implement get_ratio.

- Given the freqs dictionary of words and a particular word, use `lookup(freqs,word,1)` to get the positive count of the word.
- Similarly, use the `lookup` function to get the negative count of that word.
- Calculate the ratio of positive divided by negative counts

In [22]:
def lookup(freqs, word, label):
    '''
    Input:
        freqs: dictionary containing word frequency counts
        word: the word you want to look up
        label: the class label (0 for negative, 1 for positive)
    Output:
        count: the frequency count of the specified word and label
    '''
    key = (word, label)
    if key in freqs:
        count = freqs[key]
    else:
        count = 0
    return count


In [23]:
# get_ratio

def get_ratio(freqs, word):
    '''
    Input:
        freqs: dictionary containing the words

    Output: a dictionary with keys 'positive', 'negative', and 'ratio'.
        Example: {'positive': 10, 'negative': 20, 'ratio': 0.5}
    '''
    pos_neg_ratio = {'positive': 0, 'negative': 0, 'ratio': 0.0}
    ### START CODE HERE ###
    # use lookup() to find positive counts for the word (denoted by the integer 1)
    pos_neg_ratio['positive'] = freqs.get((word, 1), 0)
    
    # use lookup() to find negative counts for the word (denoted by integer 0)
    pos_neg_ratio['negative'] = freqs.get((word, 0), 0)
    
    # calculate the ratio of positive to negative counts for the word
    pos_neg_ratio['ratio'] = (pos_neg_ratio['positive']+1) / (pos_neg_ratio['negative'] + 1)
    ### END CODE HERE ###
    return pos_neg_ratio


In [24]:
get_ratio(freqs, 'happi')

{'positive': 162, 'negative': 18, 'ratio': 8.578947368421053}

In [25]:
# get_words_by_threshold

def get_words_by_threshold(freqs, label, threshold, get_ratio=get_ratio):
    '''
    Input:
        freqs: dictionary of words
        label: 1 for positive, 0 for negative
        threshold: ratio that will be used as the cutoff for including a word in the returned dictionary
    Output:
        word_list: dictionary containing the word and information on its positive count, negative count, and ratio of positive to negative counts.
        example of a key value pair:
        {'happi':
            {'positive': 10, 'negative': 20, 'ratio': 0.5}
        }
    '''
    word_list = {}

    for key in freqs.keys():
        word, _ = key

        # get the positive/negative ratio for a word
        pos_neg_ratio = get_ratio(freqs, word)

        # if the label is 1 and the ratio is greater than or equal to the threshold...
        if label == 1 and pos_neg_ratio['ratio'] >= threshold:
        
            # Add the pos_neg_ratio to the dictionary
            word_list[word] = pos_neg_ratio

        # If the label is 0 and the pos_neg_ratio is less than or equal to the threshold...
        elif label == 0 and pos_neg_ratio['ratio'] <= threshold:
        
            # Add the pos_neg_ratio to the dictionary
            word_list[word] = pos_neg_ratio

        # otherwise, do not include this word in the list (do nothing)

    return word_list


In [27]:
# Test your function; find positive words at or above a threshold
get_words_by_threshold(freqs, label=1, threshold=15)

{'followfriday': {'positive': 23, 'negative': 0, 'ratio': 24.0},
 'bhaktisbant': {'positive': 16, 'negative': 0, 'ratio': 17.0},
 'flipkartfashionfriday': {'positive': 16, 'negative': 0, 'ratio': 17.0},
 'p': {'positive': 107, 'negative': 1, 'ratio': 54.0},
 'influenc': {'positive': 16, 'negative': 0, 'ratio': 17.0},
 'jnlazt': {'positive': 62, 'negative': 0, 'ratio': 63.0},
 '//t.co/rcvcyyo0iq': {'positive': 62, 'negative': 0, 'ratio': 63.0},
 'youth': {'positive': 15, 'negative': 0, 'ratio': 16.0},
 'tolajobjob': {'positive': 14, 'negative': 0, 'ratio': 15.0},
 'bam': {'positive': 44, 'negative': 0, 'ratio': 45.0},
 'barsandmelodi': {'positive': 44, 'negative': 0, 'ratio': 45.0},
 '969horan696': {'positive': 44, 'negative': 0, 'ratio': 45.0},
 'warsaw': {'positive': 44, 'negative': 0, 'ratio': 45.0},
 'stat': {'positive': 51, 'negative': 0, 'ratio': 52.0},
 'impastel': {'positive': 17, 'negative': 0, 'ratio': 18.0},
 'blog': {'positive': 27, 'negative': 0, 'ratio': 28.0},
 'fback': {

<a name='5'></a>
## 5 - Error Analysis



In [28]:
# Some error analysis done for you
print('Truth Predicted Tweet')
for x, y in zip(test_x, test_y):
    y_hat = naive_bayes_predict(x, logprior, loglikelihood)
    if y != (np.sign(y_hat) > 0):
        print('%d\t%0.2f\t%s' % (y, np.sign(y_hat) > 0, ' '.join(
            process_tweet(x)).encode('ascii', 'ignore')))

Truth Predicted Tweet
1	0.00	b'bro u wan cut hair anot ur hair long liao bo sinc ord liao take easi lor treat save leav longer bro lol sibei xialan'
1	0.00	b"humayag 'stuck centr right clown right joker left ... orgasticpot ahmedshahe ahmedsaeedgahaa"
1	0.00	b'tim_a_robert pinter_quot work'
1	0.00	b'sasarichardson stefbystef_ frgt10_anthem hahahahahahahahahahahahahaha dy liter front like'
1	0.00	b"'s awak"
1	0.00	b'charlesjonesss f'
1	0.00	b"scooterblue1962 thank ye let 's hope work miss"
1	0.00	b"awkward moment name 'akarshan end stay 'singl foreveralon"
1	0.00	b'v4violetta highfiv probabl ahead sinc less artsi verbal'
1	0.00	b'dat rp tho thank much guy celebr one month partnership ty madmorphtv raid'
1	0.00	b'seniorspazz tehsmiley bore everyth'


1	0.00	b'jarednotsubway iluvmariah bravotv truli later move know queen bee upward bound movingonup'
1	0.00	b'caballeroserena actual need stop tweet drive'
1	0.00	b"well 's littlemix presal ticket bought thank ticketmasteruk wonder take book ... thing parent"
1	0.00	b'betcha dumb butt'
1	0.00	b'kik qualky808 kik kikmenow milf like4lik bore summer sexysaturday http //t.co/8r2nrl31ic'
1	0.00	b"madison420ivi `` '' wish kid"
1	0.00	b'markbreech sure would good thing 4 bottom dare 2 say 2 miss b im gon na stubborn mouth soap nothavingit p'
1	0.00	b"saharjojo10 ye n't car"
1	0.00	b'thewhitespik like especi klee one'
1	0.00	b'catargiu yeah kinda feel like warm butter'
1	0.00	b'madpilot'
1	0.00	b'whenev sister see cri text ask im okay aw someon care'
1	0.00	b"sisiphomphoza ca n't wait see"
1	0.00	b'shadypenguinn take care'
1	0.00	b'waitlesscompani aledeleonmoreno glyon'
1	0.00	b"andyherren know dumb think johnni rock u seen utub video 's mad respect"
1	0.00	b'imtoxic21 sorri loss hope goe well'

<a name='6'></a>
## 6 - Prediction

In [29]:
# Test with your own tweet - feel free to modify `my_tweet`
my_tweet = 'I am happy because I am learning :)'

p = naive_bayes_predict(my_tweet, logprior, loglikelihood)
print(p)

2.5705368957188433
