# Sentiment Analysis Using Naive Bayes

It will using Naive Bayes for sentiment analysis on tweets. Given a tweet, you will decide if it has a positive sentiment or a negative one. specifically it will:
- Train a naive bayes model on a sentiment analysis task.
- Test using your model
- Compute ratios of positive words to negative words.
- Do some error anaysis
- Predict on your own tweet.

## Table of Content

- Importing Functions and Data
- Process the data.
- Train your Model using Naive Bayes
- Test your naive bayes
- Filter words by Ratio of Positive to Negative Counts
- Error Analysis
- Predict with your own Tweet

## Importing Functions and Data

In [1]:
from utils import process_tweet, lookup
import pdb
from nltk.corpus import stopwords, twitter_samples
import numpy as np
import pandas as pd
import nltk
import string
from nltk.tokenize import TweetTokenizer
from os import getcwd

nltk.download('twitter_samples')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Dhruv\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dhruv\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
filePath = f"{getcwd()}/../tmp2/"
nltk.data.path.append(filePath)

In [4]:
#Select the set of positive and negative tweets
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

In [5]:
# Test set : 20%
# Train set : 80%
#There are total 5000 positive and 5000 negative tweets
test_pos = positive_tweets[4000:]
train_pos = positive_tweets[:4000]
test_neg = negative_tweets[4000:]
train_neg = negative_tweets[:4000]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

In [6]:
# avoid assumptions about the length of all_positive_tweets
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

## Process the Data

In [7]:
custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"

# print cleaned tweet
print(process_tweet(custom_tweet))

['hello', 'great', 'day', ':)', 'good', 'morn']


In [10]:
# Count Tweets
def count_tweets(result, tweets, ys):
    '''
    Input:
        result: a dictionary that will be used to map each pair to its frequency
        tweets: a list of tweets
        ys: a list corresponding to the sentiment of each tweet (either 0 or 1)
    Output:
        result: a dictionary mapping each pair to its frequency
    '''
    for y, tweet in zip(ys, tweets):
        for word in process_tweet(tweet):
            #define the key, which is the word and label tuple
            pair = (word, y)
            if pair in result:
                result[pair] += 1
            else:
                result[pair] = 1
    return result

## Train Model Using Naive Bayes

In [11]:
# Build the freqs dictionary for later uses
freqs = count_tweets({}, train_x, train_y)

In [16]:
# train_naive_bayes
def train_naive_bayes(freqs, train_x,train_y):
    '''
    Input:
        freqs: dictionary from (word, label) to how often the word appears
        train_x: a list of tweets
        train_y: a list of labels correponding to the tweets (0,1)
    Output:
        logprior: the log prior. (equation 3 above)
        loglikehood: the log likelihood of you Naive bayes equation. (equation 6 above)
    '''
    loglikehood = {}
    logprior = 0
    
    #calculate the number of unique words in the vocabulary
    vocab = set([pair[0] for pair in freqs.keys()])
    V = len(vocab)
    
    #calculate N_pos, N_neg, V_pos, V_neg
    N_pos = N_neg = 0
    for pair in freqs.keys():
        if pair[1] > 0:
            N_pos += 1
        else:
            N_neg += 1
    
    D = len(train_y)
    D_pos = (len(list(filter(lambda x:x > 0,train_y))))
    D_neg = (len(list(filter(lambda x:x <=0,train_y))))
    
    logprior = np.log(D_pos) - np.log(D_neg)
    
    for word in vocab:
        # get the positive and negative frequency of the word
        freq_pos = lookup(freqs,word,1)
        freq_neg = lookup(freqs,word,0)

        # calculate the probability that each word is positive, and negative
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)
        
        loglikehood[word] = np.log(p_w_pos/p_w_neg)
        
    return logprior, loglikehood


In [44]:
logprior, loglikehood = train_naive_bayes(freqs, train_x, train_y)
print(logprior)
print(len(loglikehood))

0.0
9162


## Test Naive Bayes

In [45]:
def naive_bayes_predict(tweet, logprior, loglikehood):
    '''
    Input:
        tweet: a string
        logprior: a number
        loglikehood: a dictionary of words mapping to numbers
    Output:
        p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)

    '''
    word_l = process_tweet(tweet)
    
    p = 0
    p += logprior
    
    for word in word_l:
        if word in loglikehood:
            p += loglikehood[word]
            
    return p

In [46]:
def test_naive_bayes(test_x, test_y, logprior, loglikehood, naive_bayes_predict = naive_bayes_predict):
    """
    Input:
        test_x: A list of tweets
        test_y: the corresponding labels for the list of tweets
        logprior: the logprior
        loglikehood: a dictionary with the loglikelihoods for each word
    Output:
        accuracy: (# of tweets classified correctly)/(total # of tweets)
    """
    acc = 0
    y_h = []
    
    for tweet in test_x:
        if naive_bayes_predict(tweet, logprior, loglikehood) > 0:
            y_h_i = 1
        else:
            y_h_i = 0
        y_h.append(y_h_i)
        
    error = np.mean(np.absolute(y_h-test_y))
    
    acc = 1 - error
    
    return acc

In [47]:
print("Naive Bayes accuracy = %0.4f" %
      (test_naive_bayes(test_x, test_y, logprior, loglikehood)))

Naive Bayes accuracy = 0.9955


In [48]:
# Feel free to check the sentiment of your own tweet below
my_tweet = 'you are bad :('
naive_bayes_predict(my_tweet, logprior, loglikehood)

-8.839647736872516

## Filter words by Ratio of Positive to Negative Counts

In [49]:
def get_ratio(freqs, word):
    '''
    Input:
        freqs: dictionary containing the words

    Output: a dictionary with keys 'positive', 'negative', and 'ratio'.
        Example: {'positive': 10, 'negative': 20, 'ratio': 0.5}
    '''
    pos_neg_ratio = {'positive':0, 'negative':0, 'ratio': 0.0}
    
    pos_neg_ratio['positive'] = lookup(freqs, word,1)
    
    pos_neg_ratio['negative'] = lookup(freqs, word, 0)
    
    pos_neg_ratio['ratio'] = (pos_neg_ratio['positive']+1) / (pos_neg_ratio['negative']+1)
    
    return pos_neg_ratio

In [50]:
get_ratio(freqs, 'bad')

{'positive': 14, 'negative': 54, 'ratio': 0.2727272727272727}

In [51]:
def get_words_by_threshold(freqs, label, threshold, get_ratio=get_ratio):
    '''
    Input:
        freqs: dictionary of words
        label: 1 for positive, 0 for negative
        threshold: ratio that will be used as the cutoff for including a word in the returned dictionary
    Output:
        word_list: dictionary containing the word and information on its positive count, negative count, and ratio of positive to negative counts.
        example of a key value pair:
        {'happi':
            {'positive': 10, 'negative': 20, 'ratio': 0.5}
        }
    '''
    word_list = {}
    
    for key in freqs.keys():
        word,_ = key
        
        pos_neg_ratio = get_ratio(freqs, word)
        
        if label == 1 and pos_neg_ratio['ratio'] >= threshold:
            
            word_list[word] = pos_neg_ratio
            
        elif label == 0 and pos_neg_ratio['ratio'] <= threshold:
            
            word_list[word] = pos_neg_ratio
            
    return word_list

## Error Analysis

In [52]:
print('Truth Predicted Tweet')
for x, y in zip(test_x, test_y):
    y_hat = naive_bayes_predict(x, logprior, loglikehood)
    if y != (np.sign(y_hat) > 0):
        print('%d\t%0.2f\t%s' % (y, np.sign(y_hat) > 0, ' '.join(
            process_tweet(x)).encode('ascii', 'ignore')))

Truth Predicted Tweet
1	0.00	b'truli later move know queen bee upward bound movingonup'
1	0.00	b'new report talk burn calori cold work harder warm feel better weather :p'
1	0.00	b'harri niall 94 harri born ik stupid wanna chang :d'
1	0.00	b'park get sunlight'
1	0.00	b'uff itna miss karhi thi ap :p'
0	1.00	b'hello info possibl interest jonatha close join beti :( great'
0	1.00	b'u prob fun david'
0	1.00	b'pat jay'
0	1.00	b'sr financi analyst expedia inc bellevu wa financ expediajob job job hire'


## Predict with your own Tweet

In [54]:
my_tweet = input()

p = naive_bayes_predict(my_tweet, logprior, loglikehood)
print(p)

Hey buddy, I just learned naive bayes.
2.55893356774379


# Thank You