**Sentiment Analyzer based on Naive Bayes Classifier**

Here we build a Naive Bayes Classifier for sentiment analysis!

In [0]:
!pip install -U -q PyDrive

In [61]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [62]:
path_of_data = '/content/drive/"My Drive"/"Sentiment Analysis"/'

!ls {path_of_data}
# Can you see the names of th directories? If yes, proceed...

Tweets.csv


In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [0]:
path = '/content/drive/My Drive/Sentiment Analysis/'

In [0]:
data = pd.read_csv(path+"Tweets.csv")

In [66]:
data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@USAirways Is there a phone line to call into ...
1,positive,@united Bag was finally delivered and intact. ...
2,positive,@usairways Thanks to Kevin and team at F38ish ...
3,negative,"@AmericanAir Yes, talked to them. FLL says is ..."
4,negative,@VirginAmerica and it's a really big bad thing...


In [0]:
data.columns = ['labels', 'column_having_text_of_tweet']

In [68]:
data.shape

(14640, 2)

In [69]:
data['labels'].value_counts(normalize=True)

negative    0.626913
neutral     0.211680
positive    0.161407
Name: labels, dtype: float64

This is the distribution of the label.

We need to maintain the same distribution when we split it into train and test datasets. This is called Stratified Sampling.

**Train-Test Split**

In [70]:
# Randomize the dataset
data_randomized = data.sample(frac=1, random_state=1)

# Calculate index for split
training_test_index = round(len(data_randomized) * 0.8) # we are gonna use 80% of the data for training

# Training/Test split
training_set = data_randomized[:training_test_index].reset_index(drop=True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(11712, 2)
(2928, 2)


In [71]:
round(training_set['labels'].value_counts(normalize=True), 2)

negative    0.63
neutral     0.21
positive    0.16
Name: labels, dtype: float64

In [72]:
round(test_set['labels'].value_counts(normalize=True), 2)

negative    0.63
neutral     0.21
positive    0.16
Name: labels, dtype: float64

We will do some cleaning on the training part to bring it into a form that can be used.

**Data Cleaning on Training part:**

In [73]:
training_set.head()

Unnamed: 0,labels,column_having_text_of_tweet
0,neutral,@united 618 was flight out of Houston
1,negative,.@united I think not. I'm not flying you again...
2,positive,@SouthwestAir Have had a companion pass for a ...
3,negative,@AmericanAir we have 3 more passengers with me...
4,positive,@SouthwestAir I managed to get sorted out over...


In [74]:
training_set['column_having_text_of_tweet'] = training_set['column_having_text_of_tweet'].str.replace('\W', ' ')
training_set['column_having_text_of_tweet'] = training_set['column_having_text_of_tweet'].str.lower()
training_set.head()

Unnamed: 0,labels,column_having_text_of_tweet
0,neutral,united 618 was flight out of houston
1,negative,united i think not i m not flying you again...
2,positive,southwestair have had a companion pass for a ...
3,negative,americanair we have 3 more passengers with me...
4,positive,southwestair i managed to get sorted out over...


In [0]:
training_set['column_having_text_of_tweet'] = training_set['column_having_text_of_tweet'].str.split()

In [0]:
vocabulary = []
for tweet in training_set['column_having_text_of_tweet']:
    for word in tweet:
        vocabulary.append(word)
        

In [0]:
vocabulary = list(set(vocabulary)) #to obtain only the unique elements

In [78]:
len(vocabulary)

13264

There are 13264 unique words in the vocabulary.

In [0]:
word_counts_per_tweet = {unique_word: [0] * len(training_set['column_having_text_of_tweet']) for unique_word in vocabulary}

for index, tweet in enumerate(training_set['column_having_text_of_tweet']):
    for word in tweet:
        word_counts_per_tweet[word][index] += 1

In [0]:
word_counts = pd.DataFrame(word_counts_per_tweet) #word_counts is a dataframe

In [81]:
word_counts.head()

Unnamed: 0,incurring,purpose,sittin,keambleam,missedupgrades,caren,16mont,epicfail,trained,2uaicfjrms,tim,tore,reuse,warrants,reiterate,mfssh2uhue,clearing,term,complimenting,orlandosentinel,o1u96xc3bo,forgot,jvstatus,cleveland,omaha,interested,zl4bvexmcj,seating,use,rates,7t1rdrcre6,thin,becomes,nogood,statement,medical,t5mrj5yw6i,channels,queue,gettin,...,southwestfail,odds,enforcing,livethelegend,private,threatening,deter,itsaaronchriz,users,matters,future,tiredofthis,verbiage,hn,64kn6geep8,delacy,willie,complaints,staring,belligerent,after2,ifeeldumb,lauderdale,baitandswitch,3hours,kick,street,ua1469,custserv,ujfs9zi6kd,seeing,kosher,neighbors,theft,sympathetic,slight,bankruptcies,3659,777,deactivate
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [82]:
word_counts.shape

(11712, 13264)

The above 'shape' command reveals that since there are 11712 tweets in the dataset. Since there are 13264 unique words in this set, the pandas dataframe has the shape 11712 x 13264.

We will append two more columns to it. The tweet and the label of the sentiment associated to it.

In [83]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,labels,column_having_text_of_tweet,incurring,purpose,sittin,keambleam,missedupgrades,caren,16mont,epicfail,trained,2uaicfjrms,tim,tore,reuse,warrants,reiterate,mfssh2uhue,clearing,term,complimenting,orlandosentinel,o1u96xc3bo,forgot,jvstatus,cleveland,omaha,interested,zl4bvexmcj,seating,use,rates,7t1rdrcre6,thin,becomes,nogood,statement,medical,t5mrj5yw6i,channels,...,southwestfail,odds,enforcing,livethelegend,private,threatening,deter,itsaaronchriz,users,matters,future,tiredofthis,verbiage,hn,64kn6geep8,delacy,willie,complaints,staring,belligerent,after2,ifeeldumb,lauderdale,baitandswitch,3hours,kick,street,ua1469,custserv,ujfs9zi6kd,seeing,kosher,neighbors,theft,sympathetic,slight,bankruptcies,3659,777,deactivate
0,neutral,"[united, 618, was, flight, out, of, houston]",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,negative,"[united, i, think, not, i, m, not, flying, you...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,positive,"[southwestair, have, had, a, companion, pass, ...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,negative,"[americanair, we, have, 3, more, passengers, w...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,positive,"[southwestair, i, managed, to, get, sorted, ou...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


The Naive Bayes algorithm will need to answer the three probability questions to be able to classify new tweets.

In [0]:
###### Isolating tweets of different sentiments first
positive_tweets = training_set_clean[training_set_clean['labels'] == 'positive']
negative_tweets = training_set_clean[training_set_clean['labels'] == 'negative']
neutral_tweets = training_set_clean[training_set_clean['labels'] == 'neutral']

In [0]:
##### P(Positive) and P(Negative)
p_positive = len(positive_tweets) / len(training_set_clean)
p_negative = len(negative_tweets) / len(training_set_clean)
p_neutral = len(neutral_tweets) / len(training_set_clean)

In [86]:
p_positive, p_negative, p_neutral

(0.16205601092896174, 0.6257684426229508, 0.21217554644808742)

In [0]:
###### N_Positive
n_words_per_positive_tweets = positive_tweets['column_having_text_of_tweet'].apply(len)
n_positive = n_words_per_positive_tweets.sum()

In [0]:
###### N_Negative
n_words_per_negative_tweets = negative_tweets['column_having_text_of_tweet'].apply(len)
n_negative = n_words_per_negative_tweets.sum()

In [0]:
###### N_Neutral
n_words_per_neutral_tweets = neutral_tweets['column_having_text_of_tweet'].apply(len)
n_neutral = n_words_per_neutral_tweets.sum()

In [0]:
###### N_Vocabulary
n_vocabulary = len(vocabulary)

In [0]:
###### Laplace smoothing
alpha = 1

In [92]:
n_positive, n_negative, n_neutral, n_vocabulary

(27457, 149510, 38079, 13264)

Whatever we computed above, serves as "constant" in the Naive Bayes Algorithm's equation.

In [0]:
# Initiate parameters
parameters_positive = {unique_word:0 for unique_word in vocabulary}
parameters_negative = {unique_word:0 for unique_word in vocabulary}
parameters_neutral = {unique_word:0 for unique_word in vocabulary}

Now, we will make calculations for these parameters.

In [0]:
# Calculate parameters
for word in vocabulary:
  
  n_word_given_positive = int(positive_tweets[word].sum())   # positive tweets already defined in a cell above
  p_word_given_positive = (n_word_given_positive + alpha) / (n_positive + alpha*n_vocabulary)
  parameters_positive[word] = p_word_given_positive

In [0]:

for word in vocabulary:
  n_word_given_negative = negative_tweets[word].sum()   # negative tweets already defined in a cell above
  p_word_given_negative = (n_word_given_negative + alpha) / (n_negative + alpha*n_vocabulary)
  parameters_negative[word] = p_word_given_negative

  n_word_given_neutral = neutral_tweets[word].sum()   # neutral tweets already defined in a cell above
  p_word_given_neutral = (n_word_given_neutral + alpha) / (n_neutral + alpha*n_vocabulary)
  parameters_neutral[word] = p_word_given_neutral

Apparently, the parameters are calculated.

Let's try the classifier now!

In [0]:
import re
def classify(a_tweet):
    '''
    a_tweet: a string
    '''
    
    a_tweet = re.sub('\W', ' ', a_tweet)
    a_tweet = a_tweet.lower().split()
    
    p_positive_given_a_tweet = p_positive
    p_negative_given_a_tweet = p_negative
    p_neutral_given_a_tweet = p_neutral

    for word in a_tweet:
        if word in parameters_positive:
            p_positive_given_a_tweet *= parameters_positive[word]
            
        if word in parameters_negative:
            p_negative_given_a_tweet *= parameters_negative[word]
            
        if word in parameters_neutral:
            p_neutral_given_a_tweet *= parameters_neutral[word]
    
    
    total_p = p_positive_given_a_tweet + p_negative_given_a_tweet + p_neutral_given_a_tweet

    p_positive_given_a_tweet = p_positive_given_a_tweet/total_p
    p_negative_given_a_tweet = p_negative_given_a_tweet/total_p
    p_neutral_given_a_tweet = p_neutral_given_a_tweet/total_p

    
    #classes = ['p_positive_given_a_tweet', 'p_negative_given_a_tweet', 'p_neutral_given_a_tweet']
    class_probs = [p_positive_given_a_tweet, p_negative_given_a_tweet, p_neutral_given_a_tweet]
    max_value = max(class_probs)
    index_ = class_probs.index(max_value)
    
    if index_ == 0:
      return 'positive'

    elif index_ == 1:
      return 'negative'

    elif index_ == 2:
      return 'neutral'

In [120]:
senti_ =classify('I love you very very much')
senti_

'positive'

In [121]:
senti_ = classify('I hate you very very much')
senti_

'negative'

In [123]:
senti_ = classify('Please follow back')
senti_

'neutral'

Let's see the results on the test set now !

In [0]:
test_sentiments = []
for tweet in test_set['column_having_text_of_tweet']:
  senti = classify(tweet)
  test_sentiments.append(senti)

In [0]:
#Orginal sentiments or labels on the test set:

original_sentiments = list(test_set['labels'])
test_set_len = len(original_sentiments)

In [0]:
#Accuracy:

indicators = [1 for i, j in zip(test_sentiments, original_sentiments) if i == j]

In [133]:
print("Accuracy on test set: "+str(round(len(indicators)/test_set_len*100))+"%")

Accuracy on test set: 77%


**Confusion matrix for Positive class**

In [0]:
from operator import itemgetter

In [0]:
#indices where 'positive' sentiment is the label
index_positive_GT = [i for i in range(test_set_len) if original_sentiments[i] == 'positive']

In [141]:

items_at_those_indices = list(itemgetter(*index_positive_GT)(test_sentiments))
TP = items_at_those_indices.count('positive')
print(TP)
FN = items_at_those_indices.count('negative') + items_at_those_indices.count('neutral')
print(FN)

239
226


In [0]:
index_negative_GT = [i for i in range(test_set_len) if original_sentiments[i] != 'positive']

In [146]:

items_at_those_indices = list(itemgetter(*index_positive_GT)(test_sentiments))
FP = items_at_those_indices.count('positive')
print(FP)
TN = items_at_those_indices.count('negative') + items_at_those_indices.count('neutral')
print(TN)

239
226


**Confusion matrix for every class:**

In [0]:
classes = list(set(data['labels'])) #classes or sentiments of tweets in the dataset

In [0]:
def Obtain_Confusion_Matrix(class_):

  index_class_GT = [i for i in range(test_set_len) if original_sentiments[i] == class_]
  #indices in the list of original labels where the class holds
  items_at_those_indices = list(itemgetter(*index_class_GT)(test_sentiments))
  #predictions at those items in the list of predictions
  TP = items_at_those_indices.count(class_) #True Positives of that class
  #print("True Positives: "+str(TP))
  FN = len(items_at_those_indices) - TP #If not TP, they are then False Negatives
  #print("False Negatives: "+str(FN))

  actual_P = [FN, TP]

  index_not_class_GT = [i for i in range(test_set_len) if original_sentiments[i] != class_]
  #indices in the list of original labels where the class does not hold
  items_at_those_indices = list(itemgetter(*index_not_class_GT)(test_sentiments))
  #predictions at those items in the list of predictions
  FP = items_at_those_indices.count(class_) #Flase Positives of that class
  #print("False Positives: "+str(FP))
  TN = len(items_at_those_indices) - FP #If not FP, they are then True Negatives
  #print("False Negatives: "+str(TN))

  actual_N = [TN, FP]
  
  conf_matrix = np.array([actual_N, actual_P])
  
  return conf_matrix

In [203]:
cf = {}
for class_ in classes:
  print("Obtaining the confusion matrix for the class: "+class_)
  cf[class_] = Obtain_Confusion_Matrix(class_)

Obtaining the confusion matrix for the class: negative
Obtaining the confusion matrix for the class: neutral
Obtaining the confusion matrix for the class: positive


In [192]:
# For every class, we will compute performance metric from confusion matrix of that class

Accu_list = []
Rec_list = []
Prec_list = []
F1_list = []

for class_ in classes:
  matrix = cf[class_]

  print("Sentiment : "+class_)
  Accu = (matrix[0][0] + matrix[1][1]) / sum(sum(matrix))
  Accu_list.append(Accu)
  print("Accuracy: "+str(round(Accu, 2)))

  Rec = matrix[1][1] / sum(matrix[1])
  Rec_list.append(Rec)
  print("Recall: "+str(round(Rec, 2)))

  Prec = matrix[1][1] / sum(matrix.T[1])
  Prec_list.append(Prec)
  print("Precision: "+str(round(Prec, 2)))

  F1 = 2*(Prec*Rec)/(Prec+Rec)
  F1_list.append(F1)
  print("F1 Score: "+str(round(F1, 2)))
  print("\n")


Sentiment : negative
Accuracy: 0.8
Recall: 0.97
Precision: 0.77
F1 Score: 0.86


Sentiment : neutral
Accuracy: 0.84
Recall: 0.38
Precision: 0.74
F1 Score: 0.51


Sentiment : positive
Accuracy: 0.9
Recall: 0.51
Precision: 0.82
F1 Score: 0.63




Now, we calculate Macro-Average for all the performance metric above:

In [202]:
Macro_F1 = sum(F1_list)/3
Macro_Accu = sum(Accu_list)/3
Macro_Rec = sum(Rec_list)/3
Macro_Prec = sum(Prec_list)/3

print("Macro Accuracy: "+str(round(Macro_Accu, 2)*100)+"%")
print("Macro Recall: "+str(round(Macro_Rec, 2)*100)+"%")
print("Macro Precision: "+str(round(Macro_Prec, 2)*100)+"%")
print("Macro F1 score: "+str(round(Macro_F1, 2)*100)+"%")

Macro Accuracy: 85.0%
Macro Recall: 62.0%
Macro Precision: 78.0%
Macro F1 score: 67.0%


THE END



