# Multinomial Naive Bayes

In this notebook, we perform a `Multinomial Naive Bayes` Classification on the `US Airline Tweet` Dataset, using sentiment analysis.

**Reference**
* [Multinomial NB](https://towardsdatascience.com/multinomial-naive-bayes-classifier-for-text-analysis-python-8dd6825ece67)

In [1]:
import collections

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

## 1. Import the Dataset

In [2]:
df = pd.read_csv("airline_tweet_processed.csv")
df.head()

Unnamed: 0,airline_sentiment,airline,text
0,1,Virgin America,said
1,2,Virgin America,plus added commercial experience tacky
2,1,Virgin America,today must mean need take another trip
3,0,Virgin America,really aggressive blast obnoxious entertainmen...
4,0,Virgin America,really big bad thing


**Fields**

* `airline_sentiment` : Sentiment
                           `0` - `Negative`
                           `1` - `Neutral`
                           `2` - `Positive`
* `airline` : Name of the Airline
* `text` : The words in the pre-processed tweet

## 2. Perform test train split

We allocate about 20% of the data for testing and the remaining will be used to train the model.
<br>
The input variable is the processed `text`, and the output variable is `airline_sentiment`.

In [3]:
output_classes = ['Negative', 'Neutral', 'Positive']
num_classes = len(df['airline_sentiment'].unique())
num_tweets = len(df)

In [4]:
train_percentage = 0.8
num_train = int(train_percentage * num_tweets)
num_test = num_tweets - num_train

In [5]:
print("Train set size : ", num_train)
print("Test set size : ", num_test)

Train set size :  11712
Test set size :  2928


In [6]:
# Shuffle your dataset 

shuffle_df = df.sample(frac=1)

In [7]:
train_df = shuffle_df[:num_train]
test_df = shuffle_df[num_train:]

In [8]:
train_df.to_csv("train.csv", index=False)
test_df.to_csv("test.csv", index=False)

In [9]:
print(f"The training set has {num_train} sets of values.")
print(f"The testing set has {num_test} sets of values.")

The training set has 11712 sets of values.
The testing set has 2928 sets of values.


## 3. Class Distribution

![Class Probability image](./img/class_probability.png)

In [10]:
# Find the number of tweets of each class

probability_class = np.array([train_df[train_df['airline_sentiment'] == i]['airline_sentiment'].count() for i in range(num_classes)])

In [11]:
# Divide the values by the total number of tweets to get the probability of each class

probability_class = probability_class / num_train

In [12]:
# Convert this into a dictionary for better access

probability_class = {
    i : probability_class[i] for i in range(num_classes)
}

In [13]:
# Display the Class Probabilities

print("Class probabilities : \n")
for i in range(num_classes) :
    print(output_classes[i], " : ", probability_class[i])

Class probabilities : 

Negative  :  0.6239754098360656
Neutral  :  0.21174863387978143
Positive  :  0.164275956284153


## 4. Probability Distribution over Vocabulary

### 4.1 Prepare the Vocabulary

In [14]:
# Initialize a set to store all the words

vocabulary = set()

In [15]:
# Function to extract the vocabulary

def extractVocabulary(tweet) :
    for word in str(tweet).split(" ") :
        vocabulary.add(word)

In [16]:
# Find all the unique words

_ = train_df['text'].apply(extractVocabulary)

In [17]:
# Convert the vocabulary into a list

vocabulary = list(vocabulary)

In [18]:
vocabulary_count = len(vocabulary)

In [19]:
# Save this vocabulary

vocabulary_df = pd.DataFrame(columns=['index', 'word'])
vocabulary_df['word'] = vocabulary
vocabulary_df['index'] = [i for i in range(len(vocabulary))]
vocabulary_df.head()
vocabulary_df.to_csv('vocabulary_mapping.csv', index=False)

### 4.2 Form the Word Distribution Dataframe

In [20]:
word_distribution_df = pd.DataFrame(columns=['tweet_idx', 'word_idx', 'count', 'class_idx'])

In [21]:
i = 0
def extractWordDistribution(row) :
    global word_distribution_df, i
    
    tweet = row['text']
    temp_words = str(tweet).split(" ")
    temp_word_count = collections.Counter(temp_words)
    temp_word_count_arr = []
    temp_word_idx_arr = []
    for temp_word, temp_count in temp_word_count.items() :
        temp_word_idx_arr.append(int(vocabulary_df[vocabulary_df['word'] == temp_word]['index']))
        temp_word_count_arr.append(temp_count)
    
    # Concatenate the rows into the dataset
    temp_df = pd.DataFrame({
        'tweet_idx' : [i]*len(temp_word_count_arr),
        'word_idx' : temp_word_idx_arr,
        'count' : temp_word_count_arr,
        'class_idx' : [row['airline_sentiment']]*len(temp_word_count_arr)
    })
    word_distribution_df = pd.concat([
        word_distribution_df,
        temp_df
    ],ignore_index=True)
    
    i += 1
    if i % 1000 == 0 :
        print(i)

In [22]:
_ = train_df.apply(extractWordDistribution, axis=1)

1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000


In [23]:
# Save this distribution

word_distribution_df.to_csv("word_distribution.csv", index=False)

### 4.3 Probability of each word per class

For class `j` and word `i`, the average is given by:
<br>
![Word Class Probability Formula](./img/word_class_probability.png)

In [24]:
# Smoothing

alpha = 0.001

In [25]:
#Calculate probability of each word based on class

pb_ij = word_distribution_df.groupby(['class_idx','word_idx'])
pb_j = word_distribution_df.groupby(['class_idx'])
Pr =  (pb_ij['count'].sum() + alpha) / (pb_j['count'].sum() + vocabulary_count)

In [26]:
#Unstack series

Pr = Pr.unstack()

In [27]:
Pr

word_idx,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,9076,9077,9078,9079,9080,9081,9082,9083,9084,9085,9086,9087,9088,9089,9090,9091,9092,9093,9094,9095,9096,9097,9098,9099,9100,9101,9102,9103,9104,9105,9106,9107,9108,9109,9110,9111,9112,9113,9114,9115
class_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
0,3.8e-05,2.5e-05,,,1.3e-05,1.3e-05,0.000929,1.3e-05,,,1.3e-05,1.3e-05,,0.005047,,3.8e-05,,1.3e-05,0.000213,,,1.3e-05,2.5e-05,0.000477,,0.000565,3.8e-05,1.3e-05,2.5e-05,1.3e-05,5e-05,3.8e-05,2.5e-05,1.3e-05,1.3e-05,2.5e-05,1.3e-05,1.3e-05,,,...,1.3e-05,3.8e-05,0.000314,0.000515,2.5e-05,2.5e-05,1.3e-05,2.5e-05,,2.5e-05,,0.000113,5e-05,,3.8e-05,1.3e-05,,1.3e-05,1.3e-05,,5e-05,3.8e-05,,,1.3e-05,0.000138,,1.3e-05,5e-05,2.5e-05,1.3e-05,2.5e-05,3.8e-05,1.3e-05,,,1.3e-05,,0.000113,1.3e-05
1,0.000228,,3.8e-05,,,,0.000304,,,3.8e-05,,,,0.002777,3.8e-05,3.8e-05,3.8e-05,3.8e-05,3.8e-05,3.8e-05,,3.8e-05,7.6e-05,3.8e-05,,0.000685,,,,,3.8e-05,,3.8e-05,3.8e-05,,,,,3.8e-05,,...,,0.00019,0.000152,0.000342,,,,,7.6e-05,,3.8e-05,7.6e-05,0.00019,3.8e-05,,,,,,,,3.8e-05,,3.8e-05,,0.000495,,,,7.6e-05,3.8e-05,,,3.8e-05,3.8e-05,3.8e-05,,,0.000114,
2,4.4e-05,,4.4e-05,4.4e-05,,,0.000611,,4.4e-05,4.4e-05,,,4.4e-05,0.002923,,,4.4e-05,,4.4e-05,8.7e-05,4.4e-05,,4.4e-05,,4.4e-05,0.000305,,,,,8.7e-05,,4.4e-05,4.4e-05,,,,,4.4e-05,4.4e-05,...,,4.4e-05,0.000131,0.000305,,,,,8.7e-05,4.4e-05,,0.000131,8.7e-05,,,,0.000131,4.4e-05,,4.4e-05,,,4.4e-05,,,0.000262,4.4e-05,,,,,,,,,,,4.4e-05,,


In [28]:
#Replace NaN or columns with 0 as word count with a/(count+|V|+1)

for c in range(1,num_classes):
    Pr.loc[c,:] = Pr.loc[c,:].fillna(alpha/(pb_j['count'].sum()[c] + vocabulary_count))

In [29]:
#Convert to dictionary for better access

Pr_dict = Pr.to_dict()

## 5. Multinomial Naive Bayes

![MNB image](./img/mnb.png)

In [48]:
def MultinomialNaiveBayes(data) :
    '''
    Multinomial Naive Bayes classifier
    :param data [Pandas Dataframe]: Dataframe of data
    :return predict [list]: Predicted class ID
    '''
    
    #Using dictionaries for greater speed
    df_dict = data.to_dict()
    new_dict = {}
    predictions = []
    
    # new_dict = {docIdx : {wordIdx: count},....}
    for idx in range(len(df_dict['tweet_idx'])):
        tweetIdx = df_dict['tweet_idx'][idx]
        wordIdx = df_dict['word_idx'][idx]
        count = df_dict['count'][idx]
        try: 
            new_dict[tweetIdx][wordIdx] = count
        except:
            new_dict[df_dict['tweet_idx'][idx]] = {}
            new_dict[tweetIdx][wordIdx] = count
        
    # Calculating the scores for each tweet
    for tweetIdx in new_dict.keys():
        score_dict = {}
        # Creating a probability row for each class
        for classIdx in range(1,num_classes):
            score_dict[classIdx] = 1
            # For each word:
            for wordIdx in new_dict[tweetIdx]:
                try:
                    # Use frequency smoothing
                    # log(1+f)*log(Pr(i|j))
                    probability=Pr_dict[wordIdx][classIdx]         
                    power = np.log(1+ new_dict[tweetIdx][wordIdx])     
                    score_dict[classIdx]+=power*np.log(probability)
                except:
                    # Missing V will have log(1+0)*log(a/num_classes)=0 
                    score_dict[classIdx] += 0
            # Multiply final with probability of the class
            score_dict[classIdx] +=  np.log(probability_class[classIdx])
    
        #Get class with max probabilty for the given docIdx 
        max_score = max(score_dict, key=score_dict.get)
        predictions.append(max_score)
        
    return predictions

## 6. Make Predictions of the train Dataset

In [49]:
Y_train_pred = MultinomialNaiveBayes(word_distribution_df)
Y_train = train_df['airline_sentiment'].tolist()

In [54]:
# Save the train predictions

np.save('y_train_predictions.npy', np.array(Y_train_pred))

In [55]:
# Load the saved predictions

Y_train_pred = list(np.load('y_train_predictions.npy'))

In [50]:
# Calculate the Training Error

error = 0

for (i, j) in zip(Y_train_pred, Y_train) :
    if i != j :
        error += 1

In [51]:
train_error_rate = error * 100 / num_train
print("Training Error : ", train_error_rate, "%")

Training Error :  64.79678961748634 %


In [52]:
train_accuracy = 100 - train_error_rate
print("Training Accuracy : ", train_accuracy, "%")

Training Accuracy :  35.20321038251366 %


## 7. Test the model on Unseen data

In [38]:
# Form the Vocabulary from test set

# Initialize a set to store all the words
test_vocabulary = set()

# Function to extract the vocabulary
def extractTestVocabulary(tweet) :
    for word in str(tweet).split(" ") :
        test_vocabulary.add(word)

# Find all the unique words
_ = test_df['text'].apply(extractTestVocabulary)

# Convert the vocabulary into a list
test_vocabulary = list(test_vocabulary)

# Find the number of words in the vocabulary => |V_test|
test_vocabulary_count = len(test_vocabulary)

# Convert it into a dataframe
test_vocabulary_df = pd.DataFrame(columns=['index', 'word'])
test_vocabulary_df['word'] = test_vocabulary
test_vocabulary_df['index'] = [i for i in range(test_vocabulary_count)]

In [39]:
# Word Distribution

# Initialize a dataframe to store these details
test_word_distribution_df = pd.DataFrame(columns=['tweet_idx', 'word_idx', 'count', 'class_idx'])

i = 0
def extractTestWordDistribution(row) :
    global test_word_distribution_df, i
    # Extract the count of words
    tweet = row['text']
    temp_words = str(tweet).split(" ")
    temp_word_count = collections.Counter(temp_words)
    temp_word_count_arr = []
    temp_word_idx_arr = []
    for temp_word, temp_count in temp_word_count.items() :
        temp_word_idx_arr.append(int(test_vocabulary_df[test_vocabulary_df['word'] == temp_word]['index']))
        temp_word_count_arr.append(temp_count)
    # Concatenate the rows into the dataset
    temp_df = pd.DataFrame({
        'tweet_idx' : [i]*len(temp_word_count_arr),
        'word_idx' : temp_word_idx_arr,
        'count' : temp_word_count_arr,
        'class_idx' : [row['airline_sentiment']]*len(temp_word_count_arr)
    })
    test_word_distribution_df = pd.concat([
        test_vocabulary_df,
        temp_df
    ],ignore_index=True)
    # Increment the index
    i += 1

_ = test_df.apply(extractTestWordDistribution, axis=1)

In [53]:
# Make the predictions and use to to calculate the error rate
Y_test_pred = MultinomialNaiveBayes(test_word_distribution_df)
Y_test = test_df['airline_sentiment'].tolist()

# Calculate the Training Error
error = 0
for (i, j) in zip(Y_test_pred, Y_test_pred) :
    if i != j :
        error += 1

test_error_rate = error * 100 / num_test
print("Testing Error : ", test_error_rate, "%")

test_accuracy = 100 - test_error_rate
print("Testing Accuracy : ", test_accuracy, "%")

Testing Error :  0.0 %
Testing Accuracy :  100.0 %
