In [1]:
import numpy
import pandas as pd

from sklearn.model_selection import train_test_split

from EngineFiles.MachineLearning import NaiveBayesModel as NBM

In [2]:
df = pd.read_csv('Data/indonesia_Tweet/clean_tweets.csv')
df.dropna(subset=['Tweet'],inplace=True)

In [3]:
x_train, x_test, y_train, y_test = NBM.splitSet(df, 0.2, 42)

In [4]:
freqs = NBM.countTweets(x_train, y_train)

##### Calculate the logprior
- the logprior is $log(D_{pos}) - log(D_{neg})$

##### Calculate log likelihood
- We can iterate over each word in the vocabulary, the `naiveBayesTrain()` function uses the `lookup()` function to get the positive frequencies, $freq_{pos}$, and the negative frequencies, $freq_{neg}$, for that specific word.
- Compute the positive probability of each word $P(W_{pos})$, negative probability of each word $P(W_{neg})$ using equations 4 & 5.

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

**Note:** We'll use a dictionary to store the log likelihoods for each word.  The key is the word, the value is the log likelihood of that word.

- We can then compute the loglikelihood: $log \left( \frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}$.

In [5]:
logprior, loglikelihood = NBM.naiveBayesTrain(freqs, x_train, y_train)
print(logprior, len(loglikelihood))

-0.15142484267089884 12327


## Accuration test Naive Bayes Model

Now that we have the logprior and loglikelihood, we can test the naive bayes function by making predicting on some tweets

The function takes in the tweet, logprior, loglikelihood.
It returns the probability that the tweet belongs to the positive or negative class.
For each tweet, sum up loglikelihoods of each word in the tweet.
Also we add the logprior to this sum to get the predicted sentiment of that tweet.
$$ p = logprior + \sum_i^N (loglikelihood_i)$$

The value of 0.0 means that when we add the logprior to the log likelihood, we're just adding zero to the log likelihood. However, whenever the data is not perfectly balanced, the logprior will be a non-zero value.

In [6]:
acc, y_pred = NBM.naiveBayesAccuracy(x_test, y_test, logprior,loglikelihood)
print("Naive Bayes accuracy = %0.4f" % (acc))

Naive Bayes accuracy = 0.8337


In [7]:
from sklearn.metrics import accuracy_score, classification_report, f1_score
print(f1_score(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.8075040783034257
0.8337442761535752
              precision    recall  f1-score   support

           0       0.81      0.90      0.85      1527
           1       0.87      0.75      0.81      1312

    accuracy                           0.83      2839
   macro avg       0.84      0.83      0.83      2839
weighted avg       0.84      0.83      0.83      2839

