<h1>Multinomial Naive Bayes</h1><br>
Sentiment Analysis<br>
Finding a good sentiment analysis database is very hard, so it's the dataset's fault as much as the model's

Importing

In [157]:
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import spacy

Chat_dataset=pd.read_csv('..//Datasets/chat_dataset.csv')
Chat_dataset.head()

Unnamed: 0,message,sentiment
0,I really enjoyed the movie,positive
1,The food was terrible,negative
2,I'm not sure how I feel about this,neutral
3,The service was excellent,positive
4,I had a bad experience,negative


Finding the categorical values ( how many are there? )

In [158]:
Chat_dataset.sentiment.value_counts()

sentiment
neutral     259
positive    178
negative    147
Name: count, dtype: int64

The size of the dataset

In [159]:
Chat_dataset.size

1168

<h1>Preprocessing

There are many preprocessing tools such as: <br>
<b>Lemmatization</b>: Transforms a word into it's lema dictionary word, for instance the word 'running' becomes 'run'. So a problem with this, well, context and personality, the proposition also becomes weird, imagine training a model on a proposition such as "i am run to the store" or even worse if you preprocess your dataset even further.. .<br>
<b>Stemming</b>: The stemming process removes sufixes and prefixes from a word. <br>
<b>Stop word removal</b>: removes words such as 'the', 'and', 'of', 'a', 'an'. <br>
<b>Part of speech tagging</b>: This is an interesting one, it puts every word into a category, it's part of speech (noun,verb,adjective...). It can help.<br>
<b>Named entity recognition</b>: Another good one, used to identify names of people, places, orgz<br>

Data cleaning. 


In [160]:

# function to remove links and special chars, also removing extra whitespaces.
def clean_text(text):  
    # remove links by substituting target words with nothing
    text=re.sub(r'http\S+|www\S+|https\S+', '',text)
    # remove special chars
    text=re.sub(r'[^a-zA-ZÀ-ú\s]','',text)
    # tokenization
    tokens=text.split()
    # remove extra whitespace
    tokens=[token.strip() for token in tokens]
    cleaned_text=' '.join(tokens)
    return cleaned_text


# Function to add Part Of Speech Tags (pos_tags) 
nlp=spacy.load('en_core_web_sm')
def possifier(text_df):
    pos_tags_list = []
    for text in text_df:
        doc = nlp(text)
        pos_tags = [(token.pos_) for token in doc]
        pos_tags_list.append(pos_tags)
    return pos_tags_list

# Function to count the number of tags
def process_pos_tags(tags):
    pos_counts = {}
    for tag in tags:
        if tag in pos_counts:
            pos_counts[tag] += 1
        else:
            pos_counts[tag] = 1
    return pos_counts
# Using .apply(function) to apply the function to the specific colum



In [161]:
# Using the clean_text function to ... clean the text 
Chat_dataset['message']=[clean_text(text) for text in Chat_dataset['message']]

# Adding extra columns 
pos_count=possifier(Chat_dataset['message'])
Chat_dataset['pos_count']=pos_count

Chat_dataset['pos_counts_per_row'] = Chat_dataset['pos_count'].apply(process_pos_tags)
Chat_dataset.pos_counts_per_row.head()


Chat_dataset.head()


Unnamed: 0,message,sentiment,pos_count,pos_counts_per_row
0,I really enjoyed the movie,positive,"[PRON, ADV, VERB, DET, NOUN]","{'PRON': 1, 'ADV': 1, 'VERB': 1, 'DET': 1, 'NO..."
1,The food was terrible,negative,"[DET, NOUN, AUX, ADJ]","{'DET': 1, 'NOUN': 1, 'AUX': 1, 'ADJ': 1}"
2,Im not sure how I feel about this,neutral,"[PRON, VERB, PART, ADJ, SCONJ, PRON, VERB, ADP...","{'PRON': 3, 'VERB': 2, 'PART': 1, 'ADJ': 1, 'S..."
3,The service was excellent,positive,"[DET, NOUN, AUX, ADJ]","{'DET': 1, 'NOUN': 1, 'AUX': 1, 'ADJ': 1}"
4,I had a bad experience,negative,"[PRON, VERB, DET, ADJ, NOUN]","{'PRON': 1, 'VERB': 1, 'DET': 1, 'ADJ': 1, 'NO..."


The model

In [162]:
# training data
X=Chat_dataset.drop('sentiment',axis=1)

#X=Chat_dataset['message'] 

# Target is the sentiment so y=sentiment
y=Chat_dataset['sentiment']

cv=CountVectorizer()
#X=cv.fit_transform(X)
X=cv.fit_transform(X['message'])

X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.3,random_state=0)

# The models, both the split one and the full one
# The splitted model
mnb=MultinomialNB()
mnb.fit(X_train,y_train)

# The full model
# Default: alpha=1, fit_prior=True
mnb_full=MultinomialNB(alpha=0.5, fit_prior=False,)
mnb_full.fit(X,y)

<h1> How good is the model?

In [163]:
# predicting 
# split model
pred_mnb_split=mnb.predict(X_test)
# full model
pred_mnb_full=mnb_full.predict(X_test)

le=LabelEncoder()
y_test_encoded=le.fit_transform(y_test)
pred_mnb_split_encoded=le.fit_transform(pred_mnb_split)
pred_mnb_full_encoded=le.fit_transform(pred_mnb_full)

# mean squared error
mse_mnb_split=np.sqrt(mean_squared_error(y_test_encoded,pred_mnb_split_encoded))
mse_mnb_full=np.sqrt(mean_squared_error(y_test_encoded,pred_mnb_full_encoded))

# mean absolute error
mae_mnb_split= mean_absolute_error(y_test_encoded,pred_mnb_split_encoded)
mae_mnb_full= mean_absolute_error(y_test_encoded,pred_mnb_full_encoded)

# accuracy 
acc_mnb_split=accuracy_score(y_test,pred_mnb_split)
acc_mnb_full=accuracy_score(y_test,pred_mnb_full)

# Printing results
print(f'Accuracy of the split model: {acc_mnb_split}')
print(f'Accuracy of the full model: {acc_mnb_full}')
print(f'Mean Squared Error of the split model: {mse_mnb_split}')
print(f'Mean Squared Error of the full model: {mse_mnb_full}')
print (f'Mean Absolute Error of the split model: {mae_mnb_split}')
print (f'Mean Absolute Error of the full model: {mae_mnb_full}')

Accuracy of the split model: 0.8181818181818182
Accuracy of the full model: 0.9715909090909091
Mean Squared Error of the split model: 0.648424664440323
Mean Squared Error of the full model: 0.25
Mean Absolute Error of the split model: 0.26136363636363635
Mean Absolute Error of the full model: 0.03977272727272727


The full model is better, I did test both before typing this <br>
So i decided to only use the full one for simplicity<br>
Let's see how it does

In [164]:
# Predictor function
def mnb_predictor(text):
    # Need to put the text file through the same preprocessing like the model
    text_cleaned = clean_text(text)
    text_transformed = cv.transform([text_cleaned])

    mnb_pred = mnb.predict(text_transformed)
    # Creating a DataFrame for better visualization
    result_df = pd.DataFrame({'Text': [text],
                              'MNB Prediction': mnb_pred
                              })
    
    return result_df

# Function that applies the prediictor function to each row
# Each row in our text files is separated by a newline
# So if the format is different, modifying this function might be required
def process_text_lines(text):
    result_df = pd.DataFrame(columns=['Text', 'MNB Prediction'])
    with open(text, 'r') as file:
        for line in file:
            line = line.strip()
            if line:
                df = mnb_predictor(line)
                result_df = pd.concat([result_df, df], ignore_index=True)
    return result_df


How well does the model identify positive sentiment?

In [165]:
text = '..//text_positive.txt'

predictions_df=process_text_lines(text)
prediction_counts = predictions_df['MNB Prediction'].value_counts()
print(prediction_counts)

MNB Prediction
positive    45
neutral      5
negative     1
Name: count, dtype: int64


How about negative sentiment?

In [166]:
text_n='..//text_negative.txt'
predictions_df=process_text_lines(text_n)
prediction_counts = predictions_df['MNB Prediction'].value_counts()
print(prediction_counts)

MNB Prediction
positive    26
negative    19
neutral      5
Name: count, dtype: int64


Neutral sentiment? this one is way trickier, even for humans..

In [167]:
text_ne='..//text_neutral.txt'
predictions_df=process_text_lines(text_ne)
prediction_counts = predictions_df['MNB Prediction'].value_counts()
print(prediction_counts)

MNB Prediction
positive    14
neutral     13
negative     3
Name: count, dtype: int64


It does seem that both the dataset and the model have reached their max and won't cooperate any longer.<br>
The positive sentiment seems to be identified the best with 45 out of 51 <br>