# This portfolio project is part of the AI/ML Engineering career path provided by Codecademy and aims to practically apply end-to-end machine learning (ML) workflow steps . It particularly aims to utilize Naive bayes (using bayes theorem) in classifying sentiment.
-------------------

## Project goals:
    > Applying the famous probability theorem `bayes theorem` through the `Naive Bayes` model in order to calculate the probabilty of a text falling into specific sentiment classes.
    > Showcase a general end-to-end machine learning workflow.
    > Build pipelines for effieciency, reproducability, and generalizabilty of the ML workflow. 

----
## General Workflow (Iterative):

1. **Extract, Transform, Load (ETL):** Import and prepare the dataset (e.g., from CSV or web source).
2. **Basic Preprocessing:** Handle null values, remove duplicates, encode labels.
3. **Train-Test-Validation Split:** Separate the data for training, tuning, and final evaluation.
4. **Exploratory Data Analysis (EDA):** Analyze label distribution, text lengths, frequent tokens, and other insights.
5. **Feature Engineering:** Convert raw text into features using methods like `CountVectorizer`, including potential bigrams or stopword removal.
6. **Model Selection:** Choose a suitable model (e.g., `MultinomialNB`) and pipeline approach.
7. **Model Evaluation:** Evaluate performance using metrics like accuracy, precision, recall, F1-score, and confusion matrix.
8. **Hyperparameter Tuning:** Adjust parameters such as n-gram range, `alpha` (NB smoothing), or vocabulary size.
9. **Model Validation:** Ensure model generalizes well using validation data.
10. **ML Pipeline Building:** Package preprocessing and model steps into a `Pipeline` for consistent inference (and future deployment).

---
## Since we are dealing with **text classification**, NLP preprocessing steps are required (lowercasing, punctuation removal, etc.), and features will be extracted using `CountVectorizer`. I will begin by performing cleaning, then Basic EDA (e.g. Checking the most and least repeating words or phrases), and proceed to modeling, evaluation, and tuning using a Naive Bayes classifier.

## The dataset will include the following:
**target**: the polarity of the tweet (0 = negative, 4 = positive)

**ids**: The id of the tweet (2087)

**date**: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

**flag**: The query (lyx). If there is no query, then this value is NO_QUERY.

**user**: the user that tweeted (robotickilldozr)

**text**: the text of the tweet (Lyx is cool)

Note: The dataset can viewed here: [Dataset](https://www.kaggle.com/datasets/kazanova/sentiment140)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv("./training.1600000.processed.noemoticon.csv", encoding='latin-1', header=None)
df.columns = ['target', 'ids', 'date', 'flag', 'user', 'text']
df = df[['text', 'target']]  # Will include the text and the sentiment of that text
df['target'] = df['target'].map({0: 0, 4: 1}) #Mapping the sentiments into 0, and 1 
df = df.sample(n=100000, random_state=42).reset_index(drop=True) # Will sample 20000 from the original 1.6 million rows

## Getting to know the dataset

In [3]:
# A look into the dataframe
df.head()

Unnamed: 0,text,target
0,@chrishasboobs AHHH I HOPE YOUR OK!!!,0
1,"@misstoriblack cool , i have no tweet apps fo...",0
2,@TiannaChaos i know just family drama. its la...,0
3,School email won't open and I have geography ...,0
4,upper airways problem,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    100000 non-null  object
 1   target  100000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.5+ MB


In [5]:
# The target vaariables in place
df.target.unique()

array([0, 1], dtype=int64)

In [6]:
df.target.value_counts() # Very similar class distribution - No to Little class imbalance

target
1    50057
0    49943
Name: count, dtype: int64

## Preprocessing steps

In [7]:
# Any null fields?
df.isna().any()

text      False
target    False
dtype: bool

In [8]:
# Are there duplicated rows?
df.duplicated().any()

True

In [9]:
# Dropping duplicates
df = df.drop_duplicates()

In [10]:
df.target.value_counts()

target
1    49954
0    49771
Name: count, dtype: int64

### Splitting the dataset before proceeding with text preprocessing (85% Training set, 15% Testing set)

In [11]:
x = df['text']
y = df['target']

In [12]:
x_train, x_test, y_train, y_test = train_test_split(x, y , test_size = 0.15, random_state = 42)

## Basic Text Processing and EDA

    > As the dataset contains punctuations, mentions  targeting other users, and special characters they will be removed. Additionally, all words will be lowercased to avoid missing up vocabulary needed (e.g. "data!" will not be considered the same as "data") 
Note: You can view this article for many techniques: [Article](https://www.analyticsvidhya.com/blog/2022/01/text-cleaning-methods-in-nlp/)

In [13]:
import re
import string
from sklearn.preprocessing import FunctionTransformer 
from Text_Preprocessing import remove_mentions, remove_punctuations, lower_case_remove_whitespace, remove_non_ascii ,punctuations, mention_pattern

### Some unit tests for the text preprocessing functions built in the Text_Preprocessing.py file will be conducted below

In [14]:
!python -m unittest -v Unit_Tests

test_lower_case_remove_whitespace (Unit_Tests.test_text_preprocessing.test_lower_case_remove_whitespace) ... ok
test_remove_mentions (Unit_Tests.test_text_preprocessing.test_remove_mentions) ... ok
test_remove_non_ascii (Unit_Tests.test_text_preprocessing.test_remove_non_ascii) ... ok
test_remove_punctuations (Unit_Tests.test_text_preprocessing.test_remove_punctuations) ... ok

----------------------------------------------------------------------
Ran 4 tests in 0.002s

OK


In [15]:
from sklearn.pipeline import Pipeline

In [16]:
# Wrapping them in the FunctiopnTransformer in order to use them within pipelines
mention_remover = FunctionTransformer(remove_mentions, validate=False)
punctuation_remover = FunctionTransformer(remove_punctuations, validate=False)
lowercaser = FunctionTransformer(lower_case_remove_whitespace, validate=False)
ascii_cleaner = FunctionTransformer(remove_non_ascii, validate=False)

# Building the text preprocessing pipeline
text_preprocessing_pipeline = Pipeline([('remove_mentions', mention_remover),
                                        ('remove_punctuations', punctuation_remover),
                                        ('lowercase_strip', lowercaser),
                                       ('remove_ascii_characters', ascii_cleaner)])

In [17]:
x_train_preprocessed = text_preprocessing_pipeline.fit_transform(x_train) # An exmaple dataframe to showcase the difference against the original first 5 rows

In [18]:
x_train_preprocessed = pd.DataFrame(x_train_preprocessed, columns = ["text"])

In [19]:
x_train_preprocessed.head()

Unnamed: 0,text
66729,ps i miss you
16589,oops i mean later on since its after 12 already
53665,my so knows more about basketball than i do
23198,silicone sucks we must always support ori bro ...
15765,the rest of us understand our real blessings l...


### We will show the most commonly used single words and 2 word phrases alongside the least ones as part of a simple `textual EDA`. 

### `CountVectorizer()` which is part of the preprocessing module of sklearn will be used to make a dictionary of words and their respective counts. The CountVectorizer() and many other techniques could fall into the `feature engineering` aspect as we provied the model with that dictionary (vocabulary) in order to train and then be evaluated and tuned. 

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

In [21]:
# Building the pipelines needed
word_pipeline = Pipeline([('text_preprocess', text_preprocessing_pipeline),
                          ('vectorizer', CountVectorizer())])

two_words_pipeline = Pipeline([('text_preprocess', text_preprocessing_pipeline),
                          ('vectorizer', CountVectorizer(ngram_range = (2,2)))])

In [22]:
word_pipeline.fit(x_train)
two_words_pipeline.fit(x_train)

In [23]:
# Showcasing the dictionary 
w_vectorizer = word_pipeline.named_steps["vectorizer"]
w_vectorizer.vocabulary_

{'ps': 44756,
 'miss': 37173,
 'you': 63798,
 'oops': 41056,
 'mean': 36164,
 'later': 33143,
 'on': 40842,
 'since': 51319,
 'its': 30630,
 'after': 2656,
 '12': 244,
 'already': 3383,
 'my': 38445,
 'so': 52169,
 'knows': 32513,
 'more': 37717,
 'about': 2133,
 'basketball': 5963,
 'than': 55805,
 'do': 15353,
 'silicone': 51255,
 'sucks': 54116,
 'we': 61077,
 'must': 38380,
 'always': 3445,
 'support': 54495,
 'ori': 41248,
 'bro': 8439,
 'wah': 60602,
 'mine': 37014,
 'lasts': 33116,
 'for': 19860,
 'year': 63475,
 'nia': 39273,
 'the': 55918,
 'rest': 47906,
 'of': 40442,
 'us': 59625,
 'understand': 59142,
 'our': 41376,
 'real': 46985,
 'blessings': 7273,
 'lie': 33709,
 'in': 29686,
 'friends': 20317,
 'family': 18433,
 'and': 3740,
 'love': 34587,
 'listening': 33989,
 'to': 56881,
 'shinees': 50701,
 '2nd': 929,
 'mini': 37040,
 'album': 3112,
 'quotromeoquot': 46148,
 'noooooo': 39762,
 'poor': 43799,
 'cows': 12707,
 'come': 11720,
 'droops': 16009,
 'amp': 3604,
 'drink':

In [24]:
# Showcasing the dictionary 
tw_vectorizer = two_words_pipeline.named_steps["vectorizer"]
tw_vectorizer.vocabulary_

{'ps miss': 261476,
 'miss you': 207052,
 'oops mean': 242380,
 'mean later': 202647,
 'later on': 182898,
 'on since': 239747,
 'since its': 288707,
 'its after': 170620,
 'after 12': 9054,
 '12 already': 783,
 'my so': 218112,
 'so knows': 293091,
 'knows more': 180389,
 'more about': 209881,
 'about basketball': 6051,
 'basketball than': 37497,
 'than do': 314213,
 'silicone sucks': 288261,
 'sucks we': 306575,
 'we must': 364282,
 'must always': 214490,
 'always support': 14973,
 'support ori': 308521,
 'ori bro': 243890,
 'bro wah': 52515,
 'wah mine': 358352,
 'mine lasts': 205962,
 'lasts for': 182542,
 'for year': 114542,
 'year nia': 383290,
 'the rest': 322786,
 'rest of': 271482,
 'of us': 234818,
 'us understand': 354463,
 'understand our': 350993,
 'our real': 245017,
 'real blessings': 267400,
 'blessings lie': 47228,
 'lie in': 185873,
 'in friends': 161049,
 'friends family': 116992,
 'family and': 103244,
 'and love': 20793,
 'listening to': 188768,
 'to shinees': 3382

### 1- The the top 10 words of each vectorizer
### 2- The most and least used word(s)

In [25]:
get_top_n = lambda pipeline, vectorized_data, n: sorted(zip(pipeline.named_steps['vectorizer'].get_feature_names_out(), 
                                                     vectorized_data.sum(axis=0).A1), 
                                                     key=lambda x: x[1], 
                                                      reverse=True)[:n]

get_all_data_sorted = lambda pipeline, vectorized_data: sorted(zip(pipeline.named_steps['vectorizer'].get_feature_names_out(), 
                                                     vectorized_data.sum(axis=0).A1), 
                                                     key=lambda x: x[1], 
                                                      reverse=True)

In [26]:
word_vocabulary = word_pipeline.fit_transform(x_train)
two_word_vocabulary = two_words_pipeline.fit_transform(x_train)

# Get top 10 words
top_words = get_top_n(word_pipeline, word_vocabulary, 10)
top_bigrams = get_top_n(two_words_pipeline, two_word_vocabulary, 10)

print(top_words)
print(top_bigrams)

[('to', 29539), ('the', 27486), ('my', 16672), ('and', 15757), ('you', 14386), ('is', 12499), ('it', 12322), ('for', 11519), ('in', 11481), ('of', 9583)]
[('in the', 2391), ('going to', 2264), ('for the', 1810), ('on the', 1622), ('to be', 1581), ('to the', 1533), ('have to', 1523), ('to go', 1398), ('of the', 1306), ('to get', 1246)]


In [27]:
# Most used (can also be seen above) and the least used words and bigrams
all_words = get_all_data_sorted(word_pipeline, word_vocabulary)
all_bigrams = get_all_data_sorted(two_words_pipeline, two_word_vocabulary)

print(f"Most used word: {all_words[0]}, Least used word: {all_words[-1]}")
print(f"Most used bigram: {all_bigrams[0]}, Least used bigram: {all_bigrams[-1]}")


Most used word: ('to', 29539), Least used word: ('zzzzzzzzzzzzzzzz', 1)
Most used bigram: ('in the', 2391), Least used bigram: ('zzzzzzzzzzzzzzzz if', 1)


## Naive Bayes model full pipeline , fitting, evaluation, and tuning

### The full pipeline will conclude with a Naive Bayes classifier, which will be fitted and evaluated using classification metrics such as accuracy (suitable here due to the absence of class imbalance). It will then be tuned via its hyperparameters using GridSearchCV, which performs cross-validation by splitting the data into folds so that each fold is used for validation exactly once.

In [28]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

In [29]:
pipeline = Pipeline([('text_preprocess', text_preprocessing_pipeline),
                          ('vectorizer', CountVectorizer()),
                          ('classifier', MultinomialNB())])

In [30]:
pipeline.fit(x_train, y_train)

In [31]:
#Training acuuracy score
training_accuracy = pipeline.score(x_train, y_train)
print(f"Training accuracy score: {training_accuracy}")

Training accuracy score: 0.8522874737512682


In [32]:
#Testing acuuracy score
testing_accuracy = pipeline.score(x_test, y_test)
print(f"Testing accuracy score: {testing_accuracy}")

Testing accuracy score: 0.761681930610335


In [33]:
#Looking the classification reports for both training and testing
y_pred_training = pipeline.predict(x_train)
report = classification_report(y_pred_training, y_train)
print(report)

              precision    recall  f1-score   support

           0       0.88      0.83      0.86     44624
           1       0.83      0.87      0.85     40142

    accuracy                           0.85     84766
   macro avg       0.85      0.85      0.85     84766
weighted avg       0.85      0.85      0.85     84766



In [34]:
y_pred_testing = pipeline.predict(x_test)
report = classification_report(y_pred_testing, y_test)
print(report)

              precision    recall  f1-score   support

           0       0.79      0.74      0.77      7957
           1       0.73      0.78      0.75      7002

    accuracy                           0.76     14959
   macro avg       0.76      0.76      0.76     14959
weighted avg       0.76      0.76      0.76     14959



### The model achieves an accuracy of 85% on the training set and 76% on the test set. The classification report shows a consistent drop in precision, recall, and F1-score for both classes when evaluated on unseen data. These differences indicate potential overfitting, where the model performs better on the training data than on new data. Further hyperparmatrer tuning may help improve generalization.

In [35]:
# Hyperparamter tuning on the pipeline
search_space = {"vectorizer__ngram_range": [(1,1), (2,2), (3,3)],
                "classifier__alpha": [0.1, 1.0, 2.0, 3.0],
                "classifier__fit_prior": [True, False]}

In [36]:
grid = GridSearchCV(pipeline, param_grid = search_space, cv = 20, scoring = 'accuracy')

In [37]:
grid.fit(x_train, y_train)

In [38]:
best_score = grid.best_score_
best_params = grid.best_params_

print(f"Best Score: {best_score}, Best Parameters: {best_params}")

Best Score: 0.7664156435873055, Best Parameters: {'classifier__alpha': 2.0, 'classifier__fit_prior': True, 'vectorizer__ngram_range': (1, 1)}


In [39]:
best_pipeline = grid.best_estimator_

In [40]:
#Training acuuracy score
training_accuracy = best_pipeline.score(x_train, y_train)
print(f"Training accuracy score: {training_accuracy}")

Training accuracy score: 0.8362668994644079


In [41]:
#Testing acuuracy score
testing_accuracy = best_pipeline.score(x_test, y_test)
print(f"Testing accuracy score: {testing_accuracy}")

Testing accuracy score: 0.7634200147068654


In [46]:
# Testing some random sentences (0: Negative , 1: Positive)
text = pd.Series("This game is absolute garbage, do not try it is not worth your time!")
prediction = pipeline.predict(text)
prediction

array([1], dtype=int64)

In [47]:
text = pd.Series("This game is an absolute beauty, must try and it is worth your time!")
prediction = pipeline.predict(text)
prediction

array([1], dtype=int64)

In [48]:
text = pd.Series("Trash game and it is absolute garbage, do not try it is not worth your time. I hate it !")
prediction = pipeline.predict(text)
prediction

array([0], dtype=int64)

## Project Summary

This project involves building a sentiment analysis model using the Naive Bayes algorithm to classify text reviews of games as either positive or negative. The workflow follows a typical machine learning pipeline from data ingestion to model evaluation and tuning.

---

### 1. Data Ingestion and Preprocessing

* **Dataset** was imported from a CSV file.
* Performed basic inspection to understand its structure and contents.
* Checked missing values and removed duplicate entries to clean the dataset.

---

### 2. Exploratory Data Analysis (EDA)

* Analyzed the most frequently used unigrams and bigrams in the text data.
* Identified most and least used words and bigrams.

---

### 3. Text Preprocessing

* Implemented functions to:

  * Convert text to lowercase
  * Remove mentions (e.g., `@username`)
  * Remove special characters
* Each function was unit tested for correctness.
* Combined these steps into a preprocessing pipeline using `FunctionTransformer`.

---

### 4. Data Splitting

* Split the dataset into training and testing sets.
* Used a fixed `random_state` to ensure reproducibility.

---

### 5. Model Building and Initial Evaluation

* Created a Scikit-learn pipeline with:

  * A `CountVectorizer` for feature extraction
  * A `MultinomialNB` classifier for training
* **Initial results** before hyperparameter tuning:

  * Training Accuracy: 85.23%
  * Testing Accuracy: 76.17%

---

### 6. Hyperparameter Tuning

* Conducted grid search over the following parameters:

  * `vectorizer__ngram_range`: (1,1), (2,2), (3,3)
  * `classifier__alpha`: \[0.1, 1.0, 2.0, 3.0, 4.0]
  * `classifier__fit_prior`: \[True, False]
* **Best parameters found**:

  * `alpha`: 2.0
  * `fit_prior`: True
  * `ngram_range`: (1,1)

---

### 7. Final Model Performance

* **After tuning**, results remained consistent:

  * Training Accuracy: 83.63%
  * Testing Accuracy: 76.34%
* Minor overfitting was observed, but within a reasonable margin considering the simplicity of the model.

---

### 8. User Input Consideration

* For prediction, user inputs need to be wrapped in a `pandas.Series` object to match the input format expected by the pipeline.
* This approach is appropriate and efficient for backend integration in web applications.

---

### Notes

* **Potential for Improvement**:
  The current preprocessing pipeline expects `pandas.Series` input, which is fine for experimentation but suboptimal for production use. Refactoring the text preprocessing functions to handle raw string inputs would make the model more efficient and scalable in web or app backends.

* **Model Performance & Overfitting Observation**:
  Even after hyperparameter tuning, the training accuracy (\~83.6%) remained noticeably higher than the testing accuracy (\~76.3%). This suggests some degree of overfitting may still be present. The gap could be due to:

  * **Dataset complexity**, including subtle linguistic patterns that the model may overfit.
  * **Model simplicity**: Naive Bayes assumes conditional independence, which may not fully capture contextual relationships in text.
  * **Limited expressiveness of features**: While n-grams help, they may not capture enough semantics to generalize well.

Future improvements could involve richer features (like TF-IDF or embeddings) or trying other advanced architecture like transformers for better generalization.
