# Day-24: Naive Bayes Classifier

Today we're going back to our roots in probability and using a classic theorem to build a classifier. Naive Bayes is a probabilistic machine learning algorithm used for a wide variety of classification tasks. It's a fantastic baseline model because it's fast and highly scalable. The magic of Naive Bayes lies in its "naive" assumption—that all features are independent of one another. We'll explore why this assumption, while often false, works so well in practice.

## Topics Covered:

In [15]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import matplotlib.pyplot as plt

https://www.kaggle.com/datasets/shashwatwork/consume-complaints-dataset-fo-nlp

In [16]:
df = pd.read_csv('complaints_processed.csv')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162421 entries, 0 to 162420
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  162421 non-null  int64 
 1   product     162421 non-null  object
 2   narrative   162411 non-null  object
dtypes: int64(1), object(2)
memory usage: 3.7+ MB


In [17]:
df.head()

Unnamed: 0.1,Unnamed: 0,product,narrative
0,0,credit_card,purchase order day shipping amount receive pro...
1,1,credit_card,forwarded message date tue subject please inve...
2,2,retail_banking,forwarded message cc sent friday pdt subject f...
3,3,credit_reporting,payment history missing credit report speciali...
4,4,credit_reporting,payment history missing credit report made mis...


In [28]:
df.dropna(subset=['narrative', 'product'], inplace=True)

In [30]:
df['narrative'] = df['narrative'].astype('U')

In [38]:
# We'll take the top 5 most frequent product categories
top_products = df['product'].value_counts().nlargest(5).index
df_subset = df[df['product'].isin(top_products)]

In [43]:
top_products

Index(['credit_reporting', 'debt_collection', 'mortgages_and_loans',
       'credit_card', 'retail_banking'],
      dtype='object', name='product')

In [39]:
# train-test split
X = df_subset['narrative']
y = df_subset['product']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [40]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english',decode_error='ignore',encoding='utf-8')),
    ('clf', MultinomialNB()),
])

In [41]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

In [42]:
# Evaluate the model
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.8139
Confusion Matrix:
[[ 1831   998    30    95   178]
 [  103 17589   245   329    17]
 [   78  2276  2061   189    11]
 [   28   693    53  2966    30]
 [  222   361    11    99  1990]]

Classification Report:
                     precision    recall  f1-score   support

        credit_card       0.81      0.58      0.68      3132
   credit_reporting       0.80      0.96      0.88     18283
    debt_collection       0.86      0.45      0.59      4615
mortgages_and_loans       0.81      0.79      0.80      3770
     retail_banking       0.89      0.74      0.81      2683

           accuracy                           0.81     32483
          macro avg       0.83      0.70      0.75     32483
       weighted avg       0.82      0.81      0.80     32483

