# Naive Bayes Project

Welcome to your Naive Bayes Machine Learning Project! Just follow along with the notebook and instructions below. We will be analyzing the famous sms spam data set from Kaggle!

## The Data
We will be using the famous [SMS Spam Collection](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset). 

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

## Import Libraries
Let's import some libraries to get started!

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## The Data

Let's start by reading in the spam.csv file into a pandas dataframe and perform some operations.

In [3]:
df = pd.read_csv('spam.csv', encoding='latin-1')

In [4]:
df.sample(5)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
5020,ham,:-( sad puppy noise,,,
4577,spam,Urgent! call 09066350750 from your landline. Y...,,,
1360,ham,Yo dude guess who just got arrested the other day,,,
4440,ham,I'm going 2 orchard now laready me reaching so...,,,
3187,spam,This is the 2nd time we have tried 2 contact u...,,,


## Data Cleaning

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [6]:
# drop last 3 cols
df.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'],inplace=True)

### Renaming the Features v1 as target and v2 as text

In [None]:
# renaming the cols
df.rename(columns={'v1':'target','v2':'text'},inplace=True)
df.sample(5)

## Encoding the Target

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

In [None]:
df['target'] = encoder.fit_transform(df['target'])

In [None]:
print(encoder.classes_) # 0->ham and 1-> spam

### Check for missing values

In [None]:
# missing values
df.isnull().sum()

### Check for duplicate values and remove them

In [None]:
df.duplicated().sum()

In [None]:
df = df.drop_duplicates(keep='first')
df.duplicated().sum()

In [None]:
df['target'].value_counts()

### EDA

**Using Countplot check the distribution of the target.**

In [None]:
sns.countplot(x='target', data=df, hue='target')

### Now we have to do some Natural Language Processing on the the Dataset

In [None]:
!pip install nltk

In [None]:
import nltk

In [None]:
nltk.download('punkt')

It downloads the Punkt model, which is a pre-trained data file that helps the NLTK library split text into lists of words (tokenization) and sentences.

Without it, NLTK cannot intelligently break paragraphs into sentences or words.

In [None]:
df['num_characters'] = df['text'].apply(len)

In [None]:
df.head()

In [None]:
# num of words
df['num_words'] = df['text'].apply(lambda x:len(nltk.word_tokenize(x)))

It creates a new column called **`num_words`** that contains the **total count of words** for each message in the `text` column.

**Example:**
* **Input (`x`):** `"Hi, how are you?"`
* **Tokenized:** `['Hi', ',', 'how', 'are', 'you', '?']`
* **Result (`num_words`):** `6`

In [None]:
df['num_sentences'] = df['text'].apply(lambda x:len(nltk.sent_tokenize(x)))

It creates a new column called **`num_sentences`** that contains the **total count of sentences** for each message.

**Example:**
* **Input (`x`):** `"I am fine. How are you?"`
* **Tokenized:** `['I am fine.', 'How are you?']`
* **Result (`num_sentences`):** `2`

In [None]:
df[['num_characters','num_words','num_sentences']].describe()

In [None]:
# ham
df[df['target'] == 0][['num_characters','num_words','num_sentences']].describe()

In [None]:
#spam
df[df['target'] == 1][['num_characters','num_words','num_sentences']].describe()

**Make a histplot for different targets based on number of characters**

In [None]:
plt.figure(figsize=(12,6))
sns.histplot(df[df['target'] == 0]['num_characters']) #ham
sns.histplot(df[df['target'] == 1]['num_characters'],color='red') #spam

**Make a Pairplot but it should show distinctly the targets**

In [None]:
plt.figure(figsize=(12,6))
sns.histplot(df[df['target'] == 0]['num_words'])
sns.histplot(df[df['target'] == 1]['num_words'],color='red')

## 3. Data Preprocessing
- Lower case
- Tokenization
- Removing special characters
- Removing stop words and punctuation
- Stemming

## Text Preprocessing Steps

| Step | Definition | Importance (Why?) |
| :--- | :--- | :--- |
| **Lower Casing** | Converts all text to **lowercase** (e.g., "The" $\rightarrow$ "the"). | Ensures the model treats variations like "Apple" and "apple" as the **same word**, reducing vocabulary size. |
| **Tokenization** | Breaks the text into its smallest meaningful units (words, numbers, punctuation). | The **mandatory first step**; it converts a raw string into a list of items for counting and processing. |
| **Removing Special Characters** | Eliminates non-alphanumeric symbols (e.g., `@`, `#`, `&`). | Reduces **noise** and keeps the model focused only on characters relevant to language. |
| **Removing Stop Words & Punctuation** | Removes common, high-frequency words (e.g., "a," "the," "is") and standard punctuation. | Drastically reduces the number of features and forces the model to learn from **meaningful, distinguishing words** (like 'spam' keywords). |
| **Stemming** | Reduces a word to its base or root form (e.g., "running," "runs" $\rightarrow$ "run"). | Reduces the feature space by treating all grammatical variations of a word as a single feature, helping the model **generalize** better. |

***

### Overall Importance: Model Effectiveness

These steps are vital because they convert raw, messy human language into a clean, consistent, and numerical format that machine learning algorithms can process effectively. They ensure the model is **efficient**, **consistent**, and focuses only on the most **distinguishing information**.

In [None]:
from nltk.corpus import stopwords
import string
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

def transform_text(text):
    text = text.lower()
    text = nltk.word_tokenize(text)
    
    y = []
    for i in text:
        if i.isalnum():
            y.append(i)
    
    text = y[:]
    y.clear()
    
    for i in text:
        if i not in stopwords.words('english') and i not in string.punctuation:
            y.append(i)
            
    text = y[:]
    y.clear()
    
    for i in text:
        y.append(ps.stem(i))
    
            
    return " ".join(y)

In [None]:
transform_text("I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.")

In [None]:
df['transformed_text'] = df['text'].apply(transform_text)

In [None]:
!pip install wordcloud

In [None]:
from wordcloud import WordCloud
wc = WordCloud(width=500,height=500,min_font_size=10,background_color='white')

spam_wc = wc.generate(df[df['target'] == 1]['transformed_text'].str.cat(sep=" "))
plt.figure(figsize=(15,6))
plt.imshow(spam_wc)

In [None]:
ham_wc = wc.generate(df[df['target'] == 0]['transformed_text'].str.cat(sep=" "))
plt.figure(figsize=(15,6))
plt.imshow(ham_wc)

# Model Building

##  What is Vectorization?
Vectorization is the process of converting text data into numerical data that machine learning models can understand.
Vectorization (specifically **Text Vectorization**) is the essential step of turning unstructured text (like a word, a sentence, or a document) into a sequence of numbers, known as a **vector**.

For example, a sentence like "I love cats" might be converted into the vector: `[0, 1, 0, 1, 2]` where each number corresponds to the presence or frequency of a specific word in the entire vocabulary.

### Types:

1.  **CountVectorizer (`cv`):** Converts text into a vector by simply counting the frequency of each word in a document.
    * *Example:* If the word "free" appears 5 times, its corresponding number in the vector is 5.
2.  **TfidfVectorizer (`tfidf`):** Converts text into a vector based on **Term Frequency-Inverse Document Frequency (TF-IDF)**. This is a more sophisticated method that weighs word count by how rare or important the word is across *all* documents.
    * *Goal:* Give high scores to words that appear often in a *specific* document but rarely in the entire dataset (e.g., "lottery" in a spam message).

---


In [None]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
cv = CountVectorizer()
tfidf = TfidfVectorizer(max_features=3000)

In [None]:
X = tfidf.fit_transform(df['transformed_text']).toarray()
y = df['target'].values

**Train Test Split**

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

# Model Building Prediction and Evaluation

In [None]:
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
from sklearn.metrics import accuracy_score,confusion_matrix,precision_score, classification_report

In [None]:
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()

In [None]:
gnb.fit(X_train,y_train)
y_pred1 = gnb.predict(X_test)
print(accuracy_score(y_test,y_pred1))
print(confusion_matrix(y_test,y_pred1))
print(precision_score(y_test,y_pred1))
print(classification_report(y_test,y_pred1))

In [None]:
mnb.fit(X_train,y_train)
y_pred2 = mnb.predict(X_test)
print(accuracy_score(y_test,y_pred2))
print(confusion_matrix(y_test,y_pred2))
print(precision_score(y_test,y_pred2))
print(classification_report(y_test,y_pred1))

In [None]:
bnb.fit(X_train,y_train)
y_pred3 = bnb.predict(X_test)
print(accuracy_score(y_test,y_pred3))
print(confusion_matrix(y_test,y_pred3))
print(precision_score(y_test,y_pred3))
print(classification_report(y_test,y_pred1))