# Spam Detection Using Multinomial Naive Bayes

## Introduction

In this project, we aim to classify SMS messages as either "ham" (not spam) or "spam" using a Multinomial Naive Bayes classifier. This is a common approach in natural language processing (NLP) tasks for text classification.

## Dataset

The dataset used for this project is the "spam.csv" file, which contains two columns:
- `label`: The classification of the message (ham or spam)
- `message`: The content of the SMS message

### Data Preprocessing

1. **Loading the Data**: The data is loaded into a Pandas DataFrame and unnecessary columns are dropped.
2. **Text Cleaning**: 
   - Punctuation is removed.
   - Stopwords (common words that do not contribute to the meaning) are filtered out.
3. **Encoding Labels**: The labels are mapped to numerical values: ham as 0 and spam as 1.

### Text Processing Function

A function `text_process` is defined to:
- Remove punctuation
- Remove stopwords
- Return cleaned text as a single string

## Splitting the Data

The dataset is split into training and testing sets using an 80-20 ratio.

## Feature Extraction

The text data is transformed into a document-term matrix using `CountVectorizer`, which converts the text into a format suitable for machine learning algorithms.

## Model Training

### Naive Bayes Implementation

1. **Prior Probability Calculation**: Calculate the prior probabilities of each class.
2. **Conditional Probability Calculation**: For each class, compute the conditional probabilities of each feature (word) given the class.

### Prediction Function

A function `predict` is defined to compute the posterior probabilities and classify the messages.

## Model Evaluation

The model's accuracy is calculated on both the training and testing datasets.

## Hyperparameter Tuning

The performance of the model is evaluated for different values of the smoothing parameter (alpha), which helps in adjusting the model's performance.

## Using Scikit-Learn

For comparison, the Multinomial Naive Bayes implementation from `sklearn` is used to validate results.

### Summary of Results

The accuracy of the model is reported for various values of alpha, indicating how well the model performs under different configurations.

## Conclusion

This project demonstrates the effectiveness of the Multinomial Naive Bayes classifier for spam detection.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# read file into pandas using a relative path

df = pd.read_csv("./spam.csv", encoding='latin-1')
df.dropna(how="any", inplace=True, axis=1)
df.columns = ['label', 'message']

df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
import string
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    STOPWORDS = stopwords.words('english') + ['u', 'ü', 'ur', '4', '2', 'im', 'dont', 'doin', 'ure']
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
    return ' '.join([word for word in nopunc.split() if word.lower() not in STOPWORDS])

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Arshini/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
df['message'] = df.message.apply(text_process)
df.head()

Unnamed: 0,label,message
0,ham,Go jurong point crazy Available bugis n great ...
1,ham,Ok lar Joking wif oni
2,spam,Free entry wkly comp win FA Cup final tkts 21s...
3,ham,dun say early hor c already say
4,ham,Nah think goes usf lives around though


In [5]:
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,label,message
0,0,Go jurong point crazy Available bugis n great ...
1,0,Ok lar Joking wif oni
2,1,Free entry wkly comp win FA Cup final tkts 21s...
3,0,dun say early hor c already say
4,0,Nah think goes usf lives around though


In [6]:
# split X and y into training and testing sets 
from sklearn.model_selection import train_test_split

X = df.message
y = df.label

print(f'X: {X.shape}')
print(f'y: {y.shape}')
print()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

print(f'X_train: {X_train.shape}')
print(f'y_train: {y_train.shape}')
print()

print(f'X_test: {X_test.shape}')
print(f'y_test: {y_test.shape}')
print()

X: (5572,)
y: (5572,)

X_train: (4179,)
y_train: (4179,)

X_test: (1393,)
y_test: (1393,)



In [7]:
# example text for model training (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'Please call me... PLEASE!']

In [8]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
simple_train = vect.fit_transform(simple_train)

vect.get_feature_names_out()

array(['cab', 'call', 'me', 'please', 'tonight', 'you'], dtype=object)

In [9]:
vect.get_feature_names_out()

array(['cab', 'call', 'me', 'please', 'tonight', 'you'], dtype=object)

In [10]:
# convert sparse matrix to a dense matrix
simple_train.toarray()

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]])

In [11]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train.toarray(), columns=vect.get_feature_names_out())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


In [12]:
simple_test = ["please don't call me"]
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names_out())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


In [13]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
X_train = vect.fit_transform(X_train)
word = vect.get_feature_names_out()
X_train.toarray()
count_matrix = pd.DataFrame(X_train.toarray(), columns=vect.get_feature_names_out())
count_matrix.head()
print(count_matrix.shape)

(4179, 7996)


In [14]:
num_samples, num_features = X_train.shape
classes = np.unique(y_train)
num_classes = len(classes)

priors = np.zeros(2)
for i in range(2):
    priors[i] = np.mean(y_train == i)

alpha = 0.0
probability = np.zeros((num_classes, num_features))
for i in range(num_classes):
    prob = X_train[y_train == i]
    probability[i, :] = (np.sum(prob, axis=0) + alpha) / (np.sum(prob) + alpha * num_features)

def predict(X_test, priors, probability):
    num_samples = X_test.shape[0]
    y_pred = np.zeros(num_samples)
    for i in range(num_samples):
        posterior = np.zeros(len(classes))
        for j in range(len(classes)):
            c = classes[j]
            posterior[j] = np.log(priors[j]) + np.sum(np.log(probability[j, :] ** X_test[i, :]))
        y_pred[i] = classes[np.argmax(posterior)]
    return y_pred

y_pred = predict(X_train.toarray(), priors, probability)

# Calculate the accuracy of the model
accuracy = np.sum(y_pred == y_train) / len(y_train)
print('Accuracy:', accuracy*100, '%')


  posterior[j] = np.log(priors[j]) + np.sum(np.log(probability[j, :] ** X_test[i, :]))


Accuracy: 99.71284996410624 %


In [15]:
X_test = vect.transform(X_test)
X_test.toarray()
# examine the vocabulary and document-term matrix together
pd.DataFrame(X_test.toarray(), columns=vect.get_feature_names_out())

Unnamed: 0,008704050406,0121,01223585236,01223585334,0125698789,020603,02070836089,02072069400,02073162414,02085076972,...,åòits,åômorrow,åôrents,ìll,ìï,ìïll,ûªve,ûïharry,ûò,ûówell
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1388,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1389,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1390,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1391,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
num_samples, num_features = X_test.shape
classes = np.unique(y_test)
num_classes = len(classes)

priors = np.zeros(2)
for i in range(2):
    priors[i] = np.mean(y_test == i)

alpha = 0.0
probability = np.zeros((num_classes, num_features))
for i in range(num_classes):
    prob = X_test[y_test == i]
    probability[i, :] = (np.sum(prob, axis=0) + alpha) / (np.sum(prob) + alpha * num_features)
    
y_pred = predict(X_test.toarray(), priors, probability)

# Calculate the accuracy of the model
accuracy = np.sum(y_pred == y_test) / len(y_test)
print('Accuracy:', accuracy * 100, '%')


  posterior[j] = np.log(priors[j]) + np.sum(np.log(probability[j, :] ** X_test[i, :]))


Accuracy: 99.28212491026561 %


In [17]:
def alpha(X,y,alpha):  
    num_samples, num_features = X.shape
    classes = np.unique(y)
    num_classes = len(classes)

    priors = np.zeros(2)
    for i in range(2):
        priors[i] = np.mean(y == i)
    probability = np.zeros((num_classes, num_features))
    for i in range(num_classes):
        prob = X[y == i]
        probability[i, :] = (np.sum(prob, axis=0) + alpha) / (np.sum(prob) + alpha * num_features)
    return probability, priors

print('For Train Data')
accuracies = []
for i in range(0, 11,2):
    prob, prior = alpha(X_train.toarray(), y_train, i/10)
    y_pred = predict(X_train.toarray(), prior, prob)
    accuracy = np.sum(y_pred == y_train) / len(y_train)
    accuracies.append(accuracy)
    print('accuracy:', accuracy*100, '%; ', 'alpha:', i/10)

print('For Test Data')

for i in range(0, 11,2):
    prob, prior = alpha(X_test.toarray(), y_test, i/10)
    y_pred = predict(X_test.toarray(), prior, prob)
    accuracy = np.sum(y_pred == y_test) / len(y_test)
    print('accuracy:', accuracy*100, '%;', 'alpha:', i/10)

For Train Data


  posterior[j] = np.log(priors[j]) + np.sum(np.log(probability[j, :] ** X_test[i, :]))


accuracy: 99.71284996410624 %;  alpha: 0.0
accuracy: 99.59320411581717 %;  alpha: 0.2
accuracy: 99.49748743718592 %;  alpha: 0.4
accuracy: 99.52141660684374 %;  alpha: 0.6
accuracy: 99.4496290978703 %;  alpha: 0.8
accuracy: 99.4256999282125 %;  alpha: 1.0
For Test Data
accuracy: 99.28212491026561 %; alpha: 0.0
accuracy: 99.56927494615937 %; alpha: 0.2
accuracy: 99.4256999282125 %; alpha: 0.4
accuracy: 99.21033740129216 %; alpha: 0.6
accuracy: 99.13854989231874 %; alpha: 0.8
accuracy: 99.06676238334529 %; alpha: 1.0


In [18]:
from sklearn.naive_bayes import MultinomialNB
from    sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], random_state=0)
vectorizer = CountVectorizer()
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)

accuracies = []
for i in range(0, 11,2):    
    sk = MultinomialNB(alpha=i/10)
    sk.fit(X_train_vect, y_train)
    y_pred = sk.predict(X_test_vect)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print('accuracy:', accuracy*100, '%; ', 'alpha:', i/10)

accuracy: 97.91816223977028 %;  alpha: 0.0
accuracy: 98.49246231155779 %;  alpha: 0.2
accuracy: 98.63603732950466 %;  alpha: 0.4
accuracy: 98.56424982053123 %;  alpha: 0.6
accuracy: 98.34888729361091 %;  alpha: 0.8
accuracy: 98.34888729361091 %;  alpha: 1.0




In [1]:
import nbformat

def extract_code_from_ipynb(ipynb_file, output_file):
    with open(ipynb_file, 'r', encoding='utf-8') as file:
        notebook = nbformat.read(file, as_version=4)
        
    code_cells = [cell['source'] for cell in notebook['cells'] if cell['cell_type'] == 'code']
    
    with open(output_file, 'w', encoding='utf-8') as file:
        for i, code in enumerate(code_cells, 1):
            file.write(code)
            file.write('\n\n')

# Replace 'notebook.ipynb' and 'output.py' with your file names
extract_code_from_ipynb('MultinomialNB.ipynb', 'output.py')