<a href="https://colab.research.google.com/github/atharv-d21/spam_mail_detection/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
import pandas as pd

In [6]:
data = pd.read_csv("https://raw.githubusercontent.com/atharv-d21/spam_mail_detection/refs/heads/main/data/spam_ham_dataset.csv", encoding = 'latin-1')
data

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0
...,...,...,...,...
5166,1518,ham,Subject: put the 10 on the ft\r\nthe transport...,0
5167,404,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...,0
5168,2933,ham,Subject: calpine daily gas nomination\r\n>\r\n...,0
5169,1409,ham,Subject: industrial worksheets for august 2000...,0


In [7]:
data.columns

Index(['Unnamed: 0', 'label', 'text', 'label_num'], dtype='object')

In [8]:
data['Index'] = data['Unnamed: 0']
data.drop(columns=['Unnamed: 0'], inplace=True)
data

Unnamed: 0,label,text,label_num,Index
0,ham,Subject: enron methanol ; meter # : 988291\r\n...,0,605
1,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0,2349
2,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0,3624
3,spam,"Subject: photoshop , windows , office . cheap ...",1,4685
4,ham,Subject: re : indian springs\r\nthis deal is t...,0,2030
...,...,...,...,...
5166,ham,Subject: put the 10 on the ft\r\nthe transport...,0,1518
5167,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...,0,404
5168,ham,Subject: calpine daily gas nomination\r\n>\r\n...,0,2933
5169,ham,Subject: industrial worksheets for august 2000...,0,1409


In [10]:
data['text']

Unnamed: 0,text
0,Subject: enron methanol ; meter # : 988291\r\n...
1,"Subject: hpl nom for january 9 , 2001\r\n( see..."
2,"Subject: neon retreat\r\nho ho ho , we ' re ar..."
3,"Subject: photoshop , windows , office . cheap ..."
4,Subject: re : indian springs\r\nthis deal is t...
...,...
5166,Subject: put the 10 on the ft\r\nthe transport...
5167,Subject: 3 / 4 / 2000 and following noms\r\nhp...
5168,Subject: calpine daily gas nomination\r\n>\r\n...
5169,Subject: industrial worksheets for august 2000...


## Text preprocessing

### Subtask:
Clean the text data by removing special characters, converting to lowercase, and removing stop words.


**Reasoning**:
Define a function to clean the text data by removing special characters, converting to lowercase, and removing stop words, then apply this function to the 'text' column.



In [11]:
import re
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def clean_text(text):
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text) # Remove special characters and punctuation
    text = text.lower() # Convert to lowercase
    text = ' '.join([word for word in text.split() if word not in stop_words]) # Remove stop words
    return text

data['cleaned_text'] = data['text'].apply(clean_text)
display(data.head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,label,text,label_num,Index,cleaned_text
0,ham,Subject: enron methanol ; meter # : 988291\r\n...,0,605,subject enron methanol meter 988291 follow not...
1,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0,2349,subject hpl nom january 9 2001 see attached fi...
2,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0,3624,subject neon retreat ho ho ho around wonderful...
3,spam,"Subject: photoshop , windows , office . cheap ...",1,4685,subject photoshop windows office cheap main tr...
4,ham,Subject: re : indian springs\r\nthis deal is t...,0,2030,subject indian springs deal book teco pvr reve...


## Feature extraction

### Subtask:
Convert the cleaned text data into numerical features using techniques like TF-IDF or Count Vectorization.


**Reasoning**:
Import the necessary vectorizer and transform the cleaned text data into numerical features using TF-IDF.



In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
tfidf_matrix = tfidf_vectorizer.fit_transform(data['cleaned_text'])

print("TF-IDF matrix shape:", tfidf_matrix.shape)

TF-IDF matrix shape: (5171, 5000)


## Model selection

### Subtask:
Choose a suitable classification model for spam detection (e.g., Naive Bayes, SVM, Logistic Regression).


**Reasoning**:
Import the necessary classification models from scikit-learn and explain the reasoning for choosing them for this task.



In [13]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Reasoning for model selection:
# 1. Multinomial Naive Bayes: This model is a good baseline for text classification tasks.
#    It works well with count or frequency features like TF-IDF and is computationally efficient.
#    It's based on the assumption of feature independence, which can be a reasonable approximation for text.
# 2. Logistic Regression: A linear model that is often effective for binary classification problems like spam detection.
#    It models the probability of a text belonging to the spam class and is interpretable.
# 3. Support Vector Machine (SVM): SVMs are powerful models that can find a clear margin between classes.
#    With a linear kernel, they can be effective for high-dimensional data like TF-IDF features.
#    They can handle complex relationships between features.

print("Classification models imported: Multinomial Naive Bayes, Logistic Regression, Support Vector Machine")

Classification models imported: Multinomial Naive Bayes, Logistic Regression, Support Vector Machine


## Model training

### Subtask:
Train the selected models on the preprocessed data.


**Reasoning**:
Split the data into training and testing sets and then train the selected models.



In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, data['label_num'], test_size=0.2, random_state=42)

mnb = MultinomialNB()
lr = LogisticRegression()
svc = SVC(kernel='linear')

mnb.fit(X_train, y_train)
lr.fit(X_train, y_train)
svc.fit(X_train, y_train)

print("Models trained successfully.")

Models trained successfully.


## Model evaluation

### Subtask:
Evaluate the performance of the trained model using appropriate metrics (e.g., accuracy, precision, recall, F1-score).


**Reasoning**:
Import necessary metrics and evaluate the performance of each trained model on the test data, then print the results.



In [15]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions
mnb_pred = mnb.predict(X_test)
lr_pred = lr.predict(X_test)
svc_pred = svc.predict(X_test)

# Evaluate Multinomial Naive Bayes
mnb_accuracy = accuracy_score(y_test, mnb_pred)
mnb_precision = precision_score(y_test, mnb_pred)
mnb_recall = recall_score(y_test, mnb_pred)
mnb_f1 = f1_score(y_test, mnb_pred)

print("--- Multinomial Naive Bayes Performance ---")
print(f"Accuracy: {mnb_accuracy:.4f}")
print(f"Precision: {mnb_precision:.4f}")
print(f"Recall: {mnb_recall:.4f}")
print(f"F1-score: {mnb_f1:.4f}")
print("-" * 40)

# Evaluate Logistic Regression
lr_accuracy = accuracy_score(y_test, lr_pred)
lr_precision = precision_score(y_test, lr_pred)
lr_recall = recall_score(y_test, lr_pred)
lr_f1 = f1_score(y_test, lr_pred)

print("--- Logistic Regression Performance ---")
print(f"Accuracy: {lr_accuracy:.4f}")
print(f"Precision: {lr_precision:.4f}")
print(f"Recall: {lr_recall:.4f}")
print(f"F1-score: {lr_f1:.4f}")
print("-" * 40)

# Evaluate Support Vector Machine
svc_accuracy = accuracy_score(y_test, svc_pred)
svc_precision = precision_score(y_test, svc_pred)
svc_recall = recall_score(y_test, svc_pred)
svc_f1 = f1_score(y_test, svc_pred)

print("--- Support Vector Machine Performance ---")
print(f"Accuracy: {svc_accuracy:.4f}")
print(f"Precision: {svc_precision:.4f}")
print(f"Recall: {svc_recall:.4f}")
print(f"F1-score: {svc_f1:.4f}")
print("-" * 40)

--- Multinomial Naive Bayes Performance ---
Accuracy: 0.9459
Precision: 0.8762
Recall: 0.9420
F1-score: 0.9079
----------------------------------------
--- Logistic Regression Performance ---
Accuracy: 0.9836
Precision: 0.9694
Recall: 0.9727
F1-score: 0.9710
----------------------------------------
--- Support Vector Machine Performance ---
Accuracy: 0.9894
Precision: 0.9796
Recall: 0.9829
F1-score: 0.9813
----------------------------------------


## Summary:

### Data Analysis Key Findings

*   The text data was successfully cleaned by removing special characters, converting to lowercase, and removing stop words, resulting in a `cleaned_text` column.
*   The cleaned text data was transformed into a TF-IDF matrix with a shape of (5171, 5000), using `max_features=5000` and `ngram_range=(1, 2)`.
*   Three classification models (Multinomial Naive Bayes, Logistic Regression, and Support Vector Machine) were selected and trained on the data.
*   Model evaluation on the test set showed the following performance:
    *   **Multinomial Naive Bayes:** Accuracy: 0.9459, Precision: 0.8762, Recall: 0.9420, F1-score: 0.9079
    *   **Logistic Regression:** Accuracy: 0.9836, Precision: 0.9694, Recall: 0.9727, F1-score: 0.9710
    *   **Support Vector Machine:** Accuracy: 0.9894, Precision: 0.9796, Recall: 0.9829, F1-score: 0.9813
*   The Support Vector Machine model achieved the highest performance across all evaluated metrics.

### Insights or Next Steps

*   The SVM model appears to be the most effective among the tested models for this spam classification task based on the evaluation metrics.
*   Further steps could involve hyperparameter tuning for the SVM and Logistic Regression models to potentially improve performance, or exploring other feature extraction techniques or models.
