<a href="https://colab.research.google.com/github/alixa2003/Arch-Internship-Tasks/blob/main/Email_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Email_Spam Classification**

**Objective**

The objective of this task was to build a machine learning model capable of classifying emails as spam or not spam (ham). The task focused on understanding the complete machine learning pipeline, including data preprocessing, feature extraction, model training, evaluation, and deployment using a simple user interface.

##**Importing Libraries**

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

import joblib
import gradio as gr

##**Loading The Dataset**

In [2]:
df = pd.read_csv("/content/email.csv", encoding="latin1")
df.head(5)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


**Dataset Description**

The dataset used for this task was obtained from Kaggle:
Spam Email Classification Dataset

It contains email messages along with their corresponding labels:

* ham – not spam

* spam – spam email

The dataset was first inspected for missing values and unnecessary columns. Only the relevant columns containing the email text and label were retained. Labels were converted into numerical form to make them suitable for machine learning models.

In [3]:
df.shape

(5573, 2)

In [4]:
df.columns

Index(['Category', 'Message'], dtype='object')

In [5]:
df.isnull().sum()

Unnamed: 0,0
Category,0
Message,0


In [6]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})
df.dropna(subset=['Category'], inplace=True)

In [7]:
print(df.head())
print(df['Category'].value_counts())

   Category                                            Message
0       0.0  Go until jurong point, crazy.. Available only ...
1       0.0                      Ok lar... Joking wif u oni...
2       1.0  Free entry in 2 a wkly comp to win FA Cup fina...
3       0.0  U dun say so early hor... U c already then say...
4       0.0  Nah I don't think he goes to usf, he lives aro...
Category
0.0    4825
1.0     747
Name: count, dtype: int64


##**Spliting the Dataset**

In [8]:
X = df['Message']
y = df['Category']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

##**Count Vectorizer And TF-IDF**

In [10]:
count_vectorizer = CountVectorizer(stop_words='english')
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

In [11]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

Text data cannot be directly used by machine learning models, so preprocessing was required. Two different text vectorization techniques were applied to compare their performance:

**Count Vectorizer**

This method converts text into numerical form by counting the frequency of each word in the email. It captures how often spam-related words appear.

**TF-IDF Vectorizer**

This method assigns importance to words based on how frequently they appear across all emails, reducing the weight of very common words.

The dataset was split into training and testing sets before vectorization to avoid data leakage.

##**Multinomial Naive Bayes And Logistic Regression**

In [12]:
nb_count = MultinomialNB()
nb_count.fit(X_train_count, y_train)

In [13]:
nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_train_tfidf, y_train)

In [14]:
lr_count = LogisticRegression()
lr_count.fit(X_train_count, y_train)

In [15]:
lr_tfidf = LogisticRegression()
lr_tfidf.fit(X_train_tfidf, y_train)

**Model Training**

Two machine learning models were trained using both vectorization techniques:

* Multinomial Naive Bayes

* Logistic Regression

This resulted in four model combinations:

* Naive Bayes + Count Vectorizer

* Naive Bayes + TF-IDF

* Logistic Regression + Count Vectorizer

* Logistic Regression + TF-IDF

In [16]:
def evaluate_model(model, X_test, y_test, model_name):
    y_pred = model.predict(X_test)

    print(f"\n===== {model_name} =====")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("\nClassification Report:\n", classification_report(y_test, y_pred))
    print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


**Model Evaluation**

Each model was evaluated using:

* Accuracy

* Precision, Recall, and F1-score (Classification Report)

* Confusion Matrix

Special attention was given to recall for spam emails, as missing spam emails is more harmful than incorrectly flagging normal emails.

In [17]:
evaluate_model(nb_count, X_test_count, y_test, "Naive Bayes + Count Vectorizer")
evaluate_model(nb_tfidf, X_test_tfidf, y_test, "Naive Bayes + TF-IDF")

evaluate_model(lr_count, X_test_count, y_test, "Logistic Regression + Count Vectorizer")
evaluate_model(lr_tfidf, X_test_tfidf, y_test, "Logistic Regression + TF-IDF")



===== Naive Bayes + Count Vectorizer =====
Accuracy: 0.9874439461883409

Classification Report:
               precision    recall  f1-score   support

         0.0       0.99      1.00      0.99       966
         1.0       0.97      0.93      0.95       149

    accuracy                           0.99      1115
   macro avg       0.98      0.96      0.97      1115
weighted avg       0.99      0.99      0.99      1115


Confusion Matrix:
 [[962   4]
 [ 10 139]]

===== Naive Bayes + TF-IDF =====
Accuracy: 0.9766816143497757

Classification Report:
               precision    recall  f1-score   support

         0.0       0.97      1.00      0.99       966
         1.0       1.00      0.83      0.90       149

    accuracy                           0.98      1115
   macro avg       0.99      0.91      0.95      1115
weighted avg       0.98      0.98      0.98      1115


Confusion Matrix:
 [[966   0]
 [ 26 123]]

===== Logistic Regression + Count Vectorizer =====
Accuracy: 0.9856502242

Among all tested combinations, the Multinomial Naive Bayes model using Count Vectorizer achieved the highest accuracy (98.74%) and the best recall for spam emails. This indicates that simple word frequency features are highly effective for spam detection in this dataset. TF-IDF based models, while precise, showed lower recall and missed more spam emails, making them less suitable for this task.

##**Saving The Model**

In [18]:
joblib.dump(nb_count, "spam_model.pkl")
joblib.dump(count_vectorizer, "vectorizer.pkl")

['vectorizer.pkl']

In [19]:
model = joblib.load("spam_model.pkl")
vectorizer = joblib.load("vectorizer.pkl")

##**Model Deployment**

The best-performing model (Naive Bayes with Count Vectorizer) was saved and integrated into a Gradio-based user interface. The interface allows users to enter any email text and instantly receive a prediction indicating whether the email is spam or not spam.

In [20]:
def predict_spam(email_text):
    email_vector = vectorizer.transform([email_text])
    prediction = model.predict(email_vector)[0]

    if prediction == 1:
        return "🚨 Spam Email"
    else:
        return "✅ Not Spam (Ham)"

In [21]:
interface = gr.Interface(
    fn=predict_spam,
    inputs=gr.Textbox(lines=6, placeholder="Paste email text here..."),
    outputs="text",
    title="Spam Email Classification",
    description="This app classifies emails as Spam or Not Spam using Machine Learning."
)

interface.launch()

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://95b5ea330445ca6beb.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [22]:
test_email = """
Congratulations! You have won a free vacation.
Click now to claim your reward.
"""

print(predict_spam(test_email))


🚨 Spam Email
