Email spam is an unsolicited message often sent in bulk for advertising, phishing, or malware. The goal is to build a **machine learning model** that can classify a given message as:

- **Spam** (unwanted)
- **Ham** (legitimate)

This helps email services filter unwanted content automatically.

In [35]:
#importing the dependencies

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Data collection and preprocessing:

In [2]:
# loading the dataset to a pandas DataFrame
raw_mail_dataset=pd.read_csv('/content/mail_data.csv')

In [3]:
print(raw_mail_dataset)

     Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


In [4]:
# replace null values with a null string

mail_dataset=raw_mail_dataset.where(pd.notnull(raw_mail_dataset),'')

In [5]:
mail_dataset.head(10)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [6]:
mail_dataset.shape

(5572, 2)

Label Encoding: Label Encoding converts categorical (text) data into numeric form so that machine learning algorithms can work with it. Since we are dealing with binary classification labels, label encoding makes it look like this:

Original:   ['spam', 'ham', 'spam', 'ham']
Encoded:    [0, 1, 0, 1]


In [7]:
mail_dataset.loc[mail_dataset['Category']== 'spam', 'Category',]=0
mail_dataset.loc[mail_dataset['Category']== 'ham', 'Category',]=1

spam=0
ham=1

In [8]:
# separate the data as text and labels

X=mail_dataset['Message']
Y=mail_dataset['Category']

In [9]:
print(X)
print(Y)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object
0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object


In [10]:
#splitting the datsset into training data and test data

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2) #random state  makes data to be splitted in different ways

In [11]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)

(5572,)
(4457,)
(1115,)


Feature Extraction

In [12]:
#transform the text data to feature vectors that can be used as input to the Logistic regression

feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

#convert Y_train and Y_test as integers

Y_train=Y_train.astype('int')
Y_test=Y_test.astype('int')

In [13]:
print(X_train_features)
print(X_test_features)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 34768 stored elements and shape (4457, 7458)>
  Coords	Values
  (0, 6927)	0.48935591439341625
  (0, 6586)	0.44333254982109394
  (0, 3958)	0.6161071828926097
  (0, 4334)	0.42941702167641554
  (1, 3168)	0.5869421390016224
  (1, 6971)	0.4281243465155688
  (1, 1428)	0.5869421390016224
  (1, 2121)	0.35736171430221464
  (2, 6878)	0.35749230587184955
  (2, 1876)	0.28751725124107325
  (2, 5894)	0.35749230587184955
  (2, 806)	0.26730249393705324
  (2, 5695)	0.35749230587184955
  (2, 4884)	0.35749230587184955
  (2, 3852)	0.3408491178137899
  (2, 7353)	0.31988118061968496
  (2, 5115)	0.3408491178137899
  (3, 1876)	0.3080768784015236
  (3, 7297)	0.22192369472149484
  (3, 7000)	0.30072945056088285
  (3, 7065)	0.32795623716393424
  (3, 2060)	0.24915048132454623
  (3, 5005)	0.3169028431039865
  (3, 7248)	0.23571908490908416
  (3, 300)	0.2915969875465198
  :	:
  (4454, 4627)	0.3831814754124698
  (4454, 311)	0.19547195974237946
  (4454, 5068

Training the Model:

Logistic Regression

In [14]:
model=LogisticRegression()

In [15]:
# training the LogisticRegression model with the training data

model.fit(X_train_features,Y_train)

Evaluating the trained model

In [16]:
# prediction on training data

X_train_prediction=model.predict(X_train_features)
training_data_accuracy=accuracy_score(X_train_prediction,Y_train)

In [32]:
print('Accuracy score of the training data : ',  round(training_data_accuracy, 4) )

Accuracy score of the training data :  0.9686


In [33]:
# Step 4: Predict on test data

X_test_prediction = model.predict(X_test_features)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

print("Accuracy on test data:", round(test_data_accuracy, 4))


Accuracy on test data: 0.9534


Building a Predictive System

In [62]:
#1. serializing the best model
import os
import joblib

In [63]:
# Folder and file name
model_folder = "./models/"
model_file_name = "spam_model.joblib"
vectorizer_file_name = "vectorizer.joblib"

In [64]:
# ✅ Creating the folder if it doesn't exist
os.makedirs(model_folder, exist_ok=True)

In [65]:
# Now saving the model
joblib.dump(model, model_folder + model_file_name)
joblib.dump(feature_extraction, "./models/vectorizer.joblib")
print("✅ Model and vectorizer saved successfully!")

✅ Model and vectorizer saved successfully!


In [66]:
#Loading the saved model and vectorizer
model = joblib.load(model_folder + model_file_name)
feature_extraction = joblib.load(model_folder + vectorizer_file_name)

In [67]:
# Sample input
input_mail = ["I've been searching for the right words to thank you for this breather."]

In [68]:
#convert text to feature vector using loaded vectorizer

input_mail_features=feature_extraction.transform(input_mail)

In [69]:
#making predictions

prediction=model.predict(input_mail_features)
print(prediction)

[1]


In [70]:
if prediction[0] == 1:
    print("📬 Inbox-worthy! This one's a *Ham mail* — safe and sound. 🕊️")
else:
    print("🚨 Whoa there! This smells like *Spam mail* — better not click that link. 🧂🐟")


📬 Inbox-worthy! This one's a *Ham mail* — safe and sound. 🕊️


In [71]:
def predict_spam(email_text):
    """
    Function to predict if an email is spam or ham
    Args:
        email_text: string containing the email text
    Returns:
        string with the prediction result
    """
    # Convert text to feature vector
    input_features = feature_extraction.transform([input_mail])

    # Make prediction
    prediction = model.predict(input_features)

    # Return human-readable result
    if prediction[0] == 1:
        return "📬 Ham mail - This looks safe!"
    else:
        return "🚨 Spam mail - Be careful with this one!"

Creating a Gradio Interface

In [52]:
!pip3 install gradio



In [77]:
import gradio as gr

In [None]:
import gradio as gr
import joblib
import numpy as np

# Load model and vectorizer with error handling
try:
    model = joblib.load("./models/spam_model.joblib")
    feature_extraction = joblib.load("./models/vectorizer.joblib")
    print("✅ Model and vectorizer loaded successfully!")
except Exception as e:
    print(f"❌ Error loading model: {e}")
    raise

# Custom CSS for styling
custom_css = """
.gradio-container {
    font-family: 'Helvetica', Arial, sans-serif;
    max-width: 800px;
    margin: auto;
}
.header {
    text-align: center;
    margin-bottom: 20px;
}
.result-box {
    padding: 15px;
    border-radius: 8px;
    margin-top: 15px;
    font-size: 16px;
    text-align: center;
}
.ham-result {
    background-color: #e6f7e6;
    color: #2e7d32;
    border: 1px solid #a5d6a7;
}
.spam-result {
    background-color: #ffebee;
    color: #c62828;
    border: 1px solid #ef9a9a;
}
.footer {
    margin-top: 20px;
    font-size: 12px;
    text-align: center;
    color: #666;
}
.example-container {
    margin-top: 15px;
}
"""

def predict_spam(input_mail):
    """Enhanced prediction function with error handling"""
    try:
        if not input_mail.strip():
            return "⚠️ Please enter some text"

        print(f"Processing text: {input_mail[:50]}...")  # Debug log

        # Transform input
        input_features = feature_extraction.transform([input_mail])
        print(f"Features shape: {input_features.shape}")  # Debug log

        # Make prediction
        prediction = model.predict(input_features)
        print(f"Raw prediction: {prediction}")  # Debug log

        # Return formatted result with CSS classes
        if prediction[0] == 1:
            return """<div class='result-box ham-result'>📬 Ham mail - This looks safe!</div>"""
        else:
            return """<div class='result-box spam-result'>🚨 Spam mail - Be careful!</div>"""

    except Exception as e:
        print(f"Prediction error: {e}")  # Debug log
        return f"❌ Error processing your request: {str(e)}"

# Create interface with better configuration
iface = gr.Interface(
    fn=predict_spam,
    inputs=gr.Textbox(lines=7, placeholder="Paste email text here...", label="Email Text"),
    outputs=gr.HTML(label="Prediction Result"),
    title="Spam Detector App",
    description="Detects whether an email is spam or legitimate (ham)",
    examples=[
        ["Free offer! Claim your prize now!"],
        ["Hi John, just checking in about our meeting tomorrow"],
        ["You've won a $1000 gift card! Click here to claim"]
    ],
    allow_flagging="never",
    css=custom_css
)

# Launch with debugging
print("Launching interface...")
iface.launch(debug=True)

✅ Model and vectorizer loaded successfully!




Launching interface...
It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://4acf58edd25dd7b3a2.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Processing text: Free offer! Claim your prize now!...
Features shape: (1, 7458)
Raw prediction: [0]
Processing text: You've won a $1000 gift card! Click here to claim...
Features shape: (1, 7458)
Raw prediction: [0]
Processing text: Hi John, just checking in about our meeting tomorr...
Features shape: (1, 7458)
Raw prediction: [1]
Processing text: hi bebe...
Features shape: (1, 7458)
Raw prediction: [1]
Processing text: hi free stuff for you come get them...
Features shape: (1, 7458)
Raw prediction: [1]
Processing text:  free stuff for you come get them at major discoun...
Features shape: (1, 7458)
Raw prediction: [1]
Processing text: free offer claim ours today...
Features shape: (1, 7458)
Raw prediction: [0]
Processing text: You've won a $1000 gift card! Click here to claim...
Features shape: (1, 7458)
Raw prediction: [0]


In [59]:
# import gradio as gr
# import joblib
# import numpy as np

# # Load model and vectorizer with error handling
# try:
#     model = joblib.load("./models/spam_model.joblib")
#     feature_extraction = joblib.load("./models/vectorizer.joblib")
#     print("✅ Model and vectorizer loaded successfully!")
# except Exception as e:
#     print(f"❌ Error loading model: {e}")
#     raise

# def predict_spam(input_mail):
#     """Enhanced prediction function with error handling"""
#     try:
#         if not input_mail.strip():
#             return "⚠️ Please enter some text"

#         print(f"Processing text: {input_mail[:50]}...")  # Debug log

#         # Transform input
#         input_features = feature_extraction.transform([input_mail])
#         print(f"Features shape: {input_features.shape}")  # Debug log

#         # Make prediction
#         prediction = model.predict(input_features)
#         print(f"Raw prediction: {prediction}")  # Debug log

#         # Return formatted result
#         return "📬 Ham mail - This looks safe!" if prediction[0] == 1 else "🚨 Spam mail - Be careful!"

#     except Exception as e:
#         print(f"Prediction error: {e}")  # Debug log
#         return f"❌ Error processing your request: {str(e)}"

# # Create interface with better configuration
# iface = gr.Interface(
#     fn=predict_spam,
#     inputs=gr.Textbox(lines=7, placeholder="Paste email text here...", label="Email Text"),
#     outputs=gr.Textbox(label="Prediction Result"),
#     title="Spam Detector App",
#     description="Detects whether an email is spam or legitimate (ham)",
#     examples=[
#         ["Free offer! Claim your prize now!"],
#         ["Hi John, just checking in about our meeting tomorrow"],
#         ["You've won a $1000 gift card! Click here to claim"]
#     ],
#     allow_flagging="never"
# )

# # Launch with debugging
# print("Launching interface...")
# iface.launch(
#     share=True,
#     debug=True  # Enable Gradio's debug mode
# )

✅ Model and vectorizer loaded successfully!
Launching interface...




Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://2861f4750453d2d43b.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Processing text: You've won a $1000 gift card! Click here to claim...
Features shape: (1, 7458)
Raw prediction: [0]
Processing text: Hi John, just checking in about our meeting tomorr...
Features shape: (1, 7458)
Raw prediction: [1]
Processing text: Free offer! Claim your prize now!...
Features shape: (1, 7458)
Raw prediction: [0]
Processing text: You've won a $1000 gift card! Click here to claim...
Features shape: (1, 7458)
Raw prediction: [0]
Processing text: hi baby...
Features shape: (1, 7458)
Raw prediction: [1]
Processing text: whats up
...
Features shape: (1, 7458)
Raw prediction: [1]
Processing text: get your new cd today free free free...
Features shape: (1, 7458)
Raw prediction: [0]
Processing text: get your babe...
Features shape: (1, 7458)
Raw prediction: [1]
Processing text: get this free now...
Features shape: (1, 7458)
Raw prediction: [0]
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://9e3e0ac8d72428f6f9.gradio.live
Killin

