#Email Spam Detection (Phase 2 Task)

CVIP (Coderscave)

Normal Task : Email Spam Detection

Name - Aryan Dutta

DATASET - Kaggle

Conducted an in-depth analysis on a Comprehensive Email Spam Classification Dataset sourced from Kaggle.

- This analysis showcases a machine learning-based spam email classifier implemented in Python. It involves importing and exploring email data, creating binary labels for spam and non-spam, splitting the data for model training, and constructing a text classification pipeline with a Multinomial Naive Bayes classifier. The model is trained and applied to predict the spam or non-spam status of example email strings, demonstrating a practical approach to email classification.

# Step 0: Import necessary libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Step 1: Read the dataset from a CSV file

In [2]:
df = pd.read_csv("spam.csv")

# Step 2: Explore the dataset

In [3]:
# Display the first 5 rows of the dataset
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# Display the last 5 rows of the dataset
df.tail()

Unnamed: 0,Category,Message
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


# Step 3: Create a binary label for spam/non-spam

In [5]:
# Convert the 'Category' column to a binary label 'IsSpam' (1 for 'spam', 0 for 'ham')
df['IsSpam'] = df['Category'].apply(lambda x: 1 if x == 'spam' else 0)

In [6]:
# Display the first 10 rows of the dataset with the new 'IsSpam' column
df.head(10)

Unnamed: 0,Category,Message,IsSpam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,1
6,ham,Even my brother is not like to speak with me. ...,0
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0
8,spam,WINNER!! As a valued network customer you have...,1
9,spam,Had your mobile 11 months or more? U R entitle...,1


# Step 4: Split the dataset into training and testing sets

In [7]:
X_train, X_test, y_train, y_test = train_test_split(df.Message, df.IsSpam, test_size=0.25)

# Step 5: Build a text classification pipeline

In [8]:
# Create a text classification pipeline using CountVectorizer and Multinomial Naive Bayes
text_classifier = Pipeline([
    ('vectorizer', CountVectorizer()),  # Convert text data into numerical features
    ('nb', MultinomialNB())  # Train a Multinomial Naive Bayes classifier
])

# Train the text classification model on the training data
text_classifier.fit(X_train, y_train)

# Step 6: Predict spam or non-spam for a list of email strings

In [9]:
# Create a list of email strings (spam and non-spam)
email = [
    "Congratulations! You've won a free vacation!",
    "Hello, could you please send me the report by tomorrow?"
]

# Use the trained model to predict whether the emails are spam or non-spam
email_predictions = text_classifier.predict(email)

# Print the predictions
print("Email Predictions:", email_predictions)


Email Predictions: [1 0]
