<a href="https://colab.research.google.com/github/anasadh/spam_ham_mail_prediction/blob/main/SpamHamPredictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

Data preprocessing

In [None]:
# load the dataset to pandas data frame
raw_mail_data = pd.read_csv('spam.csv', encoding='latin-1')
# reaplace the null values with null string
mail_data = raw_mail_data.where((pd.notnull(raw_mail_data)), '')


In [None]:
mail_data.shape

(5572, 5)

In [None]:
mail_data.head() #sample of data set

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [None]:
# label spam mail as 0, ham mail as 1
mail_data.loc[mail_data['v1'] == 'spam', 'category'] = 0
mail_data.loc[mail_data['v1'] == 'ham', 'category'] = 1

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'category'], dtype='object')


In [None]:
# seperate the data as text and label. x --> text, y--> label
x = mail_data['v2']
y = mail_data['v1']
print(x)
print('.................................................')
print(y)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: v2, Length: 5572, dtype: object
.................................................
0        ham
1        ham
2       spam
3        ham
4        ham
        ... 
5567    spam
5568     ham
5569     ham
5570     ham
5571     ham
Name: v1, Length: 5572, dtype: object


Train Test Split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2, random_state=3)

feature extraction

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

# Transform the text data to feature vectors using TfidfVectorizer
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
x_train_features = feature_extraction.fit_transform(x_train)
x_test_features = feature_extraction.transform(x_test)

# Convert y_train and y_test values to integers using LabelEncoder
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.transform(y_test)


Training the model --> **support vector machine**

In [None]:
# training the svm with training data
model = LinearSVC()
model.fit(x_train_features,y_train)

Evaluation of the model

In [None]:
# prediction on training data
prediction_on_training_data = model.predict(x_train_features)
accuracy_on_training_data = accuracy_score(y_train,prediction_on_training_data)
print('Accuracy on training data: ' , accuracy_on_training_data)

Accuracy on training data:  0.9995512676688355


In [None]:
# prediction on test data
prediction_on_testing_data = model.predict(x_test_features)
accuracy_on_testing_data = accuracy_score(y_test,prediction_on_testing_data)
print('Accuracy on testing data: ' , accuracy_on_testing_data)

Accuracy on testing data:  0.9856502242152466


prediction on new mail

In [None]:
# input_mail = ["Hi anamikasadh, You’re now part of the world’s largest data science community. Welcome, we’re glad you’re here! Whether you’re new to machine learning or a renowned expert, we want to support you with powerful tools and resources to help you grow as a data scientist. So, where’s the best place to start? We highly recommend trying out our popular Titanic competition. You’ll be challenged to predict which passengers survived the infamous 1912 shipwreck. Get Started Here It’s a quick way to get a strong grasp of how our platform works. You’ll become more familiar with how our notebooks (online coding environment with no cost GPUs), open datasets, and open pretrained models work together to help you build data science projects."]
# input_mail = ["My daughter is pregnant at 16 years old. She says she doesn’t know who the father is, or could be. I kicked her out of anger. Should I allow her back home? Sharon Codner, M.S. Psychology (1984) • Answered November 3 I was your daughter 50 years ago. My parents (dad) kicked me out. I was homeless and pregnant. Nice work Mom and Dad. At the one time in my life, when I needed my parents t... Read more »"]
# input_mail = [" Touristy is an online news publication headquartered between New Delhi, London & New York. Small Team but BIG Impact. We've identified you as a Global Citizen who is aware of and understands the wider world- and their place in it. But go ahead, prove us wrong by not reading further.  Who We Are ? A newsletter-first company which believes in fact-driven journalism- served with a side of millennial sarcasm. We bring you the latest Emerging Trends from around the globe. We cover Business, Culture, Startups, Ideas and People that are re-shaping the world as we know it. Our website: Touristy.substack.com Instagram: @touristynews While, we were born during our undergrad at New York University, we came to be a global community during our Masters in London last year. Along the way, we met (& hired) a few super-cool writers who hit us up weekly with the stories breaking the internet. Why You Should Stay: Ok, so we’re going to keep this brief- with 2 months away from 2024 already!! The truth is that the past 3 years brought even the brightest to their knees, bringing the entire education and job ecosystem to a halt. It was and still is a tricky time and and the truth is - everyone needs a secret weapon to survive."]
# input_mail = [" HackerEarth Hi Anamika, you have successfully registered for the contest. You have been registered for BNY Mellon Code Divas Diversity Challenge. The contest will run from Dec 11, 12:30 PM to Dec 24, 06:25 PM. To be better prepared for the contest, try out Interview prep. Interview prep Prepare for your technical interviews by solving questions asked previously by top tech companies. Solve Now Note: The above mentioned timezone is based on your profile settings and it may not match to your current timezone. Click here to view contest time in your timezone. View challenge time in your timezone. We would love to know more about your experience with HackerEarth so feel free to share any feedback. Happy Coding! P.S. Take our State of the Developer Ecosystem survey to help bridge the gap between developers and organisations. Regards, Team HackerEarth"]
input_mail = ["Free entry in 2 a wkly comp to win FA Cup fina..."]

# convert text to fature vectors
input_mail_fatures = feature_extraction.transform(input_mail)

# making prediction
prediction = model.predict(input_mail_fatures)
print(prediction)
if(prediction[0]==1):
  print('HAM MAIL')
else:
  print('SPAM MAIL')

[1]
HAM MAIL
