<a href="https://colab.research.google.com/github/Varshetaganesh22/Spam-email-Classification/blob/main/Spam_email_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd #Library for data manipulation and analysis
from sklearn.model_selection import train_test_split #Split data into training set and test set
from sklearn.feature_extraction.text import CountVectorizer #Used to convert text data into numerical features suitable for machine learning algorithms
from sklearn.naive_bayes import MultinomialNB #Multinomial Naive Bayes classification algorithm
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report #Used to calculate the metrics

In [2]:
df=pd.read_csv(r"/content/mail_data.csv",encoding='ISO-8859-1') #Reads the dataset and encodes it

In [3]:
df.head() #Prints the starting data(default 5)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
df['spam']=df['Category'].apply(lambda x: 1 if x=='spam' else 0) #Assigns the number 1 for spams and 0 for not spams

In [5]:
df.head(10) #Prints the starting 10 data

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,1
6,ham,Even my brother is not like to speak with me. ...,0
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0
8,spam,WINNER!! As a valued network customer you have...,1
9,spam,Had your mobile 11 months or more? U R entitle...,1


In [6]:
x_train,x_test,y_train,y_test=train_test_split(df.Message,df.spam,test_size=0.2) #Splits the data into training and test sets, 20% for testing and 80% for training

In [7]:
print(x_train)

5081    Keep ur problems in ur heart, b'coz nobody wil...
2666                       R u meeting da ge at nite tmr?
4536                        IM LATE TELLMISS IM ON MY WAY
4626                             I'm on the bus. Love you
93      Please call our customer service representativ...
                              ...                        
2676    I'm sick !! I'm needy !! I want you !! *pouts*...
1289                             Happy new year to u too!
2016    Just re read it and I have no shame but tell m...
1475    Friendship is not a game to play, It is not a ...
2312    Congratulations! Thanks to a good friend U hav...
Name: Message, Length: 4457, dtype: object


In [8]:
print(x_test)

89              Ela kano.,il download, come wen ur free..
483                                    Watching tv lor...
1355    Baaaaabe! I misss youuuuu ! Where are you ? I ...
183     ok. I am a gentleman and will treat you with d...
2415                             O was not into fps then.
                              ...                        
3369             Hey elaine, is today's meeting still on?
1334           Oh... Icic... K lor, den meet other day...
4249    accordingly. I repeat, just text the word ok o...
133                             First answer my question.
4567    Should i buy him a blackberry bold 2 or torch....
Name: Message, Length: 1115, dtype: object


In [9]:
cv=CountVectorizer() #This line creates an instance of the CountVectorizer class and assigns it to the variable cv

x_train_count=cv.fit_transform(x_train.values) #Builds a dictionary with the dataset vocabulary and uses it to transform each mail into numerical feature vector

In [10]:
x_train_count.toarray()[:3] #Array created by CountVectorizer

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [11]:
model=MultinomialNB() #Makes an instance of MultinomialNB with variable name model
model.fit(x_train_count,y_train) #Uses training dataset to learn patterns and relationships in the emails

In [12]:
x_test_count=cv.transform(x_test) #Uses the knowledge gained from training dataset to understand test dataset

In [13]:
y_pred = model.predict(x_test_count)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("\nConfusion Matrix:\n", conf_matrix)
print("\nClassification Report:\n", class_report)

Accuracy: 0.9838565022421525

Confusion Matrix:
 [[961   2]
 [ 16 136]]

Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       963
           1       0.99      0.89      0.94       152

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115



In [14]:
input1=['Free entry in 2 a wkly comp to win FA Cup final. Give a call to '] #Input of any mail to check if it is spam or not
input1_count=cv.transform(input1) #Uses the prior knowledge to to convert the input email into numerical feature vector
prediction=model.predict(input1_count) #Predicts the email from the trained model
if(prediction[0]==0):
    print('Not Spam mail') #Printed if the predicted email is not spam

else:
    print('Spam mail') #Printed if the predicted email is spam

Spam mail


In [15]:
input2=['Kindly go through the details about the Campus recruitment by Evertz Job Designation: Junior Engineer'] #Input of any mail to check if it is spam or not
input2_count=cv.transform(input2) #Uses the prior knowledge to to convert the input email into numerical feature vector
prediction=model.predict(input2_count) #Predicts the email from the trained model
if(prediction[0]==0):
    print('Not Spam mail') #Printed if the predicted email is not spam

else:
    print('Spam mail') #Printed if the predicted email is spam

Not Spam mail
