<a href="https://colab.research.google.com/github/bhavana957/OIBSIP/blob/main/EmailSpamDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Task 4: Email Spam Detection using Machine Learning


We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email
that is sent to a massive number of users at one time, frequently containing cryptic
messages, scams, or most dangerously, phishing content.

In this Project, use Python to build an email spam detector. Then, use machine learning to
train the spam detector to recognize and classify emails into spam and non-spam.


IMPORTING LIBRARIES

In [7]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

#to convert text into numeric values
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression #for using logistic regression
from sklearn.metrics import accuracy_score

LOADING THE DATASET

In [8]:
data=pd.read_csv('/content/spam.csv',encoding='latin')

In [9]:
print(data)

        v1                                                 v2 Unnamed: 2  \
0      ham  Go until jurong point, crazy.. Available only ...        NaN   
1      ham                      Ok lar... Joking wif u oni...        NaN   
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3      ham  U dun say so early hor... U c already then say...        NaN   
4      ham  Nah I don't think he goes to usf, he lives aro...        NaN   
...    ...                                                ...        ...   
5567  spam  This is the 2nd time we have tried 2 contact u...        NaN   
5568   ham              Will Ì_ b going to esplanade fr home?        NaN   
5569   ham  Pity, * was in mood for that. So...any other s...        NaN   
5570   ham  The guy did some bitching but I acted like i'd...        NaN   
5571   ham                         Rofl. Its true to its name        NaN   

     Unnamed: 3 Unnamed: 4  
0           NaN        NaN  
1           NaN        NaN  


In [10]:
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


TO CHECK FOR ANY NULL VALUES IN THE DATASET

In [11]:
data.isnull().sum()

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

Here, we can see that 3 columns have null values

In [12]:
#to replace the null values with a null string so to create a new dataframe and display
null_data=data.where((pd.notnull(data)),'')
null_data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


FROM THE DATASET WE CAN TELL THAT ham MEANS NON SPAM-MAIL & spam MEANS THE SPAM MAIL AND EMAILS ARE IN COLUMN v2

In [14]:
#checking the shape of the dataframe
null_data.shape

(5572, 5)

In [20]:
#renaming the columns for meaningful analysis
null_data.columns=['Spam/Non-Spam','Mail','','','']
null_data.head()

Unnamed: 0,Spam/Non-Spam,Mail,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


Now we have to convert the  categorical data into numerical labels for easy understanding of the algorithms

In [21]:
#labeling spam mails as 0 and non-spam mails as 1
null_data.loc[null_data['Spam/Non-Spam']=='spam','Spam/Non-Spam',]=0
null_data.loc[null_data['Spam/Non-Spam']=='ham','Spam/Non-Spam',]=1

Spam-0 and ham-1

Next classifying data as texts and labels

In [22]:
X=null_data['Mail']
Y=null_data['Spam/Non-Spam']


In [23]:
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Mail, Length: 5572, dtype: object


In [24]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Spam/Non-Spam, Length: 5572, dtype: object


Now, splitting the data into training data and testing data

In [26]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.20,random_state=3)

We can say that 20% data goes for testing and the rest 80% for training tests

In [27]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)

(5572,)
(4457,)
(1115,)


We see that 1115 values will go for testing and the rest 4457 for training

Using feature extraction as we have string values.
If we feed it into logistic regression model, it would be difficult to ]understand anything. So converting string values to meaningful numerical values

In [28]:
#feature extraction to transform data into numerical values which we can use as input to the logistic regression
feature_extract=TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

X_train_features=feature_extract.fit_transform(X_train)
X_test_features=feature_extract.fit_transform(X_test)

#convert  Y_train & Y_test values as integers
Y_test=Y_test.astype('int')
Y_train=Y_train.astype('int')

In [29]:
Y_train

3075    1
1787    1
1614    1
4304    1
3266    1
       ..
789     1
968     1
1667    1
3321    1
1688    1
Name: Spam/Non-Spam, Length: 4457, dtype: int64

In [30]:
Y_test

2632    1
454     0
983     1
1282    1
4610    1
       ..
4827    1
5291    1
3325    1
3561    1
1136    0
Name: Spam/Non-Spam, Length: 1115, dtype: int64

In [31]:
print(X_test_features)

  (0, 1840)	0.5470775878936475
  (0, 2563)	0.5149396017383439
  (0, 633)	0.6599570587439945
  (1, 2813)	0.1623769474001259
  (1, 121)	0.26958776220642855
  (1, 22)	0.23137073373406264
  (1, 1205)	0.15919444481136824
  (1, 1394)	0.4137552551149475
  (1, 2335)	0.25837851868099354
  (1, 2224)	0.26958776220642855
  (1, 3106)	0.2496839539259005
  (1, 2801)	0.26958776220642855
  (1, 1428)	0.26958776220642855
  (1, 1)	0.2226761689789696
  (1, 238)	0.26958776220642855
  (1, 210)	0.21557219231256677
  (1, 2261)	0.26958776220642855
  (1, 1714)	0.2496839539259005
  (2, 2835)	0.39312526272940845
  (2, 1223)	0.43336283778599577
  (2, 2860)	0.3362295403293851
  (2, 1225)	0.5554066201719753
  (2, 1729)	0.48592423391824763
  (3, 2010)	0.28009022728448785
  (3, 1336)	0.23561627348325437
  :	:
  (1111, 2594)	0.4675085813886858
  (1111, 2942)	0.4099856948072808
  (1111, 1375)	0.4169391674217092
  (1112, 2080)	0.4448076693011006
  (1112, 1131)	0.40271290258485304
  (1112, 1455)	0.3780890425792432
  (1112,

In [32]:
print(X_train_features)

  (0, 741)	0.3219352588930141
  (0, 3979)	0.2410582143632299
  (0, 4296)	0.3891385935794867
  (0, 6599)	0.20296878731699391
  (0, 3386)	0.3219352588930141
  (0, 2122)	0.38613577623520473
  (0, 3136)	0.440116181574609
  (0, 3262)	0.25877035357606315
  (0, 3380)	0.21807195185332803
  (0, 4513)	0.2909649098524696
  (1, 4061)	0.380431198316959
  (1, 6872)	0.4306015894277422
  (1, 6417)	0.4769136859540388
  (1, 6442)	0.5652509076654626
  (1, 7443)	0.35056971070320353
  (2, 933)	0.4917598465723273
  (2, 2109)	0.42972812260098503
  (2, 3917)	0.40088501350982736
  (2, 2226)	0.413484525934624
  (2, 5825)	0.4917598465723273
  (3, 6140)	0.4903863168693604
  (3, 1599)	0.5927091854194291
  (3, 1842)	0.3708680641487708
  (3, 7453)	0.5202633571003087
  (4, 2531)	0.7419319091456392
  :	:
  (4452, 2122)	0.31002103760284144
  (4453, 999)	0.6760129013031282
  (4453, 7273)	0.5787739591782677
  (4453, 1762)	0.45610005640082985
  (4454, 3029)	0.42618909997886
  (4454, 2086)	0.3809693742808703
  (4454, 3088)

Using Logistic Regression Model for training the data

In [33]:
model=LogisticRegression()

In [34]:
model.fit(X_train_features,Y_train)

Evaluating the model

In [62]:
#training data prediction
training_data_prediction=model.predict(X_train_features)
accuracy_training_data=accuracy_score(Y_train, training_data_prediction)

In [64]:
print('Training Data Accuracy=',accuracy_training_data)

Training Data Accuracy= 0.9661207089970832


In [65]:
#testing data prediction
testing_data_prediction=model.predict(X_test_features)
accuracy_testing_data=accuracy_score(Y_test, testing_data_prediction)

ValueError: ignored

BUILDING A PREDICTIVE SYSTEM FOR CHECKING WHETHER IT IS SPAM OR NOT

In [57]:
mail_input=['Happy to announce that MSME Technology Development Centre (Process &Product Development Centre), Agra an Autonomous Institute under the Ministry of MSME is organizing an Online Training program on “Descriptive Analysis and visualization using R"']

input_data_features= feature_extract.transform(mail_input)

prediction=model.predict(input_data_features)
print(prediction)

if(prediction[0]==1):
  print('Non Spam Mail')
else:
  print('Spam Mail')

ValueError: ignored