<a href="https://colab.research.google.com/github/bhavana957/OIBSIP/blob/main/EmailSpamDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Task 4: Email Spam Detection using Machine Learning


We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email
that is sent to a massive number of users at one time, frequently containing cryptic
messages, scams, or most dangerously, phishing content.

In this Project, use Python to build an email spam detector. Then, use machine learning to
train the spam detector to recognize and classify emails into spam and non-spam.


IMPORTING LIBRARIES

In [98]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

#to convert text into numeric values
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression #for using logistic regression
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

LOADING THE DATASET

In [None]:
data=pd.read_csv('/content/spam.csv',encoding='latin')

In [None]:
print(data)

        v1                                                 v2 Unnamed: 2  \
0      ham  Go until jurong point, crazy.. Available only ...        NaN   
1      ham                      Ok lar... Joking wif u oni...        NaN   
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3      ham  U dun say so early hor... U c already then say...        NaN   
4      ham  Nah I don't think he goes to usf, he lives aro...        NaN   
...    ...                                                ...        ...   
5567  spam  This is the 2nd time we have tried 2 contact u...        NaN   
5568   ham              Will Ì_ b going to esplanade fr home?        NaN   
5569   ham  Pity, * was in mood for that. So...any other s...        NaN   
5570   ham  The guy did some bitching but I acted like i'd...        NaN   
5571   ham                         Rofl. Its true to its name        NaN   

     Unnamed: 3 Unnamed: 4  
0           NaN        NaN  
1           NaN        NaN  


In [None]:
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


TO CHECK FOR ANY NULL VALUES IN THE DATASET

In [None]:
data.isnull().sum()

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

Here, we can see that 3 columns have null values

In [None]:
#to drop those unnecesary columns as they posses null values
columns_drop=['Unnamed: 2','Unnamed: 3','Unnamed: 4']
data.drop(columns=columns_drop, inplace=True)

In [None]:
data

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


FROM THE DATASET WE CAN TELL THAT ham MEANS NON SPAM-MAIL & spam MEANS THE SPAM MAIL AND EMAILS ARE IN COLUMN v2

In [None]:
#checking the shape of the dataframe
data.shape

(5572, 2)

In [None]:
#renaming the columns for meaningful analysis
data.columns=['Spam/Non-Spam','Mail']
data.head()

Unnamed: 0,Spam/Non-Spam,Mail
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Now we have to convert the  categorical data into numerical labels for easy understanding of the algorithms

In [None]:
#labeling spam mails as 0 and non-spam mails as 1
data.loc[data['Spam/Non-Spam']=='spam','Spam/Non-Spam',]=0
data.loc[data['Spam/Non-Spam']=='ham','Spam/Non-Spam',]=1

Spam-0 and ham-1

Next classifying data as texts and labels

In [None]:
X=data['Mail']
Y=data['Spam/Non-Spam']


In [None]:
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Mail, Length: 5572, dtype: object


In [None]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Spam/Non-Spam, Length: 5572, dtype: object


Now, splitting the data into training data and testing data

In [71]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.30,random_state=3)

We can say that 30% data goes for testing and the rest 70% for training tests

In [72]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)

(5572,)
(3900,)
(1672,)


We see that 1672 values will go for testing and the rest 3900 for training

Using feature extraction as we have string values.
If we feed it into logistic regression model, it would be difficult to understand anything. So converting string values to meaningful numerical values

In [83]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create and fit the vectorizer on the training data
feature_extract = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
X_train_features = feature_extract.fit_transform(X_train)

# Transform the testing data using the same vectorizer
X_test_features = feature_extract.transform(X_test)

# Convert Y_train and Y_test values to integers
Y_test = Y_test.astype(int)
Y_train = Y_train.astype(int)


In [84]:
Y_train

1455    0
3460    1
2493    1
3378    1
3826    0
       ..
789     1
968     1
1667    1
3321    1
1688    1
Name: Spam/Non-Spam, Length: 3900, dtype: int64

In [85]:
Y_test

2632    1
454     0
983     1
1282    1
4610    1
       ..
5017    1
4540    1
105     1
881     1
3995    1
Name: Spam/Non-Spam, Length: 1672, dtype: int64

In [86]:
print(X_test_features)

  (0, 5557)	0.5283989315037968
  (0, 3955)	0.5171278973205882
  (0, 1420)	0.6733300134394968
  (1, 6676)	0.230759645109698
  (1, 6086)	0.14985776135556678
  (1, 6067)	0.2706408194261022
  (1, 5088)	0.28407170806210813
  (1, 4944)	0.2706408194261022
  (1, 4858)	0.28407170806210813
  (1, 3724)	0.253719900885834
  (1, 3103)	0.28407170806210813
  (1, 3049)	0.3748431642191014
  (1, 2669)	0.14195175298081447
  (1, 557)	0.28407170806210813
  (1, 486)	0.20176005209769407
  (1, 297)	0.28407170806210813
  (1, 39)	0.24768056364996613
  (1, 1)	0.21732875647369207
  (2, 6183)	0.3106341426891049
  (2, 6133)	0.3458008215674823
  (2, 3748)	0.4427146757183849
  (2, 2706)	0.5978641546354502
  (2, 2704)	0.4800941467094377
  (3, 6561)	0.2910593419524306
  (3, 6227)	0.20793299935326795
  :	:
  (1669, 3765)	0.27662643227705713
  (1669, 1208)	0.5765304310101259
  (1670, 6612)	0.24755649780765218
  (1670, 6279)	0.2320788498224002
  (1670, 6184)	0.19162928032256513
  (1670, 6180)	0.38325856064513025
  (1670, 4

In [87]:
print(X_train_features)

  (0, 3051)	0.29377250073173566
  (0, 5839)	0.1629936193036503
  (0, 6355)	0.1558807617733673
  (0, 4463)	0.25613863155350935
  (0, 3431)	0.21322420382137608
  (0, 2669)	0.1467992773406004
  (0, 5923)	0.25613863155350935
  (0, 5142)	0.16010721180454363
  (0, 3466)	0.12934748264716298
  (0, 3913)	0.29377250073173566
  (0, 933)	0.2508580729996024
  (0, 6905)	0.24224909018184185
  (0, 5555)	0.2623842067100304
  (0, 5433)	0.21163062235443741
  (0, 2593)	0.2623842067100304
  (0, 1554)	0.19988276732642443
  (0, 2478)	0.22711374519304361
  (0, 2549)	0.23239430374695055
  (0, 5924)	0.29377250073173566
  (1, 3072)	0.1865177094597328
  (1, 4136)	0.1722045163692168
  (1, 4468)	0.20883030509262115
  (1, 239)	0.19895682423577737
  (1, 1619)	0.1783242740180506
  (1, 5927)	0.16325005023994915
  :	:
  (3895, 1962)	0.3492375146371
  (3896, 929)	0.6746077916779829
  (3896, 6722)	0.5760607620840338
  (3896, 1630)	0.46158241495372976
  (3897, 4963)	0.4437658514900006
  (3897, 1841)	0.41379848369481503
  (

Using Logistic Regression Model for training the data

In [88]:
model=LogisticRegression()

In [89]:
model.fit(X_train_features,Y_train)

Evaluating the model

In [90]:
#training data prediction
training_data_prediction=model.predict(X_train_features)
accuracy_training_data=accuracy_score(Y_train, training_data_prediction)

In [91]:
print('Training Data Accuracy=',accuracy_training_data)

Training Data Accuracy= 0.9638461538461538


In [92]:
#testing data prediction
testing_data_prediction=model.predict(X_test_features)
accuracy_testing_data=accuracy_score(Y_test, testing_data_prediction)

In [94]:
print('Testing Data Accuracy=',accuracy_testing_data)

Testing Data Accuracy= 0.9617224880382775


In [99]:
conf_matrix = confusion_matrix(Y_test, testing_data_prediction)
classification_rep = classification_report(Y_test, testing_data_prediction)

print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(classification_rep)

Confusion Matrix:
[[ 180   63]
 [   1 1428]]
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.74      0.85       243
           1       0.96      1.00      0.98      1429

    accuracy                           0.96      1672
   macro avg       0.98      0.87      0.91      1672
weighted avg       0.96      0.96      0.96      1672



BUILDING A PREDICTIVE SYSTEM FOR CHECKING WHETHER IT IS SPAM OR NOT

In [103]:
mail_input=['SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info']
input_data_features= feature_extract.transform(mail_input)

prediction=model.predict(input_data_features)
print(prediction)

if(prediction[0]==1):
  print('Non Spam Mail')
else:
  print('Spam Mail')

[0]
Spam Mail
