# **AYAN JAVEED SHAIKH**
### Intern Id : *OIB/Y2/IP6920*

## Task 3 : **EMAIL SPAM DETECTION WITH MACHINE LEARNING**

### *We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email that is sent to a massive number of users at one time, frequently containing cryptic messages, scams, or most dangerously, phishing content.*


# **Importing Modules**

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

# **Loading and Preprocessing the Dataset**

In [2]:
df = pd.read_csv("spam.csv", encoding = "ISO-8859-1")

In [3]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [4]:
df.tail()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,
5571,ham,Rofl. Its true to its name,,,


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [6]:
df.shape

(5572, 5)

In [7]:
df.isnull().sum()

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

In [8]:
df = df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)

In [9]:
df

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [10]:
x = df["v2"]
y = df["v1"]  

## **Training Testing Phase**

In [11]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.30)

## **Preprocess x (Email - Text) Using TF-IDF Vectorizer**

In [12]:
tfidf = TfidfVectorizer()

In [13]:
x_train = tfidf.fit_transform(x_train)

In [14]:
x_test = tfidf.transform(x_test)

# **Support Vector Machine** 

In [15]:
svm = SVC()

In [16]:
svm.fit(x_train, y_train)

SVC()

In [17]:
# Evaluating the model

y_pred = svm.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)

Accuracy:  0.979066985645933


In [18]:
# Deploy the Spam Detector (classification of new emails)

new_email1 = ["Hey there! meet me soon."]

new_email1_transformed = tfidf.transform(new_email1)

new_email1_pred = svm.predict(new_email1_transformed)

print("\n"+"-----"*15)
print("\n New Email's Prediction: ", new_email1_pred)
print("\n"+"-----"*15+"\n")


---------------------------------------------------------------------------

 New Email's Prediction:  ['ham']

---------------------------------------------------------------------------



In [19]:
# Deploy the Spam Detector (classification of new emails)

new_email2 = ["You have WON a guaranteed å£1000 cash or a å£2000 prize. To claim yr prize call our customer service representative on 08714712379 between 10am-7pm Cost 10p"]

new_email2_transformed = tfidf.transform(new_email2)

new_email2_pred = svm.predict(new_email2_transformed)

print("\n"+"-----"*15)
print("\n New Email's Prediction: ", new_email2_pred)
print("\n"+"-----"*15+"\n")


---------------------------------------------------------------------------

 New Email's Prediction:  ['spam']

---------------------------------------------------------------------------



# **Logistic Regression**

In [20]:
log_reg = LogisticRegression()

In [21]:
log_reg.fit(x_train, y_train)

LogisticRegression()

In [22]:
# Evaluating the model

y_pred1 = log_reg.predict(x_test)

accuracy = accuracy_score(y_test, y_pred1)
print("Accuracy: ", accuracy)

Accuracy:  0.9694976076555024


In [23]:
# Deploy the Spam Detector (classification of new emails)

new_email3 = ["Oh sorry please its over"]

new_email3_transformed = tfidf.transform(new_email3)

new_email3_pred = log_reg.predict(new_email3_transformed)

print("\n"+"-----"*15)
print("\n New Email's Prediction: ", new_email3_pred)
print("\n"+"-----"*15+"\n")


---------------------------------------------------------------------------

 New Email's Prediction:  ['ham']

---------------------------------------------------------------------------



In [24]:
# Deploy the Spam Detector (classification of new emails)

new_email4 = ["tells u 2 call 09066358152 to claim å£5000 prize. U have 2 enter all ur mobile & personal details @ the prompts. Careful!"]

new_email4_transformed = tfidf.transform(new_email4)

new_email4_pred = log_reg.predict(new_email4_transformed)

print("\n"+"-----"*15)
print("\n New Email's Prediction: ", new_email4_pred)
print("\n"+"-----"*15+"\n")


---------------------------------------------------------------------------

 New Email's Prediction:  ['spam']

---------------------------------------------------------------------------

