# **Email spam Detection with Machine Learning**

#Introduction

Explore the world of 'Email Spam Detection with Machine Learning.' This project aims to build a powerful model using Python's machine learning capabilities to classify emails as spam or legitimate. Dive into data preprocessing, feature engineering, and model training to enhance email security and user experience through effective spam detection.

##Importing Liberies

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

##Importing Dtat set

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df= pd.read_csv('/content/drive/MyDrive/infobyte/Task 4:/spam.csv', encoding='ISO-8859-1')
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [None]:
df.drop(columns= ['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], inplace= True)

In [None]:
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


##Lable Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['v1']= le.fit_transform(df['v1']) # 0= ham; 1= spam
df.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


##Spliting Data into Features and Target

In [None]:
X= df['v2'].values
y= df['v1'].values

In [None]:
X = np.array(X)
y = np.array(y)

In [None]:
print(X)

['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
 'Ok lar... Joking wif u oni...'
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
 ... 'Pity, * was in mood for that. So...any other suggestions?'
 "The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free"
 'Rofl. Its true to its name']


In [None]:
print(y)

[0 0 1 ... 0 0 0]


##Spliting the Data into Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size= 0.30, random_state= 42)

In [None]:
print(X_train)

['To review and KEEP the fantastic Nokia N-Gage game deck with Club Nokia, go 2 www.cnupdates.com/newsletter. unsubscribe from alerts reply with the word OUT'
 'Just got outta class gonna go gym.'
 'Is there coming friday is leave for pongal?do you get any news from your work place.'
 ... "Prabha..i'm soryda..realy..frm heart i'm sory"
 'Nt joking seriously i told' 'In work now. Going have in few min.']


In [None]:
print(X_test)

['Funny fact Nobody teaches volcanoes 2 erupt, tsunamis 2 arise, hurricanes 2 sway aroundn no 1 teaches hw 2 choose a wife Natural disasters just happens'
 'I sent my scores to sophas and i had to do secondary application for a few schools. I think if you are thinking of applying, do a research on cost also. Contact joke ogunrinde, her school is one me the less expensive ones'
 'We know someone who you know that fancies you. Call 09058097218 to find out who. POBox 6, LS15HB 150p'
 ... 'You are gorgeous! keep those pix cumming :) thank you!'
 'Thats cool! Sometimes slow and gentle. Sonetimes rough and hard :)'
 'Ranjith cal drpd Deeraj and deepak 5min hold']


##Feature Extraction

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf= TfidfVectorizer()
X_Train= tf.fit_transform(X_train)
X_Test= tf.transform(X_test)

##Traning the model using Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
model_lr= LogisticRegression()
model_lr.fit(X_Train, y_train)

##Predicting the Model on Test set

In [None]:
pred_lr= model_lr.predict(X_Test)
print(np.concatenate((pred_lr.reshape(len(pred_lr),1), y_test.reshape(len(y_test),1)),1))

[[0 0]
 [0 0]
 [0 1]
 ...
 [0 0]
 [0 0]
 [0 0]]


##Confusion Matrix and Accuracy Score

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
print(confusion_matrix(y_test, pred_lr))
accuracy_score(y_test, pred_lr)

[[1452    1]
 [  56  163]]


0.9659090909090909

##Testing model on new data

In [None]:
input_mail= ["As a valued customer, I am pleased to advise you that following recent review of your Mob No. you are awarded with a £1500 Bonus Prize, call 09066364589"]
input_data_feature= tf.transform(input_mail)
pred= model_lr.predict(input_data_feature)

print(pred)

if(pred == [0]):
    print("This is the Ham Mail.")
else:
    print("This is the Spam Mail.")

[1]
This is the Spam Mail.


##Traning the model using Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB
model_mnb= MultinomialNB()
model_mnb.fit(X_Train, y_train)

##Predicting the Model on Test set

In [None]:
pred_mnb= model_mnb.predict(X_Test)
print(np.concatenate((pred_mnb.reshape(len(pred_mnb),1), y_test.reshape(len(y_test),1)),1))

[[0 0]
 [0 0]
 [0 1]
 ...
 [0 0]
 [0 0]
 [0 0]]


##Confusion Matrix and Accuracy Score

In [None]:
print(confusion_matrix(y_test, pred_mnb))
accuracy_score(y_test, pred_mnb)

[[1453    0]
 [  67  152]]


0.9599282296650717

##Testing model on new data

In [None]:
input_mail= ["As a valued customer, I am pleased to advise you that following recent review of your Mob No. you are awarded with a £1500 Bonus Prize, call 09066364589"]
input_data_feature= tf.transform(input_mail)
pred= model_mnb.predict(input_data_feature)

print(pred)

if(pred == [0]):
    print("This is the Ham Mail.")
else:
    print("This is the Spam Mail.")

[1]
This is the Spam Mail.


##Traning the model using Gradient Boosting Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
model_gbc= GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_gbc.fit(X_Train, y_train)

##Predicting the Model on Test set

In [None]:
pred_gbc= model_gbc.predict(X_Test)
print(np.concatenate((pred_gbc.reshape(len(pred_gbc),1), y_test.reshape(len(y_test),1)),1))

[[0 0]
 [0 0]
 [0 1]
 ...
 [0 0]
 [0 0]
 [0 0]]


##Confusion Matrix and Accuracy Score

In [None]:
print(confusion_matrix(y_test, pred_gbc))
accuracy_score(y_test, pred_gbc)

[[1451    2]
 [  52  167]]


0.9677033492822966

In [None]:
input_mail= ["As a valued customer, I am pleased to advise you that following recent review of your Mob No. you are awarded with a £1500 Bonus Prize, call 09066364589"]
input_data_feature= tf.transform(input_mail)
pred_= model_gbc.predict(input_data_feature)

print(pred_)

if(pred_ == [0]):
    print("This is the Ham Mail.")
else:
    print("This is the Spam Mail.")

[1]
This is the Spam Mail.


### Email Spam Detection with Machine Learning
This Google Colab notebook demonstrates the process of email spam detection using various machine learning models. The dataset containing email messages is loaded and preprocessed by removing unnecessary columns and encoding the target variable ('ham' as 0 and 'spam' as 1). The data is then split into training and testing sets, followed by feature extraction using `TfidfVectorizer`. Three models are trained and evaluated: Logistic Regression, Naive Bayes (MultinomialNB), and Gradient Boosting Classifier. The models are evaluated on the test set, achieving accuracy scores of approximately

**96.59% for Logistic Regression,**
**95.99% for Naive Bayes, and**
**96.77% for Gradient Boosting Classifier.**

Lastly, the trained models are used to predict whether a sample email is spam or ham, showcasing their effectiveness in distinguishing between spam and non-spam emails based on text data.
