**Task-3**
# **Project Name: Email_Spam_Detection_Machine_Learning**

# Problem Statement:
We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email
that is sent to a massive number of users at one time, frequently containing cryptic
messages, scams, or most dangerously, phishing content.



In this Project, use Python to build an email spam detector. Then, use machine learning to
train the spam detector to recognize and classify emails into spam and non-spam.

## **Importing Libraries**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score,f1_score,recall_score,precision_score

In [None]:
warnings.filterwarnings("ignore")

##**Loading Dataset**

In [None]:
#mount google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#Load dataset
df=pd.read_csv("/content/drive/MyDrive/spam.csv",encoding="latin1")
df

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


## **Exploring Dataset**

In [None]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [None]:
df.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [None]:
print('rows---->',df.shape[0])
print('columns---->',df.shape[1])

rows----> 5572
columns----> 5


## **Data Preprocessing**

In [None]:
df = df.dropna()

In [None]:
df.isnull().sum()

v1            0
v2            0
Unnamed: 2    0
Unnamed: 3    0
Unnamed: 4    0
dtype: int64

In [None]:
df = df.drop_duplicates()

In [None]:
df.isnull().mean()*100

v1            0.0
v2            0.0
Unnamed: 2    0.0
Unnamed: 3    0.0
Unnamed: 4    0.0
dtype: float64

In [None]:
df.drop(columns=df[['Unnamed: 2','Unnamed: 3','Unnamed: 4']],axis=1,inplace=True)

In [None]:
df

Unnamed: 0,v1,v2
281,ham,\Wen u miss someone
1038,ham,"Edison has rightly said, \A fool can ask more ..."
2255,ham,I just lov this line: \Hurt me with the truth
3525,ham,\HEY BABE! FAR 2 SPUN-OUT 2 SPK AT DA MO... DE...
4668,ham,"When I was born, GOD said, \Oh No! Another IDI..."


In [None]:
df.shape

(5, 2)

In [None]:
df.columns=['spam/ham','sms']

In [None]:
df.loc[df['spam/ham'] == 'spam', 'spam/ham',] = 0
df.loc[df['spam/ham'] == 'ham', 'spam/ham',] = 1

In [None]:
df

Unnamed: 0,spam/ham,sms
281,1,\Wen u miss someone
1038,1,"Edison has rightly said, \A fool can ask more ..."
2255,1,I just lov this line: \Hurt me with the truth
3525,1,\HEY BABE! FAR 2 SPUN-OUT 2 SPK AT DA MO... DE...
4668,1,"When I was born, GOD said, \Oh No! Another IDI..."


In [None]:
x=df.sms
x

281                                   \Wen u miss someone
1038    Edison has rightly said, \A fool can ask more ...
2255        I just lov this line: \Hurt me with the truth
3525    \HEY BABE! FAR 2 SPUN-OUT 2 SPK AT DA MO... DE...
4668    When I was born, GOD said, \Oh No! Another IDI...
Name: sms, dtype: object

In [None]:
y =df['spam/ham']
y

281     1
1038    1
2255    1
3525    1
4668    1
Name: spam/ham, dtype: object

In [None]:
from sklearn.model_selection import train_test_split

## **Splitting into train and test data**

In [None]:
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2,random_state=3)

In [None]:
print(x.shape)
print(xtrain.shape)
print(xtest.shape)

(5,)
(4,)
(1,)


In [None]:
xtrain,xtest

(4668    When I was born, GOD said, \Oh No! Another IDI...
 1038    Edison has rightly said, \A fool can ask more ...
 281                                   \Wen u miss someone
 2255        I just lov this line: \Hurt me with the truth
 Name: sms, dtype: object,
 3525    \HEY BABE! FAR 2 SPUN-OUT 2 SPK AT DA MO... DE...
 Name: sms, dtype: object)

In [None]:
ytrain,ytest

(4668    1
 1038    1
 281     1
 2255    1
 Name: spam/ham, dtype: object,
 3525    1
 Name: spam/ham, dtype: object)

## **Applying TfidfVectorizer Technique**

In [None]:
feat_vect=TfidfVectorizer(min_df=1,stop_words='english',lowercase=True)
feat_vect

In [None]:
ytrain=ytrain.astype('int')
ytest=ytest.astype('int')

In [None]:
xtrain_vec =feat_vect.fit_transform(xtrain)

In [None]:
xtest_vec =feat_vect.transform(xtest)

In [None]:
print(xtrain)

4668    When I was born, GOD said, \Oh No! Another IDI...
1038    Edison has rightly said, \A fool can ask more ...
281                                   \Wen u miss someone
2255        I just lov this line: \Hurt me with the truth
Name: sms, dtype: object


In [None]:
xtrain_vec

<4x24 sparse matrix of type '<class 'numpy.float64'>'
	with 25 stored elements in Compressed Sparse Row format>

In [None]:
print(xtrain_vec)

  (0, 8)	0.36222392540501064
  (0, 15)	0.36222392540501064
  (0, 18)	0.2855815033388837
  (0, 6)	0.36222392540501064
  (0, 2)	0.7244478508100213
  (1, 5)	0.2814770040050052
  (1, 21)	0.2814770040050052
  (1, 19)	0.2814770040050052
  (1, 10)	0.2814770040050052
  (1, 0)	0.2814770040050052
  (1, 13)	0.2814770040050052
  (1, 23)	0.2814770040050052
  (1, 16)	0.2814770040050052
  (1, 1)	0.2814770040050052
  (1, 4)	0.2814770040050052
  (1, 17)	0.2814770040050052
  (1, 3)	0.2814770040050052
  (1, 18)	0.2219197030378226
  (2, 14)	0.7071067811865476
  (2, 22)	0.7071067811865476
  (3, 20)	0.4472135954999579
  (3, 7)	0.4472135954999579
  (3, 11)	0.4472135954999579
  (3, 12)	0.4472135954999579
  (3, 9)	0.4472135954999579


In [None]:
print(xtest_vec)




## **Model Training: Logistic Regression Model**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd

In [None]:
logi=LogisticRegression()

In [None]:
import pandas as pd

# Assuming ytrain is a pandas Series
ytrain = pd.Series([1, 1, 1, 1])  # Example data

ytrain.to_csv('ytard.csv', index=False)

if all(ytrain == 0) or ytrain.sum() == 0:
    print("All values are either 0 or the sum is 0 (one class)")
    # Do something else
else:
    # OK to proceed
    print("Data is not one class")

Data is not one class


In [None]:
logi.fit(xtrain_vec,ytrain)

In [None]:
logi.score(xtrain_vec,ytrain)

In [None]:
logi.score(xtest_vec,ytest)

In [None]:
pred_logi=logi.predict(xtest_vec)
pred_logi

In [None]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

## **Model Evaluation**

In [None]:
accuracy_score(ytest,pred_logi)

In [None]:
confusion_matrix(ytest,pred_logi)

In [None]:
print(classification_report(ytest,pred_logi))