<a href="https://colab.research.google.com/github/Vinayak-2003/OIBSIP_Email-SPAM-Detection-using-Machine-learning/blob/main/Email_SPAM_Detection_using_Machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Email SPAM Detection using Machine learning

### Import Libraries

In [1]:
!pip install scikit-learn==1.3.2

Collecting scikit-learn==1.3.2
  Downloading scikit_learn-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m68.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.2.2
    Uninstalling scikit-learn-1.2.2:
      Successfully uninstalled scikit-learn-1.2.2
Successfully installed scikit-learn-1.3.2


In [2]:
import numpy as np
import pandas as pd
import sklearn as sk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

### Data Collection and Processing

In [3]:
raw_mail_data = pd.read_csv("/content/spam.csv", encoding="ISO-8859-1")
raw_mail_data

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [4]:
#removing NULL data to empty string ''
mail_data = raw_mail_data.where((pd.notnull(raw_mail_data)), '')
mail_data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [5]:
#removing the 'Unnamed' columns
mail_data.drop(mail_data.columns[mail_data.columns.str.contains('unnamed', case=False)], axis=1, inplace=True)
mail_data.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
mail_data.shape

(5572, 2)

### Label Encoding

In [7]:
#converting the categorical data
mail_data.replace({'v1':{'spam':0, 'ham':1}}, inplace=True)
mail_data.head()

Unnamed: 0,v1,v2
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."


Seprating the data as text and labels


*   Text -> v2
*   Labels -> v1



In [8]:
x = mail_data['v2']     #input
x

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: v2, Length: 5572, dtype: object

In [9]:
y = mail_data['v1']       #output
print(y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: v1, Length: 5572, dtype: int64


### Splitting the data into Train and Test data

In [10]:
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.2, random_state=3)

### Feature Extraction

In [11]:
#transform the text data to feature vectors that can be used as input in logistic regression
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

X_train_feature = feature_extraction.fit_transform(X_train)
X_test_feature = feature_extraction.transform(X_test)

### Training the model

Logistic Regression

In [12]:
model = LogisticRegression()

In [13]:
model.fit(X_train_feature, Y_train)

Evaluating the training data

In [14]:
# prediction on training data

predict_train_data = model.predict(X_train_feature)
accuracy_train_data = accuracy_score(Y_train, predict_train_data)
print("Accuracy on Training Data is: ", accuracy_train_data)

Accuracy on Training Data is:  0.9661207089970832


Evaluating the test data

In [15]:
# prediction on test data

predict_test_data = model.predict(X_test_feature)
accuracy_test_data = accuracy_score(Y_test, predict_test_data)
print("Accuracy on Test Data is: ", accuracy_test_data)

Accuracy on Test Data is:  0.9623318385650225


### Building a Predictive Model

In [16]:
input_mail = ["As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune,,,"]

#convert text to feature vectors
input_mail_feature = feature_extraction.transform(input_mail)

#making prediction
predict_mail_feature = model.predict(input_mail_feature)

print(predict_mail_feature)

if predict_mail_feature[0]==1:
  print("Ham Mail")
else:
  print("Spam Mail")

[1]
Ham Mail


Therefore, from the above prediction we can state that our model is working correct with an Accuracy Score on Test data as 0.96233.