# Spam or Ham Email Identification

### Context
This project aims to build a machine learning model to classify emails as either spam or ham (non-spam).

### Problem Statement
Email spam is a significant issue, leading to wasted time and potential security risks. Accurately identifying spam emails is crucial for improving email efficiency and security.

### Goals
The main goal is to develop a reliable model that can distinguish between spam and ham emails based on their content.

### Methods
The project will involve data preprocessing, feature extraction, model training, evaluation, and tuning.

###  Import Libraries

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [3]:
# Loading data from the csv file to dataframe

sm = pd.read_csv('C:/Users/Agamya/Desktop/AGAMYA/Agu_CSV/spam.csv', encoding='ISO-8859-1')

In [4]:
print(sm)

        v1                                                 v2 Unnamed: 2  \
0      ham  Go until jurong point, crazy.. Available only ...        NaN   
1      ham                      Ok lar... Joking wif u oni...        NaN   
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3      ham  U dun say so early hor... U c already then say...        NaN   
4      ham  Nah I don't think he goes to usf, he lives aro...        NaN   
...    ...                                                ...        ...   
5567  spam  This is the 2nd time we have tried 2 contact u...        NaN   
5568   ham              Will Ì_ b going to esplanade fr home?        NaN   
5569   ham  Pity, * was in mood for that. So...any other s...        NaN   
5570   ham  The guy did some bitching but I acted like i'd...        NaN   
5571   ham                         Rofl. Its true to its name        NaN   

     Unnamed: 3 Unnamed: 4  
0           NaN        NaN  
1           NaN        NaN  


In [5]:
sm.shape

(5572, 5)

In [6]:
sm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


### Data Cleaning

In [7]:
# Dropping the unwanted columns

sm.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], inplace=True)

In [8]:
print(sm)

        v1                                                 v2
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...
...    ...                                                ...
5567  spam  This is the 2nd time we have tried 2 contact u...
5568   ham              Will Ì_ b going to esplanade fr home?
5569   ham  Pity, * was in mood for that. So...any other s...
5570   ham  The guy did some bitching but I acted like i'd...
5571   ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


In [9]:
# Renaming the specified columns

sm.rename(columns={'v1': 'Category', 'v2': 'MailMessage'}, inplace=True)

In [10]:
sm.head()

Unnamed: 0,Category,MailMessage
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [11]:
# Checking for any null messages

null_count = sm['MailMessage'].isnull().sum()
null_count

0

### Label Encoding

In [12]:
# Label spam mail as 0,  ham mail as 1;

sm.loc[sm['Category'] == 'spam', 'Category',] = 0
sm.loc[sm['Category'] == 'ham', 'Category',] = 1

In [13]:
# Separating the data as MailMessage and Category

X = sm['MailMessage']
Y = sm['Category']

In [14]:
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: MailMessage, Length: 5572, dtype: object


In [15]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object


In [16]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=3)

In [17]:
print(X.shape, X_train.shape, X_test.shape)

(5572,) (4457,) (1115,)


## Feature Extraction

## Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus).


### Term Frequency (TF)
TF measures how frequently a term appears in a document. It is calculated as the number of times a term appears in a document divided by the total number of terms in the document.


### Inverse Document Frequency (IDF)
IDF measures how important a term is within the entire corpus.


### TF-IDF Calculation
TF-IDF is the product of TF and IDF. It helps in weighting terms in a way that adjusts for the fact that some terms appear more frequently in general.


### Application in Spam Detection
In spam detection, TF-IDF helps identify terms that are significant in distinguishing between spam and ham emails.


In [18]:
# transform the text data to feature vectors that can be used as input to the Logistic regression

feature_extraction = TfidfVectorizer(min_df = 1, stop_words='english', lowercase=True)

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

In [19]:
(X_train_features)

<4457x7510 sparse matrix of type '<class 'numpy.float64'>'
	with 34758 stored elements in Compressed Sparse Row format>

## Model Selection : Logistic Regression for Spam Detection

In [20]:
model = LogisticRegression()

In [21]:
# Training the Logistic Regression model with the training data

model.fit(X_train_features, Y_train)

### Evaluating model performance on training data and test data using performance metrics

In [22]:
# prediction on training data

prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

In [23]:
print('Accuracy on training data : ', accuracy_on_training_data)

Accuracy on training data :  0.9661207089970832


In [24]:
# prediction on test data

prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

In [25]:
print('Accuracy on test data : ', accuracy_on_test_data)

Accuracy on test data :  0.9623318385650225


## Accuracy on training data :  0.9661207089970832
## Accuracy on test data :  0.9623318385650225

# Spam or Ham Mail Prediction

In [26]:
input_mail = ["Subject: You’ve Won a Prize! Congratulations! You have been selected to win a $1,000 gift card. Click here to claim your prize now."]

# convert text to feature vectors
input_data_features = feature_extraction.transform(input_mail)

# making prediction
prediction = model.predict(input_data_features)
print(prediction)


if (prediction[0]==1):
  print('Ham mail')

else:
  print('Spam mail')

[0]
Spam mail
