**Problem Statement**


My spam mail prediction model employs feature extraction techniques to analyze the content of emails and determine whether they are spam or not. By identifying key words and phrases associated with spam, such as 'free', 'offer', and others, the model extracts relevant features from the email text and uses them to make predictions. This approach enables the model to effectively classify incoming emails as either spam or legitimate, aiding in the filtering of unwanted messages and ensuring a cleaner inbox for users.

Importing Dependencies

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Loading the dataset into pandas DataFrame
mail_dataset= pd.read_csv('/content/mail_data.csv')

In [None]:
# Checking the first five rows of the DataFrame
mail_dataset.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
# Checking the number of rows and columns in the DataFrame
mail_dataset.shape

(5572, 2)

In [None]:
# Checking if there are any null values in the DataFrame
mail_dataset.isnull().sum()

Category    0
Message     0
dtype: int64

Label Encoding

In [None]:
mail_dataset.loc[mail_dataset['Category'] == 'spam','Category',] = 0
mail_dataset.loc[mail_dataset['Category'] == 'ham','Category',] = 1

Splitting the data into texts and labels

In [None]:
x = mail_dataset['Message']
y = mail_dataset['Category']

Splitting the data into training and testing data

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=3)

Feature Extraction --> Converting the textual Data(Features) into numerical data

In [None]:
# Importing TfidfVectorizer into feature_extraction
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english',lowercase=True)

In [None]:
# Converting x_train and x_test into features
x_train_features = feature_extraction.fit_transform(x_train)
x_test_features = feature_extraction.transform(x_test)


# Converting y_train and y_test into integers
y_train = y_train.astype('int')
y_test = y_test.astype('int')

Training the ML model

In [None]:
model = LogisticRegression()

In [None]:
# Training the model with training data
model.fit(x_train_features,y_train)

In [None]:
# Evaluating the model's accuracy on training data
training_prediction = model.predict(x_train_features)
training_accuracy = accuracy_score(training_prediction,y_train,)

In [None]:
print("The accuracy on training data is :", training_accuracy)

The accuracy on training data is : 0.9670181736594121


In [None]:
# Evaluating the model's accuracy on testing data
testing_prediction = model.predict(x_test_features)
testing_accuracy = accuracy_score(y_test,testing_prediction)

In [None]:
print("The accuracy on testing data is :", testing_accuracy)

The accuracy on testing data is : 0.9659192825112107


Making our own prediction system

In [None]:
input = ["As a valued customer, I am pleased to advise you that following recent review of your Mob No. you are awarded with a £1500 Bonus Prize, call 09066364589"]
input_features = feature_extraction.transform(input)

# Making prediction
input_prediction = model.predict(input_features)
if (input_prediction[0] == 0):print("It is a spam mail")
else:print("It is a ham mail")

It is a spam mail
