Author : VISHAL MEHARWADE

Internship under **OASIS INFOBYTE**

**Task 4: Email Spam Detection using Machine Learning**

Problem statement: Spam mail or junk mail is a type of email that is sent to a massive number of users at one time frequently containing a cryptic messages, scams, or most dangerously phishing content.

In this project, we gone a build a email spam detector later-on we gone use machine learning to train spam detector to recognise and classify emails into spam and non-spam.

**Importing all the necessary libraries**

In [31]:
import numpy as np
import pandas as pd

import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Loading the Dataset

In [32]:
data = pd.read_csv('/content/spam.csv', encoding='latin-1')

In [33]:
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


Unnamed: 2,Unnamed: 3,Unnamed: 4 column has been dropped because containig null values.

In [34]:
columns_to_drop = ['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']
data = data.drop(columns_to_drop, axis=1, errors='ignore')
data.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


**Tokenization and cleaning**

V2 column set lower case and cleaned.

In [35]:
data['v2'] = data['v2'].str.lower() #cleaning 2nd column that is "v2" column.

In [36]:
data['v2']

0       go until jurong point, crazy.. available only ...
1                           ok lar... joking wif u oni...
2       free entry in 2 a wkly comp to win fa cup fina...
3       u dun say so early hor... u c already then say...
4       nah i don't think he goes to usf, he lives aro...
                              ...                        
5567    this is the 2nd time we have tried 2 contact u...
5568                will ì_ b going to esplanade fr home?
5569    pity, * was in mood for that. so...any other s...
5570    the guy did some bitching but i acted like i'd...
5571                           rofl. its true to its name
Name: v2, Length: 5572, dtype: object

In [37]:
def preprocess_text(text):
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
    return ' '.join(stemmed_tokens)

data['v2'] = data['v2'].apply(preprocess_text)
data['v2']

0       go jurong point crazi avail bugi n great world...
1                                   ok lar joke wif u oni
2       free entri 2 wkli comp win fa cup final tkt 21...
3                     u dun say earli hor u c alreadi say
4                    nah think goe usf live around though
                              ...                        
5567    2nd time tri 2 contact u pound prize 2 claim e...
5568                                b go esplanad fr home
5569                                    piti mood suggest
5570    guy bitch act like interest buy someth els nex...
5571                                       rofl true name
Name: v2, Length: 5572, dtype: object

Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:

*>>> import nltk

*>>> nltk.download('punkt')

when I'm running this code and I've came across an error showing "punkt" resource not found so I've downloaded from nlt.

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Load the dataset
data = pd.read_csv('/content/spam.csv', encoding='latin-1')

# Feature Extraction (TF-IDF)
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(data['v2'])
tfidf_matrix

<5572x8672 sparse matrix of type '<class 'numpy.float64'>'
	with 73916 stored elements in Compressed Sparse Row format>

In [41]:
# Label Encoding
data['v1'] = data['v1'].map({'ham': 0, 'spam': 1})

data['v1']

0       0
1       0
2       1
3       0
4       0
       ..
5567    1
5568    0
5569    0
5570    0
5571    0
Name: v1, Length: 5572, dtype: int64

In [42]:

# Split Data
X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, data['v1'], test_size=0.2, random_state=42)

# Check the shape of the TF-IDF matrix and the split data
print("TF-IDF Matrix Shape:", tfidf_matrix.shape)
print("Training Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)

TF-IDF Matrix Shape: (5572, 8672)
Training Data Shape: (4457, 8672)
Testing Data Shape: (1115, 8672)


Random Forest Classifier

In [44]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Create a Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Train the classifier on the training data
rf_classifier.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = rf_classifier.predict(X_test)


# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

accuracy = accuracy * 100
# Print the results
print("Accuracy:", accuracy)
print("classification report:\n", classification_rep)

Accuracy: 97.66816143497758
classification report:
               precision    recall  f1-score   support

           0       0.97      1.00      0.99       965
           1       1.00      0.83      0.91       150

    accuracy                           0.98      1115
   macro avg       0.99      0.91      0.95      1115
weighted avg       0.98      0.98      0.98      1115



TESTING THE MODEL

In [45]:
input_text = """\apple Inc.Your iPhone 6 linked top***zm".edu) has been used a few minutes
ago. To localize it,login now to your apple account ."""

# Apply the same preprocessing as in your previous code
input_text = input_text.lower()
# Add more preprocessing steps if needed

# Transform the input text into a TF-IDF vector
input_tfidf = tfidf_vectorizer.transform([input_text])

# Make a prediction using the trained Random Forest model
prediction = rf_classifier.predict(input_tfidf)

# predictions
if prediction[0] == 1:
    print("This message is predicted to be SPAM by trained model.")
else:
    print("This message is predicted to be NOT SPAM by trained model.")

This message is predicted to be NOT SPAM by trained model.


In [46]:
input_text1 = "Hey, I'm mark. How are you?."

# Apply the same preprocessing as in your previous code
input_text1 = input_text1.lower()

# Transform the input text into a TF-IDF vector
input_tfidf = tfidf_vectorizer.transform([input_text1])

# Make a prediction using the trained Random Forest model
prediction = rf_classifier.predict(input_tfidf)

# perdictions
if prediction[0] == 1:
    print("This message is predicted to be SPAM by trained model.")
else:
    print("This message is predicted to be NOT SPAM by trained model.")

This message is predicted to be NOT SPAM by trained model.
