***Email Phishing Classification using Multinomial Naive Bayes***

**This script reads a CSV file containing email text and labels ('Safe Email' or
'Phishing Email'), preprocesses the data, trains a Multinomial Naive Bayes
model, and evaluates its performance.**

**The input CSV file should have two columns:**
- 'Email Text': The email content (text).
- 'Email Type': The email classification label ('Safe Email' or 'Phishing Email').

**The script performs the following steps:**

- Filters out rows with missing or empty email text.
- Encodes the labels into numerical values.
- Splits the data into training and testing sets.
- Converts the email text into numerical features using TF-IDF vectorization.
- Trains a Multinomial Naive Bayes model.
- Evaluates the model's performance using accuracy and a classification report.

In [5]:
# Import Libraries

import pandas as pd  # Import the pandas library for data manipulation
from sklearn.model_selection import train_test_split  # Import train_test_split to split data
from sklearn.feature_extraction.text import TfidfVectorizer  # Import TfidfVectorizer for text vectorization
from sklearn.naive_bayes import MultinomialNB  # Import MultinomialNB for Naive Bayes classification
from sklearn.metrics import accuracy_score, classification_report  # Import metrics for model evaluation
from sklearn.preprocessing import LabelEncoder  # Import LabelEncoder to encode labels
import nltk  # Import the NLTK library for natural language processing
import re  # Import the re library for regular expressions

In [6]:
# Download NLTK resources (if not already downloaded)
# NLTK (Natural Language Toolkit) is a suite of libraries and programs for
# symbolic and statistical natural language processing for English.
# These lines download necessary resources:
#   - 'stopwords': A collection of common words (like "the", "a", "is") that
#                  are often removed during text preprocessing.
#   - 'punkt': A pre-trained model for tokenization, which is the process of
#              splitting text into individual words or sentences.

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /Users/Manas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/Manas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
# Data Cleaning and Preprocessing

def clean_text(text):
    """
    This function performs several text cleaning operations to prepare the input text
    for further analysis. These operations are crucial for reducing noise and 
    standardizing the text data.

    Args:
        text: The input text string to be cleaned.

    Returns:
        A cleaned version of the input text string.
    """
    text = re.sub(r'http\S+', '', text)   # Remove URLs from the text using regular expressions.
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation and special characters from the text.
    text = text.lower()                   # Convert the text to lowercase to ensure consistency.
    tokens = nltk.word_tokenize(text)    # Tokenize the text into individual words using NLTK's word_tokenize.
    stop_words = set(nltk.corpus.stopwords.words('english')) # Get the set of English stop words from NLTK.
    tokens = [word for word in tokens if word not in stop_words] # Remove stop words from the list of tokens.
    return " ".join(tokens)              # Join the cleaned tokens back into a single string, separated by spaces.

In [8]:
def preprocess_data(input_file):
    """
    This function Reads, preprocesses, and prepares the data for classification. 
    Args: input_file (str): The path to the input CSV file. 
    Returns: pandas.DataFrame: The preprocessed DataFrame, or None if an error occurs.
    """
    try:
        # Read the CSV file, specifying that the first column has no header
        df = pd.read_csv(input_file)

    except FileNotFoundError:
        # Handle the case where the input file is not found
        print(f"Error: Input file '{input_file}' not found.")
        return None

    except Exception as e:
        # Handle other potential errors during file reading
        print(f"An error occurred during file reading: {e}")
        return None

    # Filter out rows where the 'Email Text' column is either null or empty
    df = df[df['Email Text'].notna() & (df['Email Text'] != '')]

    # Return the preprocessed DataFrame
    return df

In [9]:
# Import Training Data
input_csv = "Phishing_Email-dataset.csv"  # Replace with your input file name

In [10]:
# Preprocess the data
df = preprocess_data(input_csv)
df

Unnamed: 0,Email Text,Email Type
0,available capacity pursuant to our discussion ...,Safe Email
1,base gas roll for april 01 due to some unsched...,Safe Email
2,sum : computer transcription a couple of month...,Safe Email
3,re : any changes should come from bob ' s grou...,Safe Email
4,historical ling announcing le lingue indoeurop...,Safe Email
...,...,...
1995,"re : enroncredit . com vasant , tanya any inte...",Safe Email
1996,empty,Phishing Email
1997,re : tenaska iv 10 / 00 i don ' t see anything...,Safe Email
1998,audit points i want to make sure that we addre...,Safe Email


In [11]:
# Extract the email text and labels from the DataFrame
# Assuming 'df' is your DataFrame containing the email data
# and it has columns named 'Email Text' and 'Email Type'

# Apply the clean_text function to the 'Email Text' column to preprocess the text.
# This cleans and standardizes the email text for analysis.
df['Email Text'] = df['Email Text'].apply(clean_text)

# Assign the cleaned email text data to the feature variable 'X'.
X = df['Email Text']
# Assign the email type labels to the target variable 'y'.
y = df['Email Type']

# Convert categorical labels ('Safe Email', 'Phishing Email') to numerical values (0, 1).
# This is necessary because machine learning models typically work with numerical data.
label_encoder = LabelEncoder()
# Use LabelEncoder to transform the categorical labels in 'y' into numerical labels.
y_encoded = label_encoder.fit_transform(y)

# Split the data into training and testing sets (80% training, 20% testing).
# This divides the dataset into two parts: one for training the model and one for evaluating its performance.
# 'test_size=0.2' specifies that 20% of the data will be used for testing.
# 'random_state=42' ensures that the split is reproducible.
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# Display the first few rows of the training feature data (X_train).
# This is useful for quickly inspecting the preprocessed data.
X_train.head()

968    summary typewriter canadian raising recently a...
240    begin forwarded text rahshipwrightcom hackervu...
819    houston gbn steering committee june 4 th remin...
692    proulx wrote sent friday august 02 2002 959 pm...
420    reply new sexy anime would believe fserh pron ...
Name: Email Text, dtype: object

In [13]:
y_train

array([1, 1, 1, ..., 0, 1, 0])

In [14]:
# Convert the email text into numerical features using TF-IDF vectorization.
# TF-IDF (Term Frequency-Inverse Document Frequency) is a technique that weighs
# words based on their frequency in a document and their inverse document frequency
# across the entire corpus. This helps to identify important words in each email.
# 'max_features' limits the number of features to the top 5000 most frequent words.
# This helps to reduce dimensionality and improve efficiency.
vectorizer = TfidfVectorizer(max_features=5000)

# Fit and transform the training data.
# 'fit_transform' learns the vocabulary and TF-IDF weights from the training data
# and transforms the training data into a TF-IDF matrix.
X_train_tfidf = vectorizer.fit_transform(X_train)

# Transform the testing data.
# 'transform' uses the vocabulary and TF-IDF weights learned from the training data
# to transform the testing data into a TF-IDF matrix. It's important to use 'transform'
# and not 'fit_transform' on the test data to avoid data leakage. Data leakage occurs
# when information from the test set is used to influence the training process,
# which can lead to overly optimistic performance estimates.
X_test_tfidf = vectorizer.transform(X_test)

In [15]:
# Create and train a Multinomial Naive Bayes model.
# Multinomial Naive Bayes is a probabilistic classifier suitable for text data.
# It calculates the probability of an email belonging to a certain category (phishing or safe)
# based on the frequency of words in the email.
model = MultinomialNB()
# 'fit' trains the Multinomial Naive Bayes model using the TF-IDF features of the training data
# and the corresponding training labels.
model.fit(X_train_tfidf, y_train)

# Make predictions on the testing data.
# 'predict' uses the trained Multinomial Naive Bayes model to predict the category (phishing or safe)
# for each email in the test set, based on its TF-IDF features.
y_pred = model.predict(X_test_tfidf)

# Evaluate the model's performance.
# 'accuracy_score' calculates the accuracy of the model's predictions by comparing the predicted labels (y_pred)
# with the actual labels (y_test). Accuracy is the proportion of correctly classified emails.
accuracy = accuracy_score(y_test, y_pred)
# 'classification_report' generates a detailed report that includes precision, recall, F1-score, and support
# for each category (phishing and safe).
# 'target_names' provides the original categorical labels for better readability of the report.
report = classification_report(y_test, y_pred, target_names=label_encoder.classes_)

# Print the accuracy score and classification report.
# This displays the model's performance metrics to the user.
print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:\n", report)

Accuracy: 0.9400
Classification Report:
                 precision    recall  f1-score   support

Phishing Email       0.96      0.87      0.91       143
    Safe Email       0.93      0.98      0.95       257

      accuracy                           0.94       400
     macro avg       0.95      0.92      0.93       400
  weighted avg       0.94      0.94      0.94       400



In [21]:
# --- Code to display sample predictions ---

# Create a DataFrame to hold the test data and predictions
results_df = pd.DataFrame({'Email Text': X_test, 'Actual': y_test, 'Predicted': y_pred})

# Decode the numerical labels back to their original categorical values
results_df['Actual'] = label_encoder.inverse_transform(results_df['Actual'])
results_df['Predicted'] = label_encoder.inverse_transform(results_df['Predicted'])

# Separate phishing and non-phishing samples
phishing_results = results_df[results_df['Actual'] == 'Phishing Email']
non_phishing_results = results_df[results_df['Actual'] == 'Safe Email']

# Display some phishing examples
print("\n--- Phishing Email Examples ---")
if len(phishing_results) > 0:
    for i in range(min(5, len(phishing_results))):  # Display up to 5 examples
        print(f"\nEmail Text: {phishing_results.iloc[i]['Email Text']}")
        print(f"Actual: {phishing_results.iloc[i]['Actual']}")
        print(f"Predicted: {phishing_results.iloc[i]['Predicted']}")
else:
    print("No phishing emails found in the test set.")

# Display some non-phishing examples
print("\n--- Safe Email Examples ---")
if len(non_phishing_results) > 0:
    for i in range(min(5, len(non_phishing_results))):  # Display up to 5 examples
        print(f"\nEmail Text: {non_phishing_results.iloc[i]['Email Text']}")
        print(f"Actual: {non_phishing_results.iloc[i]['Actual']}")
        print(f"Predicted: {non_phishing_results.iloc[i]['Predicted']}")
else:
    print("No safe emails found in the test set.")


--- Phishing Email Examples ---

Email Text: confirmation id 273952435 attempted contact refinance home current loan pre approved loan qualifies fixed rate 4 21 lower save hundreds month current mortgage payments visit confirm information within 24 hours
Actual: Phishing Email
Predicted: Phishing Email

Email Text: fancy flutter atip thursday 26th march best racing tip day phone 0897555293 1 30pm nt believe try risk 50p phone call could gain entertaining additional source income fact information recorded telephone number race days consistently produced winning bets hit rate 60 7 fact 500 betting bank march 1997 would worth 7 719 followed simple betting process involving two telephone calls day would like free details plan leave name address end call happy send full 20 instructions calls charged per second 42p 25 seconds note please nt phone 1 30pm waste call 02
Actual: Phishing Email
Predicted: Safe Email

Email Text: report make fortune e b goode ebay 1 rated work home business oppor