# Email Phishing Detection Model

## About Dataset

The Email Phishing Dataset, compiled by user Ethan Cratchley on kaggle, is a combination of two different datasets containing "safe" emails from The Enron Email Dataset, along with phishing and "safe" emails from The Phishing Email Dataset. The dataset includes features such as the number of words, unique words, stopwords, links, unique domains, email addresses, and spelling errors, along with associated labels. The purpose of this dataset is to use machine learning to train a model that would be able to detect phishing emails. Companies and corporations can use this model to identify phishing emails, preventing attacks, and protecting sensitive information and data. In addition, Cybersecurity teams can use the model to find patterns in phishing emails, allowing them to finetune and improve the model.

# 1. Data Preprocessing


- Perform exploratory data analysis (EDA) to gain insights into the dataset's structure and distributions.
- Handle missing values, outliers, and any inconsistencies in the data.
- Encode categorical variables and normalize numerical features as necessary.
- Split the dataset into training and testing sets, ensuring a proper balance of classes.
- Discuss your applied techniques to tackle data imbalancedness if you are working on an imbalanced dataset



In [None]:
# importing initial libraries needed
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(color_codes=True)

In [None]:
# importing and reading data set
from google.colab import files
uploaded = files.upload()
import io
df = pd.read_csv(io.BytesIO(uploaded['email_phishing_data.csv']))
df.head()

In [None]:
df.shape # check for size and shape of data set

In [None]:
df.dtypes # check for catergorical values


In [None]:
df[df.duplicated()] # check for duplicated rows

In [None]:
df.isna().sum() # check for missing values

In [None]:
df.describe() # check for statistical values

In [None]:
# visualizing columns to find outliers
numeric_cols = df.select_dtypes(include='number').columns.drop('label')

n = len(numeric_cols)
cols = 3
rows = (n + cols - 1) // cols

fig, axes = plt.subplots(rows, cols, figsize=(cols*5, rows*4))
axes = axes.flatten()

# box plot for every feature
for i, col in enumerate(numeric_cols):
    sns.boxplot(y=df[col], ax=axes[i])
    axes[i].set_title(col)

# delete extra box plots
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()



In [None]:
# scanning all frames of phishing emails ('label' = 1)
df[df['label'] == 1]

In [None]:
# visualization of column by 'label' to check for correlation of outliers and phishing emails
n = len(numeric_cols)
cols = 3
rows = (n + cols - 1) // cols

fig, axes = plt.subplots(rows, cols, figsize=(cols*5, rows*4))
axes = axes.flatten()

# box plot for every feature
for i, col in enumerate(numeric_cols):
    sns.boxplot(x='label', y=col, data=df, ax=axes[i])
    axes[i].set_title(col)

# delete extra box plots
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()



In [None]:
# outlier detection using z-score
from scipy.stats import zscore

z_scores =df[numeric_cols].apply(zscore)

outliers_zscore = df[(z_scores.abs() > 3).any(axis=1)]

In [None]:
# imputation of outliers using 'capping'
for col in numeric_cols:
    upper_cap = df[col].quantile(0.99)
    lower_cap = df[col].quantile(0.01)
    df[col] = df[col].clip(lower=lower_cap, upper=upper_cap)

In [None]:
# visualizing columns to find outliers
numeric_cols = df.select_dtypes(include='number').columns.drop('label')

n = len(numeric_cols)
cols = 3
rows = (n + cols - 1) // cols

fig, axes = plt.subplots(rows, cols, figsize=(cols*5, rows*4))
axes = axes.flatten()

# box plot for every feature
for i, col in enumerate(numeric_cols):
    sns.boxplot(y=df[col], ax=axes[i])
    axes[i].set_title(col)

# delete extra box plots
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()