# Spam Detector

Background:

Let's say you work at an Internet Service Provider (ISP) and you've been tasked with improving the email filtering system for its customers. You've been provided with a dataset that contains information about emails, with two possible classifications: spam and not spam. The ISP wants you to take this dataset and develop a supervised machine learning model that will accurately detect spam emails, so it can filter them out of its customers' inboxes.

What we're creating:

You will be creating two classification models to fit the provided data, and evaluate which model is more accurate at detecting spam. The models you'll create will be a logistic regression model and a random forest model.

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Retrieve the Data

The data is located at [https://static.bc-edx.com/mbc/ai/m4/datasets/spam-data.csv](https://static.bc-edx.com/mbc/ai/m4/datasets/spam-data.csv)

Dataset Source: [UCI Machine Learning Library](https://archive-beta.ics.uci.edu/dataset/94/spambase)

Import the data using Pandas. Display the resulting DataFrame to confirm the import was successful.

In [2]:
# Import the data
# column 'spam' denotes categorical data
data = pd.read_csv("https://static.bc-edx.com/mbc/ai/m4/datasets/spam-data.csv")
data.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

*Replace the text in this markdown cell with your predictions, and be sure to provide justification for your guess.*

## Split the Data into Training and Testing Sets

In [3]:
# Create the labels set `y` and features DataFrame `X`

# Split training and testing sets
# Create the features DataFrame, X
X = data.copy()

# Drop feature/ category from master dataset
X = X.drop(columns='spam')

# Create the target DataFrame, y i.e. labels set
y = data['spam']


In [4]:
# Check the balance of the labels variable (`y`) by using the `value_counts` function.
# Count how many emails are in spams category
# 0 means non-spam and 1 means spam (based on sigmoid classification logic)
y.value_counts()


spam
0    2788
1    1813
Name: count, dtype: int64

In [5]:
# Split the data into X_train, X_test, y_train, y_test

# Use train_test_split to separate the data
# using random state to make the data reproducable
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# visualize X_train
X_train


Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
4576,0.00,0.0,0.00,0.0,0.00,0.00,0.0,0.0,0.0,0.00,...,0.00,0.000,0.131,0.000,0.000,0.000,0.0,1.488,5,64
4401,0.00,0.0,0.00,0.0,0.00,0.00,0.0,0.0,0.0,0.00,...,0.00,0.000,0.000,0.000,0.000,0.000,0.0,1.571,5,11
3707,0.17,0.0,0.17,0.0,0.00,0.00,0.0,0.0,0.8,0.00,...,0.00,0.253,0.168,0.084,0.000,0.024,0.0,4.665,81,1031
2362,0.00,0.0,0.00,0.0,0.00,0.00,0.0,0.0,0.0,0.00,...,0.00,0.000,0.000,0.000,0.000,0.000,0.0,4.228,53,148
1537,0.00,0.0,0.00,0.0,2.17,0.00,0.0,0.0,0.0,0.00,...,0.00,0.000,0.000,0.000,0.000,0.000,0.0,1.333,5,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2895,0.00,0.0,0.00,0.0,0.00,0.00,0.0,0.0,0.0,0.00,...,0.00,0.000,0.000,0.000,0.000,0.000,0.0,1.125,3,18
2763,0.00,0.0,0.00,0.0,0.00,0.00,0.0,0.0,0.0,0.00,...,4.76,0.000,0.000,0.000,0.000,0.000,0.0,1.800,5,9
905,0.00,0.0,0.76,0.0,0.76,0.00,0.5,0.5,0.0,1.01,...,0.00,0.000,0.078,0.000,0.433,0.433,0.0,2.441,19,249
3980,0.00,0.0,0.87,0.0,0.00,0.17,0.0,0.0,0.0,0.00,...,0.34,0.022,0.022,0.000,0.000,0.000,0.0,1.601,11,277


## Scale the Features

Use the `StandardScaler` to scale the features data. Remember that only `X_train` and `X_test` DataFrames should be scaled.

In [6]:
from sklearn.preprocessing import StandardScaler

# Create the StandardScaler instance
scaler = StandardScaler()

In [7]:
# Fit the Standard Scaler with the training data
X_scaler = scaler.fit(X_train)

In [8]:
# Scale the training data
X_train_scaled = X_scaler.transform(X_train)

# Transform the testing data using the scaler
X_test_scaled = X_scaler.transform(X_test)

#visualize X_train_scaled 
X_train_scaled

#visualize X_test_scaled 
X_test_scaled

array([[-0.35811925, -0.16744248, -0.56071981, ...,  0.07078032,
        -0.09748408, -0.12139738],
       [-0.35811925, -0.16744248, -0.56071981, ..., -0.11736569,
        -0.14963555, -0.32836879],
       [-0.35811925,  0.24309223, -0.56071981, ..., -0.01628853,
        -0.01214531, -0.27620527],
       ...,
       [ 1.01971205, -0.16744248,  0.25158053, ..., -0.12465162,
        -0.1875639 , -0.41418621],
       [-0.35811925, -0.16744248, -0.56071981, ..., -0.15939061,
        -0.24919745, -0.47476321],
       [-0.35811925,  0.1966166 , -0.56071981, ..., -0.11410514,
        -0.20178702, -0.31154185]])

## Create and Fit a Logistic Regression Model

Create a Logistic Regression model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

In [9]:
# Train a Logistic Regression model and print the model score
from sklearn.linear_model import LogisticRegression

logistic_regression_model = LogisticRegression()

# Fit or Train the model with the sacles data
logistic_regression_model.fit(X_train_scaled, y_train)

# Score the model
print(f"Training Data Score: {logistic_regression_model.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {logistic_regression_model.score(X_test_scaled, y_test)}")


Training Data Score: 0.9298550724637681
Testing Data Score: 0.9287576020851434


In [10]:
# Make and save testing predictions with the saved logistic regression model using the test data

# Generate predictions from the model we just fit
predictions = logistic_regression_model.predict(X_train_scaled)

# Review the predictions
# Convert those predictions (and actual values) to a DataFrame
results_data = pd.DataFrame({"Prediction": predictions, "Actual": y_train})
results_data

# Test the Model on New Data i.e. testing dataset

# Apply the fitted model to the test dataset
testing_predictions = logistic_regression_model.predict(X_test_scaled)

# Save both the test predictions and actual test values to a DataFrame
results_data = pd.DataFrame({
    "Testing Data Predictions": testing_predictions,
    "Testing Data Actual Targets": y_test})
results_data

Unnamed: 0,Testing Data Predictions,Testing Data Actual Targets
1351,0,1
1687,0,1
1297,1,1
2101,0,0
3920,0,0
...,...,...
1089,1,1
929,1,1
1545,0,1
4356,0,0


In [11]:
# Calculate the accuracy score by evaluating `y_test` vs. `testing_predictions`.

## calculate accuracy score

# Import the accuracy_score function
from sklearn.metrics import accuracy_score

# Calculate the model's accuracy on the test dataset
accuracy_score(y_test, testing_predictions)



0.9287576020851434

The accuracy score using the Logistic Regression Model is: 0.9287576020851434 (meaning 92%)

## Create and Fit a Random Forest Classifier Model

Create a Random Forest Classifier model, fit it to the training data, make predictions with the testing data, and print the model's accuracy score. You may choose any starting settings you like. 

## using n_estimators= default
using random_state so that the data is reproducable



In [12]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier

# Create the random forest classifier instance
rf_model = RandomForestClassifier(random_state=1)


# Fit the model
rf_model = rf_model.fit(X_train_scaled, y_train)


In [13]:
# Make and save testing predictions with the saved logistic regression model using the test data
testing_predictions_rf = rf_model.predict(X_test_scaled)

# Review the predictions
# Save both the test predictions and actual test values to a DataFrame
results_data = pd.DataFrame({
    "Testing Data Predictions": testing_predictions_rf,
    "Testing Data Actual Targets": y_test})
results_data


Unnamed: 0,Testing Data Predictions,Testing Data Actual Targets
1351,1,1
1687,1,1
1297,1,1
2101,0,0
3920,0,0
...,...,...
1089,1,1
929,1,1
1545,1,1
4356,0,0


In [14]:
# Calculate the accuracy score by evaluating `y_test` vs. `testing_predictions`.
acc_score = accuracy_score(y_test, testing_predictions_rf)

# Display results
print(f"Random Forest Accuracy Score : {acc_score}")


Random Forest Accuracy Score : 0.9669852302345786


The accuracy score using the Random Forest Model (default estimator) is: 0.9669852302345786 (meaning 96%)

## Additional Step 1: using a specific n_estimators 
using n_estimators = 50

n_estimators is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number of trees give better performance but makes execution slower so choosing an optimum value i.e. 50

In [15]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier

# Create the random forest classifier instance
rf_model_estimator = RandomForestClassifier(n_estimators=50, random_state=1)


# Fit the model
rf_model_estimator = rf_model_estimator.fit(X_train_scaled, y_train)

# Make predictions using the testing data
predictions_estimators = rf_model_estimator.predict(X_test_scaled)


# Calculate the accuracy score
acc_score_estimators = accuracy_score(y_test, predictions_estimators)

# Display results
print(f"Random Forest Accuracy Score with specific n_estimators: {acc_score_estimators}")


Random Forest Accuracy Score with specific n_estimators: 0.9661164205039097


## Additional Step 2: Evaluate Feature Importance

In [16]:
# Get the feature importance array
importances = rf_model.feature_importances_
# List the top 10 most important features
importances_sorted = sorted(zip(rf_model.feature_importances_, X.columns), reverse=True)
importances_sorted[:10]

[(0.12644707434188376, 'char_freq_$'),
 (0.10096476186877248, 'char_freq_!'),
 (0.07354329262147186, 'word_freq_remove'),
 (0.07030391899788394, 'capital_run_length_average'),
 (0.06045352928187928, 'word_freq_free'),
 (0.055369113067488825, 'capital_run_length_longest'),
 (0.050466131232231405, 'word_freq_your'),
 (0.04153197402166058, 'word_freq_hp'),
 (0.040767900846828665, 'capital_run_length_total'),
 (0.0332986529978853, 'word_freq_money')]

## Evaluate the Models

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the following markdown cell.

*Replace the text in this markdown cell with your answers to these questions.*

-- Which model performed better?

With or without default n_estimators, Random Forest Classifier produces better accuracy (around 96%) than Logistic Regression (around 92%).

-- How does that compare to your prediction?

Based on the accuracy score

-- Reason/ Conclusion/ Explanation:

Random Forest Classifier performed better with categorical data (email/spam) rathen than Logic regression because:

1. It emphasized on feature selection — weighs certain features as more important than others (Ref: Additional Step 2). 

2. It did not assume that the model has a linear relationship, unlike Logistic Regression which always wanted to fit and classify the data into sigmoid (0 or 1)

3. It utilized ensemble learning. If we were to use just 1 decision tree, we wouldn’t be using ensemble learning. A random forest takes random samples, forms many decision trees, and then averages out the leaf nodes to get a clearer model (Ref: Ref: Additonal Step 1 using different n_estimators)

Thus Random Forest is better (more accurate) than Logistic Regression in this case.

Thanks!
