<a href="https://colab.research.google.com/github/dtran421/3D-SHARKS/blob/main/Machine_Learning_Engineering_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Engineering Demo

In this notebook, we will walk through a simple machine learning project and see how it can be applied to software. This project focuses on identifying whether an email is spam or not ("ham" or "spam").

In [None]:
#@title Import statements for ML-related modules {display-mode: "form"}
import numpy as np
import re
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor

## Training the model
In this section, we perform preprocessing on our data and then train the model on this data. Your job here is to run each code cell and understand what is going on.

In [None]:
#@title Import the Enron Dataset to train the machine learning model {display-mode: "form"}
DATA_URL = 'https://raw.githubusercontent.com/dtran421/machine-learning-engineering-demo/main/enron_data.csv'
df = pd.read_csv(DATA_URL, index_col=0)

In [None]:
#@title Initialize constant variables {display-mode: "form"}
# these counts will be used later on for shuffling and subsetting the data
numTotal    = len(df)
numTrain    = int(.8*numTotal)
numTest     = numTotal-numTrain

# maximum amount of features that we want our feature vectors to contain
numFeatures = 3000

In [None]:
#@title Initialize lists for labels and docs (email messages) {display-mode: "form"}
#@markdown Labels: 0 = ham (*legitimate*), 1 = spam
labels = df['Label']   # list of labels for each email
docs   = df['Body']    # list of emails

In [None]:
#@title Initialize preprocessor and vectorizor {display-mode: "form"}

# this function will be called on each message to preprocess it
def preprocess(doc):
    # replace all currency signs and some url patterns with special
    # tokens as these are useful features.
    doc = re.sub('[£$]', ' __currency__ ', doc)
    doc = re.sub('\://', ' __url__ ', doc)
    doc = doc.lower()
    return doc


# vectorizer is responsible for converting bodies of text into feature vectors
# these vectors are much easier for a machine learning model to deal with
vectorizer = CountVectorizer(max_features=numFeatures, preprocessor=preprocess)

In [None]:
#@title Vectorize docs (email messages) {display-mode: "form"}
# now, we actually perform the conversion from text to feature vectors
X = vectorizer.fit_transform(docs)

In [None]:
#@title Create data structures to hold data and labels {display-mode: "form"}

# dense numpy arrays will be easier to work with
X = X.toarray()
m,n = X.shape
y = labels.array

# add a column of ones (bias of the hypothesis function)
# this is kind of like the y-intercept
X = np.column_stack([np.ones(m), X])

In [None]:
#@title Shuffle datapoints {display-mode: "form"}
# randomize the indices of our data
idx = np.arange(numTotal)
np.random.shuffle(idx)

In [None]:
#@title Split dataset into training and testing sets (email messages) {display-mode: "form"}

# apply the randomized indices to our model input
X = X[idx,:]
y = y[idx]

# split into training and testing sets
# we have hard coded the split between training and testing to be 80:20
train = np.arange(numTrain)
test  = numTrain + np.arange(numTest)

X_train = X[train,:]
y_train = y[train]

X_test  = X[test,:]
y_test  = y[test]

### Model 1: Random Forest Regressor

In this first model, we will walk you through the Random Forest Regressor, a model that averages across many classifying decision trees to come up with a final classification.

In [None]:
# create a model and then fit it to our data
model = RandomForestRegressor(max_features=.3, n_estimators=25)
model.fit(X_train, y_train)

# feel free to change or add parameters and see how it affects the 
# training time (very bottom left while running a cell) and model accurracy
# as a warning, increasing values too much can greatly increase execution time
# docs: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

RandomForestRegressor(max_features=0.3, n_estimators=25)

In [None]:
# create a prediction for our test data
pred_enron = model.predict(X_test)

# now we check to see how well our model works
res = np.sum(np.abs(pred_enron - y_test))
acc = 100 - (res / numTest * 100)
print(f'Model Accuracy: {acc}%')

Model Accuracy: 91.844%


### Model 2: Logistic Regression

That accuracy is okay, but maybe the Random Forest Regressor isn't the best choice. Let's try a Logistic Regression, which is a simple type of linear model that can be used to predict the probability of a binary event occurring. This time, it's your turn to implement the training and testing. You can reference our code above when implementing.

In [None]:
# TODO: create a new model "model" using "LogisticRegression" 
# for params, we recommend using max_iter=500 and solver='liblinear'
# feel free to play around with these params to try to get better accuracy


# TODO: fit it to our data "X_train" and "y_train"


# docs: https://scikit-learn.org/stable/modules/linear_model.html?highlight=logistic+regress#logistic-regression

LogisticRegression(max_iter=500, solver='liblinear')

In [None]:
# TODO: create a prediction "pred_enron" for our test data "X_test"


# now we check to see how well our model works
res = np.sum(np.abs(pred_enron - y_test))
acc = 100 - (res / numTest * 100)
print(f'Model Accuracy: {acc}%')

Model Accuracy: 97.55%


## Try it yourself

In this section, we have left out some code necessary for testing out the model on a new dataset (Ling Dataset) that it has never seen before. It is up to you to fill in a few lines and create new predictions using the model we trained above.

In [None]:
# TODO: read in Ling Dataset CSV from https://raw.githubusercontent.com/dtran421/machine-learning-engineering-demo/main/ling_data.csv


In [None]:
# here, we will perform same preprocessing steps that we did previously
numTotal    = len(df)
numTrain    = int(.8*numTotal)
numTest     = numTotal-numTrain

labels = df['Label']   # list of labels for each email
docs   = df['Body']    # list of emails

# convert the email content to feature vectors
X = vectorizer.fit_transform(docs)

# read into arrays
X = X.toarray()
m,n = X.shape
y = labels.array

# add column of ones
X = np.column_stack([np.ones(m), X])

In [None]:
# TODO: create a prediction "pred_ling" using the pre-trained model
# the parameter should be "X" this time instead of "X_test"


In [None]:
# compare predicted y to y_tests
res = np.sum(np.abs(pred_ling - y))
acc = 100 - (res / numTest * 100)
print(f'Model Accuracy {acc}%')

97.0%


## Conclusion

Now you've trained and tested a model that can accurately predict whether an email is "ham" or "spam". Why does this matter and how can it be applied?

Let's consider a service like Gmail. As everyone knows, this is an email service provided by Google that has many additional features like a spam inbox that is separate from the main inbox. If users had to manually move spam email to this inbox, it would be a passive pain for them. Instead, Google has a similar machine learning algorithm to the one we created above (except theirs is probably a bit better and more complicated). The algorithm creates a "prediction" for every email received by the user and assigns the email to the corresponding inbox. This is exactly what Machine Learning Engineering is!