## Spam Filtering
In this programming assignment, we will be looking at Spam Filtering with a real data set that has a "label" for every email - i.e. spam or not spam. We will use logistic regression classifier to solve this assignment and participate in a friendly competition on Kaggle (Details below). The assignment goes from data loading to data inspection to data pre-processing to creating a train/test data set to finally doing machine learning, making predictions and evaluating it. This is typically one part of the "full pipeline" in ML modeling/prototyping - So you will get a sampler taste of some "prototype pipeline" work that happens in practice! Have fun!! And if you get stuck somewhere - Use discord - Maybe someone has a suggestion that will unblock you.

The submission consists of two parts:
a) A submission of your complete working code with train/validation data sets + your write-up with insights and your learnings (details on this provided below)
b) Evaluation of your best model on the Kaggle evaluation data set - For this you can form a team of 2 - To brainstorm ideas and make your best submission. Include your team name, team members in your submission.

Kaggle Starting Point for the competition: https://www.kaggle.com/t/7d2850f5b99a41fba457f2ad7acd0fca

## Loading the data set

In [2]:
# PA 2 - Advanced Introduction to Machine Learning
# Blake Downey
# Jan 22nd, 2022

import pandas as pd
import numpy as np
local_file="data/all_emails.csv"
data_set = pd.read_csv(local_file)#,sep=',',index_col=0,header=None,engine='python',error_bad_lines=False)

## 1) Inspecting the data set

In [3]:
# 1. Print a few lines (i.e. each line is an email and a label) from the data_set containing spam (use a pandas functionality - e.g. getting the top lines)
data_set.head()

Unnamed: 0,id,text,spam
0,1235,Subject: naturally irresistible your corporate...,1
1,1236,Subject: the stock trading gunslinger fanny i...,1
2,1238,Subject: 4 color printing special request add...,1
3,1239,"Subject: do not have money , get software cds ...",1
4,1240,"Subject: great nnews hello , welcome to medzo...",1


In [4]:
# 2. Print a few lines from data_set that are not spam
data_set[data_set['spam'] == 0].head()

Unnamed: 0,id,text,spam
1026,2603,"Subject: hello guys , i ' m "" bugging you "" f...",0
1027,2604,Subject: sacramento weather station fyi - - ...,0
1028,2605,Subject: from the enron india newsdesk - jan 1...,0
1029,2606,Subject: re : powerisk 2001 - your invitation ...,0
1030,2607,Subject: re : resco database and customer capt...,0


In [5]:
# 3. Print the emails between lines 5000 and 5010 in the data set
#The dataset is only 4282 lines, So I printed the emails that have ID 5000 - 5010
#data_set[(data_set['id']>=5000) & (data_set['id']<5011)]

#update: 1/18/22 print lines 4000-4010 
data_set.iloc[4000:4011] # 4000 inclusive, 4011 non inclusive, ie range 4000-4010

Unnamed: 0,id,text,spam
4000,6587,Subject: re : var calibration issues we are p...,0
4001,6588,Subject: re : carnegie mellon team meeting un...,0
4002,6590,"Subject: re : statistica & lunch rick , we a...",0
4003,6592,Subject: re : interview with research dept . c...,0
4004,6593,Subject: re : natural gas storage item vince ...,0
4005,6594,Subject: revised : organizational changes to ...,0
4006,6595,Subject: valuation methodology we ' ve had a ...,0
4007,6596,"Subject: again , i should have also sent the f...",0
4008,6597,Subject: registration materials for nfcf to :...,0
4009,6599,"Subject: latest vince , i appologize for shi...",0


## 2) Data processing step for this HW: 
Do the following process for all emails in your data set - 1) Tokenize into words 2) Remove stop/filler words and 3) Remove punctuations 
Below - We have it done for a sample sentence

## Tokenizer
Apply a tokenizer to tokenize the sentences in your email - So your sentence gets broken down to words. We will use a tokenizer from the NLTK library (Natural Language Tool Kit) below for a single sentence. 

In [6]:
# Example Sentence
from nltk.tokenize import word_tokenize
import nltk
#nltk.download('punkt')
sentence = """Subject: only our software is guaranteed 100 % legal . name - brand software at low , low , low , low prices everything comes to him who hustles while he waits . many would be cowards if they had courage enough ."""
sentence_tokenized = word_tokenize(sentence)
print(sentence_tokenized)

['Subject', ':', 'only', 'our', 'software', 'is', 'guaranteed', '100', '%', 'legal', '.', 'name', '-', 'brand', 'software', 'at', 'low', ',', 'low', ',', 'low', ',', 'low', 'prices', 'everything', 'comes', 'to', 'him', 'who', 'hustles', 'while', 'he', 'waits', '.', 'many', 'would', 'be', 'cowards', 'if', 'they', 'had', 'courage', 'enough', '.']


## Stop Words: Remove Stop Words (or Filler words ) using stop words list

In [7]:
from nltk.corpus import stopwords
filtered_words = [word for word in sentence_tokenized if word not in stopwords.words('english')]

## Punctuations: Remove punctuations and other special characters from tokens

### 3) Exercise: 
Inspect the resulting list below for any of your emails - Does it look clean and ready to be used for the next step in spam detection? Any other pre-processing steps you can think of or may want to do before spam detection? How about including other NLP features like bi-grams and tri-grams?

In [8]:
new_words= [word for word in filtered_words if word.isalnum()]

In [9]:
# Function to clean one email: splits the email into tokens, removes stopwords, removes punctuation,
# and removes the word 'subject' from each email
def clean_email(email):
    email = word_tokenize(email)
    email = [word.lower() for word in email if word.lower() not in stopwords.words('english')]
    email = [word for word in email if word.isalnum()]
    email = [word for word in email if not word == 'subject']
    return ' '.join(email)

# Use the .apply() function of pandas to apply the same function to each entry in df['text'] column
data_set['text'] = data_set['text'].apply(clean_email)

## 4) Train/Validation Split
Now for each email in your data set - You have boiled the email down to its essentials - A list of words that are clean and ready for some Machine Learning! Maybe punctuations matter for spam emails!!? 
If you wish to keep them, you may for your curiosity and see how it impacts metrics (i.e. skip step 3 above). 

What we will do now is split the data set into train and test set - The train set can have 80% of the data (i.e. emails along with their labels) chosen at random - But with good representation from both spam and not-spam email classes. And the same goes for the test set - Which would have the remaining 20% of the data.
Look up python libraries that can do this data split for you automatically?

In [55]:
from sklearn.model_selection import train_test_split #splitting train into train and test sets
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV, SGDClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# split dataset into train and validate
X_train, X_val, y_train, y_val = train_test_split(data_set['text'], data_set['spam'], test_size=.15)

#create a vectorizer object with a cap on max_features 
vectorize = TfidfVectorizer(max_features=10000) #ngram_range=[1,4], #testing with n-grams

X_train = vectorize.fit_transform(X_train) # ONLY call .fit_transform() once and on the trainset
X_val = vectorize.transform(X_val) # call .transform() to apply the same transform vectorizer to the validation set

### 5) Train your model and evaluate on Kaggle
Report your train/validation F1-score for your baseline model (starter LR model) and also your best LR model. Also report your insights on what worked and what did not on the Kaggle evaluation. How can your model be improved? Where does your model make mistakes?

In [58]:
# train a vanilla Logistic Regression with the training set
model = LogisticRegression(random_state=0).fit(X_train, y_train)

# make a prediction with the validation set
y_pred_val = model.predict(X_val)
# also get the preds for the training set
y_pred_train = model.predict(X_train)

# print out the classification report
print(classification_report(y_val, y_pred_val))
print(f'acc: {accuracy_score(y_val, y_pred_val) * 100: .2f}%')
print(f'train f1: {f1_score(y_train, y_pred_train): .5f}')
print(f'validation f1: {f1_score(y_val, y_pred_val): .5f}')

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       494
           1       0.99      0.95      0.97       145

    accuracy                           0.99       639
   macro avg       0.99      0.97      0.98       639
weighted avg       0.99      0.99      0.99       639

acc:  98.75%
train f1:  0.99027
validation f1:  0.97183


In [59]:
# implement the SGD Classifier!!!
model_2 = SGDClassifier().fit(X_train, y_train)

y_pred_val = model_2.predict(X_val)
# also get the preds for the training set
y_pred_train = model.predict(X_train)

print(classification_report(y_val, y_pred_val))
print(f'acc: {accuracy_score(y_val, y_pred_val) * 100: .2f}%')
print(f'train f1: {f1_score(y_train, y_pred_train): .5f}')
print(f'validation f1: {f1_score(y_val, y_pred_val): .5f}')

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       494
           1       1.00      0.99      1.00       145

    accuracy                           1.00       639
   macro avg       1.00      1.00      1.00       639
weighted avg       1.00      1.00      1.00       639

acc:  99.84%
train f1:  0.99027
validation f1:  0.99654


In [60]:
model_3 = make_pipeline(StandardScaler(with_mean=False), LogisticRegressionCV(max_iter=800)).fit(X_train, y_train)

# run model 3 pipeline on the validation set to get predictions
y_pred_val = model_3.predict(X_val)
# also get the preds for the training set
y_pred_train = model.predict(X_train)

print(classification_report(y_val, y_pred_val))
print(f'acc: {accuracy_score(y_val, y_pred_val) * 100: .2f}%')
print(f'train f1: {f1_score(y_train, y_pred_train): .5f}')
print(f'validation f1: {f1_score(y_val, y_pred_val): .5f}')

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       494
           1       0.99      0.97      0.98       145

    accuracy                           0.99       639
   macro avg       0.99      0.99      0.99       639
weighted avg       0.99      0.99      0.99       639

acc:  99.22%
train f1:  0.99027
validation f1:  0.98258


In [62]:
# Read the test set into the notebook
df = pd.read_csv("data/eval_students_2.csv")

# apply the same clean email function to each of the emails in the test set
df['text'] = df['text'].apply(clean_email)
# Use the same vectorizer object on the test set. 
X_test_2 = vectorize.transform(df['text'])

# I am not doing anything special for vocab that has not been seen yet by the vectorizer. So new vocab 
# is simply discarded. 

In [63]:
# get the predictions
y_pred_test_2 = model_2.predict(X_test_2)

# print the shape and results 
print(X_test_2.shape)
print(y_pred_test_2)

# remove the 'text' column from the data frame
df.pop('text')
# add a 'spam' column to the data frame that holds the prediction values
df['spam'] = y_pred_test_2

# write the new dataframe to the disk. 
df.to_csv('.\predictions\DowneyBlake_11.csv')

(1468, 10000)
[0 0 1 ... 0 0 0]
