# Classification

Classification is one of the most frequent applications of machine learning. In this workshop, your two goals are to work through the entire machine learning pipeline (though the data is preprocessed) and compare different supervised methods. There will be little driver code for this workshop, as comfort with implementing the steps of the machine learning pipeline is crucial within all domains. 

### Processing the Data

In [47]:
# import libraries - look at the last workshop if you don't remember the 3 most essential libraries

In [2]:
import sklearn 
import numpy as np
import pandas as pd
import numpy.random as random
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline

In [35]:
# load the file "spambase.csv" as a dataframe, and display the first 5 rows to examine the data

In [3]:
infile = pd.read_csv('spambase.csv')
infile.head(5)

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,is_spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


Each row of the spambase dataset is an email. The features are obtained by taking the frequency (which is normalized on email length) of certain words. Intuitively, emails containing words like "free" are likely to be spam emails, and using this feature representation we capture some of the important elements of the email's content. The label for the dataset is the binary "is_spam" column, which is 1 for spam emails and 0 for real emails.

In [37]:
# split the dataframe into features and label 

In [4]:
x = infile.is_spam
y = infile.drop(labels = 'is_spam', axis = 1)

In [39]:
# split the (features, labels) set into training and testing sets

In [5]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.8, random_state=42)

We will now fit 3 different ML models to see which one does the best. 

### Model 1: Logistic Regression

In [41]:
# import logistic regression function from sklearn and create model

In [7]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

In [43]:
# fit model on training set and predict on test features

In [22]:
model.fit(y,x)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

To evaluate classification models, 2 common methods are zero-one-loss and confusion matrix. Zero-one-loss is the most basic and intuitive way to evaluate classification - this error is just the fraction of incorrectly predicted labels, so a zero-one-loss of 0 is optimal.

One issue with zero-one-loss is that it doesn't distinguish between false positives and false negatives. The confusion matrix provides more information by showing how many times each label was predicted for elements of each true label. Documentation is given on the sklearn reference. 

In [45]:
# display zero-one-loss and confusion matrix (importing from sklearn)

In [37]:
from sklearn.metrics import confusion_matrix, zero_one_loss

y_pred = model.predict(y_train)
zero_one_loss(x_train, y_pred)
confusion_matrix(x_train, y_pred)

0.067391304347826142

### Model 2: SVM

In [None]:
# import SVM function from sklearn and create model

In [38]:
from sklearn import svm

In [None]:
# fit model on training set and predict on test features

In [46]:
C = 1.0

svm.SVC(kernel='linear', C=C).fit(y,x)

In [None]:
# display confusion matrix and zero-one-loss

In [45]:
y_pred = svc.predict(y_train)
zero_one_loss(x_train, y_pred)
confusion_matrix(x_train, y_pred)

array([[541,  20],
       [ 40, 319]])

### Model 3: Decision Tree

In [None]:
# import decision tree function from sklearn and create model

In [47]:
from sklearn import tree

In [None]:
# fit model on training set and predict on test features

In [49]:
clf = tree.DecisionTreeClassifier()
clf.fit(y, x)

In [None]:
# display confusion matrix and zero-one-loss

In [50]:
y_pred = clf.predict(y_train)
zero_one_loss(x_train, y_pred)
confusion_matrix(x_train, y_pred)

array([[561,   0],
       [  0, 359]])

### Exploration

You've seen how some of the most common vanilla classification methods have performed on the data. However, there are countless ways to build upon these foundational methods to create more powerful models. Your challenge is to read up on some advanced classification methods to get the highest possible predictive accuracy on the test set. 

Some methods that build on top of decision trees and SVMs are different kernel functions for SVMs, gradient boosted trees, and random forests. Other classification methods that we haven't discussed are neural networks and Bayesian models. All of these methods are provided in Scikit-learn and documented [here](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning). Read up on and try out the models that seem interesting or suitable for this problem, and try to get the highest accuracy out of all the teams.

In [None]:
# repeat steps outlined above on different models to get the highest possible accuracy

In [52]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(y, x)
y_pred = knn.predict(y_train)
zero_one_loss(x_train, y_pred)
confusion_matrix(x_train, y_pred)

array([[507,  54],
       [ 68, 291]])

### It's Really That Easy

By now, you have seen some of the most commonly used machine learning techniques for regression and classification. You've also applied them as part of the machine learning pipeline to fit models, make predictions, and evaluate results. Evaluation of accuracy is crucial, as it allows you to determine which model is best for a given problem.

After you have a fitted model, you can use it to predict on new data, where you have the features and want to know the labels. For example, now that we've fitted a model that takes in features (word frequencies) and outputs whether it believes the email to be spam, we can feed in new emails (after preprocessing the emails to create the word frequency features) and get predictions on whether these unknown emails are spam.  

The big takeaway is that applying machine learning methods can really be this easy. Knowing the theory behind each model helps you understand which models work in which situations, and the implementation for most basic machine learning methods are provided in sklearn (or other libraries in different languages). To apply these methods and draw interesting conclusions from data, all you need to do is recreate the machine learning pipeline like we did today.

Of course, there are many techniques such as feature extraction, ensembling, and k-fold cross validation for parameter tuning that we will not have time to get to in these workshops. These methods are very important, and any machine learning book will go much further in depth concerning advanced techniques. As for the workshops, we will be moving on from supervised learning to unsupervised learning, time series analysis, and deep learning in future lectures.