# Tutorial- Binary Classification with Python and Scikit-Learn

This notebook shows how you can perform [binary classification](http://en.wikipedia.org/wiki/Binary_classification) using jupyter and the pre-installed Python library [scikit-learn](http://scikit-learn.org/). In particular, this notebook shows how you can solve a common machine learning problem: spam classification.  

This notebook will cover the following steps:

1. Load training and test datasets
2. Train the machine learning model
3. Test the machine learning model

## Background 


### Binary Classification with SGD

Classification is the task of placing an unkown piece of data into a defined target. Binary classification separates the elements in a the set into two groups, as opposed to [multi-class](http://en.wikipedia.org/wiki/Multinomial_logit) which uses more than 2 groups. 

Binary classification allows the data to be classified into SUCCESS-FAILURE, YES-NO, 1-0 classes. Data needs to be [featurized](http://en.wikipedia.org/wiki/Feature_%28machine_learning%29) to allow the data to be determined if it fits in one category or another. [Logistic regression](http://en.wikipedia.org/wiki/Logistic_regression) is a popular method wich trains the algorithm to provide values for an equation which will provide a best-of-fit line for the data.


_**Spambase  DataSet**_

This notebook uses the [Spambase data set](https://archive.ics.uci.edu/ml/datasets/Spambase), which was created by Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt at HP Labs. It includes 4601 observations corresponding to email messages, 1813 of which are spam. The spambase data has 57 real valued explanatory variables which characterize the contents of an email and and one binary response variable indicating if the email is spam. Of the 57 explanatory variables, 48 describe word frequency, 6 describe character frequency, and 3 describe sequences of capital letters.  
The dataset may be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/) 

Python's most popular machine learning library, [scikit-learn](http://scikit-learn.org/), has classes for classification and logistic regression. We will use [Schotastic Gradient Descent](http://en.wikipedia.org/wiki/Stochastic_gradient_descent) as defined in its [class](http://scikit-learn.org/stable/modules/sgd.html); its most obvious advantages are that the algorithm on each iteration updates the weights calculated, saving iterations.

## Load the dataset

In [1]:
import pandas as pd
df = pd.read_csv(r'resources\spambase.data', header=None)

In [2]:
print('Total emails: {}'.format(len(df)))

Total emails: 4601


In [3]:
print('Total features: {}'.format(len(df.columns)))

Total features: 58


In [4]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


**The data has already been featurized by UCI.** 

## Create training, sample and target vectors

**Split the full dataset 80/20 for training/sample**  
Machine learning models rely on training sets from which the function for calculating the linear separating plane is derived. Additonally, machine learning models rely on test sets which can be used to assess the performance of the model. Generate the training and test sets from the original dataset.
Please note, the scikit-learn API expects the features and labels in two separate objects.

In [5]:
import random

In [6]:
idx = list(df.index.values)

In [7]:
training_size = int(len(df)*.8)
print(training_size)

3680


In [8]:
#80% training, 20% for testing

training_indices = random.sample(idx, k=training_size)

In [9]:
df_training = df.loc[training_indices] #feature vector 
df_test = df.drop(training_indices) #test set 

In [10]:
df_training

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
1663,0.00,0.00,0.21,0.0,0.21,0.00,0.00,0.00,0.00,0.0,...,0.000,0.057,0.000,0.000,0.000,0.00,2.807,39,379,1
2851,0.16,0.00,0.00,0.0,0.66,0.00,0.00,0.00,0.00,0.0,...,0.118,0.047,0.023,0.000,0.000,0.00,1.983,19,240,0
1661,0.10,0.10,0.03,0.0,0.07,0.03,0.00,0.03,0.00,0.1,...,0.000,0.071,0.000,0.006,0.065,0.00,2.106,46,3214,1
1065,0.00,1.36,0.00,0.0,0.00,0.00,1.36,0.00,0.00,0.0,...,0.000,0.170,0.000,0.170,0.170,0.17,9.411,128,160,1
2196,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.0,...,0.000,0.000,0.000,0.262,0.000,0.00,1.565,14,36,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2377,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.0,...,0.000,0.000,0.000,0.386,0.000,0.00,1.600,4,16,0
2258,0.76,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.0,...,0.135,0.000,0.000,0.000,0.000,0.00,1.411,5,24,0
3538,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.0,...,0.000,0.000,0.000,0.000,0.000,0.00,1.000,1,6,0
1327,0.00,0.00,0.48,0.0,1.45,0.00,0.00,0.00,0.48,0.0,...,0.000,0.198,0.000,0.594,0.000,0.00,5.683,128,557,1


In [11]:
df_test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
7,0.00,0.0,0.00,0.0,1.88,0.00,0.00,1.88,0.00,0.00,...,0.00,0.206,0.0,0.000,0.000,0.00,2.450,11,49,1
20,0.00,0.0,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.729,0.0,0.729,0.000,0.00,3.833,9,23,1
24,0.00,0.0,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.196,0.0,0.392,0.196,0.00,5.466,22,82,1
33,0.00,0.0,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.000,0.0,0.000,0.302,0.00,1.700,5,17,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4589,0.00,0.0,0.00,0.0,0.54,0.00,0.00,0.00,0.00,0.00,...,0.00,0.000,0.0,0.000,0.000,0.00,1.000,1,22,0
4592,0.00,0.0,1.25,0.0,2.50,0.00,0.00,0.00,0.00,0.00,...,0.00,0.111,0.0,0.000,0.000,0.00,1.285,4,27,0
4593,0.00,0.0,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.000,0.0,1.052,0.000,0.00,1.000,1,6,0
4594,0.00,0.0,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.630,0.0,0.000,0.000,0.00,1.727,5,19,0


***Create target vectors***

In [12]:
df_training_target = df_training[57] #classification label 
df_test_target = df_test[57] # 

***Remove classification column from training & sample vectors***

In [13]:
df_training = df_training.drop(57,1)
df_test = df_test.drop(57,1)

  df_training = df_training.drop(57,1)
  df_test = df_test.drop(57,1)


In [14]:
df_training.head()
#training set

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,47,48,49,50,51,52,53,54,55,56
1663,0.0,0.0,0.21,0.0,0.21,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.057,0.0,0.0,0.0,0.0,2.807,39,379
2851,0.16,0.0,0.0,0.0,0.66,0.0,0.0,0.0,0.0,0.0,...,0.0,0.118,0.047,0.023,0.0,0.0,0.0,1.983,19,240
1661,0.1,0.1,0.03,0.0,0.07,0.03,0.0,0.03,0.0,0.1,...,0.0,0.0,0.071,0.0,0.006,0.065,0.0,2.106,46,3214
1065,0.0,1.36,0.0,0.0,0.0,0.0,1.36,0.0,0.0,0.0,...,0.0,0.0,0.17,0.0,0.17,0.17,0.17,9.411,128,160
2196,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.262,0.0,0.0,1.565,14,36


In [15]:
df_test.head()
#test set

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,47,48,49,50,51,52,53,54,55,56
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259
7,0.0,0.0,0.0,0.0,1.88,0.0,0.0,1.88,0.0,0.0,...,0.0,0.0,0.206,0.0,0.0,0.0,0.0,2.45,11,49
20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.729,0.0,0.729,0.0,0.0,3.833,9,23
24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.196,0.0,0.392,0.196,0.0,5.466,22,82
33,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.302,0.0,1.7,5,17


## Fit the model  
The class [SGDClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier) has the parameters to train the data and fit it; SGD optimizes logistic regression. 
The [loss function](http://en.wikipedia.org/wiki/Loss_function) updates itself to converge to the minimum each pass, taking less iterations to reach its goal.   
scikit-learn by default determines a cutoff value of 0.5, below this the value is 0, above the value is 1. 

`SGDClassifier` provides functionality to fit linear models for classification and regression using different (convex) loss functions and different penalties. With `loss="log"`, `SGDClassifier` fits a logistic regression model, while with `loss="hinge"` it fits a [linear support vector machine](http://en.wikipedia.org/wiki/Support_vector_machine) (SVM). L1 and L2 are the [regularization methods](http://scikit-learn.org/stable/auto_examples/svm/plot_svm_scale_c.html) to be used. 

In [16]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss="log_loss", penalty="l2")

In [17]:
clf.fit(df_training, df_training_target)

`score` in `scikit-learn` is the [R-squared](http://en.wikipedia.org/wiki/Coefficient_of_determination) of the trained model. 

In [18]:
clf.score(df_training, df_training_target)

0.6173913043478261

## Run model on sample  
At this point, you have a Logistic Regression model trained to classify featurized emails as spam or not spam. It is important to test the model using a set of emails with known classifications.   
The [quality of a model](http://en.wikipedia.org/wiki/Precision_and_recall) is often measured by accuracy. Accuracy is the percentage of correct predictions made by a model (correct predictions / total predictions). Calculate the accuracy of your model.

In [19]:
prediction = clf.predict(df_test)

In [20]:
print('Model\'s Accuracy is {}'.format(clf.score(df_test, df_test_target)))

Model's Accuracy is 0.6373507057546145


## Conclusion  
In conclusion, this machine learning exercise consisted of:   
* Selecting a dataset
* Parsing and spliting the data to training and test sets
* Training the model to the training set
* Predicting new values on the test set 
* Comparing the predicted values vs the test values. 

## References  

1. [Hackeling, Gavin. _Mastering Machine Learning with scikit-learn._ PACKT Publishing, 2014](https://www.packtpub.com/big-data-and-business-intelligence/mastering-machine-learning-scikit-learn)    
2.  [Machine Learning Repository: Spambase Dataset](https://archive.ics.uci.edu/ml/datasets/Spambase)
3. [scikit-learn SGDClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier)  
4. [Stochastic Gradient Descent on scikit-learn](http://scikit-learn.org/stable/modules/sgd.html)