## Binary Classification

In this tutorial we will explore how to perform binary classification

### General Imports

Import the Pandas library to load and manipulate tabular datasets.

In [1]:
import pandas as pd

## Binary Classification
Load the dataset from the locally available file `SAHeart.csv`, which contains a retrospective sample of males in a heart-disease high-risk region
of the Western Cape, South Africa. A description of the data is available here: https://web.stanford.edu/~hastie/ElemStatLearn/datasets/SAheart.info.txt

The result of loading the data using the Pandas `read_csv` function is a Pandas `DataFrame` named `heart`.

In [2]:
heart = pd.read_csv('SAheart.csv', sep=',', header=0)
heart.head()

Unnamed: 0,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
0,160,12.0,5.73,23.11,1,49,25.3,97.2,52,1
1,144,0.01,4.41,28.61,0,55,28.87,2.06,63,1
2,118,0.08,3.48,32.28,1,52,29.14,3.81,46,0
3,170,7.5,6.41,38.03,1,51,31.99,24.26,58,1
4,134,13.6,3.5,27.78,1,60,25.99,57.34,49,1


Separate the columns of the `heart` Pandas `DataFrame` into:
- `X`: the feature variables, columns 0 to 9 (`:9`) (sbp, tobacco, ldl, adiposity, famhist, typea, obesity, alcohol, age)
- `y`: the response variable, column `9` (chd)

In [3]:
X = heart.iloc[:,:9]
y = heart.iloc[:,9]

Print the rows that we will use to *quickly* check the prediction of the classification methods (`460:`, i.e., rows from 460 to the end of the DataFrame).

In [4]:
heart.iloc[460:,:]

Unnamed: 0,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
460,118,5.4,11.61,30.79,0,64,27.35,23.97,40,0
461,132,0.0,4.82,33.41,1,62,14.7,0.0,46,1


### Logistic Regression

Import `LogisticRegression` from the SciKit-Learn library to perform logistic regression.
More details about this can be found online in the documentation for SciKit-Learn: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

In [5]:
from sklearn.linear_model import LogisticRegression

Create a `LogisticRegression` object to configure how to perform the logistic regression.

The `random_state` parameter makes sure we get the same exact results every time.

In [6]:
LR = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr', max_iter=200)

Use the dataset in the `X` and `y` sub-dataframes to train and create the logistic regression model.

In [7]:
LR.fit(X, y)

LogisticRegression(max_iter=200, multi_class='ovr', random_state=0)

Quickly test the model predictions for the last two rows in the dataset (rows 460 and 461). It seems that the first prediction is incorrect while the second is correct.

In [8]:
LR.predict(X.iloc[460:,:])

array([1, 1])

Now evaluate the model performance by asking for the 0 to 1 `score` executed on the whole of the *training* dataset. 
Normally, two different sets of data are used for the purpose of training the model and the evaluation of the model, but here we keep it simple and use the same dataset for both training and evaluation.

In [9]:
round(LR.score(X,y), 4)

0.7338

### Support Vector Machine

Following the general structure of the previous section, we now evaluate an SVM-based classifier.
More details here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

We import the classifier class, create the SVM classifier `SVM` object to configure the SVM method (here with no parameters), fit the data to create the model, and show the quick prediction.

*NOTE: Disregard the warning produced.*

It seems that the first prediction is incorrect while the second is correct here.

In [10]:
from sklearn import svm

SVM = svm.LinearSVC(random_state=0)
SVM.fit(X, y)
SVM.predict(X.iloc[460:,:])



array([1, 1])

Now evaluate the model performance by asking for the 0 to 1 `score` executed on the whole of the *training* dataset. 

In [11]:
round(SVM.score(X,y), 4)

0.5411

### Random Forrest

Following the general structure of the previous section, we now evaluate a Random Forrest-based classifier.
More details here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier


We import the classifier class, create the Random Forrest classifier `RF` object to configure the Random Forrest method, fit the data to create the model, and show the quick prediction.

The prediction is correct for the first row but incorrect for the second here.

In [12]:
from sklearn.ensemble import RandomForestClassifier

RF = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
RF.fit(X, y)
RF.predict(X.iloc[460:,:])

array([0, 0])

Now evaluate the model performance by asking for the 0 to 1 `score` executed on the whole of the *training* dataset. 

In [13]:
round(RF.score(X,y), 4)

0.7338

*What would happen if a larger `max_depth` value were used?*

### Neural Network / Multi-layer Perceptron

Following the general structure of the previous section, we now evaluate a Neural Network-based classifier, specifically a Multi-layer Perceptron.
More details here: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

We import the classifier class, create the Neural Network classifier `NN` object to configure the MLP method, fit the data to create the model, and show the quick prediction.

The prediction is correct for the first row but incorrect for the second here.

In [14]:
from sklearn.neural_network import MLPClassifier

NN = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=0)
NN.fit(X, y)
NN.predict(X.iloc[460:,:])

array([0, 0])

Now evaluate the model performance by asking for the 0 to 1 `score` executed on the whole of the *training* dataset. 

In [15]:
round(NN.score(X,y), 4)

0.658

### Summary

The methods evaluated here all have different performance and different advantages and disadvantages, but using Pandas and SciKit-Learn their use and evaluation is very similar and follows the same pattern of code:
1. Import classifier
2. Create a classifer object with optional configuration parameters
3. Train the model on some data
4. Evaluate the model on some data