## Binary and Multi-Class Classification

In this tutorial we will explore how to perform binary and multi-class classification using Python, Pandas and Scikit-Learn on two sample datasets.

### General Imports

Import the Pandas library to load and manipulate tabular datasets.

In [None]:
import pandas as pd

## Binary Classification
Load the dataset from the locally available file `SAHeart.csv`, which contains a retrospective sample of males in a heart-disease high-risk region
of the Western Cape, South Africa. A description of the data is available here: https://web.stanford.edu/~hastie/ElemStatLearn/datasets/SAheart.info.txt

The result of loading the data using the Pandas `read_csv` function is a Pandas `DataFrame` named `heart`.

In [None]:
heart = pd.read_csv('SAheart.csv', sep=',', header=0)
heart.head()

Separate the columns of the `heart` Pandas `DataFrame` into:
- `X`: the feature variables, columns 0 to 9 (`:9`) (sbp, tobacco, ldl, adiposity, famhist, typea, obesity, alcohol, age)
- `y`: the response variable, column `9` (chd)

In [None]:
X = heart.iloc[:,:9]
y = heart.iloc[:,9]

Print the rows that we will use to *quickly* check the prediction of the classification methods (`460:`, i.e., rows from 460 to the end of the DataFrame).

In [None]:
heart.iloc[460:,:]

### Logistic Regression

Import `LogisticRegression` from the SciKit-Learn library to perform logistic regression.
More details about this can be found online in the documentation for SciKit-Learn: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

In [None]:
from sklearn.linear_model import LogisticRegression

Create a `LogisticRegression` object to configure how to perform the logistic regression.

The `random_state` parameter make usre we get the same exact results every time.

In [None]:
LR = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr', max_iter=200)

Use the dataset in the `X` and `y` sub-dataframes to train and create the logistic regression model.

In [None]:
LR.fit(X, y)

Quickly test the model predictions for the last two rows in the dataset (rows 460 and 461). It seems that the first prediction is incorrect while the second is correct.

In [None]:
LR.predict(X.iloc[460:,:])

Now evaluate the model performance by asking for the 0 to 1 `score` executed on the whole of the *training* dataset. 
Normally, two different sets of data are used for the purpose of training the model and the evaluation of the model, but here we keep it simple and use the same dataset for both training and evaluation.

In [None]:
round(LR.score(X,y), 4)

### Support Vector Machine

Following the general structure of the previous section, we now evaluate an SVM-based classifier.
More details here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

We import the classifier class, create the SVM classifier `SVM` object to configure the SVM method (here with no parameters), fit the data to create the model, and show the quick prediction.

*NOTE: Disregard the warning produced.*

It seems that the first prediction is incorrect while the second is correct here.

In [None]:
from sklearn import svm

SVM = svm.LinearSVC(random_state=0)
SVM.fit(X, y)
SVM.predict(X.iloc[460:,:])

Now evaluate the model performance by asking for the 0 to 1 `score` executed on the whole of the *training* dataset. 

In [None]:
round(SVM.score(X,y), 4)

### Random Forrest

Following the general structure of the previous section, we now evaluate a Random Forrest-based classifier.
More details here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier


We import the classifier class, create the Random Forrest classifier `RF` object to configure the Random Forrest method, fit the data to create the model, and show the quick prediction.

The prediction is correct for the first row but incorrect for the second here.

In [None]:
from sklearn.ensemble import RandomForestClassifier

RF = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
RF.fit(X, y)
RF.predict(X.iloc[460:,:])

Now evaluate the model performance by asking for the 0 to 1 `score` executed on the whole of the *training* dataset. 

In [None]:
round(RF.score(X,y), 4)

*What would happen if a larger `max_depth` value were used?*

### Neural Network / Multi-layer Perceptron

Following the general structure of the previous section, we now evaluate a Neural Network-based classifier, specifically a Multi-layer Perceptron.
More details here: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

We import the classifier class, create the Neural Network classifier `NN` object to configure the MLP method, fit the data to create the model, and show the quick prediction.

The prediction is correct for the first row but incorrect for the second here.

In [None]:
from sklearn.neural_network import MLPClassifier

NN = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=0)
NN.fit(X, y)
NN.predict(X.iloc[460:,:])

Now evaluate the model performance by asking for the 0 to 1 `score` executed on the whole of the *training* dataset. 

In [None]:
round(NN.score(X,y), 4)

*What would happen if a larger `hidden_layer_sizes` values were used?*

### Summary

The methods evaluated here all have different performance and different advantages and disadvantages, but using Pandas and SciKit-Learn their use and evaluation is very similar and follows the same pattern of code:
1. Import classifier
2. Create a classifer object with optional configuration parameters
3. Train the model on some data
4. Evaluate the model on some data

## Multi-class Classification
Load the training dataset from the locally available file `vowel.train.csv` and the evaluation (or *testing*) dataset from the `vowel.test.csv` file.
These files contain data for speaker independent recognition of the eleven steady state vowels of British English.
A description of the data is available here: https://web.stanford.edu/~hastie/ElemStatLearn/datasets/vowel.info.txt

The result of loading the data using the Pandas `read_csv` function are two Pandas `DataFrame` objects:
1. `vowel_train`: The DataFrame that contains the training subset of the data
1. `vowel_test`: The DataFrame that contains the testing subset of the data

As it can be seen, in the Multi-class classification example we will use separate datasets for training and evaluation/testing, as is normally done and in contrast to how it was done in the binary classification above.

In this case, the data is already separated into two sub-sets but it would be possible to start with one dataset and split it into two subsets for training and evaluation.

In [None]:
vowel_train = pd.read_csv('vowel.train.csv', sep=',', header=0)
vowel_test = pd.read_csv('vowel.test.csv', sep=',', header=0)

vowel_train.head()

Separate the columns of the training Pandas `DataFrame` into:
- `X`: the feature variables, columns 1 to the last column (`1:`) (x.1, x.2, ..., x.10)
- `y`: the response variable, column `0` (y)

In [None]:
X_tr = vowel_train.iloc[:,1:]
y_tr = vowel_train.iloc[:,0]

Separate the columns of the test Pandas `DataFrame` into:
- `X`: the feature variables, columns 1 to the last column (`1:`) (x.1, x.2, ..., x.10)
- `y`: the response variable, column `0` (y)

In [None]:
X_test = vowel_test.iloc[:,1:]
y_test = vowel_test.iloc[:,0]

### Logistic Regression

Use `LogisticRegression` from the SciKit-Learn library to perform logistic regression.

Here the configuration and training of the model (the `fit` function call) are all done in one step.

We create the Logistic Regression classifier object `LR` to configure the LR method, fit the **training** data to create the model, and predict the values of the **testing** dataset.

The prediction is presentation is not as useful here.

In [None]:
LR = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial', max_iter=300).fit(X_tr, y_tr)
LR.predict(X_test)

Now evaluate the model performance by asking for the 0 to 1 `score` executed on the whole of the **testing** dataset. 

In [None]:
round(LR.score(X_test,y_test), 4)

### Support Vector Machine

Following the general structure of the previous section, we now evaluate an SVM-based classifier.
More details here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

The configuration and training of the model (the `fit` function call) are all done in one step.

We create the SVM classifier object `SVM` to configure the SVM method and fit the **training** data to create the model.

Since showing the predictions is not very useful, we only evaluate the model performance by asking for the 0 to 1 `score` executed on the whole of the **testing** dataset. 

In [None]:
SVM = svm.SVC(decision_function_shape="ovo", random_state=0).fit(X_tr, y_tr)
round(SVM.score(X_test, y_test), 4)

### Random Forrest

We now evaluate a Random Forrest-based classifier.
More details here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

The configuration and training of the model (the `fit` function call) are all done in one step.

We create the Random Forrest classifier `RF` object to configure the Random Forrest method and fit the **training** data to create the model.

Since showing the predictions is not very useful, we only evaluate the model performance by asking for the 0 to 1 `score` executed on the whole of the **testing** dataset. 

In [None]:
RF = RandomForestClassifier(n_estimators=1000, max_depth=10, random_state=0).fit(X_tr, y_tr)
round(RF.score(X_test, y_test), 4)

### Neural Network / Multi-layer Perceptron

We now evaluate a Neural Network-based classifier, specifically a Multi-layer Perceptron.
More details here: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

The configuration and training of the model (the `fit` function call) are all done in one step.

We create the Neural Network classifier `NN` object to configure the MLP method and fit the **training** data to create the model.

Since showing the predictions is not very useful, we only evaluate the model performance by asking for the 0 to 1 `score` executed on the whole of the **testing** dataset. 

In [None]:
NN = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(40, 10), random_state=0,max_iter=300).fit(X_tr, y_tr)
round(NN.score(X_test, y_test), 4)

### Summary

Out of the methods evaluated here, the SVM-based classifier seems to perform best, but not by a large margin.