# Scikit Learn Tutorial #5 - Classification Algorithms

<table align="left"><td>
  <a target="_blank"  href="https://colab.research.google.com/github/TannerGilbert/Tutorials/blob/master/Scikit-Learn-Tutorial/5.%20Classification%20Algorithms.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab
  </a>
</td><td>
  <a target="_blank"  href="https://github.com/TannerGilbert/Tutorials/blob/master/Scikit-Learn-Tutorial/5.%20Classification%20Algorithms.ipynb">
    <img width=32px src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
</td></table>

![Scikit Learn Logo](http://scikit-learn.org/stable/_static/scikit-learn-logo-small.png)

## Loading in Datasets

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'label'])
le = LabelEncoder()
iris['label'] = le.fit_transform(iris['label'])
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [2]:
breast_cancer = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', names=['id', 'clump_thickness', 'uniformity_of_cell_size', 'uniformity_of_cell_shape', 'marginal_adhesion', 'single_epithelial_cell_size', 'bare_nuclei', 'bland_chromatin', 'normal_nucleoli', 'mitoses', 'label'])
breast_cancer.drop(['id'], axis=1, inplace=True)
breast_cancer['label'].replace([2, 4], [0, 1], inplace=True)
breast_cancer.replace('?', -999999, inplace=True)
breast_cancer.head()

Unnamed: 0,clump_thickness,uniformity_of_cell_size,uniformity_of_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,label
0,5,1,1,1,2,1,3,1,1,0
1,5,4,4,5,7,10,3,2,1,0
2,3,1,1,1,2,2,3,1,1,0
3,6,8,8,1,3,4,3,7,1,0
4,4,1,1,3,2,1,3,1,1,0


In [3]:
wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', names=['label', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols','flavanoids', 'nonflavanoid_phenols' ,'proanthocyanins', 'color_intensity', 'hue', 'OD280/OD315_of_diluted_wines', 'proline'], delimiter=",", index_col=False)
wine.head()

Unnamed: 0,label,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,OD280/OD315_of_diluted_wines,proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93


## Comparing Models

Because the syntax is almost the same for all model it won't be the focus of this tutorial. We will rather focus on the theory behind each model.

### Logistic Regression

In statistics, the logistic model (or logit model) is a statistical model that is usually taken to apply to a binary dependent variable.
In Machine Learning Logistic Regression is used as the go to method for binary classification problems(problems with two classes). It uses the sigmoid function to squeeze the output between 0-1. 

![Logistic Regression](http://api.ning.com/files/BLRhjJ5GSEnu-TjYW2cexTEbLfMnDWRa40PPL0SrRhIgpFmTjY5n9xFH24K1KQqp4U28glRU-UWum3rr50*b8stW2KedAi02/Capture.PNG)
Sigmoid Function:
![Sigmoid Function](https://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Sigmoid-function-2.svg/1200px-Sigmoid-function-2.svg.png)

### Naive Bayes

Naive Bayes is based on the Bayes' theorem with the "naive" assumption of independence between the features. Naive Bayes is a simple model but despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods. It's also really fast and so it's really good for quick prototyping.
![Bayes' Theorem](https://wikimedia.org/api/rest_v1/media/math/render/svg/b1078eae6dea894bd826f0b598ff41130ee09c19)

### Support Vector Machines

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. A SVM is a large margin classifier that means it maximizes the distance between the outermost points of the two classes also called the support Vectors. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. 
![](https://upload.wikimedia.org/wikipedia/commons/thumb/2/2a/Svm_max_sep_hyperplane_with_margin.png/220px-Svm_max_sep_hyperplane_with_margin.png)

### Nearest Neighbors

K Nearest Neighbors (KNN) is a non-parametric method used for classification and regression. It's one of the simplest Machine Learning Algorithm. It's input consists of the k closest training examples in the feature space. If used in classification KNN outputs a class based on the majority of votes of its neighbors. 
![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/KnnClassification.svg/220px-KnnClassification.svg.png)

### Decision Tree

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. DTs are simple to understand and can be easily visualised they also require very little data preparation.
![](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/CART_tree_titanic_survivors.png/240px-CART_tree_titanic_survivors.png)

### Comparison on our 3 Datasets

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

models = [
    ('LR', LogisticRegression()),
    ('NB', GaussianNB()),
    ('SVM', SVC()),
    ('KNN', KNeighborsClassifier()),
    ('DT', DecisionTreeClassifier()),
]

In [5]:
from sklearn.model_selection import train_test_split
import numpy as np

for dataset_name, dataset in [('iris',iris), ('breast cancer',breast_cancer), ('wine',wine)]:
    X = np.array(dataset.drop(['label'], axis=1))
    y = np.array(dataset['label'])
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    for name, model in models:
        clf = model
        clf.fit(X_train, y_train)
        accuracy = clf.score(X_test, y_test)
        print(dataset_name, name, accuracy)

iris LR 1.0
iris NB 1.0
iris SVM 1.0
iris KNN 1.0
iris DT 1.0
breast cancer LR 0.907142857143
breast cancer NB 0.921428571429
breast cancer SVM 0.95
breast cancer KNN 0.985714285714
breast cancer DT 0.95
wine LR 0.888888888889
wine NB 0.944444444444
wine SVM 0.861111111111
wine KNN 0.833333333333
wine DT 0.944444444444
