# Comparing Machine Learning Models With Scikit-Learn 

Based on Machine Learning Practices by Kevin Markham

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Reading Datastes
- Dataset: [Iris Dataset](http://archive.ics.uci.edu/ml/datasets/Iris)  

50 sammples of 3 different species of Iris (150 samples)_

Measurements:
- Sepal Length
- Sepal width
- Petal Length
- Petal Width

### Species
    - Setosa
    - Versicolor
    - Virginica

- 150 Observations
- 4 Features Sepal Length & Width Petal Length & Width
- Response is the iris species
- CLASSIFICATION problem: Response is categorical

In [2]:
# Importing load_iris dataset from the right module

from sklearn.datasets import load_iris

In [3]:
iris = load_iris()

In [4]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [5]:
# It's a Bunch from sklearn

type(iris)

sklearn.utils.Bunch

In [6]:
X = iris.data
y = iris.target

In [7]:
X.shape, y.shape

((150, 4), (150,))

In [8]:
type(X), type(y)

(numpy.ndarray, numpy.ndarray)

### Evaluation Procedure 1: Train and test on the entire dataset

### Logistic Regression (Classification Model)

In [10]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X,y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [11]:
logreg

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [None]:
logreg.predict_proba(X_new)

In [12]:
logreg.predict(X)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [13]:
y_pred = logreg.predict(X)

In [14]:
len(y_pred)

150

#### Classification Accuracy:
- __Proportion__ of correct predictions
- Common __evaluation metric__ for classifications problems

In [15]:
from sklearn import metrics

metrics.accuracy_score(y,y_pred)

0.96

In [16]:
logreg.score(X,y)

0.96

- Known as training accuracy when you train and test the model on the same data

### KNN K=5

In [19]:
# K=5 Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X,y)
y_pred = knn.predict(X)
metrics.accuracy_score(y,y_pred)

0.9666666666666667

In [20]:
knn.score(X,y)

0.9666666666666667

### KNN K=1

In [21]:
# K=1 Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X,y)
y_pred = knn.predict(X)
metrics.accuracy_score(y,y_pred)

1.0

In [22]:
knn.score(X,y)

1.0

### KNN K=2

In [24]:
# K=2 Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X,y)
y_pred = knn.predict(X)
metrics.accuracy_score(y,y_pred)

0.98

In [25]:
knn.score(X,y)

0.98

___Training and testing your model on the same data is not a useful procedure for deciding which model to choose___

KNN = 1 is an __overfitting__ model, it has learnt to follow the noise instead of the signal
- Black line: SIGNAL
- Green line: NOISE

![title](Img/image2.png)

![title](Img/image3.png)

### Evaluation Procedure 2: Train and test Split
- Split the dataset
    - X_train, y_train
    - X_test, y_test