# 5.0 Supervised Learning

### Simple Demo : [IRIS](https://archive.ics.uci.edu/ml/datasets/Iris) dataset 

In [1]:
import seaborn as sns
iris = sns.load_dataset('iris')
import pandas as pd

X_iris = iris.drop('species', axis=1)
y_iris = iris['species']

  import pandas.util.testing as tm


In [2]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,random_state=1)

In [3]:
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB()                       # 2. instantiate model
model.fit(Xtrain, ytrain)                  # 3. fit model to data
y_model = model.predict(Xtest)             # 4. predict on new data

In [4]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)

0.9736842105263158

GaussianNB: (shhh.... I am **naive** but I am **smart**!)

## Next, KNN Classifier

more [info](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/)

In [5]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 5)



In [6]:
knn.fit(Xtrain, ytrain)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [7]:
knn.score(Xtest, ytest)

1.0

knn: hahaha, I am smarter than you, GaussianNB!

### Pros and Cons of KNN

**Pros**

One of the most attractive features of the K-nearest neighbor algorithm is that is **simple to understand and easy to implement**. 

**Cons**

One of the obvious drawbacks of the KNN algorithm is the **computationally expensive** testing phase which is impractical in industry settings. Furthermore, KNN can suffer from **skewed class** distributions. For example, if a certain class is very frequent in the training set, it will tend to dominate the majority voting of the new example (large number = more common). Finally, the accuracy of KNN can be severely degraded with high-dimension data because there is little difference between the nearest and farthest neighbor.

## Exercise:

Read the Wine Quality White dataset (``winequality.csv``) from your data folder.

Perform the follwong tasks:
- select features to be the first 11 columns while the last column to be the target
- plot the pair plot for all features
- separate the data to 80:20 training and testing datasets
- create an instance of Neighbours Classifier and fit the data
- Measure the accurary and Root Mean Square Error (RMSE) for the fitted model
- Perform grid search to find the best parameters for KNN, for ``weights: ['uniform', 'distance']``, ``n_neighbors: [40, 60, 80, 100, 120, 140]``


In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

import pandas as pd

df = pd.read_csv('winequality.csv')

Score: 0.4995918367346939
RMSE: 0.9328472981322575
GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'weights': ['uniform', 'distance'], 'n_neighbors': [40, 60, 80, 100, 120, 140]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
0.5687448951810509
distance
100


# SVM

In [8]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer


cancer = load_breast_cancer()
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0)


clf = SVC(kernel='rbf', C=1).fit(X_train, y_train)
print('Breast cancer dataset')
print('Accuracy of RBF SVC classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of RBF SVC classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

Breast cancer dataset
Accuracy of RBF SVC classifier on training set: 0.90
Accuracy of RBF SVC classifier on test set: 0.94


In [None]:
X_train

array([[1.185e+01, 1.746e+01, 7.554e+01, ..., 9.140e-02, 3.101e-01,
        7.007e-02],
       [1.122e+01, 1.986e+01, 7.194e+01, ..., 2.022e-02, 3.292e-01,
        6.522e-02],
       [2.013e+01, 2.825e+01, 1.312e+02, ..., 1.628e-01, 2.572e-01,
        6.637e-02],
       ...,
       [9.436e+00, 1.832e+01, 5.982e+01, ..., 5.052e-02, 2.454e-01,
        8.136e-02],
       [9.720e+00, 1.822e+01, 6.073e+01, ..., 0.000e+00, 1.909e-01,
        6.559e-02],
       [1.151e+01, 2.393e+01, 7.452e+01, ..., 9.653e-02, 2.112e-01,
        8.732e-02]])

### C parameter

Try the following codes, express your findings.

In [15]:
clf = SVC(kernel='rbf', C=100).fit(X_train, y_train)
print('Breast cancer dataset')
print('Accuracy of RBF SVC classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of RBF SVC classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

Breast cancer dataset
Accuracy of RBF SVC classifier on training set: 0.94
Accuracy of RBF SVC classifier on test set: 0.94


### MinMaxScaler

SVM is sensitive to distance/length. Try the following codes and express your findings. 

In [22]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf = SVC(C=45).fit(X_train_scaled, y_train)
print('Breast cancer dataset (normalized with MinMax scaling)')
print('RBF-kernel SVC (with MinMax scaling) training set accuracy: {:.2f}'
     .format(clf.score(X_train_scaled, y_train)))
print('RBF-kernel SVC (with MinMax scaling) test set accuracy: {:.2f}'
     .format(clf.score(X_test_scaled, y_test)))

Breast cancer dataset (normalized with MinMax scaling)
RBF-kernel SVC (with MinMax scaling) training set accuracy: 0.99
RBF-kernel SVC (with MinMax scaling) test set accuracy: 0.98
