# Scikit-Learn Tutorial

Ottimo [paper](https://dtai.cs.kuleuven.be/events/lml2013/papers/lml2013_api_sklearn.pdf) che descrive l'architettura dell'API di Scikit-earn


In [None]:
from sklearn import neighbors, datasets
from sklearn.base import is_classifier
from sklearn.model_selection import StratifiedShuffleSplit, ShuffleSplit, StratifiedKFold, train_test_split, GridSearchCV
from sklearn.decomposition import PCA
from sklearn.metrics import plot_confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from joblib import dump, load
from sklearn.pipeline import Pipeline

import pandas as pd

import numpy as np

<img src="imgs\Estimators.png"  width="700">

[Iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html)

Le 4 features sono lunghezza e larghezza del **sepalo** e del **petalo**

In [None]:
iris = datasets.load_iris()
X, y = iris.data, iris.target

species = ['Iris setosa', 'Iris versicolor', 'Iris virginica']
features = ['sepal_len', 'sepal_width', 'petal_len', 'petal_width']

In [None]:
len(X)

In [None]:
X[:10], y[:10]

In [None]:
# stesso numero di sample per ogni specie
[list(y).count(i) for i in [0, 1, 2]]

## train test split

meteriche, iteratori e superfunctions
PCA e TSNE per visualizzaione 2D o 3D
prova di clustring

In [None]:
# stratify permette di ottenere nel train e nel test la stessa distribuzione di calssi dell'intero Datatset
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0, shuffle=True, stratify=y)

In [None]:
# classificazione basata sui primi vicini secondo la distanza Euclidea
# https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=kneighborsclassifier#sklearn.neighbors.KNeighborsClassifier
knn = neighbors.KNeighborsClassifier(n_neighbors=5)

# lo stimatore è un classificatore?
is_classifier(knn)

In [None]:
# dizionario con i parametri del modello
knn.get_params()

In [None]:
# è possibile modificare i parametri dopo la creazione dello stimatore
knn.set_params(p=2)

In [None]:
# fit del modello
knn.fit(train_X, train_y)

In [None]:
list(zip(knn.predict(test_X), test_y))

## Salvare il modello

https://joblib.readthedocs.io/en/latest/persistence.html

*joblib.dump* and *joblib.load* provide a replacement for pickle to work efficiently on arbitrary Python objects containing large data, in particular large numpy arrays.

Si può anche usare **pickle**

In [None]:
dump(knn, 'filename.joblib')

In [None]:
knn = load('filename.joblib')

## Score

Estimator score method: Estimators have a **score method** providing a default evaluation criterion for the problem they are designed to solve. This is not discussed on this page, but in each estimator’s documentation

In [None]:
# Nel caso del knn lo score è l'accuratezza
knn.score(test_X, test_y)

In [None]:
# come sopra utilizzando una metrica (in questo caso la stessa di quella di default)
pred_test_y = knn.predict(test_X)
accuracy_score(test_y, pred_test_y)

In [None]:
plot_confusion_matrix(knn, test_X, test_y, normalize='true')

## Pipeline

Sequentially apply a list of transforms and a final estimator. Intermediate steps of pipeline must implement **fit** and **transform** methods and the final estimator only needs to implement **fit**.

In [None]:
winedf = pd.read_csv(r'F:\Documenti\insegnamento\scikit_learn\data\winequality-red.csv',sep=';')

In [None]:
winedf.head(2)

In [None]:
winedf['quality'].unique()

In [None]:
# features
X=winedf.drop(['quality'],axis=1)
# labels
Y=winedf['quality']

In [None]:
pipeline = Pipeline([('scaler', StandardScaler()), ('SVM', SVC())])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2, random_state=1, stratify=Y)

In [None]:
params = {'SVM__C': np.logspace(-5, 5, 10), 'SVM__gamma':[1e-1, 1e-2]}

In [None]:
grid = GridSearchCV(pipeline, param_grid=params, cv=5, verbose=1)

In [None]:
grid.fit(X_train, y_train)

In [None]:
grid.best_params_