# Optimization in sklearn

In this lab, we'll use some more advanced sklearn tools to evaluate and optimize a classifier for the <a href="https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-dataset">wine dataset</a>.

In [1]:
from sklearn.datasets import load_wine
wine = load_wine()

So far we have only evaluated a classifier with a single train-test split.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

data, test_data, target, test_target = train_test_split(wine.data, wine.target, test_size=0.2)
clf = KNeighborsClassifier().fit(data, target)
accuracy = clf.score(test_data, test_target)
print(accuracy)

0.7222222222222222


To get a more stable score, we can perform <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html">cross-validation</a>.

In [3]:
from sklearn.model_selection import cross_val_score

clf = KNeighborsClassifier()
scores = cross_val_score(clf, wine.data, wine.target, cv=5)
print(scores.mean(), scores.std())

0.6912698412698413 0.04877951071049148


To incorporate standardization into this approach, we need to build a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">pipeline</a>.

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ("scale", StandardScaler()),
    ("classify", KNeighborsClassifier())
])

scores = cross_val_score(pipeline, wine.data, wine.target, cv=5)
print(scores.mean(), scores.std())

0.9493650793650794 0.037910929811115976


To tune the hyperparameters as well, we can do a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">grid search</a>.

In [5]:
from sklearn.model_selection import GridSearchCV

settings = {"classify__n_neighbors": range(1, 10)}
grid = GridSearchCV(pipeline, settings, cv=2)

scores = cross_val_score(grid, wine.data, wine.target, cv=5)
print(scores.mean(), scores.std())

0.9609523809523809 0.028267341226138717


If we wanted to train a model for future use now, how many neighbors should we use?

In [6]:
grid = grid.fit(wine.data, wine.target)
print(grid.best_params_)

{'classify__n_neighbors': 6}


Here's a model for future use.

In [7]:
clf = KNeighborsClassifier(n_neighbors=6).fit(wine.data, wine.target)

Here's how we would classify new examples.

In [8]:
examples = wine.data[[0,75,177]] # Pretending these are new
predictions = clf.predict(examples)
print(predictions)
print(wine.target[[0,75,177]])

[0 1 1]
[0 1 2]
