# Minicurso Classificadores

## Lição 3: Otimizando os parâmetros...

### Reconhecendo números de 0 a 9 escritos a mão

#### Importando dependências e setup

In [1]:
# Import datasets, and ML algorithms
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.svm import SVC

# Import Numeric library
import numpy as np

# Import plotting library
import matplotlib.pyplot as plt

# Set plot to be show inside the notebook
%matplotlib inline

In [2]:
# Set the random seed to reproducibility
import random
random.seed(0)

#### Preparando os dados

In [3]:
digits = datasets.load_digits()

n_samples = len(digits.images)

X = digits.images.reshape((n_samples, -1))
X

array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])

In [4]:
y = digits.target
y

array([0, 1, 2, ..., 8, 9, 8])

#### Afinando os parâmetros

##### Testanto todas as combinações

In [5]:
# Read about the parameter at: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
svc = SVC(gamma='scale')

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
clf = GridSearchCV(svc, parameters, cv=3)

clf.fit(X, y)
clf.best_estimator_

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [6]:
clf.predict(X)

array([0, 1, 2, ..., 8, 9, 8])

In [7]:
clf.score(X,y)

1.0

##### Testanto aleatóriamente

In [8]:
parameters = {
    'kernel':('linear', 'rbf', 'poly', 'sigmoid'),
    'C': [10**x for x in range(0,5)],
    'gamma': ('auto', 'scale', 0.0001),
    'tol': (1e-3, 1e-6, 1e-9), # Tolerance for stopping criterion.
    }

clf = RandomizedSearchCV(SVC(), parameters, cv=3)

clf.fit(X, y)
clf.best_estimator_

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=1e-06, verbose=False)

In [9]:
clf.predict(X)

array([0, 1, 2, ..., 8, 9, 8])

In [10]:
clf.score(X,y)

1.0

#### Otimizando e Testando

Se utilizarmos a mesma base para afinar (*tunar*) os parâmetros e avaliar estamos caindo no mesmo problema de quando usavamos a mesma base para treinar e testar.

![train_validation_test.png](https://cdn-images-1.medium.com/max/720/1*4G__SV580CxFj78o9yUXuQ.png)

*(Imagem obtida em: https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6)*

##### Treinando e Validando

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify=y,
                                                    test_size=0.20) # Valores comum: .2, .25 e .3.

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(1437, 64) (1437,)
(360, 64) (360,)


In [12]:
svc = SVC(gamma='scale', probability=True)

parameters = {'C':[1, 10]}
clf = GridSearchCV(svc, parameters, cv=3, scoring='f1_macro')  # n_jobs=-1

clf.fit(X_train, y_train)

clf.best_estimator_

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [13]:
clf.score(X_train,y_train)

1.0

##### Testando

In [14]:
clf.predict(X_test)

array([7, 2, 8, 5, 4, 2, 0, 6, 0, 9, 0, 5, 7, 6, 8, 0, 9, 7, 4, 1, 5, 3,
       1, 1, 2, 4, 2, 3, 9, 2, 2, 8, 6, 1, 5, 3, 1, 6, 3, 8, 9, 6, 9, 2,
       1, 1, 9, 3, 5, 7, 3, 6, 7, 2, 7, 4, 0, 3, 2, 0, 8, 9, 3, 1, 0, 9,
       8, 3, 4, 5, 8, 3, 4, 0, 5, 5, 8, 1, 2, 1, 0, 6, 1, 8, 8, 7, 0, 3,
       1, 1, 2, 3, 1, 8, 1, 2, 0, 6, 0, 5, 2, 2, 3, 2, 3, 6, 6, 6, 8, 5,
       5, 9, 7, 0, 5, 3, 6, 1, 8, 7, 2, 2, 2, 7, 6, 8, 4, 7, 5, 3, 7, 3,
       7, 1, 6, 7, 4, 9, 5, 7, 1, 6, 5, 5, 0, 1, 8, 8, 6, 5, 0, 2, 1, 0,
       3, 0, 3, 7, 9, 4, 5, 7, 7, 7, 9, 5, 9, 0, 5, 7, 3, 6, 7, 0, 9, 0,
       9, 0, 8, 4, 3, 9, 6, 1, 9, 0, 4, 5, 0, 0, 8, 3, 2, 7, 7, 3, 8, 6,
       5, 4, 6, 1, 8, 6, 2, 4, 2, 8, 5, 1, 2, 4, 9, 0, 6, 0, 2, 4, 4, 9,
       1, 4, 3, 5, 3, 4, 5, 1, 5, 8, 4, 6, 1, 4, 0, 5, 2, 1, 3, 6, 4, 1,
       9, 7, 5, 2, 0, 9, 5, 2, 9, 7, 7, 8, 4, 7, 8, 3, 4, 7, 4, 6, 8, 6,
       8, 6, 0, 6, 1, 1, 3, 7, 4, 6, 4, 9, 7, 8, 5, 4, 8, 4, 9, 8, 4, 9,
       3, 6, 6, 2, 9, 3, 3, 9, 4, 3, 2, 5, 4, 4, 0,

In [15]:
# For each sample the probability for each class
probas = clf.predict_proba(X_test)
probas

array([[1.07746459e-03, 1.22470203e-03, 1.77261555e-03, ...,
        9.84780094e-01, 2.38421611e-03, 1.95832628e-03],
       [6.68019409e-04, 7.11799369e-04, 9.92480215e-01, ...,
        7.64450798e-04, 9.57052679e-04, 1.12664396e-03],
       [4.99215684e-03, 1.12554332e-02, 3.34424458e-02, ...,
        2.32410576e-02, 8.53737150e-01, 2.09424705e-02],
       ...,
       [1.30408336e-04, 3.40117228e-05, 9.99416466e-01, ...,
        4.93221796e-05, 4.58270267e-05, 4.95785682e-05],
       [3.44441367e-03, 2.33658896e-03, 3.75668912e-03, ...,
        3.82838496e-03, 2.03136529e-02, 9.24281849e-01],
       [9.24229521e-03, 1.88969385e-02, 3.12982949e-02, ...,
        3.33232970e-02, 4.14095834e-02, 5.59787653e-01]])

In [16]:
# Let's take a look at the probabilities for the 1st predict sample:
probas[0]

array([0.00107746, 0.0012247 , 0.00177262, 0.00172237, 0.00180767,
       0.00174481, 0.00152773, 0.98478009, 0.00238422, 0.00195833])

In [17]:
clf.score(X_test,y_test)

0.9916993515590266

---

### Para saber mais...

Dataset:
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits

Outros datasets para treinar:
https://scikit-learn.org/stable/datasets/index.html

Tutorial que usei de base para esta lição:
- https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html#sphx-glr-auto-examples-classification-plot-digits-classification-py

### Dica de execução

Uma forma bem simples de executar este Notebook é usando o Google Colab: https://colab.research.google.com/

Se for utilizar sua máquina, lembre de intalar o Python 3 (eu usei o 3.7) e as dependências:
- NumPy
- Scikit-learn
- Jupyter Notebook
- Matplotlib

Sugiro instalar tanto o python quanto as dependências via [Anaconda](https://www.anaconda.com/distribution/#download-section) (ou [MiniConda](https://conda.io/en/latest/miniconda.html)) criando um Environment.