# Model Selection

## Selecting Best Models Using Exhaustive Search

### Problema
Você quer encontrar o melhor modelo testando uma gama de hiperparâmetros diferentes.

### Solução
Use o `GridSearchCV` da biblioteca `scikit-learn`.

### Etapas:

1. **Criar o modelo**
   - Um modelo de Regressão Logística é definido.
2. **Definir os hiperparâmetros a testar**
   - Penalidade (`penalty`): `['l1', 'l2']`
   - Regularização (`C`): `np.logspace(0, 4, 10)`
3. **Executar o Grid Search com validação cruzada (cv=5)**
4. **Visualizar os melhores hiperparâmetros**
5. **Fazer previsões com o melhor modelo**

### Discussão

- O `GridSearchCV` realiza uma busca exaustiva por todas as combinações possíveis dos hiperparâmetros fornecidos.
- Cada combinação é avaliada com validação cruzada (k-fold).
- O melhor modelo é selecionado com base no desempenho.
- Após encontrar os melhores hiperparâmetros, o modelo final é treinado com todos os dados.


In [1]:
import numpy as np
from sklearn import linear_model, datasets
from sklearn.model_selection import GridSearchCV


iris = datasets.load_iris()
features = iris.data        
target = iris.target       

# modelo de regressão logística
logistic = linear_model.LogisticRegression(solver='liblinear')  # solver compatível com 'l1'

# definir os valores possíveis dos hiperparâmetros
penalty = ['l1', 'l2']                           # tipo de regularização
C = np.logspace(0, 4, 10)                        # valores de C (força da regularização)

# cria um dicionário com os hiperparâmetros
hyperparameters = dict(C=C, penalty=penalty)

# GridSearchCV com validação cruzada com 5 divisões
gridsearch = GridSearchCV(logistic, hyperparameters, cv=5, verbose=0)

# treina o modelo (realiza a busca pelos melhores hiperparâmetros)
best_model = gridsearch.fit(features, target)

print("Melhor tipo de penalização:", best_model.best_estimator_.get_params()['penalty'])
print("Melhor valor de C:", best_model.best_estimator_.get_params()['C'])

# usa o melhor modelo para fazer previsões
predicoes = best_model.predict(features)
print("Primeiras previsões:", predicoes[:10])  



Melhor tipo de penalização: l1
Melhor valor de C: 7.742636826811269
Primeiras previsões: [0 0 0 0 0 0 0 0 0 0]


In [2]:
np.logspace(0, 4, 10)

array([1.00000000e+00, 2.78255940e+00, 7.74263683e+00, 2.15443469e+01,
       5.99484250e+01, 1.66810054e+02, 4.64158883e+02, 1.29154967e+03,
       3.59381366e+03, 1.00000000e+04])

In [3]:
best_model.predict(features)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

## Selecting Best Models Using Randomized Search

In [12]:
from scipy.stats import uniform
from sklearn.model_selection import RandomizedSearchCV, cross_val_score

penalty = ['l1', 'l2']

C = uniform(loc=0, scale=4)
hyperparameters = dict(C=C, penalty=penalty)

# randomized search
randomizedsearch = RandomizedSearchCV(
 logistic, hyperparameters, random_state=1, n_iter=100, cv=5, verbose=0,
 n_jobs=-1)

best_model = randomizedsearch.fit(features, target)

In [5]:
uniform(loc=0, scale=4).rvs(10)

print('Best Penalty:', best_model.best_estimator_.get_params()['penalty'])
print('Best C:', best_model.best_estimator_.get_params()['C'])

Best Penalty: l1
Best C: 1.668088018810296


In [6]:
best_model.predict(features)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2,
       2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

## Selecting Best Models from Multiple Learning Algorithms

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

np.random.seed(0)

iris = datasets.load_iris()
features = iris.data
target = iris.target

# pipeline
pipe = Pipeline([("classifier", RandomForestClassifier())])

# dicionário
search_space = [{"classifier": [LogisticRegression()],
 "classifier__penalty": ['l1', 'l2'],
 "classifier__C": np.logspace(0, 4, 10)},
 {"classifier": [RandomForestClassifier()],
 "classifier__n_estimators": [10, 100, 1000],
 "classifier__max_features": [1, 2, 3]}]

gridsearch = GridSearchCV(pipe, search_space, cv=5, verbose=0)

best_model = gridsearch.fit(features, target)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [10]:
best_model.best_estimator_.get_params()["classifier"]

In [11]:
best_model.predict(features)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [13]:
cross_val_score(gridsearch, features, target).mean()

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

np.float64(0.9733333333333334)