# 04_grid_random_search

1. Zaimplementuj GridSearchCV oraz RandomizedSearchCV dla datasetu wine. 
Uwzględnij poniższe parametry:
estymator: LogisticRegression(solver="liblinear")
parametr C:
min 1
max 10 000
liczba wystąpień 1 000 
regularyzacja l1 oraz l2
2. Zaimplementuj GridSearchCV (jeden na wszystkie modele) w celu znalezienia najlepszego algorytmu oraz hyperparametrów dla datasetu z pkt.1:
wykorzystaj estymatory:
RandomForestClassifier
KNeighborsClassifier
LogisticRegression
3. Porównaj wyniki korzystając z  hyperopt-sklearn.




In [1]:
import sklearn
from sklearn.datasets import load_wine
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

from hpsklearn import HyperoptEstimator, any_classifier, any_preprocessing
from hyperopt import tpe

import warnings
import pandas as pd 
import numpy as np

In [2]:
X, y = load_wine()['data'], load_wine()['target']

# 1

Exhaustive search over specified parameter values for an estimator.

In [3]:
warnings.filterwarnings("ignore")

param_grid = {'C': np.logspace(0, 4, 1000), 'penalty': ['l1', 'l2']}

logreg = LogisticRegression(solver="liblinear")
logreg_cv = GridSearchCV(logreg, param_grid, cv=5).fit(X, y)

print(logreg_cv.best_params_)

{'C': 1.6151326935030905, 'penalty': 'l2'}


Randomized search on hyper parameters.

In [4]:
warnings.filterwarnings("ignore")

param_dist = {'C': np.logspace(0, 4, 1000), 'penalty': ['l1', 'l2']}

logreg = LogisticRegression(solver="liblinear")
logreg_cv = RandomizedSearchCV(logreg, param_dist, cv=5, random_state=1, n_iter=1000).fit(X, y)

print(logreg_cv.best_params_)

{'penalty': 'l2', 'C': 2.1693835183851844}


The only difference between both the approaches is in grid search we define the combinations and do training of the model whereas in RandomizedSearchCV the model selects the combinations randomly. 

# 2

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.

In [5]:
pipe = Pipeline([("classifier", RandomForestClassifier())])

In [8]:
search_space = [
    {"classifier": [RandomForestClassifier()],
     "classifier__n_estimators": [10, 50, 100],
     "classifier__max_features": [1, 2, 3]},
    {"classifier": [KNeighborsClassifier()],
     "classifier__n_neighbors": range(1, 10, 1),
     "classifier__leaf_size": [30, 60, 90]},
    {"classifier": [LogisticRegression()],
     "classifier__penalty": ["l1", "l2"],
     "classifier__C": np.logspace(0, 4, 10)}
]

In [9]:
gridsearch = GridSearchCV(pipe, search_space, cv=5, verbose=1, n_jobs=-1)

In [10]:
best_model = gridsearch.fit(X, y)

Fitting 5 folds for each of 56 candidates, totalling 280 fits


In [11]:
print(best_model.best_estimator_.get_params()["classifier"])

RandomForestClassifier(max_features=2)


# 3

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [13]:
model = HyperoptEstimator()

In [14]:
model.fit(X_train, y_train)

100%|█████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.09s/trial, best loss: 0.04166666666666663]
100%|█████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.05s/trial, best loss: 0.04166666666666663]
100%|█████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.68s/trial, best loss: 0.04166666666666663]
100%|█████████████████████████████████████████████████| 4/4 [00:02<00:00,  2.56s/trial, best loss: 0.04166666666666663]
100%|█████████████████████████████████████████████████| 5/5 [00:01<00:00,  1.18s/trial, best loss: 0.04166666666666663]
100%|█████████████████████████████████████████████████| 6/6 [00:01<00:00,  1.18s/trial, best loss: 0.04166666666666663]
100%|█████████████████████████████████████████████████| 7/7 [00:01<00:00,  1.15s/trial, best loss: 0.04166666666666663]
100%|█████████████████████████████████████████████████| 8/8 [00:01<00:00,  1.59s/trial, best loss: 0.04166666666666663]
100%|███████████████████████████████████

In [15]:
acc = model.score(X_test, y_test)
print("Accuracy: %.3f" % acc)

Accuracy: 0.898


In [16]:
print(model.best_model())

{'learner': RandomForestClassifier(max_depth=3, max_features=0.9611671660778067,
                       min_samples_leaf=14, n_estimators=30, n_jobs=1,
                       random_state=2, verbose=False), 'preprocs': (StandardScaler(with_std=False),), 'ex_preprocs': ()}
