In scikit-learn, the RandomSearchCV function implements random search with cross-validation. RandomSearchCV requires three arguments to be specified:

- estimator: the machine learning model whose hyperparameters we’re tuning; this is exactly the same for GridSearchCV
- param_distributions: a dictionary which specifies the hyperparameters as keys and corresponding distributions to draw lists of values from for each hyperparameter. In GridSearchCV, we instead had param_grid, a dictionary representing the grid of hyperparameters to search from
- n_iter: the number of times the algorithm needs to randomly draw from the distributions. The default value for this is 10.

In [4]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

# Load the data set
cancer = load_breast_cancer()

# Split the data into training and testing sets
X = cancer.data
y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y)

provide some information about how we want to select random numbers. Do we want random numbers between 0 and 100? Between -10 and 10? Do we want the same chance of picking small numbers and picking large numbers?

In [5]:
from scipy.stats import uniform
distributions = {
    'penalty': ['l1', 'l2'],
    'C': uniform(loc=0, scale=100)
}

In [7]:
lr = LogisticRegression(solver='liblinear', max_iter=1000)
clf = RandomizedSearchCV(lr, distributions,n_iter=8)

RandomizedSearchCV(estimator=LogisticRegression(max_iter=1000,
                                                solver='liblinear'),
                   n_iter=8,
                   param_distributions={'C': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x115f78850>,
                                        'penalty': ['l1', 'l2']})


## Evaluating the Results of `RandomizedSearchCV`

We can now follow a similar process to what we did with GridSearchCV to evaluate the results of RandomizedSearchCV.

After fitting a RandomizedSearchCV model we can find out the results using the following attributes of the clf argument:

- .best_estimator_ gives us the best estimator
- .best_score_ gives us the mean cross-validated score corresponding to the best estimator
- .best_params_ gives us the set of hyperparameters that correspond to the best estimator
Additionally, the `.cv_results_ attribute` gives us the scores for each hyperparameter combination in the grid. We’re now ready to evaluate the random search we set up earlier and we’ve preloaded the code from the previous exercise in the Setup cell.

In [8]:
clf.fit(X_train, y_train)
best_model = clf.best_estimator_

print(best_model)
print(clf.best_params_)

LogisticRegression(C=np.float64(54.289379228449874), max_iter=1000,
                   penalty='l1', solver='liblinear')
{'C': np.float64(54.289379228449874), 'penalty': 'l1'}


In [9]:
## YOUR SOLUTION HERE ##
from sklearn.metrics import accuracy_score

# Store the best score from the grid search
best_score = clf.best_score_

# Get the best estimator and evaluate on the test set
best_model = clf.best_estimator_
test_predictions = best_model.predict(X_test)

# Calculate accuracy
test_score = accuracy_score(y_test, test_predictions)


print(best_score)
print(test_score)

0.9672229822161423
0.965034965034965


In [11]:
hyperparameter_values = pd.DataFrame(clf.cv_results_['params'])
randomsearch_scores = pd.DataFrame(clf.cv_results_['mean_test_score'])


df = pd.concat([hyperparameter_values, randomsearch_scores], axis = 1)
print(df)

           C penalty         0
0  87.942761      l2  0.948427
1  93.273480      l1  0.955458
2  13.126674      l1  0.957811
3  54.289379      l1  0.967223
4  70.815615      l1  0.957811
5  96.889236      l2  0.950780
6   2.208408      l1  0.946074
7  55.422794      l1  0.962517
