# <center> **Hyperparameter optimization**

The challenge is based on the [Kaggle: Predicting a Biological Response](https://www.kaggle.com/c/bioresponse).  
It is necessary to predict the biological response of molecules (column 'Activity') from their chemical composition (columns D1-D1776).

**Data description**  
The data is presented in CSV format. Each line represents a molecule.
- The first Activity column contains experimental data describing the actual biological response [0, 1];
- The remaining columns D1-D1776 are molecular descriptors - these are calculated properties that can capture some characteristics of a molecule, such as size, shape, or composition of elements.

**Purpose of the task:**  
Two models need to be trained: logistic regression and random forest. Next, you need to make a selection of hyperparameters using basic and advanced optimization methods.
It is important to use four methods (GridSeachCV, RandomizedSearchCV, Hyperopt, Optuna) at least once, the maximum number of iterations should not exceed 50.

In [88]:
import numpy as np
import pandas as pd 

from sklearn import linear_model
from sklearn import tree 
from sklearn import ensemble 
from sklearn import metrics
from sklearn import preprocessing 
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
import hyperopt
from hyperopt import hp, fmin, tpe, Trials, space_eval
import optuna

import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline
plt.style.use('seaborn')

In [89]:
# import and explore data
data = pd.read_csv('files/_train_sem.csv')
data.head()

Unnamed: 0,Activity,D1,D2,D3,D4,D5,D6,D7,D8,D9,...,D1767,D1768,D1769,D1770,D1771,D1772,D1773,D1774,D1775,D1776
0,1,0.0,0.497009,0.1,0.0,0.132956,0.678031,0.273166,0.585445,0.743663,...,0,0,0,0,0,0,0,0,0,0
1,1,0.366667,0.606291,0.05,0.0,0.111209,0.803455,0.106105,0.411754,0.836582,...,1,1,1,1,0,1,0,0,1,0
2,1,0.0333,0.480124,0.0,0.0,0.209791,0.61035,0.356453,0.51772,0.679051,...,0,0,0,0,0,0,0,0,0,0
3,1,0.0,0.538825,0.0,0.5,0.196344,0.72423,0.235606,0.288764,0.80511,...,0,0,0,0,0,0,0,0,0,0
4,0,0.1,0.517794,0.0,0.0,0.494734,0.781422,0.154361,0.303809,0.812646,...,0,0,0,0,0,0,0,0,0,0


In [90]:
# check the distribution of a target value
data['Activity'].value_counts()

1    2034
0    1717
Name: Activity, dtype: int64

In [91]:
# split our data
X = data.drop('Activity', axis=1)
y = data['Activity']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42, test_size=0.2)

# check the samples demension
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(3000, 1776) (751, 1776) (3000,) (751,)


## <center> <span style="color: red"> **Part I. Simple model without optimization**

### <center> **Logistic Regression**

In [109]:
# build a simple model
log_reg = linear_model.LogisticRegression(random_state=42, max_iter=1000)

# training this model
log_reg.fit(X_train, y_train)

# predict and calculate results for the training
y_train_pred = log_reg.predict(X_train)
print('Train F1-score for simple logistic regression: {:.3f}'.format(metrics.f1_score(y_train, y_train_pred)))

# predict and calculate results for the testing
y_test_pred = log_reg.predict(X_test)
print('Test F1-score for simple logistic regression: {:.3f}'.format(metrics.f1_score(y_test, y_test_pred)))

Train F1-score for simple logistic regression: 0.893
Test F1-score for simple logistic regression: 0.777


### <center> **Random Forest**

In [93]:
# build a model
forest = ensemble.RandomForestClassifier(random_state=42, n_estimators=50, max_depth=3)

# training this model
forest.fit(X_train, y_train)

# predict and calculate results for the training
y_train_pred = forest.predict(X_train)
print('Train F1-score for a random forest: {:.3f}'.format(metrics.f1_score(y_train, y_train_pred)))

# predict and calculate results for the testing
y_test_pred = forest.predict(X_test)
print('Test F1-score for a random forest: {:.3f}'.format(metrics.f1_score(y_test, y_test_pred)))

Train F1-score for a random forest: 0.763
Test F1-score for a random forest: 0.737


## <center> <span style="color: red"> **Part II. GridSearchCV**

### <center> **Logistic Regression**

In [107]:
# set search space
param_grid = [{'penalty': ['l2', 'none'],
              'solver': ['lbfgs', 'sag'],
              'C': list(np.linspace(0.01, 1, 10, dtype=float))},
              
              {'penalty': ['l1', 'l2'],
              'solver': ['liblinear', 'saga'],
              'C': list(np.linspace(0.01, 1, 10, dtype=float))}
]

# call the optimization algorithm
grid_search = GridSearchCV(
    estimator=log_reg,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1
)

# train ths model
grid_search.fit(X_train, y_train)

# predict and calculate results for the training
y_train_pred = grid_search.predict(X_train)
print('Train F1-score for logistic regression with GridSearchCV: {:.3f}'.format(metrics.f1_score(y_train, y_train_pred)))

# predict and calculate results for the testing
y_test_pred = grid_search.predict(X_test)
print('Test F1-score for logistic regression with GridSearchCV: {:.3f}'.format(metrics.f1_score(y_test, y_test_pred)))

Train F1-score for logistic regression with GridSearchCV: 0.831
Test F1-score for logistic regression with GridSearchCV: 0.781




### <center> **Random Forest**

In [95]:
# set search space
param_grid = {
    'n_estimators': list(range(100, 250, 25)), 
    'max_depth': list(range(5, 15, 1)),
    'min_samples_leaf': list(range(1, 6, 1)),
    'criterion': ['gini', 'entropy']
}

# call the optimization algorithm
grid_search = GridSearchCV(
    estimator=forest,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1
)

# predict and calculate results for the training
grid_search.fit(X_train, y_train)

# predict and calculate results for the training
y_train_pred = grid_search.predict(X_train)
print('Train F1-score for random forest with GridSearchCV: {:.3f}'.format(metrics.f1_score(y_train, y_train_pred)))

# predict and calculate results for the training
y_test_pred = grid_search.predict(X_test)
print('Test F1-score for random forest with GridSearchCV: {:.3f}'.format(metrics.f1_score(y_test, y_test_pred)))

Train F1-score for random forest with GridSearchCV: 0.980
Test F1-score for random forest with GridSearchCV: 0.799


## <center> <span style="color: red"> **Part III. RandomizedSearchCV**

### <center> **Logistic Regression**

In [96]:
# set search space
param_grid = [{'penalty': ['l2', 'none'],
              'solver': ['lbfgs', 'sag'],
              'C': list(np.linspace(0.01, 1, 10, dtype=float))},
              
              {'penalty': ['l1', 'l2'],
              'solver': ['liblinear', 'saga'],
              'C': list(np.linspace(0.01, 1, 10, dtype=float))}
]

# call the optimization algorithm
random_search = RandomizedSearchCV(
    estimator=log_reg,
    param_distributions=param_grid,
    cv=5,
    n_iter=50,
    n_jobs=-1
)

# training this model
random_search.fit(X_train, y_train)

# predict and calculate results for the training
y_train_pred = random_search.predict(X_train)
print('Train F1-score for logistic regression with RandomizedSearchCV: {:.3f}'.format(metrics.f1_score(y_train, y_train_pred)))

# predict and calculate results for the testing
y_test_pred = random_search.predict(X_test)
print('Test F1-score for logistic regression with RandomizedSearchCV: {:.3f}'.format(metrics.f1_score(y_test, y_test_pred)))

Train F1-score for logistic regression with RandomizedSearchCV: 0.831
Test F1-score for logistic regression with RandomizedSearchCV: 0.781




### <center> **Random Forest**

In [97]:
# set search space
param_grid = {
    'n_estimators': list(range(100, 250, 25)), 
    'max_depth': list(range(5, 15, 1)),
    'min_samples_leaf': list(range(1, 6, 1)),
    'criterion': ['gini', 'entropy']
}

# call the optimization algorithm
random_search = RandomizedSearchCV(
    estimator=forest,
    param_distributions=param_grid,
    cv=5,
    n_iter=50,
    n_jobs=-1
)

# training this model
random_search.fit(X_train, y_train)

# predict and calculate results for the training
y_train_pred = random_search.predict(X_train)
print('Train F1-score for random forest with RandomizedSearchCV: {:.3f}'.format(metrics.f1_score(y_train, y_train_pred)))

# predict and calculate results for the testing
y_test_pred = random_search.predict(X_test)
print('Test F1-score for random forest with RandomizedSearchCV: {:.3f}'.format(metrics.f1_score(y_test, y_test_pred)))

Train F1-score for random forest with RandomizedSearchCV: 0.98
Test F1-score for random forest with RandomizedSearchCV: 0.80


## <center> <span style="color: red"> **Part IV. Hyperopt**

### <center> **Logistic Regression**

In [98]:
# seed random state because of a stupid undecisionable error
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

In [99]:
# set search space
space = hp.choice('parameter_combinations', [
        {'solver': 'saga',
         'penalty': hp.choice('penalty', ['l1', 'l2']),
         'C': hp.uniform('C_saga', 0.01, 1)},
        
        {'solver': 'lbfgs',
        'penalty': 'l2',
        'C': hp.uniform('C_lbfgs', 0.01, 1)}
])


def obj_func(params, cv=5, X=X_train, y=y_train, random_state=RANDOM_STATE):
    """func for finding best params with cross validarion with hyperopt"""
    
    # params space
    params = {
        'solver': params['solver'], 
        'penalty': params['penalty'], 
        'C': params['C']
    }
      
    # model with set of params
    model = linear_model.LogisticRegression(**params, class_weight='balanced', 
        random_state=random_state, max_iter=50
    )
      
    # use cross validation
    score = cross_val_score(model, X, y, cv=cv, scoring='f1', n_jobs=-1).mean()

    return -score 

# logging results
trials = Trials()

# find best params for our space
best = fmin(obj_func, 
          space=space, 
          algo=tpe.suggest, 
          max_evals=20, 
          trials=trials, 
          rstate=np.random.RandomState(42))

100%|██████████| 20/20 [01:15<00:00,  3.78s/trial, best loss: -0.7799713634306372]


In [100]:
# found the best parameters
best_params = {
    'solver': 'saga',
    'penalty': 'l2', 
    'C': 0.10567819922023905
}
    
# build a model with best params
hyperopt_lr = linear_model.LogisticRegression(
    **best_params, class_weight='balanced',random_state=RANDOM_STATE, max_iter=50
)

# training this model
hyperopt_lr.fit(X_train, y_train)

# predict and calculate results for the training
y_train_pred = hyperopt_lr.predict(X_train)
print('Train F1-score for log reg with Hyperopt: {:.3f}'.format(metrics.f1_score(y_train, y_train_pred)))

# predict and calculate results for the testing
y_test_pred = hyperopt_lr.predict(X_test)
print('Test F1-score for log reg with Hyperopt: {:.3f}'.format(metrics.f1_score(y_test, y_test_pred)))

Train F1-score for log reg with Hyperopt: 0.851
Test F1-score for log reg with Hyperopt: 0.784




### <center> **Random Forest**

In [101]:
# set search space
space = {'n_estimators': hp.quniform('n_estimators', 100, 250, 25),
       'max_depth' : hp.quniform('max_depth', 5, 15, 1),
       'min_samples_leaf': hp.quniform('min_samples_leaf', 1, 6, 1),
       'criterion': hp.choice('criterion', ['gini', 'entropy'])}

# fix random seed
random_state = 42

def hyperopt_rf(params, cv=5, X=X_train, y=y_train, random_state=random_state):
    """func for finding best params with cross validarion with hyperopt"""
    
    params = {'n_estimators': int(params['n_estimators']), 
              'max_depth': int(params['max_depth']), 
              'min_samples_leaf': int(params['min_samples_leaf']),
              'criterion': params['criterion']}
  
    # build a model with set of params
    model = ensemble.RandomForestClassifier(
        **params, class_weight='balanced', n_jobs=-1, random_state=random_state
    )
    
    # training this model
    model.fit(X_train, y_train)
    
    # use cross validate
    score = cross_val_score(model, X_train, y_train, cv=cv, scoring="f1", n_jobs=-1).mean()

    return -score


# logging results
trials = Trials() # используется для логирования результатов

# finding best params
best = fmin(hyperopt_rf, # наша функция 
            space=space, # пространство гиперпараметров
            algo=tpe.suggest, # алгоритм оптимизации, установлен по умолчанию, задавать необязательно
            max_evals=20, # максимальное количество итераций
            trials=trials, # логирование результатов
            rstate=np.random.RandomState(42) 
            )

# build a model with best params
model = ensemble.RandomForestClassifier(
    random_state=random_state, 
    n_estimators=int(best['n_estimators']),
    max_depth=int(best['max_depth']),
    min_samples_leaf=int(best['min_samples_leaf']),
    criterion=space_eval(space, best)['criterion']
)

# train this model
model.fit(X_train, y_train)

# predict and calculate results for the training
y_train_pred = model.predict(X_train)
print('Train F1-score for random forest with Hyperopt: {:.3f}'.format(metrics.f1_score(y_train, y_train_pred)))

# predict and calculate results for the testing
y_test_pred = model.predict(X_test)
print('Test F1-score for random forest with Hyperopt: {:.3f}'.format(metrics.f1_score(y_test, y_test_pred)))

100%|██████████| 20/20 [01:24<00:00,  4.21s/trial, best loss: -0.8135020917624874]
Train F1-score for random forest with Hyperopt: 0.965
Test F1-score for random forest with Hyperopt: 0.803


## <center> <span style="color: red"> **Part IV. Optuna**

### <center> **Logistic Regression**

In [102]:
def optuna_1(trial):
    """function to iterate over hyperparameters from the first set
    
    Args:
        trial : hyperparameter class
      
    Returns:
        score(float): target metric - F1
    """
    
    # space for searching
    penalty = trial.suggest_categorical('penalty', ['l2', 'none'])
    solver = trial.suggest_categorical('solver', ['lbfgs', 'sag', 'newton-cg'])
    C = trial.suggest_float('C', 0.1, 1)  
    
    # build a model
    model = linear_model.LogisticRegression(
        penalty = penalty,
        solver = solver,
        C=C,
        random_state=42,
        max_iter=50)   
    
    # training the model
    model.fit(X_train, y_train)
    
    # compete a cross validation
    score = cross_val_score(model, X_train, y_train, cv=5, scoring="f1", n_jobs=-1).mean()
    
    return score


def optuna_2(trial):
    """function to iterate over hyperparameters from the second set
    
    Args:
        trial : hyperparameter class
      
    Returns:
        score(float): target metric - F1
    """
    
    # space for searching
    penalty = trial.suggest_categorical('penalty', ['l1', 'l2'])
    solver = trial.suggest_categorical('solver', ['liblinear', 'saga'])
    C = trial.suggest_float('C', 0.1, 1)  
    
    # build a model
    model = linear_model.LogisticRegression(
        penalty = penalty,
        solver = solver,
        C=C,
        random_state=42,
        max_iter=50)   
    
    # training the model
    model.fit(X_train, y_train)
    
    # compete a cross validation
    score = cross_val_score(model, X_train, y_train, cv=5, scoring="f1", n_jobs=-1).mean()
    
    return score

In [103]:
# finding best parameters
sampler = optuna.samplers.TPESampler(seed=42)
study_1 = optuna.create_study(study_name="LogisticRegression", direction="maximize")
study_1.optimize(optuna_1, n_trials=25)

# build a model with best params
log_reg_best_1 = linear_model.LogisticRegression(**study_1.best_params,random_state=42)

# training the model
log_reg_best_1.fit(X_train, y_train)

# predict and calculate results for the testing sample
y_test_pred = log_reg_best_1.predict(X_test)
print('Test F1-score for log reg with Optuna: {:.3f}'.format(metrics.f1_score(y_test, y_test_pred)))


[32m[I 2022-12-18 15:59:22,742][0m A new study created in memory with name: LogisticRegression[0m
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[32m[I 2022-12-18 15:59:24,396][0m Trial 0 finished with value: 0.7574579769635668 and parameters: {'penalty': 'none', 'solver': 'lbfgs', 'C': 0.8445211524878764}. Best is trial 0 with value: 0.7574579769635668.[0m
[32m[I 2022-12-18 15:59:30,764][0m Trial 1 finished with value: 0.7789014212164245 and parameters: {'penalty': 'none', 'solver': 'sag', 'C': 0.9931481451194933}. Best is trial 1 with value: 0.7789014212164245.[0m
[32m[I 2022-12-18 15:59:37,529][0m Trial 2 finished with value: 0.7797781171199327 and p

Test F1-score for log reg with Optuna: 0.787


In [104]:
# finding best parameters
sampler = optuna.samplers.TPESampler(seed=42)
study_2 = optuna.create_study(study_name="LogisticRegression", direction="maximize")
study_2.optimize(optuna_2, n_trials=25)

# build a model with best params
log_reg_best_2 = linear_model.LogisticRegression(**study_2.best_params,random_state=42)

# training the model 
log_reg_best_2.fit(X_train, y_train)

# predict and calculate results for the testing sample
y_test_pred = log_reg_best_2.predict(X_test)
print('Test F1-score for log reg with Optuna: {:.3f}'.format(metrics.f1_score(y_test, y_test_pred)))

[32m[I 2022-12-18 16:03:28,087][0m A new study created in memory with name: LogisticRegression[0m
[32m[I 2022-12-18 16:03:30,148][0m Trial 0 finished with value: 0.7810893505547788 and parameters: {'penalty': 'l2', 'solver': 'liblinear', 'C': 0.5239690538924806}. Best is trial 0 with value: 0.7810893505547788.[0m
[32m[I 2022-12-18 16:03:42,539][0m Trial 1 finished with value: 0.7833927351874851 and parameters: {'penalty': 'l1', 'solver': 'saga', 'C': 0.666972270209778}. Best is trial 1 with value: 0.7833927351874851.[0m
[32m[I 2022-12-18 16:03:53,877][0m Trial 2 finished with value: 0.7875943300206008 and parameters: {'penalty': 'l1', 'solver': 'saga', 'C': 0.4579481018201026}. Best is trial 2 with value: 0.7875943300206008.[0m
[32m[I 2022-12-18 16:04:05,972][0m Trial 3 finished with value: 0.7811816513962073 and parameters: {'penalty': 'l1', 'solver': 'saga', 'C': 0.8243614381116966}. Best is trial 2 with value: 0.7875943300206008.[0m
[32m[I 2022-12-18 16:04:08,046][0

Test F1-score for log reg with Optuna: 0.778


### <center> **Random Forest**

In [105]:
def optuna_forest(trial):
    """function to iterate over hyperparameters for random forest
    
    Args:
        trial : hyperparameter class
      
    Returns:
        score(float): target metric - F1
    """
    
    # space for searching
    n_estimators = trial.suggest_int('n_estimators', 100, 250, 25) 
    max_depth = trial.suggest_int('max_depth', 5, 15, 1)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 3, 6, 1)
    criterion = trial.suggest_categorical('criterion', ['gini', 'entropy'])
    
    # build a model
    random_forest = ensemble.RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_leaf=min_samples_leaf,
        criterion=criterion,
        random_state=42)   
    
    # training the model
    random_forest.fit(X_train, y_train)
    
    # compete a cross validation
    score = cross_val_score(random_forest, X_train, y_train, cv=5, scoring="f1", n_jobs=-1).mean()
    
    return score

In [106]:
# finding best parameters
sampler = optuna.samplers.TPESampler(seed=42)
study_forest = optuna.create_study(study_name="RandomForest", direction="maximize")
study_forest.optimize(optuna_forest, n_trials=25)

# build a model with best parameters
random_forest_best = ensemble.RandomForestClassifier(**study_forest.best_params, random_state=42)

# training the model
random_forest_best.fit(X_train, y_train)

# predict and calculate results for the testing sample
y_test_pred = random_forest_best.predict(X_test)
print('Test F1-score for random forest with Optuna: {:.3f}'.format(metrics.f1_score(y_test, y_test_pred)))

[32m[I 2022-12-18 16:04:59,132][0m A new study created in memory with name: RandomForest[0m
[32m[I 2022-12-18 16:05:03,747][0m Trial 0 finished with value: 0.8083553978762932 and parameters: {'n_estimators': 125, 'max_depth': 14, 'min_samples_leaf': 5, 'criterion': 'gini'}. Best is trial 0 with value: 0.8083553978762932.[0m
[32m[I 2022-12-18 16:05:11,685][0m Trial 1 finished with value: 0.8051366853195903 and parameters: {'n_estimators': 250, 'max_depth': 9, 'min_samples_leaf': 4, 'criterion': 'gini'}. Best is trial 0 with value: 0.8083553978762932.[0m
[32m[I 2022-12-18 16:05:20,376][0m Trial 2 finished with value: 0.8037422019731902 and parameters: {'n_estimators': 225, 'max_depth': 10, 'min_samples_leaf': 6, 'criterion': 'gini'}. Best is trial 0 with value: 0.8083553978762932.[0m
[32m[I 2022-12-18 16:05:25,060][0m Trial 3 finished with value: 0.7969365774578533 and parameters: {'n_estimators': 125, 'max_depth': 9, 'min_samples_leaf': 6, 'criterion': 'entropy'}. Best is 

Test F1-score for random forest with Optuna: 0.792


## <center> <span style="color: red"> **Conclusion**

**Warnings!** The target metric maximization is not the purpose of this assignment! The main goal is using various options to optimize hyperparameters.

The use of basic optimization models such as *GridSearchCV* and *RandomizedSearchCV* shows high results, but is an extremely time-consuming and resource-consuming optimization method.  
In this regard, using advanced optimization looks much more practical. **Optuna** and **Hyperopt** increase the metric even for a small selection of parameters, while taking much less time.

**F1-score for logistic regression model:**
- Baseline: 0.777
- GridSearchCV: 0.781
- RandomizedSearchCV: 0.781
- Hyperopt: 0.784
- Optuna: **0.787**

**F1-score for logistic regression model:**
- Baseline: 0.737
- GridSearchCV: 0.799
- RandomizedSearchCV: 0.80
- Hyperopt: **0.803**
- Optuna: 0.792