Perform hyperparameter tuning on prepared **Titanic dataset** using:
1. `GridSearchCV`
2. `RandomizedSearchCV`

Tune hyperparameters of `LogisticRegression` as follows:
- target metric: F1-score
- hyperparameters: `penalty` (either L1 or L2) and `C` between 0.01 and 10
- 8-fold CV

For both grid and randomized search check 200 combinations of hyperparameters. Pick the right `solver` and `max_iter` parameters. Note that boundaries for C hyperparameter must be the same for both approaches, but the implementation to enforce 100 combinations will be different.

Print best hyperparameters (`C` and `penalty`) for both `GridSearchCV` and`RandomizedSearchCV`. Are they similar?

Send the Jupyter notebook (with output) exported in `.html` format on email lkrain@sgh.waw.pl.

## Little data preprocessing 

In [42]:
import pandas as pd
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, f1_score
from sklearn.datasets import load_iris  # Replace this with the Titanic dataset
from sklearn.model_selection import train_test_split
from scipy.stats import loguniform
import numpy as np

In [22]:
ds_titanic = pd.read_csv(
    "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv",
    sep=",",
    header=0,
)

In [23]:
ds_titanic['Sex'] = ds_titanic['Sex'].map({'male': 1, 'female': 0})
X= ds_titanic.drop(columns=["Survived","Name"])
y = ds_titanic.Survived
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.7, random_state=42
)

## Hyperparameter tuning

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

In [24]:
model = LogisticRegression(solver='liblinear', max_iter=1000) 

In [43]:
param_grid = {
    'penalty': ['l1', 'l2'],
    'C': np.linspace(0.01, 10, 100)
}
f1_scorer = make_scorer(f1_score)

## Grid Search 

In [44]:
#GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring=f1_scorer, cv=8)
grid_search.fit(X_train, y_train)

## Randomized Search

In [45]:
#RandomizedSearchCV
random_param_grid = {
    'penalty': ['l1', 'l2'],
    'C': loguniform(0.01, 10)  # rozkład Log-uniform
}

In [46]:
random_search = RandomizedSearchCV(estimator=model, param_distributions=random_param_grid, n_iter=200, scoring=make_scorer(f1_score), cv=8, random_state=42)
random_search.fit(X_train, y_train)

## Final result

In [47]:
#Pokaż hiperparametry
print("GridSearchCV best params:", grid_search.best_params_)
print("RandomizedSearchCV best params:", random_search.best_params_)

GridSearchCV best params: {'C': 9.596363636363636, 'penalty': 'l1'}
RandomizedSearchCV best params: {'C': 8.341930294140777, 'penalty': 'l1'}


In [48]:
#Porównaj parametry
if grid_search.best_params_ == random_search.best_params_:
    print("The best parameters from both searches are the same.")
else:
    print("The best parameters from both searches differ.")

The best parameters from both searches differ.


In [49]:
random_search.best_score_

0.7576539022657165

In [50]:
grid_search.best_score_

0.7601340609958753

## Conclusion

So in the final results we can see that strength of regularization is lower,( the C is higher) in the case of Grid Search approach, with means model become more complex. However F1 score is tend to be the same for both approaches. Is not the same: 0.758 for Random Search and 0.76 for Grid Search. But they tend to be the same.