#Project 5: Hyperparameter Optimization for Titanic Survival Classifier
**Project Title:** Hyperparameter Optimization of Machine Learning Models for Titanic Survival Prediction

**Project Goal:** To systematically optimize the hyperparameters of a chosen machine learning model (Logistic Regression or Decision Tree) for the Titanic survival classification task. The objective is to identify the optimal set of hyperparameters that maximize model performance, demonstrating the impact of hyperparameter tuning compared to default model settings.
Objectives:
1.	Model Selection and Initialization:
o	Select either Logistic Regression or Decision Trees as the base model for hyperparameter optimization.
o	Initialize the chosen model with its default parameters to establish a baseline performance.
2.	Hyperparameter Search Strategy Implementation:
o	Implement two distinct hyperparameter search strategies:
	Grid Search: Systematically explore a predefined grid of hyperparameter values for the selected model.
	Randomized Search: Randomly sample hyperparameter values from specified distributions for the selected model.
o	Both search methods should incorporate cross-validation to ensure robust evaluation of hyperparameter combinations.
3.	Model Training and Evaluation:
o	Train models using both Grid Search and Randomized Search to find the best performing hyperparameter sets.
o	Evaluate the performance of the best models identified by each search method using appropriate classification metrics (e.g., accuracy, precision, recall, F1-score) on a held-out test set.
4.	Performance Comparison and Analysis:
o	Compare the performance of the model with default parameters against the performance of the models tuned by Grid Search and Randomized Search.
o	Document the optimal hyperparameters found by each search method.
o	Analyze and present the improvement in evaluation metrics achieved through hyperparameter optimization.
Tools/Libraries:
•	Python 3.x
•	Pandas (for data manipulation, assuming a preprocessed dataset)
•	Scikit-learn (for model implementations: LogisticRegression, DecisionTreeClassifier, GridSearchCV, RandomizedSearchCV, and classification metrics: accuracy_score, precision_score, recall_score, f1_score)
•	NumPy (for numerical operations, if needed)
Deliverables:
•	A well-structured and commented Python script (.py or .ipynb) demonstrating the implementation of Grid Search and Randomized Search.
•	Clear output of the performance metrics (e.g., accuracy, F1-score) for:
o	The default model.
o	The best model from Grid Search.
o	The best model from Randomized Search.
•	Documentation (inline comments or markdown) detailing the search spaces defined for each method and the rationale behind choosing them.
•	A summary table or visualization comparing the performance of the default and tuned models.
Success Criteria:
•	Both Grid Search and Randomized Search are correctly implemented and executed.
•	Optimal hyperparameters are identified by each search method.
•	A clear comparison of default vs. tuned model performance is provided, showcasing any improvements.
•	The code is efficient, readable, and reproducible.

In [None]:

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

###Step 1: Baseline Logistic Regression.

In [None]:
# Importing preprocessed data set
df = pd.read_csv("/content/titanic_preprocessed.csv")

In [None]:
df.head(5)

Unnamed: 0,Survived,Pclass,Age,Fare,Embarked_Q,Embarked_S,Sex_male,FamilySize,Title_Miss,Title_Mr,Title_Mrs,Title_Rare
0,0,3,-0.565736,0.014151,0.0,1.0,1.0,2,0.0,1.0,0.0,0.0
1,1,1,0.663861,0.139136,0.0,0.0,0.0,2,0.0,0.0,1.0,0.0
2,1,3,-0.258337,0.015469,0.0,1.0,0.0,1,1.0,0.0,0.0,0.0
3,1,1,0.433312,0.103644,0.0,1.0,0.0,2,0.0,0.0,1.0,0.0
4,0,3,0.433312,0.015713,0.0,1.0,1.0,1,0.0,1.0,0.0,0.0


In [None]:
# Split into features (X) and target (y)
X = df.drop("Survived", axis=1)
y = df["Survived"]

In [None]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [None]:
#intializing baseline model Logistic regression with default parameters.
log_reg = LogisticRegression(max_iter=1000)

In [None]:
#train the model
log_reg.fit(X_train, y_train)

In [None]:
#prediction
y_pred = log_reg.predict(X_test)

In [None]:
#Evaluation
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [None]:
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

Accuracy: 0.8101
Precision: 0.7778
Recall: 0.7101
F1 Score: 0.7424


This is the Baseline performance of Logistic regression on that data set.

###Step 2: Hyperparameter Search Strategy Implementation

In [None]:
#Gridsearch
from sklearn.model_selection import GridSearchCV

In [None]:
# Define the parameter grid for Logistic Regression
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']
}


In [None]:

# Initialize Logistic Regression
log_reg = LogisticRegression(max_iter=1000)


In [None]:
# Setup GridSearchCV
grid_search = GridSearchCV(
    estimator=log_reg,
    param_grid=param_grid,
    scoring='f1',
    cv=5,
    n_jobs=-1,
    verbose=1
)


In [None]:

# Run GridSearchCV
grid_search.fit(X_train, y_train)

# Best hyperparameters
print("Best Parameters from Grid Search:", grid_search.best_params_)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best Parameters from Grid Search: {'C': 100, 'penalty': 'l1', 'solver': 'saga'}




After GridSearch we have got best parameters thatsa are {'C': 100, 'penalty': 'l1', 'solver': 'saga'}

In [None]:
#Random search for best parameters
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

In [None]:
# Define parameter distributions
param_dist = {
    'C': loguniform(0.001, 100),   # continuous range from 0.001 to 100
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']
}


In [None]:
#intializing the logistic regression
log_reg = LogisticRegression(max_iter=1000)

In [None]:
#setting up rendom search
random_search = RandomizedSearchCV(
    estimator=log_reg,
    param_distributions=param_dist,
    n_iter=30,         # number of random combinations to test
    scoring='f1',
    cv=5,
    n_jobs=-1,
    verbose=1,
    random_state=42
)


In [None]:
random_search.fit(X_train, y_train)

print("Best Parameters from Random Search:", random_search.best_params_)

Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best Parameters from Random Search: {'C': np.float64(1.1462107403425035), 'penalty': 'l2', 'solver': 'saga'}


So now we can see best parameters by the random search {'C': np.float64(1.1462107403425035), 'penalty': 'l2', 'solver': 'saga'}

###Best Model Evaluation

In [None]:
#Evaluating Grid Search mmodel
# Best model from Grid Search
best_grid_model = grid_search.best_estimator_

In [None]:
# Predictions on test set
y_pred_grid = best_grid_model.predict(X_test)

In [None]:
# Evaluation
accuracy_grid = accuracy_score(y_test, y_pred_grid)
precision_grid = precision_score(y_test, y_pred_grid)
recall_grid = recall_score(y_test, y_pred_grid)
f1_grid = f1_score(y_test, y_pred_grid)

In [None]:
print("Grid Search Logistic Regression Performance:")
print(f"Accuracy: {accuracy_grid:.4f}")
print(f"Precision: {precision_grid:.4f}")
print(f"Recall: {recall_grid:.4f}")
print(f"F1 Score: {f1_grid:.4f}")

Grid Search Logistic Regression Performance:
Accuracy: 0.8380
Precision: 0.8125
Recall: 0.7536
F1 Score: 0.7820


This is the performance if best model train by Grid search.

In [None]:
##Evaluating Random Search best model
# Best model from Randomized Search
best_random_model = random_search.best_estimator_

In [None]:
# Predictions on test set
y_pred_random = best_random_model.predict(X_test)

In [None]:
# Evaluation
accuracy_random = accuracy_score(y_test, y_pred_random)
precision_random = precision_score(y_test, y_pred_random)
recall_random = recall_score(y_test, y_pred_random)
f1_random = f1_score(y_test, y_pred_random)

In [None]:
print("Random Search Logistic Regression Performance:")
print(f"Accuracy: {accuracy_random:.4f}")
print(f"Precision: {precision_random:.4f}")
print(f"Recall: {recall_random:.4f}")
print(f"F1 Score: {f1_random:.4f}")

Random Search Logistic Regression Performance:
Accuracy: 0.8101
Precision: 0.7778
Recall: 0.7101
F1 Score: 0.7424


###Performance Comparison and Analysis:

In [None]:
#Comparison

results = pd.DataFrame({
    "Model": ["Default Logistic Regression", "Grid Search Logistic Regression", "Random Search Logistic Regression"],
    "Accuracy": [0.8101, 0.8380, 0.8101],
    "Precision": [0.7778, 0.8125, 0.7778],
    "Recall": [0.7101, 0.7536, 0.7101],
    "F1 Score": [0.7424, 0.7820, 0.7424]
})

print(results)

                               Model  Accuracy  Precision  Recall  F1 Score
0        Default Logistic Regression    0.8101     0.7778  0.7101    0.7424
1    Grid Search Logistic Regression    0.8380     0.8125  0.7536    0.7820
2  Random Search Logistic Regression    0.8101     0.7778  0.7101    0.7424


**Best Hyperparameters**

For **Grid Search**, the best hyperparameters it found are:

* C = 100

* penalty = l1

* solver = saga

This means the model worked best when the regularization strength was high (C = 100), it used L1 regularization (which can remove less important features), and the saga solver (a solver that can handle l1).

---


For **Random Search**, the best hyperparameters it found are:

* C ≈ 1.146

* penalty = l2

* solver = saga

This means the model worked best when the regularization strength was about 1.14, it used L2 regularization (which shrinks but keeps all features), and again the saga solver.

**Final Analysis**

First, I trained Logistic Regression with **default settings**. The performance was okay (Accuracy = 0.81, F1 = 0.74).

Then I used **Grid Search**. This method tested many hyperparameter combinations and found the best one. After tuning, the Accuracy improved to **0.8380** and F1 Score improved to **0.7820**. This shows Grid Search helped the model perform better.

I also tried **Random Search**, but in my run, it did not find better parameters than the default. Its results were almost the same as the baseline.

Overall, **Grid Search** gave the best results in this project. It shows how hyperparameter optimization can improve model performance compared to default settings.