1. Problem identification 

2. Data wrangling

3. Exploratory data analysis

4. Prep-processing and training data development

5. **Modeling (Machine learning steps)**

6. Documentation

<div class="span5 alert alert-success">
<h3>Hyperparameter Tunning for MLA</h3>
- Can be used on both regression and classification problems.
https://www.kdnuggets.com/2022/10/hyperparameter-tuning-grid-search-random-search-python.html

## <font color='purple'><b>GridSearchCV</b></font> 

Method used in machine learning to find the best hyperparameters for a model by exhaustively searching through a predefined set of possible values. <span style="background-color:PowderBlue;">Every possible combination is tested systematically.</span>

  
1️⃣ Define the Parameter Grid: You specify a set of possible values for each hyperparameter you want to tune. This creates a grid of all possible combinations of these values.

2️⃣ Train and Evaluate: The grid search algorithm trains and evaluates the model for each combination of hyperparameters in the grid. This is typically done using cross-validation to ensure that the evaluation is reliable.

3️⃣ Select the Best Combination: After evaluating all combinations, the grid search algorithm selects the combination of hyperparameters that resulted in the best performance according to a specified metric (e.g., accuracy, RMSE).


Model examples and their parameters 
- KNeighborsClassifier(n_neighbors= )
  
        - n_neighbors 
        - weights 
        - metric 
        - p

- Decision Tree

        - max_depth → Maximum depth of the tree.
        - min_samples_split → Minimum samples required to split a node.
        - criterion → Measure of impurity ("gini", "entropy").
        - min_samples_leaf → Minimum samples required at a leaf node.

  


**Code - Decision Tree** 

```python

from sklearn.model_selection import GridSearchCV

# Define the Hyperparameter Grid

param_grid = {

    'n_estimators': [10, 50, 100],  # Example for RandomForest
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

#Set Up GridSearchCV

grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', return_train_score=True)

#Fit GridSearchCV

grid_search.fit(X_train, y_train)

Retrieve the Best Parameters

print("Best Parameters:", grid_search.best_params_)

best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)

# Analyze Results

import pandas as pd

cv_results = pd.DataFrame(grid_search.cv_results_)

print(cv_results[['param_n_estimators', 'param_max_depth', 'mean_test_score']])

```


## <font color='purple'><b>RandomisedSearchCV</b></font> 


RandomizedSearchCV is a hyperparameter tuning method in scikit-learn that searches for the best combination of hyperparameters by <span style="background-color: PowderBlue;">randomly sampling from a predefined distribution instead of exhaustively testing all possible values</span> like GridSearchCV does.

**Code** 
```python
from sklearn.model_selection import RandomizedSearchCV

from sklearn.ensemble import RandomForestClassifier

# Define model

model = RandomForestClassifier(random_state=42)

# Define hyperparameter distribution

param_dist = {

    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
    
}

# Set up RandomizedSearchCV

random_search = RandomizedSearchCV(model, param_dist, n_iter=10, cv=5, scoring='accuracy', random_state=42)

# Fit the model

random_search.fit(X_train, y_train)

# Get best parameters

print("Best Parameters:", random_search.best_params_)
```


## <font color='purple'><b>Bayesian Optimization</b></font> 

Bayesian Optimization is a smart way to find the best hyperparameters for a machine learning model without testing every possible option. It does this by doing <span style="background-color: PowderBlue;">balancing exploration where it tries a few random values, learns from their results, and then balances exploitation where it uses what it learned in exploration to make better guesses.</span> While this method dynamically adjusts based on prior results, it still operates within a predefined search space, making optimization faster and more efficient than traditional methods like grid search.

Steps 

1️⃣ Define the function – Create a function that takes in parameters and outputs a single number, which represents how good those parameters are.

2️⃣ Set up Bayesian Optimization – Specify the function to optimize and define the range of possible values for each parameter.

3️⃣ Run the optimization – Start with random exploration (init_points) and then let the optimizer make smarter guesses (n_iter).

4️⃣ Retrieve the best parameters – At the end, check which values gave the best result for optimization.


**Code**

```python
from bayesian_optimization import BayesianOptimization

# 1. Define the function you want to optimize

def objective_function(a, b):

    return a + b  # Example: Maximizing the sum of two parameters

# 2. Set up the Bayesian Optimizer with predefined search space

optimizer = BayesianOptimization(

    f=objective_function,
    pbounds={'a': (1, 3), 'b': (4, 7)},
    random_state=42  # For reproducibility
    
)

# 3. Run optimization

optimizer.maximize(`   # always use .maximize()

    init_points=3,  # Number of random explorations
    n_iter=5        # Number of optimization iterations
    
)

# 4. Print the best parameters and their associated maximized target

print("Best Parameters:", optimizer.max['params'])

print("Best Target Value:", optimizer.max['target'])
```


## <font color='purple'><b>Cross Validation</b></font> 

Method used in machine learning to evaluate a model’s performance by splitting the data into multiple parts and testing it on different subsets. <span style="background-color:PowderBlue;">It helps ensure your model generalizes well to unseen data.</span>

🔁 How It Works
1️⃣ Split the Data: The dataset is divided into k equal parts (called “folds”). 2️⃣ Train and Test: The model is trained on k–1 folds and tested on the remaining fold. This process repeats k times, each time using a different fold for testing. 3️⃣ Average the Scores: The performance scores from each fold are averaged to give a more reliable estimate of how the model will perform in the real world.

Why Use Cross-Validation
Reduces the risk of overfitting

Gives a more accurate picture of model performance

Works with both regression and classification models

Often used inside hyperparameter tuning methods like GridSearchCV


Model Examples
LogisticRegression()

penalty

C

solver

RandomForestClassifier()

n_estimators

max_depth

min_samples_split

```python

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Define model
model = RandomForestClassifier(random_state=42)

# Apply cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

# Print results
print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", scores.mean())

#optional 
# For precision
scores = cross_val_score(model, X, y, cv=5, scoring='precision')

# For recall
scores = cross_val_score(model, X, y, cv=5, scoring='recall')
```

<div class="span5 alert alert-success">
<h3>Hyperparameter Tunning for Deep Learning</h3>

## <font color='purple'><b>KerasClassifier</b></font> 

- When using KerasClassifier (or KerasRegressor), your model must be wrapped inside a Python function. That’s because GridSearchCV or RandomizedSearchCV from scikit-learn needs to be able to rebuild the model fresh for each combination of hyperparameters it tries.
