# 05: Advanced Topic - Hyperparameter Tuning

In our `04_Model_Comparison.ipynb` notebook, we found that several models could achieve high accuracy on the Iris dataset. However, for more complex, real-world problems, getting the best performance requires **hyperparameter tuning**. This notebook explains what that means and how to do it.

### What are Hyperparameters?

Think of hyperparameters as the settings or knobs for a machine learning model. We can tune these settings to find the combination that yields the best performance. For this example, we will use a **Random Forest** model, which we saw performed not as well as the others due to its inherent randomness.

Some key hyperparameters for a Random Forest include:
- `n_estimators`: The number of decision trees in the forest.
- `max_depth`: The maximum depth of each tree.
- `min_samples_leaf`: The minimum number of samples required to be at a leaf node.

### Setup and Data Loading

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Load and prepare the data
df = pd.read_csv('../data/iris.csv')
X = df.drop('target', axis=1)
y = df['target']

# Use the same random_state to get the exact same split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### Define the Hyperparameter Grid

We will create a "grid" of hyperparameters to test. `GridSearchCV` will then systematically try every combination of these settings to find the best one.

In [2]:
param_grid = {
    'n_estimators': [50, 100, 200],      # Test with 50, 100, or 200 trees
    'max_depth': [None, 10, 20, 30],       # Test different tree depths
    'min_samples_leaf': [1, 2, 4]         # Test different minimum leaf sizes
}

### 3. Perform the Grid Search

Now we create an instance of `GridSearchCV`. We pass it our model, the parameter grid, and tell it to use `cv=5` (5-fold cross-validation), which is a standard practice to ensure the results are robust.

In [3]:
# Create a Random Forest model
rf = RandomForestClassifier(random_state=42)

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit the grid search to the data (this will take a moment)
grid_search.fit(X_train, y_train)

print("Grid search complete!")

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Grid search complete!


### View the Best Parameters and Evaluate

After the search is complete, we can see which combination of hyperparameters performed the best.

In [5]:
print(f'Best Parameters: {grid_search.best_params_}')

# Get the best model found by the grid search
best_model = grid_search.best_estimator_

# Make predictions with the best model
y_pred = best_model.predict(X_test)

# Print the final classification report
print('--- Final Report on Best Model ---')
print(classification_report(y_test, y_pred))

Best Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'n_estimators': 50}
--- Final Report on Best Model ---
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       0.82      0.90      0.86        10
           2       0.89      0.80      0.84        10

    accuracy                           0.90        30
   macro avg       0.90      0.90      0.90        30
weighted avg       0.90      0.90      0.90        30



### Final Accuracy Score

In [6]:
# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f'Final Accuracy of the Tuned Model: {accuracy * 100:.2f}%')

Final Accuracy of the Tuned Model: 90.00%


### Saving model with best parameters

In [7]:
# saving model with best parameters
import joblib
model_filename = '../models/iris_random_forest_best.joblib'
joblib.dump(best_model, model_filename)

['../models/iris_random_forest_best.joblib']

### Conclusion: When Tuning Doesn't Improve Performance

This is a fascinating and important result. In our `04_Model_Comparison.ipynb` notebook, the default Random Forest model scored 90% accuracy. After an extensive hyperparameter search, the final accuracy of our tuned model remained at **90%**. This demonstrates a critical concept in machine learning: **tuning is not a guaranteed improvement.**

**Why didn't the accuracy improve?**
1.  **Small Dataset:** The Iris dataset is very small. With only 30 samples in our test set, the default parameters of the Random Forest were likely already sufficient to learn the patterns effectively. There isn't enough data to find a significantly more optimal set of hyperparameters.
2.  **Inherent Data Overlap:** The errors in the Random Forest model occurred when distinguishing between species 1 (Versicolor) and 2 (Virginica). It's likely that the few data points that were misclassified are genuinely ambiguous and lie in an overlapping region of the feature space. No amount of tuning the model's settings can fix ambiguity that is inherent to the data itself.

#### Is it Worth Tuning a Different Model?

Let's look at our results from the `04_Model_Comparison.ipynb` notebook:
- **K-Nearest Neighbors (KNN): 100% Accuracy**
- **Support Vector Machine (SVM): 97% Accuracy**

Given that the default KNN model already achieved a perfect score, there is no practical benefit to be gained from tuning it further for this specific problem. The goal is to find the best possible model, and we have already achieved that with a simpler approach.

**Final Takeaway:** The most valuable step in this exercise was the **model comparison**. It quickly showed us that for the Iris dataset, the KNN algorithm is a superior choice to the Random Forest. This teaches us an important lesson: it's often more effective to try several different types of models than it is to spend a great deal of time tuning a single model that may not be the best fit for the data.