# SPRINT 6: Hyperparameter Tuning

## Objective
In this final technical notebook, we will take our best-performing model from the previous stage (SVM trained on RFE features) and fine-tune its **hyperparameters**. Hyperparameters are the "settings" of a model that we, as data scientists, can tweak. By finding the optimal combination of these settings, we can often achieve a significant boost in performance.

### Method:
We will use **Grid Search Cross-Validation (`GridSearchCV`)**. This technique exhaustively tries every combination of the hyperparameters we provide and uses cross-validation to find the absolute best combination.

In [1]:
# --- 1. Import Libraries ---
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# --- 2. Load the RFE-selected Feature Data ---
# This was the dataset that gave us our best model.
RFE_DATA_PATH = '../data/heart_disease_rfe_features.csv'
df = pd.read_csv(RFE_DATA_PATH)

# --- 3. Separate Features (X) and Target (y) ---
X = df.drop('target', axis=1)
y = df['target']

# --- 4. Split the data into Training and Testing sets ---
# We need this to have a final, unseen test set to evaluate our tuned model.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 5. Verify the shapes ---
print("--- Data is ready for tuning ---")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

--- Data is ready for tuning ---
X_train shape: (242, 12)
X_test shape: (61, 12)


## Step 2: Defining the Hyperparameter Grid and Running GridSearchCV

Now we will define the "grid" of hyperparameters we want to test for our `SVC` model. Hyperparameters are the settings we can tune to optimize the model.

-   **`C`**: The regularization parameter. It controls the trade-off between achieving a low training error and a low testing error.
-   **`gamma`**: Defines how far the influence of a single training example reaches.
-   **`kernel`**: Specifies the kernel type to be used in the algorithm.

`GridSearchCV` will systematically build a model for every single combination of these values and use cross-validation to determine which combination is the best.

In [2]:
# --- 1. Define the parameter grid to search ---
# These are the different 'settings' we want to test for the SVM.
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf', 'linear']
}

# --- 2. Instantiate GridSearchCV ---
# estimator: The model we are tuning.
# param_grid: The dictionary of parameters to test.
# cv=5: Use 5-fold cross-validation.
# verbose=2: Show detailed progress as it runs.
# refit=True: Automatically retrain the best model on the entire training set.
grid_search = GridSearchCV(
    estimator=SVC(random_state=42),
    param_grid=param_grid,
    cv=5,
    verbose=2,
    refit=True
)

# --- 3. Run the grid search on the training data ---
# This step can take a few minutes to run.
print("--- Starting Grid Search ---")
grid_search.fit(X_train, y_train)
print("--- Grid Search Finished ---")


# --- 4. Print the best parameters found ---
print("\n--- Best Hyperparameters Found ---")
print(grid_search.best_params_)

--- Starting Grid Search ---
Fitting 5 folds for each of 32 candidates, totalling 160 fits
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.0s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   0.0s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   0.0s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   0.0s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   0.0s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   0.0s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.0s
[CV] END .........

## Step 3: Evaluating the Final Tuned Model

Now that GridSearchCV has found the best hyperparameters, our final step is to use this single best model to make predictions on our held-out test set (`X_test`). This will give us the final, most reliable performance score of our fully optimized model, and we can compare it to the original, untuned SVM.

In [3]:
# --- 1. Re-evaluate the original, untuned model for a direct comparison ---
# Instantiate the original SVM model again
original_svm_model = SVC(random_state=42)
# Fit it on the original training data (from RFE)
original_svm_model.fit(X_train, y_train)
# Make predictions
y_pred_original = original_svm_model.predict(X_test)
# Calculate its accuracy
svm_accuracy = accuracy_score(y_test, y_pred_original)

# --- 2. Get the best tuned model from the grid search ---
best_svm_model = grid_search.best_estimator_

# --- 3. Make predictions with the tuned model ---
y_pred_tuned = best_svm_model.predict(X_test)

# --- 4. Calculate the tuned model's accuracy ---
tuned_accuracy = accuracy_score(y_test, y_pred_tuned)

# --- 5. Compare the two models ---
print("--- Final Model Performance Comparison ---")
print(f"Original SVM Accuracy (from Exp. A): {svm_accuracy:.4f}")
print(f"Tuned SVM Accuracy: {tuned_accuracy:.4f}")

# Calculate and print the improvement percentage
if svm_accuracy > 0:
    improvement = ((tuned_accuracy - svm_accuracy) / svm_accuracy) * 100
    print(f"\nImprovement: {improvement:.2f}%")

--- Final Model Performance Comparison ---
Original SVM Accuracy (from Exp. A): 0.8689
Tuned SVM Accuracy: 0.8852

Improvement: 1.89%


In [4]:
# --- 1. Import the joblib library ---
import joblib

# --- 2. Get the best model from the grid search ---
best_model = grid_search.best_estimator_

# --- 3. Define the path to save the model ---
MODEL_PATH = '../models/final_model.pkl'

# --- 4. Use joblib to dump (save) the model to a file ---
joblib.dump(best_model, MODEL_PATH)

print(f"Model successfully saved to: {MODEL_PATH}")
print("\n--- Mission Accomplished for Notebook 06! ---")

Model successfully saved to: ../models/final_model.pkl

--- Mission Accomplished for Notebook 06! ---
