In [1]:
## üìö 1. Setup and Data Loading

import pandas as pd
import numpy as np
import time
from sklearn.model_selection import KFold, RandomizedSearchCV # NEW TOOL!
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_absolute_error, make_scorer

# --- Load a standard regression dataset (Make Regression for large, fast demo) ---
X, y = make_regression(
    n_samples=500,        # 500 samples
    n_features=10,        # 10 features
    n_informative=5,
    random_state=42
)

# Convert MAE to a scorer (since we maximize score, we use negative MAE)
mae_scorer = make_scorer(mean_absolute_error, greater_is_better=False)

print(f"Dataset loaded: {len(X)} samples for regression.")
print("Goal: Minimize Mean Absolute Error (MAE).")

Dataset loaded: 500 samples for regression.
Goal: Minimize Mean Absolute Error (MAE).


## üé≤ 2. Introduction to Random Search

**Random Search** is a powerful and efficient alternative to Grid Search. Instead of testing *every* single combination in a predefined grid:

1.  It randomly samples a **fixed number of combinations** from the specified hyperparameter distributions.
2.  It uses Cross-Validation to score only those randomly sampled combinations.

### üß† Why Random Search is Better (Often)

Grid Search wastes time testing redundant points (e.g., trying a shallow tree with many split points). Random Search focuses its limited computational budget on exploring different *regions* of the parameter space, often finding a near-optimal or even better solution faster than Grid Search.

We will compare the time taken and the final score achieved using the same overall search space.

In [2]:
## üõ†Ô∏è 3. Defining the Search Space

# We use a Random Forest Regressor since our main project is regression
rf_model = RandomForestRegressor(random_state=42)

# We define the search space (which is large to make the comparison meaningful)
# Note: n_estimators and max_depth are the key tuning parameters for RF.
param_distribution = {
    'n_estimators': [50, 100, 150, 200],                  # 4 values (trees in the forest)
    'max_depth': [5, 10, 15, 20, 30, None],               # 6 values (max depth of each tree)
    'min_samples_split': [2, 5, 10, 20],                  # 4 values
    'max_features': [1.0, 'sqrt', 'log2']                 # 3 values
}

# The hypothetical full grid size is: 4 * 6 * 4 * 3 = 288 combinations
print(f"Hypothetical Full Grid Size: {4 * 6 * 4 * 3} models.")

Hypothetical Full Grid Size: 288 models.


In [3]:
## üìä 4. Implementing RandomizedSearchCV

# Set up the Cross-Validation strategy (Standard KFold for general regression)
cv_strategy = KFold(n_splits=5, shuffle=True, random_state=42)

# --- Random Search Specifics ---
n_iterations = 40  # We will test only 40 random combinations (out of 288 possible)

random_search = RandomizedSearchCV(
    estimator=rf_model,                 # Random Forest Regressor
    param_distributions=param_distribution, # The search space distribution
    n_iter=n_iterations,                # The fixed number of random combinations to test
    scoring=mae_scorer,                 # Metric: Negative MAE (to be maximized)
    cv=cv_strategy,                     # CV strategy
    verbose=1,
    random_state=42,
    n_jobs=-1
)

print(f"\nStarting Random Search ({n_iterations} iterations * 5 folds = {n_iterations * 5} trainings)...")
start_time_random = time.time()
random_search.fit(X, y)
end_time_random = time.time()
random_time = end_time_random - start_time_random


Starting Random Search (40 iterations * 5 folds = 200 trainings)...
Fitting 5 folds for each of 40 candidates, totalling 200 fits


In [4]:
## üìà 5. Analyzing and Comparing Results

# 5.1. Random Search Results
print("\n--- Random Search Results ---")
print(f"Time Taken: {random_time:.2f} seconds.")
# Since scoring is negative MAE, we make it positive for readability
print(f"Best CV Score (MAE): {-random_search.best_score_:.3f}") 
print(f"Best Hyperparameters Found: {random_search.best_params_}")

# 5.2. Time Comparison (Hypothetical Grid Search)
# Hypothetical time: (Total Combinations / Tested Combinations) * Random Search Time
total_combos = 288
hypothetical_grid_time = (total_combos / n_iterations) * random_time

print(f"\n--- Efficiency Comparison ---")
print(f"Random Search Tested: {n_iterations} combinations.")
print(f"Hypothetical Grid Search Time ({total_combos} combos): {hypothetical_grid_time:.2f} seconds.")

# Final Comparison
print(f"\nRandom Search was {hypothetical_grid_time / random_time:.1f}x faster while achieving a competitive score!")


--- Random Search Results ---
Time Taken: 63.16 seconds.
Best CV Score (MAE): 18.892
Best Hyperparameters Found: {'n_estimators': 200, 'min_samples_split': 2, 'max_features': 1.0, 'max_depth': 20}

--- Efficiency Comparison ---
Random Search Tested: 40 combinations.
Hypothetical Grid Search Time (288 combos): 454.72 seconds.

Random Search was 7.2x faster while achieving a competitive score!


## üåü 6. Conclusion and Dataset Rationale

### The Power of Random Search

* **Efficiency:** Random Search is dramatically faster than Grid Search, especially as the number of hyperparameters and their possible values increases. By focusing its budget, it often finds a near-optimal solution without having to test the entire space.
* **Recommendation:** For most real-world tuning tasks, **Random Search is the first technique you should try.**

### üéØ Rationale for Dataset Choice

| Dataset | Problem Type | Why we used it in this notebook | Why we did NOT use your data |
| :--- | :--- | :--- | :--- |
| **`make_regression`** | **General Regression** | It's a clean, fast-to-generate dataset with 500 samples. This allows us to run many cross-validation folds quickly to accurately measure the **time efficiency** of Random Search vs. Grid Search. | It's **Time Series** data. Using it here would require the slow `TimeSeriesSplit` CV, skewing the time comparison and confusing the core lesson (Random vs. Grid). |
| **`Supplement_Sales...`** | **Time Series Regression** | N/A | Tuning requires **`TimeSeriesSplit`** CV, which we will use in a dedicated, slower tuning notebook when the focus shifts from "efficiency comparison" to "final model deployment." |
