# 🔧 Understanding Hyperparameters in Machine Learning Models  

## What Are Hyperparameters?  
Hyperparameters are **configurable parameters** set before a machine learning model begins training. Unlike model parameters (e.g., weights in neural networks), hyperparameters **are not learned from the data** but are instead manually specified or optimized using techniques like **grid search** or **random search**.  

## Why Do We Need Hyperparameters?  
Hyperparameters play a crucial role in determining the **performance, speed, and generalization** of a model. Choosing the right hyperparameters can:  
- Improve **accuracy** and **efficiency**  
- Prevent **overfitting** (learning noise instead of patterns)  
- Enhance **generalization** to unseen data  
- Speed up **training and inference**  

## Examples of Hyperparameters in Different Models  
Here are some common hyperparameters across different models:  

### 🏆 Decision Trees & Random Forests  
- `max_depth`: Controls tree depth to prevent overfitting  
- `min_samples_split`: Minimum samples required to split a node  
- `n_estimators` (for ensembles): Number of trees in a forest  

### 🔥 Neural Networks  
- `learning_rate`: Defines how fast the model updates weights  
- `batch_size`: Number of training samples per batch  
- `epochs`: Number of complete passes through the dataset  

### 📈 Gradient Boosting (XGBoost, LightGBM)  
- `learning_rate`: Controls the contribution of each tree  
- `n_estimators`: Number of boosting rounds  
- `max_depth`: Limits tree depth to prevent overfitting  

## What Is the "Perfect" Hyperparameter Value?  
There is **no universal perfect value** for hyperparameters. The optimal settings depend on:  
- The **dataset** size and complexity  
- The **model type** and architecture  
- The **goal** (e.g., maximizing accuracy vs. minimizing inference time)  

To find the best hyperparameters, we use:  
✅ **Grid Search**: Tests all combinations of hyperparameters  
✅ **Random Search**: Randomly samples hyperparameters for efficiency  
✅ **Bayesian Optimization**: Selects hyperparameters based on past results  

## 🔍 Conclusion  
Hyperparameters **define how a model learns**, impacting its **accuracy, speed, and generalization**. Proper tuning is essential for achieving **optimal performance** without overfitting or underfitting.  


## **1. Pruning in Classification Tree**  
Pruning helps prevent **overfitting** by reducing the size of a decision tree, leading to improved accuracy on unseen data. Without pruning, a tree may **memorize** training data rather than generalizing well to new data.  

### **Post-Pruning (Cost Complexity Pruning - CCP)**  
In post-pruning, the tree is first grown to full depth (even if it overfits) and then gradually pruned by removing nodes based on a complexity parameter α .  

#### **How CCP Works?**  
The pruning process minimizes the following equation:  

$$
\text{Total Cost} = \text{RSS} + \alpha \times \text{Number of Leaves}
$$


- **RSS (Residual Sum of Squares)** measures the error in predictions.  
- **α** is a tuning parameter that controls the trade-off between tree complexity and error.  
  - **Higher α** → More pruning → Simpler tree.  
  - **Lower α** → Less pruning → More complex tree.  
- The value for **α** can be found using cross validation.


### Baseline Predictors

In [23]:
# Function to find optimal ccp_alpha
def find_optimal_alpha_base(Train):
    static_predictors = ["Venue_code", "Opp_code", "Hour", "Day_code"]
    
    dt = DecisionTreeClassifier(random_state=1)
    path = dt.cost_complexity_pruning_path(Train[static_predictors], Train["Target"])
    ccp_alphas = path.ccp_alphas[:-1]  # Exclude the last value to avoid a single-node tree
    
    kf = KFold(n_splits=5, shuffle=True, random_state=1)
    alpha_scores = {}
    
    for alpha in ccp_alphas:
        dt = DecisionTreeClassifier(random_state=1, ccp_alpha=alpha)
        scores = cross_val_score(dt, Train[static_predictors], Train["Target"], cv=kf, scoring='accuracy')
        alpha_scores[alpha] = np.mean(scores)
    
    best_alpha = max(alpha_scores, key=alpha_scores.get)
    print(f"Best ccp_alpha: {best_alpha:.6f} with Accuracy: {alpha_scores[best_alpha]:.4f}")
    return best_alpha


### Baseline Predictors + Rolling Predictors

In [11]:
# Function to find optimal ccp_alpha
def find_optimal_alpha_roll(Train):

    all_predictors = cv_parameters_roll(Train)
    Train = roll(Train)
    
    dt = DecisionTreeClassifier(random_state=1)
    path = dt.cost_complexity_pruning_path(Train[all_predictors], Train["Target"])
    ccp_alphas = path.ccp_alphas[:-1]  # Exclude the last value to avoid a single-node tree

    
    kf = KFold(n_splits=5, shuffle=True, random_state=1)
    alpha_scores = {}
    
    for alpha in ccp_alphas:
        dt = DecisionTreeClassifier(random_state=1, ccp_alpha=alpha)
        scores = cross_val_score(dt, Train[all_predictors], Train["Target"], cv=kf, scoring='accuracy')
        alpha_scores[alpha] = np.mean(scores)
    
    best_alpha = max(alpha_scores, key=alpha_scores.get)
    print(f"Best ccp_alpha: {best_alpha:.6f} with Accuracy: {alpha_scores[best_alpha]:.4f}")
    return best_alpha

 ### Full Feature Set

In [25]:
# Function to find optimal ccp_alpha
def find_optimal_alpha_full(Train):
        # Define the feature columns for which we'll calculate rolling averages
    cols = ["GF", "GA", "Sh", "SoT", "PK", "PKatt"]
    new_cols = [f"{c}_rolling" for c in cols]
    
    # Apply rolling averages to both Train and Test datasets
    train_results = []
    for team, group in Train.groupby("Team"):
        result = rolling_averages(group, cols, new_cols)
        train_results.append(result)
    Train = pd.concat(train_results)

    # Define static and rolling predictors
    static_predictors = ["Venue_code", "Opp_code", "Hour", "Day_code","Rank","IsRanked"]
    rolling_predictors = new_cols
    all_predictors = static_predictors + rolling_predictors

    dt = DecisionTreeClassifier(random_state=1)
    path = dt.cost_complexity_pruning_path(Train[all_predictors], Train["Target"])
    ccp_alphas = path.ccp_alphas[:-1]  # Exclude the last value to avoid a single-node tree
 
    kf = KFold(n_splits=5, shuffle=True, random_state=1)
    alpha_scores = {}
    
    for alpha in ccp_alphas:
        dt = DecisionTreeClassifier(random_state=1, ccp_alpha=alpha)
        scores = cross_val_score(dt, Train[all_predictors], Train["Target"], cv=kf, scoring='accuracy')
        alpha_scores[alpha] = np.mean(scores)
    
    best_alpha = max(alpha_scores, key=alpha_scores.get)
    print(f"Best ccp_alpha: {best_alpha:.6f} with Accuracy: {alpha_scores[best_alpha]:.4f}")
    return best_alpha

## **2. C in Logistic Regression**  
In **Logistic Regression**, `C` is the **inverse of the regularization strength** (also called the **inverse of lambda** in regularization).

$$ 
C = \frac{1}{\lambda}
$$
where **λ (lambda)** is the regularization parameter.

### 🔹 What Does `C` Do?
- It **controls the trade-off** between model complexity and generalization.
- **Higher values of `C`** → Less regularization (**more complex model, risk of overfitting**).
- **Lower values of `C`** → More regularization (**simpler model, avoids overfitting**).

### 🔹 Impact of `C` Values

| `C` Value  | Effect on Model |
|------------|---------------|
| **Very Small (`C → 0.0001`)** | Strong regularization, may underfit |
| **Moderate (`C = 1.0`)** | Balanced regularization |
| **Very Large (`C → 10000`)** | Almost no regularization, may overfit |



### Baseline Predictors

In [26]:
# Function to find the optimal C for Logistic Regression
def find_optimal_C_base(Train):
    static_predictors = ["Venue_code", "Opp_code", "Hour", "Day_code"]

    # Define a range of C values to test (logarithmically spaced)
    C_values = np.logspace(-2, 3, 20)   # Testing C from 0.0001 to 10000

    kf = KFold(n_splits=5, shuffle=True, random_state=1)
    C_scores = {}

    for C in C_values:
        lr = LogisticRegression(C=C, solver='liblinear', random_state=1)
        scores = cross_val_score(lr, Train[static_predictors], Train["Target"], cv=kf, scoring='accuracy')
        C_scores[C] = np.mean(scores)

    best_C = max(C_scores, key=C_scores.get)
    print(f"Best C: {best_C:.6f} with Accuracy: {C_scores[best_C]:.4f}")
    return best_C


### Baseline Predictors + Rolling Predictors

In [27]:
# Function to find the optimal C for Logistic Regression
def find_optimal_C_roll(Train):
   # Define the feature columns for which we'll calculate rolling averages
    cols = ["GF", "GA", "Sh", "SoT", "PK", "PKatt"]
    new_cols = [f"{c}_rolling" for c in cols]
    
    # Apply rolling averages to both Train and Test datasets
    train_results = []
    for team, group in Train.groupby("Team"):
        result = rolling_averages(group, cols, new_cols)
        train_results.append(result)
    Train = pd.concat(train_results)

    # Define static and rolling predictors
    static_predictors = ["Venue_code", "Opp_code", "Hour", "Day_code"]
    rolling_predictors = new_cols
    all_predictors = static_predictors + rolling_predictors

    # Define a range of C values to test (logarithmically spaced)
    C_values = np.logspace(-2, 3, 20)   # Testing C from 0.0001 to 10000

    kf = KFold(n_splits=5, shuffle=True, random_state=1)
    C_scores = {}

    for C in C_values:
        lr = LogisticRegression(C=C, solver='liblinear', random_state=1)
        scores = cross_val_score(lr, Train[all_predictors], Train["Target"], cv=kf, scoring='accuracy')
        C_scores[C] = np.mean(scores)

    best_C = max(C_scores, key=C_scores.get)
    print(f"Best C: {best_C:.6f} with Accuracy: {C_scores[best_C]:.4f}")
    return best_C

 ### Full Feature Set

In [28]:
# Function to find the optimal C for Logistic Regression
def find_optimal_C_full(Train):
   # Define the feature columns for which we'll calculate rolling averages
    cols = ["GF", "GA", "Sh", "SoT", "PK", "PKatt"]
    new_cols = [f"{c}_rolling" for c in cols]
    
    # Apply rolling averages to both Train and Test datasets
    train_results = []
    for team, group in Train.groupby("Team"):
        result = rolling_averages(group, cols, new_cols)
        train_results.append(result)
    Train = pd.concat(train_results)

    # Define static and rolling predictors
    static_predictors = ["Venue_code", "Opp_code", "Hour", "Day_code","Rank","IsRanked"]
    rolling_predictors = new_cols
    all_predictors = static_predictors + rolling_predictors

    # Define a range of C values to test (logarithmically spaced)
    C_values = np.logspace(-2, 3, 20)   # Testing C from 0.0001 to 10000

    kf = KFold(n_splits=5, shuffle=True, random_state=1)
    C_scores = {}

    for C in C_values:
        lr = LogisticRegression(C=C, solver='liblinear', random_state=1)
        scores = cross_val_score(lr, Train[all_predictors], Train["Target"], cv=kf, scoring='accuracy')
        C_scores[C] = np.mean(scores)

    best_C = max(C_scores, key=C_scores.get)
    print(f"Best C: {best_C:.6f} with Accuracy: {C_scores[best_C]:.4f}")
    return best_C

# LDA Shrinkage
## What is Shrinkage in LDA?
Shrinkage is a regularization technique used in **Linear Discriminant Analysis (LDA)** to improve the estimation of the covariance matrix. It blends the empirical covariance matrix with a more structured version, reducing overfitting and improving stability, especially when dealing with high-dimensional data.

## When and Why is Shrinkage Needed?
- When **the number of features is large** compared to the number of samples, the empirical covariance matrix can be poorly estimated.
- Shrinkage **adds regularization** to avoid overfitting and makes the model more robust.
- It is useful when **the covariance matrix is nearly singular or unstable**.
- Works **only with `solver="lsqr"` or `solver="eigen"`**, as these solvers allow regularization.

## How is Shrinkage Controlled?
The shrinkage parameter (`shrinkage`) is a value between **0 and 1**:
- `shrinkage=0`: No shrinkage (uses the empirical covariance matrix).
- `shrinkage=1`: Full shrinkage (uses a diagonalized covariance matrix).
- **Optimal values** can be found via cross-validation (`GridSearchCV`).

### Baseline Predictors

In [29]:
def find_best_shrinkage_base(Train):
    static_predictors = ["Venue_code", "Opp_code", "Hour", "Day_code"]
    param_grid ={"shrinkage": np.linspace(0.0, 1.0, 10)}
    lda = LinearDiscriminantAnalysis(solver="lsqr")
    grid_search = GridSearchCV(lda, param_grid, scoring="accuracy", cv=5)
    grid_search.fit(Train[static_predictors], Train["Target"])
    return grid_search.best_params_["shrinkage"]


### Baseline Predictors + Rolling Predictors

In [30]:
def find_best_shrinkage_roll(Train):
    # Define the feature columns for which we'll calculate rolling averages
    cols = ["GF", "GA", "Sh", "SoT", "PK", "PKatt"]
    new_cols = [f"{c}_rolling" for c in cols]
    
    # Apply rolling averages to both Train and Test datasets
    train_results = []
    for team, group in Train.groupby("Team"):
        result = rolling_averages(group, cols, new_cols)
        train_results.append(result)
    Train = pd.concat(train_results)

    # Define static and rolling predictors
    static_predictors = ["Venue_code", "Opp_code", "Hour", "Day_code"]
    rolling_predictors = new_cols
    all_predictors = static_predictors + rolling_predictors

    param_grid = {"shrinkage": np.linspace(0.0, 1.0, 10)}
    lda = LinearDiscriminantAnalysis(solver="lsqr")
    grid_search = GridSearchCV(lda, param_grid, scoring="accuracy", cv=5)
    grid_search.fit(Train[all_predictors], Train["Target"])
    return grid_search.best_params_["shrinkage"]


 ### Full Feature Set

In [31]:
def find_best_shrinkage_full(Train):
     # Define the feature columns for which we'll calculate rolling averages
    cols = ["GF", "GA", "Sh", "SoT", "PK", "PKatt"]
    new_cols = [f"{c}_rolling" for c in cols]
    
    # Apply rolling averages to both Train and Test datasets
    train_results = []
    for team, group in Train.groupby("Team"):
        result = rolling_averages(group, cols, new_cols)
        train_results.append(result)
    Train = pd.concat(train_results)

    # Define static and rolling predictors
    static_predictors = ["Venue_code", "Opp_code", "Hour", "Day_code","Rank","IsRanked"]
    rolling_predictors = new_cols
    all_predictors = static_predictors + rolling_predictors

    param_grid = {"shrinkage": np.linspace(0.0, 1.0, 10)}
    lda = LinearDiscriminantAnalysis(solver="lsqr")
    grid_search = GridSearchCV(lda, param_grid, scoring="accuracy", cv=5)
    grid_search.fit(Train[all_predictors], Train["Target"])
    return grid_search.best_params_["shrinkage"]


# **QDA Regularization Parameter (`reg_param`)**

## What is `reg_param` in QDA?
`reg_param` is a regularization parameter in **Quadratic Discriminant Analysis (QDA)** that **adds shrinkage** to the covariance matrix. It prevents overfitting by **regularizing class-specific covariance matrices**, making the model more stable in cases with small sample sizes or highly correlated features.

## When and Why is `reg_param` Needed?
- **High-dimensional data**: If the number of features is large compared to available samples, `reg_param` helps by **shrinking covariance estimates**.
- **Ill-conditioned covariance matrix**: When the covariance matrix is near-singular, `reg_param` helps stabilize it.
- **Prevents overfitting**: Adding shrinkage smooths class-specific covariance estimates, making QDA more generalizable.

## How is `reg_param` Controlled?
The value of `reg_param` typically ranges from **0 to 1**:
- `reg_param=0`: No regularization (pure QDA).
- `reg_param=1`: Maximum shrinkage (covariance matrix becomes diagonal).
- **Tuned using cross-validation** (`GridSearchCV`) for optimal performance.

In [None]:
def find_best_shrinkage_base(Train):
    static_predictors = ["Venue_code", "Opp_code", "Hour", "Day_code"]
# Generate a range of values between 0 and 1 with 10 steps
    param_grid = {"reg_param": np.linspace(0.0, 1.0, 10)}

    qda = QuadraticDiscriminantAnalysis()
    grid_search = GridSearchCV(qda, param_grid, scoring="accuracy", cv=5)
    grid_search.fit(Train[static_predictors], Train["Target"])

    return grid_search.best_params_["reg_param"]