# Boosting

Boosting is an ensemble learning technique that combines multiple weak learners (typically decision trees) to create a strong learner. Unlike Bagging and Random Forest, which build trees independently, Boosting trains trees sequentially, where each new tree corrects the errors of the previous ones.

#### How It Works:
- **Sequential Training**: Trees are trained one after another, with each tree focusing on the misclassified instances of the previous tree.
- **Weighted Data**: In each iteration, the incorrectly classified instances are given more weight, so subsequent trees focus more on difficult cases.
- **Aggregation**: The final prediction is made by combining the weighted predictions of all trees, typically using a weighted vote (for classification) or weighted average (for regression).

#### Advantages:
✅ **Reduces Bias**: Boosting improves the accuracy by reducing both bias and variance.  
✅ **Highly Accurate**: Often produces better results than individual models due to its focus on correcting errors.  
✅ **Works Well with Complex Data**: Can handle complex data distributions and capture subtle patterns in the data.  
✅ **Flexible**: Can be applied to a wide range of models, and different base learners (like decision trees, logistic regression, etc.) can be used.

#### Disadvantages:
❌ **Prone to Overfitting**: If too many trees are added, boosting can overfit the training data, especially if the base learner is too complex.  
❌ **Computationally Expensive**: Training sequential trees can be slow and resource-intensive.  
❌ **Less Interpretability**: Like Random Forest, boosting ensembles (e.g., Gradient Boosting) can be difficult to interpret.  
❌ **Sensitive to Noisy Data**: Boosting can be sensitive to noise in the data, as it places more weight on difficult cases, which might be noisy or outliers.


### Boosting using Baseline Predictors (refer /Data/Data_Formatting.ipynb)  

In [3]:
# Function to make yearly predictions using Gradient Boosting
def make_yearly_predictions_gb(Train, Test):
    best_alpha = find_optimal_alpha_base(Train)
   
    # Define static predictors
    static_predictors =  parameters_base(Train,Test)
    
      # Train a Gradient Boosting Classifier
    gb_clf = GradientBoostingClassifier(n_estimators=50, max_depth=10, min_samples_split=10, ccp_alpha=best_alpha, random_state=1)
    gb_clf.fit(Train[static_predictors], Train["Target"])

    results = []
    for year in range(Test['Date'].dt.year.min(), Test['Date'].dt.year.max() + 1):
        test_year = Test[Test['Date'].dt.year == year]
        if not test_year.empty:
           # Predict on test data
            preds = gb_clf.predict(test_year[static_predictors])
            
             
            # Calculate precision and accuracy
            precision = precision_score(test_year["Target"], preds, average="weighted")
            accuracy = accuracy_score(test_year["Target"], preds)

            # Append results to list
            results.append({
                "Model": "Boosting",
                "Year": year,
                "Precision": precision,
                "Accuracy": accuracy
            })

    # Convert results to DataFrame
    results_df = pd.DataFrame(results)

    return results_df


### Boosting using Baseline Predictors + Rolling Predictors (refer /Data/Data_Formatting.ipynb)  

In [4]:
def make_yearly_predictions_gb_rolling(Train, Test):
    results = []
    all_predictors = parameters_roll(Train, Test)
    
    # Process all data together for proper rolling calculations
    full_data = pd.concat([Train, Test]).sort_values(['Team', 'Date'])
    full_data = roll(full_data)
    
    # Split back into train/test
    Train = full_data[full_data['Date'].isin(Train['Date'])]
    Test = full_data[full_data['Date'].isin(Test['Date'])]
    
    for year in sorted(Test['Date'].dt.year.unique()):
        # Train on all data BEFORE the test year
        train_mask = Train['Date'].dt.year < year
        X_train = Train.loc[train_mask, all_predictors]
        y_train = Train.loc[train_mask, "Target"]
        
        # Test on current year
        test_year = Test[Test['Date'].dt.year == year]
        X_test = test_year[all_predictors]
        y_test = test_year["Target"]
        
        if len(X_train) > 0 and len(X_test) > 0:
            gb_clf = GradientBoostingClassifier(n_estimators=50, max_depth=10, 
                                             min_samples_split=10, random_state=1)
            gb_clf.fit(X_train, y_train)
            
            preds = gb_clf.predict(X_test)
            precision = precision_score(y_test, preds, average="weighted")
            accuracy = accuracy_score(y_test, preds)
            
            results.append({
                "Model": "Boosting",
                "Year": year,
                "Precision": precision,
                "Accuracy": accuracy,
                "Samples": len(y_test)
            })
    
    return pd.DataFrame(results)


### Boosting using Full Feature Set (refer /Data/Data_Formatting.ipynb) 

In [5]:
def make_yearly_predictions_gb_full(Train, Test):
    results = []
    all_predictors = parameters_full(Train, Test)
    
    # Process all data together for proper rolling calculations
    full_data = pd.concat([Train, Test]).sort_values(['Team', 'Date'])
    full_data = roll(full_data)
    
    # Split back into train/test
    Train = full_data[full_data['Date'].isin(Train['Date'])]
    Test = full_data[full_data['Date'].isin(Test['Date'])]
    
    for year in sorted(Test['Date'].dt.year.unique()):
        # Train on all data BEFORE the test year
        train_mask = Train['Date'].dt.year < year
        X_train = Train.loc[train_mask, all_predictors]
        y_train = Train.loc[train_mask, "Target"]
        
        # Test on current year
        test_year = Test[Test['Date'].dt.year == year]
        X_test = test_year[all_predictors]
        y_test = test_year["Target"]
        
        if len(X_train) > 0 and len(X_test) > 0:
            gb_clf = GradientBoostingClassifier(n_estimators=50, max_depth=10, 
                                             min_samples_split=10, random_state=1)
            gb_clf.fit(X_train, y_train)
            
            preds = gb_clf.predict(X_test)
            precision = precision_score(y_test, preds, average="weighted")
            accuracy = accuracy_score(y_test, preds)
            
            results.append({
                "Model": "Boosting",
                "Year": year,
                "Precision": precision,
                "Accuracy": accuracy,
                "Samples": len(y_test)
            })
    
    return pd.DataFrame(results)
