# Boosting

Boosting is an ensemble learning technique that combines multiple weak learners (typically decision trees) to create a strong learner. Unlike Bagging and Random Forest, which build trees independently, Boosting trains trees sequentially, where each new tree corrects the errors of the previous ones.

#### How It Works:
- **Sequential Training**: Trees are trained one after another, with each tree focusing on the misclassified instances of the previous tree.
- **Weighted Data**: In each iteration, the incorrectly classified instances are given more weight, so subsequent trees focus more on difficult cases.
- **Aggregation**: The final prediction is made by combining the weighted predictions of all trees, typically using a weighted vote (for classification) or weighted average (for regression).

#### Advantages:
✅ **Reduces Bias**: Boosting improves the accuracy by reducing both bias and variance.  
✅ **Highly Accurate**: Often produces better results than individual models due to its focus on correcting errors.  
✅ **Works Well with Complex Data**: Can handle complex data distributions and capture subtle patterns in the data.  
✅ **Flexible**: Can be applied to a wide range of models, and different base learners (like decision trees, logistic regression, etc.) can be used.

#### Disadvantages:
❌ **Prone to Overfitting**: If too many trees are added, boosting can overfit the training data, especially if the base learner is too complex.  
❌ **Computationally Expensive**: Training sequential trees can be slow and resource-intensive.  
❌ **Less Interpretability**: Like Random Forest, boosting ensembles (e.g., Gradient Boosting) can be difficult to interpret.  
❌ **Sensitive to Noisy Data**: Boosting can be sensitive to noise in the data, as it places more weight on difficult cases, which might be noisy or outliers.


In [6]:
#downloading all the necesaary dependecies
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score
from sklearn.model_selection import cross_val_score, KFold


In [7]:
%run ../Data/Data_Formatting.ipynb

In [8]:
%run Classification_Tree.ipynb

Note: you may need to restart the kernel to use updated packages.


In [9]:
pip install scikit-learn




In [10]:
#loading the training dataset 
train_path = Path("../Data/premierleague_team_data.csv")
matches = pd.read_csv(train_path)

#loading the testing data 
test_path = Path("../Data/premierleague_test_team_data.csv")
test_matches = pd.read_csv(test_path)

In [11]:
#loading the training dataset with rank
train_path = Path("../Data/premierleague_rank_team_data.csv")
new_matches = pd.read_csv(train_path)

#loading the testing data with rank
test_path = Path("../Data/premierleague_rank_test_team_data.csv")
new_test_matches = pd.read_csv(test_path)

In [12]:
process_data(matches, test_matches)

In [13]:
process_data(new_matches, new_test_matches)

### Boosting using Baseline Predictors (refer /Data/Data_Formatting.ipynb)  

In [14]:
# Function to make yearly predictions using Gradient Boosting
def make_yearly_predictions_gb(Train, Test):
    best_alpha = find_optimal_alpha(Train)
   # Convert 'Date' columns to datetime and sort data
    Train['Date'] = pd.to_datetime(Train['Date'], errors='coerce')
    Test['Date'] = pd.to_datetime(Test['Date'], errors='coerce')
    Train = Train.dropna(subset=['Date']).sort_values(by='Date')
    Test = Test.dropna(subset=['Date']).sort_values(by='Date')

    # Define static predictors
    static_predictors = ["Venue_code", "Opp_code", "Hour", "Day_code"]

      # Train a Gradient Boosting Classifier
    gb_clf = GradientBoostingClassifier(n_estimators=50, max_depth=10, min_samples_split=10, ccp_alpha=best_alpha, random_state=1)
    gb_clf.fit(Train[static_predictors], Train["Target"])

    results = []
    for year in range(Test['Date'].dt.year.min(), Test['Date'].dt.year.max() + 1):
        test_year = Test[Test['Date'].dt.year == year]
        if not test_year.empty:
           # Predict on test data
            preds = gb_clf.predict(test_year[static_predictors])
            
             
            # Calculate precision and accuracy
            precision = precision_score(test_year["Target"], preds, average="weighted")
            accuracy = accuracy_score(test_year["Target"], preds)

            # Append results to list
            results.append({
                "Model": "Boosting",
                "Year": year,
                "Precision": precision,
                "Accuracy": accuracy
            })

    # Convert results to DataFrame
    results_df = pd.DataFrame(results)

    return results_df


### Boosting using Baseline Predictors + Rolling Predictors (refer /Data/Data_Formatting.ipynb)  

In [15]:
def make_yearly_predictions_gb_rolling(Train, Test):
    best_alpha = find_optimal_alpha(Train)
    # Convert 'Date' columns to datetime and sort data
    Train['Date'] = pd.to_datetime(Train['Date'], errors='coerce')
    Test['Date'] = pd.to_datetime(Test['Date'], errors='coerce')
    Train = Train.dropna(subset=['Date']).sort_values(by='Date')
    Test = Test.dropna(subset=['Date']).sort_values(by='Date')
    
    # Define the feature columns for which we'll calculate rolling averages
    cols = ["GF", "GA", "Sh", "SoT", "PK", "PKatt",]
    new_cols = [f"{c}_rolling" for c in cols]
    
    # Apply rolling averages to both Train and Test datasets
    train_results = []
    for team, group in Train.groupby("Team"):
        result = rolling_averages(group, cols, new_cols)
        train_results.append(result)
    Train = pd.concat(train_results)
    
    test_results = []
    for team, group in Test.groupby("Team"):
        result = rolling_averages(group, cols, new_cols)
        test_results.append(result)
    Test = pd.concat(test_results)
    
    # Define static and rolling predictors
    static_predictors = ["Venue_code", "Opp_code", "Hour", "Day_code"]
    rolling_predictors = new_cols
    all_predictors = static_predictors + rolling_predictors

     # Train a Gradient Boosting Classifier
    gb_clf = GradientBoostingClassifier(n_estimators=50, max_depth=10, min_samples_split=10, random_state=1)
    gb_clf.fit(Train[static_predictors], Train["Target"])
    
    results = []
    for year in range(Test['Date'].dt.year.min(), Test['Date'].dt.year.max() + 1):
        test_year = Test[Test['Date'].dt.year == year]
        if not test_year.empty:
           preds = gb_clf.predict(test_year[static_predictors])
  
           precision = precision_score(test_year["Target"], preds, average="weighted")
           accuracy = accuracy_score(test_year["Target"], preds)
            
            # Append results to list
           results.append({
                "Model": "Boosting",
                "Year": year,
                "Precision": precision,
                "Accuracy": accuracy
           })

    # Convert results to DataFrame
    results_df = pd.DataFrame(results)

    return results_df


### Boosting using Full Feature Set (refer /Data/Data_Formatting.ipynb) 

In [18]:
def make_yearly_predictions_gb_full(Train, Test):
    best_alpha = find_optimal_alpha(Train)
    # Convert 'Date' columns to datetime and sort data
    Train['Date'] = pd.to_datetime(Train['Date'], errors='coerce')
    Test['Date'] = pd.to_datetime(Test['Date'], errors='coerce')
    Train = Train.dropna(subset=['Date']).sort_values(by='Date')
    Test = Test.dropna(subset=['Date']).sort_values(by='Date')
    
    # Define the feature columns for which we'll calculate rolling averages
    cols = ["GF", "GA", "Sh", "SoT", "PK", "PKatt",]
    new_cols = [f"{c}_rolling" for c in cols]
    
    # Apply rolling averages to both Train and Test datasets
    train_results = []
    for team, group in Train.groupby("Team"):
        result = rolling_averages(group, cols, new_cols)
        train_results.append(result)
    Train = pd.concat(train_results)
    
    test_results = []
    for team, group in Test.groupby("Team"):
        result = rolling_averages(group, cols, new_cols)
        test_results.append(result)
    Test = pd.concat(test_results)
    
    # Define static and rolling predictors
    static_predictors = ["Venue_code", "Opp_code", "Hour", "Day_code","Rank","IsRanked"]
    rolling_predictors = new_cols
    all_predictors = static_predictors + rolling_predictors

     # Train a Gradient Boosting Classifier
    gb_clf = GradientBoostingClassifier(n_estimators=50, max_depth=10, min_samples_split=10, random_state=1)
    gb_clf.fit(Train[static_predictors], Train["Target"])

    results = []
    for year in range(Test['Date'].dt.year.min(), Test['Date'].dt.year.max() + 1):
        test_year = Test[Test['Date'].dt.year == year]
        if not test_year.empty:
            preds = gb_clf.predict(test_year[static_predictors])
  
            precision = precision_score(test_year["Target"], preds, average="weighted")
            accuracy = accuracy_score(test_year["Target"], preds)
            
           # Append results to list
            results.append({
                "Model": "Boosting",
                "Year": year,
                "Precision": precision,
                "Accuracy": accuracy
            })

    # Convert results to DataFrame
    results_df = pd.DataFrame(results)

    return results_df
