# Decision Trees and Their Application in Classification  

In this section, we will **test** and **evaluate** various tree-based methods, including **classification trees**, **bagging**, **random forest**, and **boosting**, to see which works best for predicting football match outcomes.

## What Are Decision Trees?  

A **decision tree** is a supervised machine learning algorithm used for both **classification** and **regression** tasks. It splits the data into subsets based on the most significant feature at each step, resulting in a tree-like structure. 

### How Do Decision Trees Work for Classification?  

In **classification tasks**, decision trees are used to predict categorical outcomes by recursively splitting the data at each node, with the goal of maximizing the "purity" of the resulting subsets. The **root node** represents the entire dataset, and the tree branches out to **leaf nodes** that represent the predicted class label. Each split is based on the feature that best separates the data at that point, typically using a metric such as **Gini impurity** or **cross entropy**.

#### Example Process:
1. **Starting Node (Root)**: The algorithm evaluates all possible features and chooses the one that best divides the data into distinct classes.
2. **Internal Nodes**: Each subsequent node splits the data based on a feature that provides the greatest separation of class labels.
3. **Leaf Nodes**: These represent the final predicted class labels for a given subset of data.

Decision trees are **easy to interpret** and visualize, which makes them an appealing choice for understanding how predictions are made. However, they can be prone to **overfitting** if not properly tuned.

## Types of Tree-Based Methods  

#### 1. Classification Trees  
#### 2. Bagging (Bootstrap Aggregating)  
#### 3. Random Forest  
#### 4. Boosting  

## Summary  

In this section, we have explored various tree-based methods—**classification trees**, **bagging**, **random forest**, and **boosting**—and will test each one to determine which best suits the football match outcome prediction task. Each method will be tested and evaluated to assess its effectiveness in predicting match results.


In [2]:
#downloading all the necesaary dependecies
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score
from sklearn.model_selection import cross_val_score, KFold

In [3]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [4]:
%run ../Data/Data_Formatting.ipynb

In [5]:
%run ../Data/Ultimate_Hyperparameters.ipynb

In [6]:
%run ../Data/Parameters.ipynb

In [7]:
#loading the training dataset 
train_path = Path("../Data/premierleague_team_data.csv")
matches = pd.read_csv(train_path)

#loading the testing data 
test_path = Path("../Data/premierleague_test_team_data.csv")
test_matches = pd.read_csv(test_path)

In [8]:
#loading the training dataset with rank
train_path = Path("../Data/premierleague_rank_team_data.csv")
new_matches = pd.read_csv(train_path)

#loading the testing data with rank
test_path = Path("../Data/premierleague_rank_test_team_data.csv")
new_test_matches = pd.read_csv(test_path)

In [9]:
process_data(matches, test_matches)

In [10]:
process_data(new_matches, new_test_matches)

# Best Model Selection 

## Choosing the Best Overall Model

In our analysis, we have evaluated four different machine learning models: **Bagging, Gradient Boosting, Decision Tree, and Random Forest**. Each of these models has been tested on multiple years of data, and their performance has been measured using **accuracy** and **precision**.

### Why Do We Need to Choose the Best Model?

1. **Consistency Across Years**  
   Some models may perform well in certain years but not in others. Selecting the best overall model ensures that we choose the one that provides **consistent performance across multiple years** rather than excelling only in specific cases.

2. **Generalization to Future Data**  
   The goal is to make predictions on new, unseen data. A model that performs well across different years is more **robust and reliable** for future predictions.

3. **Maximizing Accuracy and Precision**  
   By comparing the performance of all models, we can **identify the model with the highest accuracy and precision**. This helps in minimizing errors and improving decision-making.

4. **Avoiding Overfitting**  
   Some models might perform exceptionally well on training data but fail on test data. By analyzing results across multiple years, we ensure that the chosen model is **not overfitting** and generalizes well.

### How Will We Choose the Best Model?
We will compare all models based on:
- **Accuracy** (how often the model makes correct predictions)  
- **Precision** (how reliable the model's positive predictions are)  

The model that consistently achieves the highest accuracy and precision across multiple years will be selected as the **best overall model**.

By following this approach, we ensure that our chosen model is the most reliable and effective for making predictions. 🚀


## Models with Baseline Predictors

In [11]:
# Run all models
def best_tree_model_baseline(A, B):
    bagging_results = make_yearly_predictions_bagging(A, B)
    gb_results = make_yearly_predictions_gb(A, B)
    decs_results = make_yearly_predictions_decs(A, B)
    rf_results = make_yearly_predictions_rf(A, B)

    # Combine all results
    all_results = pd.concat([bagging_results, gb_results, decs_results, rf_results], ignore_index=True)

    # Find best accuracy & precision for each year
    best_per_year = all_results.loc[all_results.groupby("Year")["Accuracy"].idxmax()]
    best_per_year_precision = all_results.loc[all_results.groupby("Year")["Precision"].idxmax()]

    # Compute average precision and accuracy per model
    avg_results = all_results.groupby("Model")[["Accuracy", "Precision"]].mean().reset_index()

    # Find the best overall model based on highest average accuracy
    best_model = avg_results.loc[avg_results["Accuracy"].idxmax()]

    # Find the best overall model based on highest average precision
    best_precision_model = avg_results.loc[avg_results["Precision"].idxmax()]

    # Display results
    print("Best Model Per Year (by Accuracy):")
    print(best_per_year)

    print("\nBest Model Per Year (by Precision):")
    print(best_per_year_precision)

    print("\nOverall Best Model (by Accuracy):")
    print(f"Model: {best_model['Model']}, Avg Precision: {best_model['Precision']:.4f}, Avg Accuracy: {best_model['Accuracy']:.4f}")

    print("\nOverall Best Model (by Precision):")
    print(f"Model: {best_precision_model['Model']}, Avg Precision: {best_precision_model['Precision']:.4f}, Avg Accuracy: {best_precision_model['Accuracy']:.4f}")


## Models with Baseline Predictors + Rolling Predictors

In [12]:
# Run all models
def best_tree_model_rollling(A,B) :
    bagging_results = make_yearly_predictions_bagging_rolling(A,B)
    gb_results = make_yearly_predictions_gb_rolling(A,B)
    decs_results = make_yearly_predictions_decs_rolling(A,B)
    rf_results = make_yearly_predictions_rf_rolling(A,B)

   # Combine all results
    all_results = pd.concat([bagging_results, gb_results, decs_results, rf_results], ignore_index=True)

   # Find best accuracy & precision for each year
    best_per_year = all_results.loc[all_results.groupby("Year")["Accuracy"].idxmax()]
    best_per_year_precision = all_results.loc[all_results.groupby("Year")["Precision"].idxmax()]

  # Compute average precision and accuracy per model
    avg_results = all_results.groupby("Model")[["Accuracy", "Precision"]].mean().reset_index()

    # Find the best overall model based on highest average accuracy
    best_model = avg_results.loc[avg_results["Accuracy"].idxmax()]

    # Find the best overall model based on highest average precision
    best_precision_model = avg_results.loc[avg_results["Precision"].idxmax()]

    # Display results
    print("Best Model Per Year (by Accuracy):")
    print(best_per_year)

    print("\nBest Model Per Year (by Precision):")
    print(best_per_year_precision)

    print("\nOverall Best Model (by Accuracy):")
    print(f"Model: {best_model['Model']}, Avg Precision: {best_model['Precision']:.4f}, Avg Accuracy: {best_model['Accuracy']:.4f}")

    print("\nOverall Best Model (by Precision):")
    print(f"Model: {best_precision_model['Model']}, Avg Precision: {best_precision_model['Precision']:.4f}, Avg Accuracy: {best_precision_model['Accuracy']:.4f}")


## Models with Full Set Predictors

In [13]:
# Run all models
def best_tree_model_full(A,B) :
    bagging_results = make_yearly_predictions_bagging_full(A,B)
    gb_results = make_yearly_predictions_gb_full(A,B)
    decs_results = make_yearly_predictions_decs_full(A,B)
    rf_results = make_yearly_predictions_rf_full(A,B)

   # Combine all results
    all_results = pd.concat([bagging_results, gb_results, decs_results, rf_results], ignore_index=True)

   # Find best accuracy & precision for each year
    best_per_year = all_results.loc[all_results.groupby("Year")["Accuracy"].idxmax()]
    best_per_year_precision = all_results.loc[all_results.groupby("Year")["Precision"].idxmax()]
# Compute average precision and accuracy per model
    avg_results = all_results.groupby("Model")[["Accuracy", "Precision"]].mean().reset_index()

    # Find the best overall model based on highest average accuracy
    best_model = avg_results.loc[avg_results["Accuracy"].idxmax()]

    # Find the best overall model based on highest average precision
    best_precision_model = avg_results.loc[avg_results["Precision"].idxmax()]

    # Display results
    print("Best Model Per Year (by Accuracy):")
    print(best_per_year)

    print("\nBest Model Per Year (by Precision):")
    print(best_per_year_precision)

    print("\nOverall Best Model (by Accuracy):")
    print(f"Model: {best_model['Model']}, Avg Precision: {best_model['Precision']:.4f}, Avg Accuracy: {best_model['Accuracy']:.4f}")

    print("\nOverall Best Model (by Precision):")
    print(f"Model: {best_precision_model['Model']}, Avg Precision: {best_precision_model['Precision']:.4f}, Avg Accuracy: {best_precision_model['Accuracy']:.4f}")


## Importance of Training Accuracy 

Training accuracy measures how well a machine learning model fits the training data. It is important to check training accuracy for the following reasons:

1. **Detecting Underfitting**  
   - If the training accuracy is **too low**, it means the model is **not learning enough** patterns from the data.  
   - This could be due to an **overly simple model**, insufficient features, or poor hyperparameters.

2. **Ensuring Model Competency**  
   - A model with **reasonable training accuracy** ensures that it has successfully learned meaningful patterns from the dataset.  
   - If the model cannot achieve high accuracy on the training data, it is unlikely to perform well on new data.

3. **Providing a Baseline for Comparison**  
   - Training accuracy helps us **compare** with testing accuracy to detect **overfitting**.  
   - If training accuracy is significantly higher than testing accuracy, the model might be **memorizing** rather than **generalizing**.

💡 **Key Insight**: While high training accuracy is desirable, it should not be the sole indicator of a good model. We must also check testing accuracy to ensure real-world performance.


## Models with Baseline Predictors

In [14]:
#best_model_baseline(matches,matches)

## Models with Baseline Predictors + Rolling Predictors

In [15]:
#best_model_rollling(matches,matches)

## Models with Full Set Predictors

In [17]:
#best_tree_model_full(new_matches,new_matches)

## Importance of Checking Testing Accuracy

Testing accuracy measures how well a machine learning model performs on **unseen** data. It is crucial to check testing accuracy for the following reasons:

1. **Evaluating Generalization**  
   - The primary goal of machine learning is to create models that **generalize well** to new data.  
   - A high testing accuracy indicates that the model can make reliable predictions on unseen samples.

2. **Detecting Overfitting**  
   - If the training accuracy is high but the testing accuracy is low, it suggests **overfitting**.  
   - Overfitting occurs when the model learns **specific details** of the training data rather than general patterns, making it unreliable for new data.

3. **Validating Model Performance**  
   - A model is only useful if it performs well on real-world data.  
   - Testing accuracy gives us a **realistic expectation** of how the model will behave when deployed.

4. **Comparing Different Models**  
   - By evaluating testing accuracy across different models, we can select the best model for **real-world applications**.  
   - The model with the highest **testing accuracy and precision** is often the best choice.

💡 **Key Insight**: A good model should have **both high training and testing accuracy**. A balance between these ensures that the model is neither too simple (underfitting) nor too complex (overfitting).


## Models with Baseline Predictors

In [16]:
#best_model_baseline(matches,test_matches)

## Models with Baseline Predictors + Rolling Predictors

In [17]:
#best_model_rollling(matches,test_matches)

## Models with Full Set Predictors

In [18]:
#best_model_full(new_matches,new_test_matches)#