# Classification Tree

A **Classification Tree** is a type of decision tree used for classification tasks. It works by recursively splitting the dataset into subsets based on feature values, aiming to maximize the separation between different classes. The final result is a tree-like model where each leaf node represents a class label.

## How It Works:
- **Recursive Partitioning**: The dataset is split into smaller groups using the most informative features.
- **Gini Impurity / Entropy**: The quality of splits is determined using metrics like Gini Impurity or Entropy.
- **Tree Growth**: The process continues until a stopping criterion is met (e.g., maximum depth, minimum samples per split).
- **Prediction**: For a new input, the model follows the decision path and assigns a class label based on the majority vote in the final node.

## Advantages:
✅ **Easy to Interpret**: The decision-making process is visual and intuitive.  
✅ **Requires Minimal Data Preprocessing**: No need for feature scaling or normalization.  
✅ **Captures Non-Linear Relationships**: Works well with complex decision boundaries.  

## Disadvantages:
❌ **Prone to Overfitting**: Without pruning, the tree can become too complex and fit noise in the data.  
❌ **Unstable**: Small changes in data can result in a significantly different tree.  
❌ **Less Accurate Than Ensembles**: Single decision trees are often outperformed by ensemble methods like Random Forests and Gradient Boosting.  


In [4]:
#downloading all the necesaary dependecies
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score
from sklearn.model_selection import cross_val_score, KFold

In [5]:
%run ../Data/Data_Formatting.ipynb

In [6]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [7]:
#loading the training dataset 
train_path = Path("../Data/premierleague_team_data.csv")
matches = pd.read_csv(train_path)

#loading the testing data 
test_path = Path("../Data/premierleague_test_team_data.csv")
test_matches = pd.read_csv(test_path)

In [8]:
#loading the training dataset with rank
train_path = Path("../Data/premierleague_rank_team_data.csv")
new_matches = pd.read_csv(train_path)

#loading the testing data with rank
test_path = Path("../Data/premierleague_rank_test_team_data.csv")
new_test_matches = pd.read_csv(test_path)

In [9]:
process_data(matches, test_matches)

In [10]:
process_data(new_matches, new_test_matches)

## **Pruning in Classification Tree**  
Pruning helps prevent **overfitting** by reducing the size of a decision tree, leading to improved accuracy on unseen data. Without pruning, a tree may **memorize** training data rather than generalizing well to new data.  

### **Post-Pruning (Cost Complexity Pruning - CCP)**  
In post-pruning, the tree is first grown to full depth (even if it overfits) and then gradually pruned by removing nodes based on a complexity parameter α .  

#### **How CCP Works?**  
The pruning process minimizes the following equation:  

$$
\text{Total Cost} = \text{RSS} + \alpha \times \text{Number of Leaves}
$$


- **RSS (Residual Sum of Squares)** measures the error in predictions.  
- **α** is a tuning parameter that controls the trade-off between tree complexity and error.  
  - **Higher α** → More pruning → Simpler tree.  
  - **Lower α** → Less pruning → More complex tree.  
- The value for **α** can be found using cross validation.


In [11]:
# Function to find optimal ccp_alpha
def find_optimal_alpha(Train):
    static_predictors = ["Venue_code", "Opp_code", "Hour", "Day_code"]
    
    dt = DecisionTreeClassifier(random_state=1)
    path = dt.cost_complexity_pruning_path(Train[static_predictors], Train["Target"])
    ccp_alphas = path.ccp_alphas[:-1]  # Exclude the last value to avoid a single-node tree
    
    kf = KFold(n_splits=5, shuffle=True, random_state=1)
    alpha_scores = {}
    
    for alpha in ccp_alphas:
        dt = DecisionTreeClassifier(random_state=1, ccp_alpha=alpha)
        scores = cross_val_score(dt, Train[static_predictors], Train["Target"], cv=kf, scoring='accuracy')
        alpha_scores[alpha] = np.mean(scores)
    
    best_alpha = max(alpha_scores, key=alpha_scores.get)
    print(f"Best ccp_alpha: {best_alpha:.6f} with Accuracy: {alpha_scores[best_alpha]:.4f}")
    return best_alpha


### Classifiaction Tree using Baseline Predictors  (refer /Data/Data_Formatting.ipynb)

In [12]:
# Function to make yearly predictions
def make_yearly_predictions_decs(Train, Test):
    best_alpha = find_optimal_alpha(Train)
    # Convert 'Date' columns to datetime and sort data
    Train['Date'] = pd.to_datetime(Train['Date'], errors='coerce')
    Test['Date'] = pd.to_datetime(Test['Date'], errors='coerce')
    Train = Train.dropna(subset=['Date']).sort_values(by='Date')
    Test = Test.dropna(subset=['Date']).sort_values(by='Date')
    
    # Define static predictors
    static_predictors = ["Venue_code", "Opp_code", "Hour", "Day_code"]

     # Train Decision Tree with externally provided ccp_alpha
    dt = DecisionTreeClassifier(max_depth=10, min_samples_split=10, ccp_alpha=best_alpha, random_state=1)
    dt.fit(Train[static_predictors], Train["Target"])
  
    
    results = []
    for year in range(Test['Date'].dt.year.min(), Test['Date'].dt.year.max() + 1):
        test_year = Test[Test['Date'].dt.year == year]
        if not test_year.empty:
           preds = dt.predict(test_year[static_predictors])
           
             # Calculate precision and accuracy
           precision = precision_score(test_year["Target"], preds, average="weighted")
           accuracy = accuracy_score(test_year["Target"], preds)
            
           # Append results to list
           results.append({
                "Model": "Classification Tree",
                "Year": year,
                "Precision": precision,
                "Accuracy": accuracy
           })

    # Convert results to DataFrame
    results_df = pd.DataFrame(results)

    return results_df

### Classifiaction Tree using Baseline Predictors + Rolling Predictors (refer /Data/Data_Formatting.ipynb)

In [13]:
def make_yearly_predictions_decs_rolling(Train, Test):
    best_alpha = find_optimal_alpha(Train)
    # Convert 'Date' columns to datetime and sort data
    Train['Date'] = pd.to_datetime(Train['Date'], errors='coerce')
    Test['Date'] = pd.to_datetime(Test['Date'], errors='coerce')
    Train = Train.dropna(subset=['Date']).sort_values(by='Date')
    Test = Test.dropna(subset=['Date']).sort_values(by='Date')
    
    # Define the feature columns for which we'll calculate rolling averages
    cols = ["GF", "GA", "Sh", "SoT", "PK", "PKatt"]
    new_cols = [f"{c}_rolling" for c in cols]
    
    # Apply rolling averages to both Train and Test datasets
    train_results = []
    for team, group in Train.groupby("Team"):
        result = rolling_averages(group, cols, new_cols)
        train_results.append(result)
    Train = pd.concat(train_results)
    
    test_results = []
    for team, group in Test.groupby("Team"):
        result = rolling_averages(group, cols, new_cols)
        test_results.append(result)
    Test = pd.concat(test_results)
    
    # Define static and rolling predictors
    static_predictors = ["Venue_code", "Opp_code", "Hour", "Day_code"]
    rolling_predictors = new_cols
    all_predictors = static_predictors + rolling_predictors

    # Train Decision Tree with externally provided ccp_alpha
    dt = DecisionTreeClassifier(max_depth=10, min_samples_split=10, ccp_alpha=best_alpha, random_state=1)
    dt.fit(Train[all_predictors], Train["Target"])

    results = []
    for year in range(Test['Date'].dt.year.min(), Test['Date'].dt.year.max() + 1):
        test_year = Test[Test['Date'].dt.year == year]
        if not test_year.empty:
            # Predict on test data
           preds = dt.predict(test_year[all_predictors])
            
             # Calculate precision and accuracy
           precision = precision_score(test_year["Target"], preds, average="weighted")
           accuracy = accuracy_score(test_year["Target"], preds)
            
              # Append results to list
           results.append({
                "Model": "Classification Tree",
                "Year": year,
                "Precision": precision,
                "Accuracy": accuracy
           })

    # Convert results to DataFrame
    results_df = pd.DataFrame(results)

    return results_df

### Classifiaction Tree using Full Feature Set (refer /Data/Data_Formatting.ipynb)

In [14]:
def make_yearly_predictions_decs_full(Train, Test):
    best_alpha = find_optimal_alpha(Train)
    # Convert 'Date' columns to datetime and sort data
    Train['Date'] = pd.to_datetime(Train['Date'], errors='coerce')
    Test['Date'] = pd.to_datetime(Test['Date'], errors='coerce')
    Train = Train.dropna(subset=['Date']).sort_values(by='Date')
    Test = Test.dropna(subset=['Date']).sort_values(by='Date')
    
    # Define the feature columns for which we'll calculate rolling averages
    cols = ["GF", "GA", "Sh", "SoT", "PK", "PKatt"]
    new_cols = [f"{c}_rolling" for c in cols]
    
    # Apply rolling averages to both Train and Test datasets
    train_results = []
    for team, group in Train.groupby("Team"):
        result = rolling_averages(group, cols, new_cols)
        train_results.append(result)
    Train = pd.concat(train_results)
    
    test_results = []
    for team, group in Test.groupby("Team"):
        result = rolling_averages(group, cols, new_cols)
        test_results.append(result)
    Test = pd.concat(test_results)
    
    # Define static and rolling predictors
    static_predictors =  ["Venue_code", "Opp_code", "Hour", "Day_code","Rank","IsRanked"]
    rolling_predictors = new_cols
    all_predictors = static_predictors + rolling_predictors
    
     # Train Decision Tree with externally provided ccp_alpha
    dt = DecisionTreeClassifier(max_depth=10, min_samples_split=10, ccp_alpha=best_alpha, random_state=1)
    dt.fit(Train[all_predictors], Train["Target"])
    
    results = []
    for year in range(Test['Date'].dt.year.min(), Test['Date'].dt.year.max() + 1):
        test_year = Test[Test['Date'].dt.year == year]
        if not test_year.empty:
            preds = dt.predict(test_year[all_predictors])
            
            precision = precision_score(test_year["Target"], preds, average="weighted")
            accuracy = accuracy_score(test_year["Target"], preds)
            
            # Append results to list
            results.append({
                "Model": "Classification Tree",
                "Year": year,
                "Precision": precision,
                "Accuracy": accuracy
            })

    # Convert results to DataFrame
    results_df = pd.DataFrame(results)

    return (results_df)
