## Predictor Groups  

### 1. Baseline Predictors (Static Factors)  
The first group consists of fundamental predictors that are **easily accessible before a match** and have a significant impact on game results. These include:  

- **Venue_code** – Encodes whether a team is playing home or away.  
- **Opp_code** – Represents the opponent team.  
- **Hour** – The match start time, which can influence player performance.  
- **Day_code** – Encodes the day of the week, considering potential rest and preparation effects.  

### 2. Rolling Averages (Performance-Based Predictors)  
The second group expands on the first by integrating **rolling averages** of key performance indicators. These provide a **dynamic perspective** by capturing recent form and trends:  

- **GF_rolling** – Rolling average of goals scored.  
- **GA_rolling** – Rolling average of goals conceded.  
- **Sh_rolling** – Rolling average of total shots taken.  
- **SoT_rolling** – Rolling average of shots on target.  
- **PK_rolling** – Rolling average of penalties scored.  
- **PKatt_rolling** – Rolling average of penalty attempts.  

### 3. Full Feature Set (Rank-Based Enhancement)  
The third group includes **all previous predictors** while adding **ranking-based insights**, further refining predictive power:  

- **Rank** – The FIFA club ranking of the team.  
- **IsRanked** – A binary indicator of whether the team appears in FIFA club rankings.  

By progressively incorporating these predictor groups, we aim to analyze their **individual and combined impacts** on match outcome prediction accuracy. This approach ensures a **structured and data-driven methodology**, balancing both static match conditions and dynamic team performance trends.  


## 1. Baseline Predictors (Static Factors)¶  

Before using these predictors in our model, we need to **properly format and encode** them to ensure they are suitable for machine learning. The baseline predictors include **Venue_code, Opp_code, Hour, and Day_code**, which provide valuable pre-match context. These features influence match outcomes by capturing factors such as **home advantage, opponent strength, match timing, and scheduling effects**.  

To enhance their utility, we will **convert categorical variables into numerical representations** where necessary, ensuring that our model can effectively process and analyze them. This structured approach lays the groundwork for incorporating more advanced predictors in later stages.

In [1]:
def process_data(matches, test_matches):
    # Convert 'Date' column to datetime format
    matches["Date"] = pd.to_datetime(matches["Date"])
    test_matches["Date"] = pd.to_datetime(test_matches["Date"])

    # Convert 'Venue' column to categorical codes for numerical processing
    matches["Venue_code"] = matches["Venue"].astype("category").cat.codes
    test_matches["Venue_code"] = test_matches["Venue"].astype("category").cat.codes

    # Convert 'Opponent' column to categorical codes for numerical processing
    matches["Opp_code"] = matches["Opponent"].astype("category").cat.codes
    test_matches["Opp_code"] = test_matches["Opponent"].astype("category").cat.codes

    # Extract the hour from the "Time" column and fill missing values with "0"
    matches["Hour"] = matches["Time"].str.replace(":.+", "", regex=True).fillna("0").astype("int")
    test_matches["Hour"] = test_matches["Time"].str.replace(":.+", "", regex=True).fillna("0").astype("int")

    # Extract the day of the week from the "Date" column and assign as numerical code (0 = Monday, 6 = Sunday)
    matches["Day_code"] = matches["Date"].dt.dayofweek
    test_matches["Day_code"] = test_matches["Date"].dt.dayofweek

    # Create the binary "Target" column where 1 represents a "Win" (W) and 0 otherwise
    matches["Target"] = (matches["Result"] == "W").astype("int")
    test_matches["Target"] = (test_matches["Result"] == "W").astype("int")


### 2. Rolling Averages (Performance-Based Predictors)¶  
Before incorporating the rolling averages into our model, we will first calculate and format these performance-based features. These include rolling averages for goals scored, goals conceded, shots taken, and other key metrics. This preparation ensures that these dynamic indicators are ready for use, allowing us to capture recent team trends and form effectively.

In [2]:
def rolling_averages(group, cols, new_cols):
    # Sort the group by the "Date" column to ensure chronological order
    group = group.sort_values("Date")
    
    # Calculate rolling averages over a window of 3 rows, excluding the current row
    # (e.g., for row N, it computes the average of rows N-1, N-2, and N-3)
    rolling_stats = group[cols].rolling(3, closed='left').mean()
    
    # Assign the calculated rolling averages to new columns in the group
    group[new_cols] = rolling_stats
    
    # Drop rows where the new rolling average columns contain NaN values
    # (occurs when there aren't enough previous rows to calculate the average)
    group = group.dropna(subset=new_cols)
    
    return group

### 3. Full Feature Set (Rank-Based Enhancement)¶  
For the rank-based features, we will use the  **FIFA club rankings** data we have retrieved from their database andwill use them by encoding the team rankings and creating a binary indicator for whether a team is ranked. This allows us to include global performance data into our model, enhancing its predictive power with structured rank-based insights.

In [None]:
# Function to find optimal ccp_alpha
def find_optimal_alpha(Train):
    static_predictors = ["Venue_code", "Opp_code", "Hour", "Day_code"]
    
    dt = DecisionTreeClassifier(random_state=1)
    path = dt.cost_complexity_pruning_path(Train[static_predictors], Train["Target"])
    ccp_alphas = path.ccp_alphas[:-1]  # Exclude the last value to avoid a single-node tree
    
    kf = KFold(n_splits=5, shuffle=True, random_state=1)
    alpha_scores = {}
    
    for alpha in ccp_alphas:
        dt = DecisionTreeClassifier(random_state=1, ccp_alpha=alpha)
        scores = cross_val_score(dt, Train[static_predictors], Train["Target"], cv=kf, scoring='accuracy')
        alpha_scores[alpha] = np.mean(scores)
    
    best_alpha = max(alpha_scores, key=alpha_scores.get)
    print(f"Best ccp_alpha: {best_alpha:.6f} with Accuracy: {alpha_scores[best_alpha]:.4f}")
    return best_alpha
