# Player Performance Predictor - Phase 3: Format-Specific Model Training

**Goal:** Train specialist models for each cricket format (T20, ODI, Test) using our new, enriched database. This notebook is designed to be run multiple times, once for each format.

### Step 1: Configuration - CHOOSE THE FORMAT TO TRAIN

**Action:** Change the value of the `MATCH_FORMAT_TO_TRAIN` variable below to either 'T20', 'ODI', or 'Test'. Then, run the entire notebook (`Kernel > Restart & Run All`).

In [None]:
MATCH_FORMAT_TO_TRAIN = "T20" # <-- CHANGE THIS VALUE (e.g., "ODI" or "Test")

print(f"Configuration set to train models for: {MATCH_FORMAT_TO_TRAIN}")

### Step 2: Load Data & Filter by Format

In [None]:
import pandas as pd
import sqlite3
import joblib
import os
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score

sns.set_style('whitegrid')
db_path = '../cricket_data.db'
conn = sqlite3.connect(db_path)
batting_df_full = pd.read_sql_query("SELECT * FROM batting_innings", conn)
bowling_df_full = pd.read_sql_query("SELECT * FROM bowling_innings", conn)
conn.close()

# --- CRUCIAL STEP: Filter data for the chosen format --- #
batting_df = batting_df_full[batting_df_full['match_type'] == MATCH_FORMAT_TO_TRAIN].copy()
bowling_df = bowling_df_full[bowling_df_full['match_type'] == MATCH_FORMAT_TO_TRAIN].copy()

batting_df['date'] = pd.to_datetime(batting_df['date'])
bowling_df['date'] = pd.to_datetime(bowling_df['date'])

print(f"Data loaded. Found {len(batting_df)} batting and {len(bowling_df)} bowling innings for {MATCH_FORMAT_TO_TRAIN}.")

## Part A: Training the Batting Prediction Model

### Step 3: Filter for Consistently Active Batters in this Format

In [None]:
# Define the cutoff date (3 years ago from today)
cutoff_date = pd.Timestamp.now() - pd.DateOffset(years=3)

# Create a dataframe of recent innings for this format
recent_batting_df = batting_df[batting_df['date'] >= cutoff_date]

# Count player appearances in the recent period for this format
n_min_recent_appearances = 10
recent_player_counts = recent_batting_df['player'].value_counts()

# Get the list of players who meet the threshold
final_batters_list = recent_player_counts[recent_player_counts >= n_min_recent_appearances].index
print(f"Found {len(final_batters_list)} batters who have played at least {n_min_recent_appearances} {MATCH_FORMAT_TO_TRAIN} matches in the last 3 years.")

# Filter the main format dataframe to include the ENTIRE CAREER (in this format) of these players
batting_df_filtered = batting_df[batting_df['player'].isin(final_batters_list)].copy()
print(f"The final filtered dataset contains {len(batting_df_filtered)} total {MATCH_FORMAT_TO_TRAIN} innings from these players.")

### Step 4: Feature Engineering on Filtered Batting Data

In [None]:
print("Creating features for the filtered batting data...")
match_teams = batting_df_full.groupby('match_id')['team'].unique().apply(list).to_dict()
def find_opposition(row):
    teams_in_match = match_teams.get(row['match_id'])
    if teams_in_match and len(teams_in_match) == 2:
        return teams_in_match[1] if teams_in_match[0] == row['team'] else teams_in_match[0]
    return 'N/A'

batting_df_filtered.loc[:, 'against_team'] = batting_df_filtered.apply(find_opposition, axis=1)
batting_df_filtered = batting_df_filtered.sort_values(by=['player', 'date'])
batting_df_filtered.loc[:, 'form_last_5_innings'] = batting_df_filtered.groupby('player')['runs'].transform(lambda x: x.rolling(window=5, min_periods=1).mean().shift(1))
batting_df_filtered['form_last_5_innings'] = batting_df_filtered['form_last_5_innings'].fillna(0)
print("Feature engineering complete.")

### Step 5: Prepare Batting Data for Modeling

In [None]:
categorical_features = ['player', 'venue', 'against_team']
numerical_features = ['form_last_5_innings']
target_batting = 'runs'

encoded_features = pd.get_dummies(batting_df_filtered[categorical_features], columns=categorical_features)
X_batting = pd.concat([encoded_features, batting_df_filtered[numerical_features]], axis=1)
y_batting = batting_df_filtered[target_batting]

print("Batting data prepared.")
print(f"Final shape of {MATCH_FORMAT_TO_TRAIN} batting feature matrix (X):", X_batting.shape)

### Step 6: Train the Batting Model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_batting, y_batting, test_size=0.2, random_state=42)

batting_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1, verbose=2)

print(f"Training the {MATCH_FORMAT_TO_TRAIN} batting model...")
batting_model.fit(X_train, y_train)
print("Model training complete.")

predictions = batting_model.predict(X_test)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"\n--- {MATCH_FORMAT_TO_TRAIN} Batting Model Evaluation ---")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"R-squared (R2 Score): {r2:.2f}")

### Step 7: Save the Trained Batting Model

In [None]:
if not os.path.exists('../models'):
    os.makedirs('../models')

model_filename = f'../models/{MATCH_FORMAT_TO_TRAIN.lower()}_batting_model.joblib'
columns_filename = f'../models/{MATCH_FORMAT_TO_TRAIN.lower()}_batting_columns.joblib'

joblib.dump(batting_model, model_filename)
joblib.dump(X_batting.columns, columns_filename)

print(f"{MATCH_FORMAT_TO_TRAIN} batting model and columns saved successfully.")

---

## Part B: Training the Bowling Prediction Model

### Step 8: Filter for Consistently Active Bowlers in this Format

In [None]:
# Create a dataframe of recent bowling innings only
recent_bowling_df = bowling_df[bowling_df['date'] >= cutoff_date]

# Count player appearances in the recent period for this format
recent_player_counts_bowling = recent_bowling_df['player'].value_counts()

# Get the list of players who meet the threshold
final_bowlers_list = recent_player_counts_bowling[recent_player_counts_bowling >= n_min_recent_appearances].index
print(f"Found {len(final_bowlers_list)} bowlers who have played at least {n_min_recent_appearances} {MATCH_FORMAT_TO_TRAIN} matches in the last 3 years.")

# Filter the main format dataframe to include the ENTIRE CAREER (in this format) of these players
bowling_df_filtered = bowling_df[bowling_df['player'].isin(final_bowlers_list)].copy()
print(f"The final filtered dataset contains {len(bowling_df_filtered)} total {MATCH_FORMAT_TO_TRAIN} innings from these players.")

### Step 9: Feature Engineering on Filtered Bowling Data

In [None]:
print("\nCreating features for the filtered bowling data...")
bowling_df_filtered.loc[:, 'against_team'] = bowling_df_filtered.apply(find_opposition, axis=1)
bowling_df_filtered = bowling_df_filtered.sort_values(by=['player', 'date'])
bowling_df_filtered.loc[:, 'form_last_5_wickets'] = bowling_df_filtered.groupby('player')['wickets'].transform(lambda x: x.rolling(window=5, min_periods=1).mean().shift(1))
bowling_df_filtered['form_last_5_wickets'] = bowling_df_filtered['form_last_5_wickets'].fillna(0)
print("Feature engineering complete.")

### Step 10: Prepare Bowling Data for Modeling

In [None]:
categorical_features = ['player', 'venue', 'against_team']
numerical_features = ['form_last_5_wickets']
target_bowling = 'wickets'

encoded_features_bowling = pd.get_dummies(bowling_df_filtered[categorical_features], columns=categorical_features)
X_bowling = pd.concat([encoded_features_bowling, bowling_df_filtered[numerical_features]], axis=1)
y_bowling = bowling_df_filtered[target_bowling]

print("Bowling data prepared.")
print(f"Final shape of {MATCH_FORMAT_TO_TRAIN} bowling feature matrix (X):", X_bowling.shape)

### Step 11: Train the Bowling Model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_bowling, y_bowling, test_size=0.2, random_state=42)

bowling_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1, verbose=2)

print(f"Training the {MATCH_FORMAT_TO_TRAIN} bowling model...")
bowling_model.fit(X_train, y_train)
print("Model training complete.")

predictions = bowling_model.predict(X_test)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"\n--- {MATCH_FORMAT_TO_TRAIN} Bowling Model Evaluation ---")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"R-squared (R2 Score): {r2:.2f}")

### Step 12: Save the Trained Bowling Model

In [None]:
model_filename_bowling = f'../models/{MATCH_FORMAT_TO_TRAIN.lower()}_bowling_model.joblib'
columns_filename_bowling = f'../models/{MATCH_FORMAT_TO_TRAIN.lower()}_bowling_columns.joblib'

joblib.dump(bowling_model, model_filename_bowling)
joblib.dump(X_bowling.columns, columns_filename_bowling)

print(f"{MATCH_FORMAT_TO_TRAIN} bowling model and columns saved successfully.")