# Player Performance Predictor - Definitive Balanced Model Training

**Goal:** Build our final, high-accuracy models using the 'class_weight' parameter to solve class imbalance in a memory-efficient way, allowing us to train on the full, unfiltered format-specific data.

### Step 1: Configuration - CHOOSE THE FORMAT TO TRAIN

In [1]:
MATCH_FORMAT_TO_TRAIN = "Test" # <-- CHANGE THIS VALUE (e.g., "ODI" or "Test")

print(f"Configuration set to train models for: {MATCH_FORMAT_TO_TRAIN}")

Configuration set to train models for: Test


### Step 2: Load Data & Define Custom Bins

In [2]:
import pandas as pd
import numpy as np
import sqlite3
import joblib
import os
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

db_path = '../cricket_final_data_by_format.db'
conn = sqlite3.connect(db_path)
batting_df_full = pd.read_sql_query("SELECT * FROM batting_stats", conn)
bowling_df_full = pd.read_sql_query("SELECT * FROM bowling_stats", conn)
conn.close()

batting_df = batting_df_full[batting_df_full['match_type'] == MATCH_FORMAT_TO_TRAIN].copy()
bowling_df = bowling_df_full[bowling_df_full['match_type'] == MATCH_FORMAT_TO_TRAIN].copy()
batting_df['date'] = pd.to_datetime(batting_df['date'])
bowling_df['date'] = pd.to_datetime(bowling_df['date'])
print(f"Data loaded for {MATCH_FORMAT_TO_TRAIN}.")

if MATCH_FORMAT_TO_TRAIN == 'T20':
    run_bins = [-1, 15, 29, 49, 999]
    run_labels = ['0-15', '16-29', '30-49', '50+']
    wicket_bins = [-1, 1, 3, 99]
    wicket_labels = ['0-1', '2-3', '4+']
elif MATCH_FORMAT_TO_TRAIN == 'ODI':
    run_bins = [-1, 24, 49, 99, 999]
    run_labels = ['0-24', '25-49', '50-99', '100+']
    wicket_bins = [-1, 1, 3, 99]
    wicket_labels = ['0-1', '2-3', '4+']
else: # Test
    run_bins = [-1, 24, 49, 99, 999]
    run_labels = ['0-24', '25-49', '50-99', '100+']
    wicket_bins = [-1, 2, 4, 99]
    wicket_labels = ['0-2', '3-4', '5+']

batting_df['runs_bin'] = pd.cut(batting_df['runs'], bins=run_bins, labels=run_labels)
bowling_df['wickets_bin'] = pd.cut(bowling_df['wickets'], bins=wicket_bins, labels=wicket_labels)
print("Custom performance bins defined and applied.")

Data loaded for Test.
Custom performance bins defined and applied.


## Part A: Training the Batting Classification Model

### Step 3: Feature Engineering for Batting

In [3]:
print("Creating features for batting data...")
batting_df['career_avg'] = np.where(batting_df['career_innings'] > 0, batting_df['career_runs'] / batting_df['career_innings'], 0)
batting_df['career_sr'] = np.where(batting_df['career_balls_faced'] > 0, (batting_df['career_runs'] / batting_df['career_balls_faced']) * 100, 0)
batting_df['form_avg_last_10'] = batting_df.groupby('player')['runs'].transform(lambda x: x.rolling(window=10, min_periods=1).mean().shift(1))
batting_df['form_avg_last_10'] = batting_df['form_avg_last_10'].fillna(0)

match_teams = batting_df_full.groupby('match_id')['team'].unique().apply(list).to_dict()
def find_opposition(row):
    teams_in_match = match_teams.get(row['match_id'])
    if teams_in_match and len(teams_in_match) == 2:
        return teams_in_match[1] if teams_in_match[0] == row['team'] else teams_in_match[0]
    return 'N/A'
batting_df['against_team'] = batting_df.apply(find_opposition, axis=1)
print("Feature engineering complete.")

Creating features for batting data...
Feature engineering complete.


### Step 4: Prepare Batting Data for Modeling

In [4]:
batting_df.dropna(subset=['runs_bin', 'player', 'venue', 'against_team'], inplace=True)

categorical_features = ['player', 'venue', 'against_team']
numerical_features = ['career_avg', 'career_sr', 'career_innings', 'form_avg_last_10']
target = 'runs_bin'

X = pd.concat([
    pd.get_dummies(batting_df[categorical_features], columns=categorical_features),
    batting_df[numerical_features]
], axis=1)
y = batting_df[target]

print("Batting data prepared.")
print(f"Final shape of {MATCH_FORMAT_TO_TRAIN} batting feature matrix (X):", X.shape)

Batting data prepared.
Final shape of Test batting feature matrix (X): (99992, 3068)


### Step 5: Train & Evaluate the Batting Classifier (with Class Weights)

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

n_jobs = max(1, os.cpu_count() - 1)

# --- DEFINITIVE CHANGE: Use class_weight='balanced' to handle imbalance without MemoryError ---
batting_classifier = RandomForestClassifier(
    n_estimators=100, 
    random_state=42, 
    n_jobs=n_jobs, 
    verbose=2,
    class_weight='balanced'
)

print(f"Training the {MATCH_FORMAT_TO_TRAIN} batting classifier with balanced class weights...")
batting_classifier.fit(X_train, y_train) # We train on the original, imbalanced data
print("Training complete.")

predictions = batting_classifier.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

print(f"\n--- {MATCH_FORMAT_TO_TRAIN} Batting Model Evaluation ---")
print(f"Overall Accuracy: {accuracy * 100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test, predictions, zero_division=0))

Training the Test batting classifier with balanced class weights...


[Parallel(n_jobs=11)]: Using backend ThreadingBackend with 11 concurrent workers.


building tree 1 of 100
building tree 2 of 100
building tree 3 of 100
building tree 4 of 100
building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100


[Parallel(n_jobs=11)]: Done  19 tasks      | elapsed:   10.7s


building tree 32 of 100
building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100
building tree 44 of 100
building tree 45 of 100
building tree 46 of 100
building tree 47 of 100
building tree 48 of 100
building tree 49 of 100
building tree 50 of 100
building tree 51 of 100
building tree 52 of 100
building tree 53 of 100
building tree 54 of 100
building tree 55 of 100
building tree 56 of 100
building tree 57 of 100
building tree 58 of 100
building tree 59 of 100
building tree 60 of 100
building tree 61 of 100
building tree 62 of 100
building tree 63 of 100
building tree 64 of 100
building tree 65 of 100
building tree 66 of 100
building tree 67 of 100
building tree 68 of 100
building tree 69 of 100
building tree 70 of 100
building tree 71 of 100
building tree 72 of 100
building tree 73

[Parallel(n_jobs=11)]: Done 100 out of 100 | elapsed:   48.4s finished


Training complete.


[Parallel(n_jobs=11)]: Using backend ThreadingBackend with 11 concurrent workers.
[Parallel(n_jobs=11)]: Done  19 tasks      | elapsed:    0.0s



--- Test Batting Model Evaluation ---
Overall Accuracy: 59.43%

Classification Report:
              precision    recall  f1-score   support

        0-24       0.66      0.87      0.75     13004
        100+       0.08      0.02      0.03       925
       25-49       0.22      0.10      0.14      3641
       50-99       0.13      0.05      0.07      2429

    accuracy                           0.59     19999
   macro avg       0.27      0.26      0.25     19999
weighted avg       0.49      0.59      0.53     19999



[Parallel(n_jobs=11)]: Done 100 out of 100 | elapsed:    0.4s finished


### Step 6: Save the Trained Batting Classifier

In [6]:
if not os.path.exists('../models'):
    os.makedirs('../models')

model_filename = f'../models/{MATCH_FORMAT_TO_TRAIN.lower()}_batting_classifier.joblib'
columns_filename = f'../models/{MATCH_FORMAT_TO_TRAIN.lower()}_batting_columns.joblib'

joblib.dump(batting_classifier, model_filename)
joblib.dump(X.columns, columns_filename)

print(f"{MATCH_FORMAT_TO_TRAIN} batting classifier and columns saved successfully.")

Test batting classifier and columns saved successfully.


---

## Part B: Training the Bowling Classification Model

### Step 7: Feature Engineering for Bowling

In [7]:
print("\nCreating features for bowling data...")
bowling_df['career_bowling_avg'] = np.where(bowling_df['career_wickets'] > 0, bowling_df['career_runs_conceded'] / bowling_df['career_wickets'], 0)
bowling_df['career_bowling_sr'] = np.where(bowling_df['career_wickets'] > 0, bowling_df['career_balls_bowled'] / bowling_df['career_wickets'], 0)
bowling_df['form_wickets_last_10'] = bowling_df.groupby('player')['wickets'].transform(lambda x: x.rolling(window=10, min_periods=1).mean().shift(1))
bowling_df['form_wickets_last_10'] = bowling_df['form_wickets_last_10'].fillna(0)
bowling_df['against_team'] = bowling_df.apply(find_opposition, axis=1)
print("Feature engineering complete.")


Creating features for bowling data...
Feature engineering complete.


### Step 8: Prepare Bowling Data for Modeling

In [8]:
bowling_df.dropna(subset=['wickets_bin', 'player', 'venue', 'against_team'], inplace=True)

categorical_features_bowling = ['player', 'venue', 'against_team']
numerical_features_bowling = ['career_bowling_avg', 'career_bowling_sr', 'form_wickets_last_10']
target_bowling = 'wickets_bin'

X_bowling = pd.concat([
    pd.get_dummies(bowling_df[categorical_features_bowling], columns=categorical_features_bowling),
    bowling_df[numerical_features_bowling]
], axis=1)
y_bowling = bowling_df[target_bowling]

print("Bowling data prepared.")
print(f"Final shape of {MATCH_FORMAT_TO_TRAIN} bowling feature matrix (X):", X_bowling.shape)

Bowling data prepared.
Final shape of Test bowling feature matrix (X): (56842, 2500)


### Step 9: Train & Evaluate the Bowling Classifier (with Class Weights)

In [9]:
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X_bowling, y_bowling, test_size=0.2, random_state=42, stratify=y_bowling)

n_jobs = max(1, os.cpu_count() - 1)
bowling_classifier = RandomForestClassifier(
    n_estimators=100, 
    random_state=42, 
    n_jobs=n_jobs, 
    verbose=2,
    class_weight='balanced'
)

print(f"\nTraining the {MATCH_FORMAT_TO_TRAIN} bowling classifier...")
bowling_classifier.fit(X_train_b, y_train_b)
print("Training complete.")

predictions_b = bowling_classifier.predict(X_test_b)
accuracy_b = accuracy_score(y_test_b, predictions_b)

print(f"\n--- {MATCH_FORMAT_TO_TRAIN} Bowling Model Evaluation ---")
print(f"Overall Accuracy: {accuracy_b * 100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test_b, predictions_b, zero_division=0))


Training the Test bowling classifier...


[Parallel(n_jobs=11)]: Using backend ThreadingBackend with 11 concurrent workers.


building tree 1 of 100building tree 2 of 100

building tree 3 of 100
building tree 4 of 100
building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100


[Parallel(n_jobs=11)]: Done  19 tasks      | elapsed:    4.6s


building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100
building tree 44 of 100
building tree 45 of 100
building tree 46 of 100
building tree 47 of 100
building tree 48 of 100
building tree 49 of 100
building tree 50 of 100
building tree 51 of 100
building tree 52 of 100
building tree 53 of 100
building tree 54 of 100
building tree 55 of 100
building tree 56 of 100
building tree 57 of 100
building tree 58 of 100
building tree 59 of 100
building tree 60 of 100
building tree 61 of 100
building tree 62 of 100
building tree 63 of 100
building tree 64 of 100
building tree 65 of 100
building tree 66 of 100
building tree 67 of 100
building tree 68 of 100
building tree 69 of 100
building tree 70 of 100
building tree 71 of 100
building tree 72 of 100
building tree 73 of 100
building tree 74

[Parallel(n_jobs=11)]: Done 100 out of 100 | elapsed:   22.1s finished
[Parallel(n_jobs=11)]: Using backend ThreadingBackend with 11 concurrent workers.
[Parallel(n_jobs=11)]: Done  19 tasks      | elapsed:    0.0s


Training complete.

--- Test Bowling Model Evaluation ---
Overall Accuracy: 73.19%

Classification Report:
              precision    recall  f1-score   support

         0-2       0.78      0.93      0.85      8706
         3-4       0.24      0.09      0.13      2057
          5+       0.16      0.04      0.06       606

    accuracy                           0.73     11369
   macro avg       0.39      0.35      0.35     11369
weighted avg       0.65      0.73      0.68     11369



[Parallel(n_jobs=11)]: Done 100 out of 100 | elapsed:    0.1s finished


### Step 10: Save the Trained Bowling Classifier

In [10]:
model_filename_bowling = f'../models/{MATCH_FORMAT_TO_TRAIN.lower()}_bowling_classifier.joblib'
columns_filename_bowling = f'../models/{MATCH_FORMAT_TO_TRAIN.lower()}_bowling_columns.joblib'

joblib.dump(bowling_classifier, model_filename_bowling)
joblib.dump(X_bowling.columns, columns_filename_bowling)

print(f"{MATCH_FORMAT_TO_TRAIN} bowling classifier and columns saved successfully.")

Test bowling classifier and columns saved successfully.
