# Modeling the sequence of a cricket game

Seperately from the previous modeling, where I tried to predict the outcome of a game based on the stats of the players, I want to attempt to model a cricket game itself, ball by ball.

I will create three models:
1. A model that predicts the probability of a wicket occuring in a ball.
2. A model predicting the number of extras resulting from a ball.
3. A model predicting the number of runs scored by a ball.

These will then be combined into a finite state machine with the aim of predicting the outcome of the game through modelling the sequence of events within it.

In [199]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ast

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_absolute_percentage_error, f1_score
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import OneHotEncoder

# 1. Predicting the probability of a wicket occuring in a ball

In [3]:
data = pd.read_csv('../data/saved_data/all_matches.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857199 entries, 0 to 857198
Data columns (total 29 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   game_id               857199 non-null  object 
 1   date                  857199 non-null  object 
 2   venue                 857199 non-null  object 
 3   location              816895 non-null  object 
 4   gender                857199 non-null  object 
 5   match_type            857199 non-null  object 
 6   innings               857199 non-null  int64  
 7   batting_team          857199 non-null  object 
 8   bowling_team          857199 non-null  object 
 9   batting_team_players  857199 non-null  object 
 10  bowling_team_players  857199 non-null  object 
 11  over                  857199 non-null  int64  
 12  ball_in_over          857199 non-null  int64  
 13  batter                857199 non-null  object 
 14  bowler                857199 non-null  object 
 15  

In [7]:
batter_stats = pd.read_csv('../data/saved_data/batter_stats.csv')
bowler_stats = pd.read_csv('../data/saved_data/bowler_stats.csv')

In [8]:
batter_stats.head()

Unnamed: 0,batter,total_runs_batter_mean,total_high_scoring_hit_mean,total_total_mean,total_is_wicket_mean,powerplay_runs_batter_mean,powerplay_high_scoring_hit_mean,powerplay_total_mean,powerplay_is_wicket_mean,non_powerplay_runs_batter_mean,non_powerplay_high_scoring_hit_mean,non_powerplay_total_mean,non_powerplay_is_wicket_mean
0,00015688,0.294118,0.029412,0.485294,0.044118,0.375,0.05,0.525,0.025,0.178571,0.0,0.428571,0.071429
1,00029c30,0.454545,0.0,0.818182,0.0,,,,,0.454545,0.0,0.818182,0.0
2,0030a57d,0.533333,0.0,0.733333,0.0,,,,,0.533333,0.0,0.733333,0.0
3,00321fff,0.612245,0.040816,0.673469,0.081633,,,,,0.612245,0.040816,0.673469,0.081633
4,00467a76,0.974684,0.101266,1.088608,0.101266,,,,,0.974684,0.101266,1.088608,0.101266


In [9]:
bowler_stats.head()

Unnamed: 0,bowler,total_batter_runs_conceded_mean,total_runs_from_relevant_extras_mean,total_total_runs_conceded_mean,total_taken_from_relevant_wickets_mean,powerplay_batter_runs_conceded_mean,powerplay_runs_from_relevant_extras_mean,powerplay_total_runs_conceded_mean,powerplay_taken_from_relevant_wickets_mean,non_powerplay_batter_runs_conceded_mean,non_powerplay_runs_from_relevant_extras_mean,non_powerplay_total_runs_conceded_mean,non_powerplay_taken_from_relevant_wickets_mean
0,00029c30,1.090909,0.060606,1.151515,0.045455,,,,,1.090909,0.060606,1.151515,0.045455
1,00321fff,1.06721,0.03666,1.107943,0.052953,,,,,1.06721,0.03666,1.107943,0.052953
2,00467a76,1.236742,0.068182,1.348485,0.066288,1.331034,0.048276,1.427586,0.055172,1.201044,0.075718,1.318538,0.070496
3,005f0561,0.864734,0.021739,0.898551,0.057971,0.757396,0.017751,0.775148,0.04142,0.938776,0.02449,0.983673,0.069388
4,007113d7,1.008584,0.025751,1.06867,0.081545,,,,,1.008584,0.025751,1.06867,0.081545


In [20]:
powerplay_balls = data[data['powerplay'] == True][['batter', 'bowler', 'is_wicket']]
powerplay_batter_stats = batter_stats[['batter'] + [col for col in batter_stats.columns if 'powerplay' in col[:9]]]
powerplay_bowler_stats = bowler_stats[['bowler'] + [col for col in bowler_stats.columns if 'powerplay' in col[:9]]]

non_powerplay_balls = data[data['powerplay'] == False][['batter', 'bowler', 'is_wicket']]
non_powerplay_batter_stats = batter_stats[['batter'] + [col for col in batter_stats.columns if 'non_powerplay' in col[:13]]]
non_powerplay_bowler_stats = bowler_stats[['bowler'] + [col for col in bowler_stats.columns if 'non_powerplay' in col[:13]]]

Due to time constraints, their performance and ease of use, I will be using a random forest classifierto predict the probability of a wicket occuring in a ball. It's ability to easily return class probabilities rather than just the classes makes it useful here too.

In [25]:
powerplay_balls_stats = powerplay_balls.merge(powerplay_batter_stats, on='batter', how='left').drop(columns=['batter'])
powerplay_balls_stats = powerplay_balls_stats.merge(powerplay_bowler_stats, on='bowler', how='left').drop(columns=['bowler'])

non_powerplay_balls_stats = non_powerplay_balls.merge(non_powerplay_batter_stats, on='batter', how='left').drop(columns=['batter'])
non_powerplay_balls_stats = non_powerplay_balls_stats.merge(non_powerplay_bowler_stats, on='bowler', how='left').drop(columns=['bowler'])
powerplay_balls_stats.head()

Unnamed: 0,is_wicket,powerplay_runs_batter_mean,powerplay_high_scoring_hit_mean,powerplay_total_mean,powerplay_is_wicket_mean,powerplay_batter_runs_conceded_mean,powerplay_runs_from_relevant_extras_mean,powerplay_total_runs_conceded_mean,powerplay_taken_from_relevant_wickets_mean
0,False,1.38797,0.237594,1.46015,0.042105,1.202899,0.045549,1.269151,0.055901
1,False,1.307902,0.220708,1.376022,0.054496,1.202899,0.045549,1.269151,0.055901
2,True,1.38797,0.237594,1.46015,0.042105,1.202899,0.045549,1.269151,0.055901
3,False,1.182553,0.198708,1.277868,0.048465,1.202899,0.045549,1.269151,0.055901
4,False,1.182553,0.198708,1.277868,0.048465,1.202899,0.045549,1.269151,0.055901


In [61]:
X_powerplay = powerplay_balls_stats.drop(columns=['is_wicket'])
y_powerplay = powerplay_balls_stats['is_wicket']

X_train, X_test, y_train, y_test = train_test_split(X_powerplay, y_powerplay, test_size=0.2, random_state=42)

In [89]:
from sklearn.ensemble import RandomForestClassifier

def do_random_forest_classifier(X_train, y_train, cv=5, random_state=42, **kwargs):
    """
    Train Random Forest Classifier and get probability predictions
    """
    # Initialize model
    model = RandomForestClassifier(
        random_state=random_state,
        **kwargs
    )
    
    # Cross validation with probability predictions and verbose output
    cv_results = cross_validate(
        model,
        X_train,
        y_train,
        cv=cv,
        scoring={
            'auc': 'roc_auc',
            'precision': 'precision', 
            'recall': 'recall',
            'f1': 'f1'
        },
        return_train_score=True,
        verbose=1 # Add verbose output to see training progress
    )
    
    # Fit the model
    model.fit(X_train, y_train)
    
    # Example of getting probabilities
    # proba = model.predict_proba(X)  # Returns [[prob_class0, prob_class1], ...]
    # wicket_prob = proba[:, 1]  # Just the probability of wicket (class 1)
    
    # Print metrics
    print("Model Performance:")
    print(f"Train AUC: {cv_results['train_auc'].mean():.3f} ± {cv_results['train_auc'].std():.3f}")
    print(f"CV AUC: {cv_results['test_auc'].mean():.3f} ± {cv_results['test_auc'].std():.3f}")

    print(f"Precision: {cv_results['test_precision'].mean():.3f} ± {cv_results['test_precision'].std():.3f}")
    print(f"Recall: {cv_results['test_recall'].mean():.3f} ± {cv_results['test_recall'].std():.3f}")
    print(f"F1 Score: {cv_results['test_f1'].mean():.3f} ± {cv_results['test_f1'].std():.3f}")
    
    return model

# Use it
powerplay_wicket_model = do_random_forest_classifier(X_train, y_train)

KeyboardInterrupt: 

In [63]:
powerplay_results = powerplay_wicket_model.predict_proba(X_test)
powerplay_results


array([[1.        , 0.        ],
       [1.        , 0.        ],
       [0.90514423, 0.09485577],
       ...,
       [0.8163913 , 0.1836087 ],
       [1.        , 0.        ],
       [1.        , 0.        ]])

In [66]:
print(f"Proportion of balls where wicket chance is 0: {len(powerplay_results[powerplay_results[:, 1] == 0]) / len(powerplay_results):.3f}")
print(f"Max wicket chance: {powerplay_results[:, 1].max():.3f}")
print(f"Mean wicket chance: {powerplay_results[:, 1].mean():.3f}")

Proportion of balls where wicket chance is 0: 0.568
Max wicket chance: 0.950
Mean wicket chance: 0.040


At a face value these results look awful, and were we trying to use this to classify whether or not a wicket occurs they would be.

However, we are trying to predict the probability of a wicket occuring in a ball. Binary classification models return the class with the highest proability - which in our case will pretty much always be no wicket...

Therefore, I do not think the low auc/f1 values are a problem.

I am however more concerned about the high number of balls with a wicket chance of 0. Since technically, there is always a small chance of a wicket occuring, I am going to add a small 'fudge factor' to the predictions to ensure that this always exists.

In [46]:
print(f"Mean actual wicket chance: {powerplay_balls_stats['is_wicket'].mean():.3f}")
print(f"Mean predicted wicket chance: {powerplay_results[:, 1].mean():.3f}")

print(f"STD of actual wicket chance: {powerplay_balls_stats['is_wicket'].std():.3f}")
print(f"STD of predicted wicket chance: {powerplay_results[:, 1].std():.3f}")

Mean actual wicket chance: 0.044
Mean predicted wicket chance: 0.040
STD of actual wicket chance: 0.205
STD of predicted wicket chance: 0.094


The mean wicket chance outputted by the model is 0.04, very similar to the actual wicket chance of 0.039, suggesting that the model is actually fairly representative of the actual wicket chance.

Therefore I probably want the fudge factor to be small but non-zero, keeping the mean predicted wicket chance close to the actual wicket chance.


In [51]:
def predict_wicket_probability(model, X, fudge_factor=0.01):
    wicket_prob = model.predict_proba(X)[:, 1]
    wicket_prop = wicket_prob * (1 - fudge_factor) + fudge_factor
    return wicket_prop

powerplay_results = predict_wicket_probability(powerplay_wicket_model, X_test, 0.01)
print(f"New mean predicted wicket chance: {powerplay_results.mean():.3f}")

New mean predicted wicket chance: 0.050


Adding a fudge factor of 1% seems to be alright. While it is arbitrarily chosen, it does keep the mean chance similar to the actual chance while ensuring that there is somewhat of a chance of a wicket occuring at all times.

# Non powerplay wickets

Again, I am going to repeat the exact same process for the non powerplay wickets.

In [70]:
X_non_powerplay = non_powerplay_balls_stats.drop(columns=['is_wicket'])
y_non_powerplay = non_powerplay_balls_stats['is_wicket']

X_train, X_test, y_train, y_test = train_test_split(X_non_powerplay, y_non_powerplay, test_size=0.2, random_state=42)

In [53]:
non_powerplay_wicket_model = do_random_forest_classifier(X_train, y_train)

Model Performance:
Train AUC: 0.950 ± 0.000
CV AUC: 0.571 ± 0.003
Precision: 0.053 ± 0.004
Recall: 0.012 ± 0.001
F1 Score: 0.020 ± 0.002


In [59]:
non_powerplay_results = predict_wicket_probability(non_powerplay_wicket_model, X_test, 0)
print(f"Proportion of balls where wicket chance is 0: {len(non_powerplay_results[non_powerplay_results == 0]) / len(non_powerplay_results):.3f}")
print(f"Max wicket chance: {non_powerplay_results.max():.3f}")
print(f"Mean wicket chance: {non_powerplay_results.mean():.3f}")

Proportion of balls where wicket chance is 0: 0.480
Max wicket chance: 0.968
Mean wicket chance: 0.057


In [68]:
print(f"Mean actual wicket chance: {non_powerplay_balls_stats['is_wicket'].mean():.3f}")
print(f"Mean predicted wicket chance: {non_powerplay_results.mean():.3f}")

print(f"STD of actual wicket chance: {non_powerplay_balls_stats['is_wicket'].std():.3f}")
print(f"STD of predicted wicket chance: {non_powerplay_results.std():.3f}")


Mean actual wicket chance: 0.061
Mean predicted wicket chance: 0.057
STD of actual wicket chance: 0.238
STD of predicted wicket chance: 0.113


Again we see that the model is fairly representative of the actual wicket chance, good signs all around.

I think I will stick with the same fudge factor of 1% for the non powerplay wickets to keep things consistent.

In [71]:
non_powerplay_results = predict_wicket_probability(non_powerplay_wicket_model, X_test, 0.01)
print(f"New mean predicted wicket chance: {non_powerplay_results.mean():.3f}")

New mean predicted wicket chance: 0.066


Again, this remains within ±1% of the actual wicket chance, I'm happy with this for now.

---
# 2. Predicting the number of extras resulting from a ball

I think I am going to break this down into three models:
1. A model predicting the probability of an extra occuring in a ball.
2. A model predicting the type of extra | an extra has occured.
3. A model predicting the number of extras occuring in a ball | The type of extra

This should help solve the problem of the high number of extra balls where the number of extras is 0, which would likely skew the predictions if I were to use just a single model.

Again, I am going to use a random forest classifier for this task.

## 2.a) Predicting the probability of an extra occuring in a ball

## Powerplay

In [83]:
powerplay_extra_balls = data[data['powerplay'] == True][['batter', 'bowler', 'extras']]
powerplay_extra_batter_stats = batter_stats[['batter'] + [col for col in batter_stats.columns if 'powerplay' in col[:9]]]
powerplay_extra_bowler_stats = bowler_stats[['bowler'] + [col for col in bowler_stats.columns if 'powerplay' in col[:9]]]

non_powerplay_extra_balls = data[data['powerplay'] == False][['batter', 'bowler', 'extras']]
non_powerplay_extra_batter_stats = batter_stats[['batter'] + [col for col in batter_stats.columns if 'non_powerplay' in col[:13]]]
non_powerplay_extra_bowler_stats = bowler_stats[['bowler'] + [col for col in bowler_stats.columns if 'non_powerplay' in col[:13]]]

In [84]:
powerplay_extra_balls['extras'].mean()

np.float64(0.09018491859723067)

In [85]:
powerplay_extra_balls['is_extras'] = powerplay_extra_balls['extras'] > 0
powerplay_extra_balls.info()

<class 'pandas.core.frame.DataFrame'>
Index: 269524 entries, 0 to 857110
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   batter     269524 non-null  object
 1   bowler     269524 non-null  object
 2   extras     269524 non-null  int64 
 3   is_extras  269524 non-null  bool  
dtypes: bool(1), int64(1), object(2)
memory usage: 8.5+ MB


In [86]:
powerplay_extra_balls = powerplay_extra_balls.merge(powerplay_extra_batter_stats, on='batter', how='left').drop(columns=['batter'])
powerplay_extra_balls = powerplay_extra_balls.merge(powerplay_extra_bowler_stats, on='bowler', how='left').drop(columns=['bowler'])

In [87]:
X_powerplay = powerplay_extra_balls.drop(columns=['extras', 'is_extras'])
y_powerplay = powerplay_extra_balls['is_extras']

X_train, X_test, y_train, y_test = train_test_split(X_powerplay, y_powerplay, test_size=0.2, random_state=42)

In [90]:
extras_model = do_random_forest_classifier(X_train, y_train)

Model Performance:
Train AUC: 0.918 ± 0.000
CV AUC: 0.580 ± 0.004
Precision: 0.138 ± 0.005
Recall: 0.032 ± 0.003
F1 Score: 0.052 ± 0.004


In [91]:
powerplay_results = predict_wicket_probability(extras_model, X_test, 0)
print(f"Proportion of balls where extra chance is 0: {len(powerplay_results[powerplay_results == 0]) / len(powerplay_results):.3f}")
print(f"Max extra chance: {powerplay_results.max():.3f}")
print(f"Mean extra chance: {powerplay_results.mean():.3f}")

Proportion of balls where extra chance is 0: 0.412
Max extra chance: 0.973
Mean extra chance: 0.073


In [93]:
print(f"Mean actual extra chance: {powerplay_extra_balls['extras'].mean():.3f}")
print(f"Mean predicted extra chance: {powerplay_results.mean():.3f}")

print(f"STD of actual extra chance: {powerplay_extra_balls['extras'].std():.3f}")
print(f"STD of predicted extra chance: {powerplay_results.std():.3f}")

Mean actual extra chance: 0.090
Mean predicted extra chance: 0.073
STD of actual extra chance: 0.392
STD of predicted extra chance: 0.122


Interesting to see that the predictions are less accurate here than for the wickets. Its still within ±2% of the actual mean chance, but not as close as I would like.

That being said, again adding a fudge factor of 1% seems like it would be alright, and in fact would bring the mean predicted extra chance closer to the actual extra chance.

In [94]:
powerplay_results = predict_wicket_probability(extras_model, X_test, 0.01)
print(f"New mean predicted extra chance: {powerplay_results.mean():.3f}")

New mean predicted extra chance: 0.082


## Non powerplay

In [95]:
non_powerplay_extra_balls['is_extras'] = non_powerplay_extra_balls['extras'] > 0
non_powerplay_extra_balls = non_powerplay_extra_balls.merge(non_powerplay_extra_batter_stats, on='batter', how='left').drop(columns=['batter'])
non_powerplay_extra_balls = non_powerplay_extra_balls.merge(non_powerplay_extra_bowler_stats, on='bowler', how='left').drop(columns=['bowler'])
X_non_powerplay = non_powerplay_extra_balls.drop(columns=['extras', 'is_extras'])
y_non_powerplay = non_powerplay_extra_balls['is_extras']

X_train, X_test, y_train, y_test = train_test_split(X_non_powerplay, y_non_powerplay, test_size=0.2, random_state=42)
non_powerplay_extras_model = do_random_forest_classifier(X_train, y_train)
non_powerplay_results = predict_wicket_probability(non_powerplay_extras_model, X_test, 0)
print(f"Proportion of balls where extra chance is 0: {len(non_powerplay_results[non_powerplay_results == 0]) / len(non_powerplay_results):.3f}")
print(f"Max extra chance: {non_powerplay_results.max():.3f}")
print(f"Mean extra chance: {non_powerplay_results.mean():.3f}")
print(f"\nMean actual extra chance: {non_powerplay_extra_balls['extras'].mean():.3f}")
print(f"Mean predicted extra chance: {non_powerplay_results.mean():.3f}")

print(f"STD of actual extra chance: {non_powerplay_extra_balls['extras'].std():.3f}")
print(f"STD of predicted extra chance: {non_powerplay_results.std():.3f}")


Model Performance:
Train AUC: 0.932 ± 0.000
CV AUC: 0.577 ± 0.004
Precision: 0.120 ± 0.009
Recall: 0.027 ± 0.002
F1 Score: 0.044 ± 0.003
Proportion of balls where extra chance is 0: 0.453
Max extra chance: 0.970
Mean extra chance: 0.063

Mean actual extra chance: 0.075
Mean predicted extra chance: 0.063
STD of actual extra chance: 0.347
STD of predicted extra chance: 0.115


Again, a fudge factor of 1% seems appropriate here.

In [96]:
non_powerplay_results = predict_wicket_probability(non_powerplay_extras_model, X_test, 0.01)
print(f"New mean predicted extra chance: {non_powerplay_results.mean():.3f}")

New mean predicted extra chance: 0.072


# 2.b) Predicting the type of extra | an extra has occured

My hope here is that using the batter and bowler stats we can predict the type of extra that occurs.

Depending on the success of this, I may instead just try to predict the number of runs|extra occured.

In [180]:
balls_with_extras = data[data['extras'] > 0][['batter', 'bowler', 'extras_details', 'extras', 'powerplay']]
balls_with_extras['extras_details'] = balls_with_extras['extras_details'].apply(ast.literal_eval)

In [181]:
balls_with_extras['extras_details'].value_counts()

extras_details
{'wides': 1}                    32496
{'legbyes': 1}                   9435
{'noballs': 1}                   4455
{'byes': 1}                      2385
{'wides': 2}                     2120
{'wides': 5}                      989
{'legbyes': 2}                    754
{'byes': 4}                       704
{'legbyes': 4}                    630
{'wides': 3}                      536
{'byes': 2}                       433
{'byes': 1, 'noballs': 1}         104
{'wides': 4}                       57
{'legbyes': 3}                     56
{'byes': 4, 'noballs': 1}          43
{'byes': 3}                        38
{'legbyes': 1, 'noballs': 1}       24
{'byes': 2, 'noballs': 1}          22
{'legbyes': 5}                     13
{'penalty': 5}                     12
{'noballs': 5}                     10
{'noballs': 2}                     10
{'legbyes': 4, 'noballs': 1}        4
{'legbyes': 2, 'noballs': 1}        2
{'byes': 5}                         2
{'legbyes': 1, 'penalty': 5}       

In [182]:
balls_with_extras['extras_details'].apply(lambda x: list(x.keys())).value_counts()

extras_details
[wides]               36198
[legbyes]             10888
[noballs]              4476
[byes]                 3562
[byes, noballs]         159
[legbyes, noballs]       29
[penalty]                12
[noballs, byes]          10
[legbyes, penalty]        1
[noballs, legbyes]        1
[noballs, penalty]        1
Name: count, dtype: int64

In [183]:
balls_with_extras.shape

(55337, 5)

In [184]:
noballs_balls = balls_with_extras[balls_with_extras['extras_details'].apply(lambda x: list(x.keys()) == ['noballs'])]
noballs_balls['extras_details'].value_counts()

extras_details
{'noballs': 1}    4455
{'noballs': 5}      10
{'noballs': 2}      10
{'noballs': 3}       1
Name: count, dtype: int64

We have a couple if issues here:
1. If we are trying to predict the number of extra runs | extra occurs, the results are massively skewed towards 1.
2. If we are trying to predict the number of extra runs | extra type the results will still heavily be skewed towards 1 for the extras from which there can be a variable number of runs.

I think this is a lose-lose situation, but at least predicting the probability of an extra type should help in the case of extras where there is a fixed number of runs.

In [185]:
balls_with_extras['extras_type'] = balls_with_extras['extras_details'].apply(lambda x: list(x.keys()))
# Explode the extras_type column to create separate rows for each extra type
exploded_extras = balls_with_extras.explode('extras_type')
exploded_extras['extras_type'].value_counts()

extras_type
wides      36198
legbyes    10919
noballs     4676
byes        3731
penalty       14
Name: count, dtype: int64

We have 5 types of extras. Since we can have multiple extras in a ball, I am going to use 5 different models to predict the probability of each type occuring.

For the sake of length and readability, I am going to limit my explanations/analysis.

In [186]:
# Unfortuntely due to the way the data is structured I can't use one_hot_encoding with the exploded rows
# without duplicating rows, so I will just create the columns manually.
balls_with_extras['wides'] = balls_with_extras['extras_details'].apply(lambda x: 'wides' in list(x.keys()))
balls_with_extras['byes'] = balls_with_extras['extras_details'].apply(lambda x: 'byes' in list(x.keys()))
balls_with_extras['legbyes'] = balls_with_extras['extras_details'].apply(lambda x: 'legbyes' in list(x.keys()))
balls_with_extras['noballs'] = balls_with_extras['extras_details'].apply(lambda x: 'noballs' in list(x.keys()))
balls_with_extras['penalty'] = balls_with_extras['extras_details'].apply(lambda x: 'penalty' in list(x.keys()))
extras_dropped = balls_with_extras.drop(columns=['extras_details', 'extras_type', 'extras'])

powerplay_extra_balls = extras_dropped[extras_dropped['powerplay']].drop(columns=['powerplay'])
powerplay_extra_balls = powerplay_extra_balls.merge(powerplay_extra_batter_stats, on='batter', how='left').drop(columns=['batter'])
powerplay_extra_balls = powerplay_extra_balls.merge(powerplay_extra_bowler_stats, on='bowler', how='left').drop(columns=['bowler'])

non_powerplay_extra_balls = extras_dropped[extras_dropped['powerplay'] == False].drop(columns=['powerplay'])
non_powerplay_extra_balls = non_powerplay_extra_balls.merge(non_powerplay_extra_batter_stats, on='batter', how='left').drop(columns=['batter'])
non_powerplay_extra_balls = non_powerplay_extra_balls.merge(non_powerplay_extra_bowler_stats, on='bowler', how='left').drop(columns=['bowler'])

In [175]:
# Wides:
# For the sake of time I am going to limit the max depth of the trees to 30, otherwise I will be sitting here for hours...
X_powerplay_wides = powerplay_extra_balls.drop(columns=['wides', 'byes', 'legbyes', 'noballs', 'penalty'])
y_powerplay_wides = powerplay_extra_balls['wides']
X_train, X_test, y_train, y_test = train_test_split(X_powerplay_wides, y_powerplay_wides, test_size=0.2, random_state=42)
print(f"Powerplay wides model:")
powerplay_wides_model = do_random_forest_classifier(X_train, y_train, max_depth=30)
test_wides = predict_wicket_probability(powerplay_wides_model, X_test, 0)

# Byes:
X_powerplay_byes = powerplay_extra_balls.drop(columns=['wides', 'byes', 'legbyes', 'noballs', 'penalty'])
y_powerplay_byes = powerplay_extra_balls['byes']
X_train, X_test, y_train, y_test = train_test_split(X_powerplay_byes, y_powerplay_byes, test_size=0.2, random_state=42)
print(f"Powerplay byes model:")
powerplay_byes_model = do_random_forest_classifier(X_train, y_train, max_depth=30)
test_byes = predict_wicket_probability(powerplay_byes_model, X_test, 0)
# Legbyes:
X_powerplay_legbyes = powerplay_extra_balls.drop(columns=['wides', 'byes', 'legbyes', 'noballs', 'penalty'])
y_powerplay_legbyes = powerplay_extra_balls['legbyes']
X_train, X_test, y_train, y_test = train_test_split(X_powerplay_legbyes, y_powerplay_legbyes, test_size=0.2, random_state=42)
print(f"Powerplay legbyes model:")
powerplay_legbyes_model = do_random_forest_classifier(X_train, y_train, max_depth=30)
test_legbyes = predict_wicket_probability(powerplay_legbyes_model, X_test, 0)

# Noballs:
X_powerplay_noballs = powerplay_extra_balls.drop(columns=['wides', 'byes', 'legbyes', 'noballs', 'penalty'])
y_powerplay_noballs = powerplay_extra_balls['noballs']
X_train, X_test, y_train, y_test = train_test_split(X_powerplay_noballs, y_powerplay_noballs, test_size=0.2, random_state=42)
print(f"Powerplay noballs model:")
powerplay_noballs_model = do_random_forest_classifier(X_train, y_train, max_depth=30)
test_noballs = predict_wicket_probability(powerplay_noballs_model, X_test, 0)
# Penalties:
# There are no penalites in the powerplay data, and only 14 in the non powerplay data.
# I am hence going to just model this as a random chance, especially considering that the batter/bolwer skill
# in theory should have no impact on the probability of a penalty occuring.

Powerplay wides model:
Model Performance:
Train AUC: 0.986 ± 0.000
CV AUC: 0.616 ± 0.007
Precision: 0.727 ± 0.004
Recall: 0.849 ± 0.007
F1 Score: 0.783 ± 0.004
Powerplay byes model:
Model Performance:
Train AUC: 0.997 ± 0.000
CV AUC: 0.587 ± 0.012
Precision: 0.116 ± 0.034
Recall: 0.031 ± 0.010
F1 Score: 0.049 ± 0.015
Powerplay legbyes model:
Model Performance:
Train AUC: 0.992 ± 0.000
CV AUC: 0.661 ± 0.004
Precision: 0.347 ± 0.009
Recall: 0.158 ± 0.010
F1 Score: 0.217 ± 0.010
Powerplay noballs model:
Model Performance:
Train AUC: 0.994 ± 0.000
CV AUC: 0.608 ± 0.017
Precision: 0.254 ± 0.064
Recall: 0.088 ± 0.029
F1 Score: 0.130 ± 0.040


In [176]:
print(f"Mean actual wides chance: {powerplay_extra_balls['wides'].mean():.3f}")
print(f"Mean predicted wides chance: {test_wides.mean():.3f}")
print(f"Proportion of balls where wides chance is 0: {len(test_wides[test_wides == 0]) / len(test_wides):.3f}\n")

print(f"Mean actual byes chance: {powerplay_extra_balls['byes'].mean():.3f}")
print(f"Mean predicted byes chance: {test_byes.mean():.3f}")
print(f"Proportion of balls where byes chance is 0: {len(test_byes[test_byes == 0]) / len(test_byes):.3f}\n")
print(f"Mean actual legbyes chance: {powerplay_extra_balls['legbyes'].mean():.3f}")
print(f"Mean predicted legbyes chance: {test_legbyes.mean():.3f}")
print(f"Proportion of balls where legbyes chance is 0: {len(test_legbyes[test_legbyes == 0]) / len(test_legbyes):.3f}\n")

print(f"Mean actual noballs chance: {powerplay_extra_balls['noballs'].mean():.3f}")
print(f"Mean predicted noballs chance: {test_noballs.mean():.3f}")
print(f"Proportion of balls where noballs chance is 0: {len(test_noballs[test_noballs == 0]) / len(test_noballs):.3f}\n")

Mean actual wides chance: 0.694
Mean predicted wides chance: 0.696
Proportion of balls where wides chance is 0: 0.000

Mean actual byes chance: 0.044
Mean predicted byes chance: 0.048
Proportion of balls where byes chance is 0: 0.139

Mean actual legbyes chance: 0.191
Mean predicted legbyes chance: 0.195
Proportion of balls where legbyes chance is 0: 0.061

Mean actual noballs chance: 0.074
Mean predicted noballs chance: 0.080
Proportion of balls where noballs chance is 0: 0.068



Good to see that the chances fairly match up again. Due to the very low number of zeroes, I do not think that adding a fudge factor will be necessary here.

In [178]:
# Wides:
# For the sake of time I am going to limit the max depth of the trees to 30, otherwise I will be sitting here for hours...
X_non_powerplay_wides = non_powerplay_extra_balls.drop(columns=['wides', 'byes', 'legbyes', 'noballs', 'penalty'])
y_non_powerplay_wides = non_powerplay_extra_balls['wides']
X_train, X_test, y_train, y_test = train_test_split(X_non_powerplay_wides, y_non_powerplay_wides, test_size=0.2, random_state=42)
print(f"Non powerplay wides model:")
non_powerplay_wides_model = do_random_forest_classifier(X_train, y_train, max_depth=30)
test_wides = predict_wicket_probability(non_powerplay_wides_model, X_test, 0)

# Byes:
X_non_powerplay_byes = non_powerplay_extra_balls.drop(columns=['wides', 'byes', 'legbyes', 'noballs', 'penalty'])
y_non_powerplay_byes = non_powerplay_extra_balls['byes']
X_train, X_test, y_train, y_test = train_test_split(X_non_powerplay_byes, y_non_powerplay_byes, test_size=0.2, random_state=42)
print(f"Non powerplay byes model:")
non_powerplay_byes_model = do_random_forest_classifier(X_train, y_train, max_depth=30)
test_byes = predict_wicket_probability(non_powerplay_byes_model, X_test, 0)

# Legbyes:
X_non_powerplay_legbyes = non_powerplay_extra_balls.drop(columns=['wides', 'byes', 'legbyes', 'noballs', 'penalty'])
y_non_powerplay_legbyes = non_powerplay_extra_balls['legbyes']
X_train, X_test, y_train, y_test = train_test_split(X_non_powerplay_legbyes, y_non_powerplay_legbyes, test_size=0.2, random_state=42)
print(f"Non powerplay legbyes model:")
non_powerplay_legbyes_model = do_random_forest_classifier(X_train, y_train, max_depth=30)
test_legbyes = predict_wicket_probability(non_powerplay_legbyes_model, X_test, 0)

# Noballs:
X_non_powerplay_noballs = non_powerplay_extra_balls.drop(columns=['wides', 'byes', 'legbyes', 'noballs', 'penalty'])
y_non_powerplay_noballs = non_powerplay_extra_balls['noballs']
X_train, X_test, y_train, y_test = train_test_split(X_non_powerplay_noballs, y_non_powerplay_noballs, test_size=0.2, random_state=42)
print(f"Non powerplay noballs model:")
non_powerplay_noballs_model = do_random_forest_classifier(X_train, y_train, max_depth=30)
test_noballs = predict_wicket_probability(non_powerplay_noballs_model, X_test, 0)

# Penalty:
# See above cell.

Non powerplay wides model:
Model Performance:
Train AUC: 0.988 ± 0.000
CV AUC: 0.610 ± 0.004
Precision: 0.672 ± 0.002
Recall: 0.786 ± 0.006
F1 Score: 0.724 ± 0.003
Non powerplay byes model:
Model Performance:
Train AUC: 0.997 ± 0.000
CV AUC: 0.598 ± 0.006
Precision: 0.131 ± 0.025
Recall: 0.027 ± 0.006
F1 Score: 0.045 ± 0.010
Non powerplay legbyes model:
Model Performance:
Train AUC: 0.994 ± 0.000
CV AUC: 0.650 ± 0.005
Precision: 0.353 ± 0.021
Recall: 0.138 ± 0.005
F1 Score: 0.198 ± 0.008
Non powerplay noballs model:
Model Performance:
Train AUC: 0.994 ± 0.000
CV AUC: 0.605 ± 0.008
Precision: 0.275 ± 0.018
Recall: 0.094 ± 0.004
F1 Score: 0.140 ± 0.006


In [179]:
print(f"Mean actual wides chance: {non_powerplay_extra_balls['wides'].mean():.3f}")
print(f"Mean predicted wides chance: {test_wides.mean():.3f}")
print(f"Proportion of balls where wides chance is 0: {len(test_wides[test_wides == 0]) / len(test_wides):.3f}\n")

print(f"Mean actual byes chance: {non_powerplay_extra_balls['byes'].mean():.3f}")
print(f"Mean predicted byes chance: {test_byes.mean():.3f}")
print(f"Proportion of balls where byes chance is 0: {len(test_byes[test_byes == 0]) / len(test_byes):.3f}\n")
print(f"Mean actual legbyes chance: {non_powerplay_extra_balls['legbyes'].mean():.3f}")
print(f"Mean predicted legbyes chance: {test_legbyes.mean():.3f}")
print(f"Proportion of balls where legbyes chance is 0: {len(test_legbyes[test_legbyes == 0]) / len(test_legbyes):.3f}\n")

print(f"Mean actual noballs chance: {non_powerplay_extra_balls['noballs'].mean():.3f}")
print(f"Mean predicted noballs chance: {test_noballs.mean():.3f}")
print(f"Proportion of balls where noballs chance is 0: {len(test_noballs[test_noballs == 0]) / len(test_noballs):.3f}\n")

Mean actual wides chance: 0.633
Mean predicted wides chance: 0.631
Proportion of balls where wides chance is 0: 0.000

Mean actual byes chance: 0.080
Mean predicted byes chance: 0.085
Proportion of balls where byes chance is 0: 0.068

Mean actual legbyes chance: 0.201
Mean predicted legbyes chance: 0.207
Proportion of balls where legbyes chance is 0: 0.055

Mean actual noballs chance: 0.090
Mean predicted noballs chance: 0.097
Proportion of balls where noballs chance is 0: 0.023



Again, this seems to match up remarkably well. Also again, due to the very low number of zeroes, I do not think that adding a fudge factor will be needed here.

# 2.c) Predicting the number of extras occuring in a ball | The type of extra

Again, I will be limiting my comments/explanations due to the length of this task.

In [201]:
# We need a regression model rather than a classifier here.
def do_random_forest_regressor(X_train, y_train, cv=5, random_state=42, **kwargs):
    model = RandomForestRegressor(random_state=random_state, **kwargs)
    
    # Perform cross-validation
    cv_results = cross_validate(
        model,
        X_train,
        y_train,
        cv=cv,
        scoring={
            'r2': 'r2',
            'rmse': 'neg_root_mean_squared_error',
            'mape': 'neg_mean_absolute_percentage_error'
        },
        return_train_score=True
    )
    
    # Fit final model
    model.fit(X_train, y_train)
    
    # Analyze CV results
    analysis = {
        'r2': {
            'train_avg': cv_results['train_r2'].mean(),
            'train_std': cv_results['train_r2'].std(),
            'cv_avg': cv_results['test_r2'].mean(),
            'cv_std': cv_results['test_r2'].std(),
            'gap': cv_results['train_r2'].mean() - cv_results['test_r2'].mean(),
            'consistency': 'Stable' if cv_results['test_r2'].std() < 0.05 else 'Unstable'
        },
        'rmse': {
            'train_avg': -cv_results['train_rmse'].mean(),
            'train_std': cv_results['train_rmse'].std(),
            'cv_avg': -cv_results['test_rmse'].mean(),
            'cv_std': cv_results['test_rmse'].std(),
            'gap': -cv_results['test_rmse'].mean() - (-cv_results['train_rmse'].mean())
        },
        'mape': {
            'train_avg': -cv_results['train_mape'].mean(),
            'train_std': cv_results['train_mape'].std(),
            'cv_avg': -cv_results['test_mape'].mean(),
            'cv_std': cv_results['test_mape'].std(),
            'gap': -cv_results['test_mape'].mean() - (-cv_results['train_mape'].mean())
        }
    }
    
    # Print detailed analysis
    print("Cross-Validation Analysis:")
    print("\nR² Scores:")
    print(f"Training Average: {analysis['r2']['train_avg']:.3f} ± {analysis['r2']['train_std']:.3f}")
    print(f"CV Average: {analysis['r2']['cv_avg']:.3f} ± {analysis['r2']['cv_std']:.3f}")
    print(f"Gap (Train-CV): {analysis['r2']['gap']:.3f}")
    print(f"CV Consistency: {analysis['r2']['consistency']}")
    
    print("\nRMSE Scores:")
    print(f"Training Average: {analysis['rmse']['train_avg']:.3f} ± {analysis['rmse']['train_std']:.3f}")
    print(f"CV Average: {analysis['rmse']['cv_avg']:.3f} ± {analysis['rmse']['cv_std']:.3f}")
    print(f"Gap (CV-Train): {analysis['rmse']['gap']:.3f}")
    
    print("\nMAPE Scores:")
    print(f"Training Average: {analysis['mape']['train_avg']:.3f} ± {analysis['mape']['train_std']:.3f}")
    print(f"CV Average: {analysis['mape']['cv_avg']:.3f} ± {analysis['mape']['cv_std']:.3f}")
    print(f"Gap (CV-Train): {analysis['mape']['gap']:.3f}")
    
    # Feature importance
    feature_importance = pd.DataFrame({
        'Feature': X_train.columns,
        'Importance': model.feature_importances_
    }).sort_values('Importance', ascending=False)
    
    print("\nTop 10 Most Important Features:")
    print(feature_importance.head(10))
    
    return model, analysis, feature_importance

In [189]:
balls_with_extras.head()

Unnamed: 0,batter,bowler,extras_details,extras,powerplay,extras_type,wides,byes,legbyes,noballs,penalty
17,09a9d073,3a60e0b5,{'wides': 1},1,True,[wides],True,False,False,False,False
33,73c18486,e62dd25d,{'legbyes': 1},1,True,[legbyes],False,False,True,False,False
51,09a9d073,6a26221c,{'wides': 1},1,False,[wides],True,False,False,False,False
76,650d5e49,8361e524,{'legbyes': 1},1,False,[legbyes],False,False,True,False,False
99,650d5e49,81c36ee9,{'wides': 1},1,False,[wides],True,False,False,False,False


In [198]:
exploded_extras = balls_with_extras.explode('extras_type')
exploded_extras['extras_from_relevant_type'] = exploded_extras.apply(lambda row: row['extras_details'].get(row['extras_type'], 0), axis=1)
exploded_extras = exploded_extras.drop(columns=['extras_details', 'wides', 'byes', 'legbyes', 'noballs', 'penalty', 'extras'])
exploded_extras.head()

Unnamed: 0,batter,bowler,powerplay,extras_type,extras_from_relevant_type
17,09a9d073,3a60e0b5,True,wides,1
33,73c18486,e62dd25d,True,legbyes,1
51,09a9d073,6a26221c,False,wides,1
76,650d5e49,8361e524,False,legbyes,1
99,650d5e49,81c36ee9,False,wides,1


In [None]:
powerplay_extra_balls = exploded_extras[exploded_extras['powerplay']].drop(columns=['powerplay'])
powerplay_extra_balls = powerplay_extra_balls.merge(powerplay_extra_batter_stats, on='batter', how='left').drop(columns=['batter'])
powerplay_extra_balls = powerplay_extra_balls.merge(powerplay_extra_bowler_stats, on='bowler', how='left').drop(columns=['bowler'])
non_powerplay_extra_balls = exploded_extras[exploded_extras['powerplay'] == False].drop(columns=['powerplay'])
non_powerplay_extra_balls = non_powerplay_extra_balls.merge(non_powerplay_extra_batter_stats, on='batter', how='left').drop(columns=['batter'])
non_powerplay_extra_balls = non_powerplay_extra_balls.merge(non_powerplay_extra_bowler_stats, on='bowler', how='left').drop(columns=['bowler'])

# Wides:
print(f"\nPowerplay wides model:")
X_powerplay_wides = powerplay_extra_balls[powerplay_extra_balls['extras_type'] == 'wides'].drop(columns=['extras_type', 'extras_from_relevant_type'])
y_powerplay_wides = powerplay_extra_balls[powerplay_extra_balls['extras_type'] == 'wides']['extras_from_relevant_type']
X_train, X_test, y_train, y_test = train_test_split(X_powerplay_wides, y_powerplay_wides, test_size=0.2, random_state=42)
powerplay_wides_model, _, _ = do_random_forest_regressor(X_train, y_train, max_depth=30)

# Byes:
print(f"\nPowerplay byes model:")
X_powerplay_byes = powerplay_extra_balls[powerplay_extra_balls['extras_type'] == 'byes'].drop(columns=['extras_type', 'extras_from_relevant_type'])
y_powerplay_byes = powerplay_extra_balls[powerplay_extra_balls['extras_type'] == 'byes']['extras_from_relevant_type']
X_train, X_test, y_train, y_test = train_test_split(X_powerplay_byes, y_powerplay_byes, test_size=0.2, random_state=42)
powerplay_byes_model, _, _ = do_random_forest_regressor(X_train, y_train, max_depth=30)

# Legbyes:
print(f"\nPowerplay legbyes model:")
X_powerplay_legbyes = powerplay_extra_balls[powerplay_extra_balls['extras_type'] == 'legbyes'].drop(columns=['extras_type', 'extras_from_relevant_type'])
y_powerplay_legbyes = powerplay_extra_balls[powerplay_extra_balls['extras_type'] == 'legbyes']['extras_from_relevant_type']
X_train, X_test, y_train, y_test = train_test_split(X_powerplay_legbyes, y_powerplay_legbyes, test_size=0.2, random_state=42)
powerplay_legbyes_model, _, _ = do_random_forest_regressor(X_train, y_train, max_depth=30)

# Noballs and penalties result in a fixed number of extras, so these do not need modeling.

The results here show that the model is not very good at predicting the number of extras occuring in a ball.

This is likely due to a combination of factors:
1. The number of extras is almsot always 1, so the model struggles to predict different values.
2. The amount of data is fairly limited, so the model is unable to learn well.

I do not think that we will be able to predict this with any model so trying would be a waste of time, however there are a few other options.

1. Just set the number of extras to 1 for all types - as this is almost always the case it should not lead to significant issues.
2. Use a probability distribution to predict the number of extras occuring depending on the type.

In [205]:
print(f"Proportion of wides leading to 1 extra: {powerplay_extra_balls[(powerplay_extra_balls['extras_type'] == 'wides') & (powerplay_extra_balls['extras_from_relevant_type'] == 1)].shape[0] / powerplay_extra_balls[powerplay_extra_balls['extras_type'] == 'wides'].shape[0]:.3f}")
print(f"Proportion of byes leading to 1 extra: {powerplay_extra_balls[(powerplay_extra_balls['extras_type'] == 'byes') & (powerplay_extra_balls['extras_from_relevant_type'] == 1)].shape[0] / powerplay_extra_balls[powerplay_extra_balls['extras_type'] == 'byes'].shape[0]:.3f}")
print(f"Proportion of legbyes leading to 1 extra: {powerplay_extra_balls[(powerplay_extra_balls['extras_type'] == 'legbyes') & (powerplay_extra_balls['extras_from_relevant_type'] == 1)].shape[0] / powerplay_extra_balls[powerplay_extra_balls['extras_type'] == 'legbyes'].shape[0]:.3f}")

Proportion of wides leading to 1 extra: 0.900
Proportion of byes leading to 1 extra: 0.597
Proportion of legbyes leading to 1 extra: 0.846


While the proportions for wides and legbyes scoring 1 extra are quite high, there is still a non-negligable chance of more extras occuring. Byes in particular see a significant chance of >1 extra being scored.

I think therefore the second option will be better.
Since the number of runs that are scored when such extras occur can easily be down to luck rather than skill, fitting a distribution to the data should not be too bad a representation. I think we should still filter by powerplay/non-powerplay here, due to the fact that the less fielders on the field will likely increase the number of extras.

For simplicity, I will just use an empirical 

In [215]:
def get_empirical_distribution(data):
    """
    Create probability distribution directly from data
    
    Args:
        data: Series of values
    
    Returns:
        values: possible values
        probabilities: probability of each value
    """
    # Get value counts and convert to probabilities
    value_counts = data.value_counts()
    total = len(data)
    probabilities = value_counts / total
    
    # Sort by value for clarity
    probabilities = probabilities.sort_index()
    
    return probabilities

powerplay_wides = powerplay_extra_balls[powerplay_extra_balls['extras_type'] == 'wides']['extras_from_relevant_type']
powerplay_byes = powerplay_extra_balls[powerplay_extra_balls['extras_type'] == 'byes']['extras_from_relevant_type']
powerplay_legbyes = powerplay_extra_balls[powerplay_extra_balls['extras_type'] == 'legbyes']['extras_from_relevant_type']
powerplay_wides_probs = get_empirical_distribution(powerplay_wides)
powerplay_byes_probs = get_empirical_distribution(powerplay_byes)
powerplay_legbyes_probs = get_empirical_distribution(powerplay_legbyes)

non_powerplay_wides = non_powerplay_extra_balls[non_powerplay_extra_balls['extras_type'] == 'wides']['extras_from_relevant_type']
non_powerplay_byes = non_powerplay_extra_balls[non_powerplay_extra_balls['extras_type'] == 'byes']['extras_from_relevant_type']
non_powerplay_legbyes = non_powerplay_extra_balls[non_powerplay_extra_balls['extras_type'] == 'legbyes']['extras_from_relevant_type']
non_powerplay_wides_probs = get_empirical_distribution(non_powerplay_wides)
non_powerplay_byes_probs = get_empirical_distribution(non_powerplay_byes)
non_powerplay_legbyes_probs = get_empirical_distribution(non_powerplay_legbyes)

# 3. Prediciting the number of runs scored from a ball

In [217]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857199 entries, 0 to 857198
Data columns (total 29 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   game_id               857199 non-null  object 
 1   date                  857199 non-null  object 
 2   venue                 857199 non-null  object 
 3   location              816895 non-null  object 
 4   gender                857199 non-null  object 
 5   match_type            857199 non-null  object 
 6   innings               857199 non-null  int64  
 7   batting_team          857199 non-null  object 
 8   bowling_team          857199 non-null  object 
 9   batting_team_players  857199 non-null  object 
 10  bowling_team_players  857199 non-null  object 
 11  over                  857199 non-null  int64  
 12  ball_in_over          857199 non-null  int64  
 13  batter                857199 non-null  object 
 14  bowler                857199 non-null  object 
 15  

In [223]:
# Handle both dict and float (NaN) cases
data['extras_parsed'] = data['extras_details'].apply(lambda x: ast.literal_eval(x) if not pd.isna(x) else {})
data['freehit'] = data['extras_parsed'].apply(lambda x: 'noballs' in list(x.keys()))
data['freehit'].value_counts()

freehit
False    852523
True       4676
Name: count, dtype: int64

In [224]:
runs_data = data[['batter', 'bowler', 'powerplay', 'freehit', 'runs_batter']]
runs_data.head()

Unnamed: 0,batter,bowler,powerplay,freehit,runs_batter
0,7fca84b7,3a60e0b5,True,False,1
1,73c18486,3a60e0b5,True,False,1
2,7fca84b7,3a60e0b5,True,False,0
3,09a9d073,3a60e0b5,True,False,0
4,09a9d073,3a60e0b5,True,False,4


In [231]:
runs_data['runs_batter'].value_counts()

runs_batter
0    406520
1    285508
4     77460
2     58822
6     24971
3      3796
5       120
7         2
Name: count, dtype: int64

The data here is extremely unbalances, making the job of a predictive model difficult.

I think in this situation using an empirical distribution will be the best option. However, I am going to segment the data by powerplay/non-powerplay and then further my considering the skill of the bowler compared to the batter.

I will compare their average runs conceded/made per ball to determine skill levels, and segment the data into 3 groups based on this:
1. Bowler is better than batter
2. Bowler is worse than batter
3. Bowler is about the same as batter

In [232]:
powerplay_runs_data = runs_data[runs_data['powerplay']].drop(columns=['powerplay'])
powerplay_runs_data = powerplay_runs_data.merge(powerplay_batter_stats, on='batter', how='left').drop(columns=['batter'])
powerplay_runs_data = powerplay_runs_data.merge(powerplay_bowler_stats, on='bowler', how='left').drop(columns=['bowler'])

non_powerplay_runs_data = runs_data[~runs_data['powerplay']].drop(columns=['powerplay'])
non_powerplay_runs_data = non_powerplay_runs_data.merge(non_powerplay_batter_stats, on='batter', how='left').drop(columns=['batter'])
non_powerplay_runs_data = non_powerplay_runs_data.merge(non_powerplay_bowler_stats, on='bowler', how='left').drop(columns=['bowler'])