In T20 cricket, predicting the final score of a team during a match can provide valuable insights for strategy, commentary, and fan engagement. The objective of this project is to develop a machine learning model that accurately predicts the final score of a T20 innings based on historical ball-by-ball data up to a given point in the match. The model will use features such as current score, overs completed, wickets lost, batsmen and bowler performance, and other contextual match details to make its predictions. This predictive capability aims to support real-time decision-making and enhance the analytical understanding of the game.

Outcome (Target Variable):
	•	Final Score of the Innings: The total number of runs scored by the batting team by the end of 20 overs or until all wickets are lost.
This is a regression problem, as the target is a continuous numerical value.

Features:
batting_team
bowling team
city
current score
ball left
wickets left
current rr
last five

In [None]:
import pandas as pd
import pickle
import numpy as np

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
df = pickle.load(open('/content/drive/MyDrive/dataset__01.pkl','rb'))

In [None]:
df

# batting_team
# bowling team
# city
# current score
# ball left
# wickets left
# last five overs run

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,venue
4227,37,India,South Africa,0.1,0,0,Centurion,"SuperSport Park, Centurion"
4228,37,India,South Africa,0.2,0,SV Samson,Centurion,"SuperSport Park, Centurion"
4229,37,India,South Africa,0.3,0,0,Centurion,"SuperSport Park, Centurion"
4230,37,India,South Africa,0.4,4,0,Centurion,"SuperSport Park, Centurion"
4231,37,India,South Africa,0.5,6,0,Centurion,"SuperSport Park, Centurion"
...,...,...,...,...,...,...,...,...
313547,2610,England,South Africa,19.2,0,0,Cardiff,Sophia Gardens
313548,2610,England,South Africa,19.3,0,JC Buttler,Cardiff,Sophia Gardens
313549,2610,England,South Africa,19.4,0,DJ Willey,Cardiff,Sophia Gardens
313550,2610,England,South Africa,19.5,0,0,Cardiff,Sophia Gardens


In [None]:
df.isnull().sum()

Unnamed: 0,0
match_id,0
batting_team,0
bowling_team,0
ball,0
runs,0
player_dismissed,0
city,8425
venue,0


In [None]:
df['city'].value_counts()

Unnamed: 0_level_0,count
city,Unnamed: 1_level_1
Colombo,4924
Johannesburg,3835
Auckland,3165
Mirpur,3061
Cape Town,2369
...,...
Jaipur,123
Victoria,123
Dallas,122
Potchefstroom,122


In [None]:
df[df['city'].isnull()]['venue'].value_counts()

Unnamed: 0_level_0,count
venue,Unnamed: 1_level_1
Dubai International Cricket Stadium,3092
Pallekele International Cricket Stadium,1942
Melbourne Cricket Ground,1453
Sydney Cricket Ground,749
Adelaide Oval,498
Harare Sports Club,372
Sylhet International Cricket Stadium,128
Sharjah Cricket Stadium,127
Carrara Oval,64


In [None]:
cities = np.where(df['city'].isnull(),df['venue'].str.split().apply(lambda x:x[0]),df['city'])

In [None]:
df['city'] = cities

In [None]:
df.isnull().sum()

Unnamed: 0,0
match_id,0
batting_team,0
bowling_team,0
ball,0
runs,0
player_dismissed,0
city,0
venue,0


In [None]:
df.drop(columns=['venue'],inplace=True)

In [None]:
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city
4227,37,India,South Africa,0.1,0,0,Centurion
4228,37,India,South Africa,0.2,0,SV Samson,Centurion
4229,37,India,South Africa,0.3,0,0,Centurion
4230,37,India,South Africa,0.4,4,0,Centurion
4231,37,India,South Africa,0.5,6,0,Centurion
...,...,...,...,...,...,...,...
313547,2610,England,South Africa,19.2,0,0,Cardiff
313548,2610,England,South Africa,19.3,0,JC Buttler,Cardiff
313549,2610,England,South Africa,19.4,0,DJ Willey,Cardiff
313550,2610,England,South Africa,19.5,0,0,Cardiff


In [None]:
eligible_cities = df['city'].value_counts()[df['city'].value_counts() > 600].index.tolist()

In [None]:
df = df[df['city'].isin(eligible_cities)]

In [None]:
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city
4227,37,India,South Africa,0.1,0,0,Centurion
4228,37,India,South Africa,0.2,0,SV Samson,Centurion
4229,37,India,South Africa,0.3,0,0,Centurion
4230,37,India,South Africa,0.4,4,0,Centurion
4231,37,India,South Africa,0.5,6,0,Centurion
...,...,...,...,...,...,...,...
313547,2610,England,South Africa,19.2,0,0,Cardiff
313548,2610,England,South Africa,19.3,0,JC Buttler,Cardiff
313549,2610,England,South Africa,19.4,0,DJ Willey,Cardiff
313550,2610,England,South Africa,19.5,0,0,Cardiff


In [None]:
# Make an explicit copy to avoid SettingWithCopyWarning
df = df.copy()

# Convert 'runs' column to numeric (non-numeric becomes NaN)
df['runs'] = pd.to_numeric(df['runs'], errors='coerce')

# Apply cumulative sum of runs grouped by match ID
df['current_score'] = df.groupby('match_id')['runs'].cumsum()


In [None]:
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,current_score
4227,37,India,South Africa,0.1,0,0,Centurion,0
4228,37,India,South Africa,0.2,0,SV Samson,Centurion,0
4229,37,India,South Africa,0.3,0,0,Centurion,0
4230,37,India,South Africa,0.4,4,0,Centurion,4
4231,37,India,South Africa,0.5,6,0,Centurion,10
...,...,...,...,...,...,...,...,...
313547,2610,England,South Africa,19.2,0,0,Cardiff,180
313548,2610,England,South Africa,19.3,0,JC Buttler,Cardiff,180
313549,2610,England,South Africa,19.4,0,DJ Willey,Cardiff,180
313550,2610,England,South Africa,19.5,0,0,Cardiff,180


In [None]:
df['over'] = df['ball'].apply(lambda x:str(x).split(".")[0])
df['ball_no'] = df['ball'].apply(lambda x:str(x).split(".")[1])
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,current_score,over,ball_no
4227,37,India,South Africa,0.1,0,0,Centurion,0,0,1
4228,37,India,South Africa,0.2,0,SV Samson,Centurion,0,0,2
4229,37,India,South Africa,0.3,0,0,Centurion,0,0,3
4230,37,India,South Africa,0.4,4,0,Centurion,4,0,4
4231,37,India,South Africa,0.5,6,0,Centurion,10,0,5
...,...,...,...,...,...,...,...,...,...,...
313547,2610,England,South Africa,19.2,0,0,Cardiff,180,19,2
313548,2610,England,South Africa,19.3,0,JC Buttler,Cardiff,180,19,3
313549,2610,England,South Africa,19.4,0,DJ Willey,Cardiff,180,19,4
313550,2610,England,South Africa,19.5,0,0,Cardiff,180,19,5


In [None]:
df['balls_bowled'] = (df['over'].astype('int')*6) + df['ball_no'].astype('int')
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,current_score,over,ball_no,balls_bowled
4227,37,India,South Africa,0.1,0,0,Centurion,0,0,1,1
4228,37,India,South Africa,0.2,0,SV Samson,Centurion,0,0,2,2
4229,37,India,South Africa,0.3,0,0,Centurion,0,0,3,3
4230,37,India,South Africa,0.4,4,0,Centurion,4,0,4,4
4231,37,India,South Africa,0.5,6,0,Centurion,10,0,5,5
...,...,...,...,...,...,...,...,...,...,...,...
313547,2610,England,South Africa,19.2,0,0,Cardiff,180,19,2,116
313548,2610,England,South Africa,19.3,0,JC Buttler,Cardiff,180,19,3,117
313549,2610,England,South Africa,19.4,0,DJ Willey,Cardiff,180,19,4,118
313550,2610,England,South Africa,19.5,0,0,Cardiff,180,19,5,119


In [None]:
df['balls_left'] = 120 - df['balls_bowled']
df['balls_left'] = df['balls_left'].apply(lambda x:0 if x<0 else x)
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,current_score,over,ball_no,balls_bowled,balls_left
4227,37,India,South Africa,0.1,0,0,Centurion,0,0,1,1,119
4228,37,India,South Africa,0.2,0,SV Samson,Centurion,0,0,2,2,118
4229,37,India,South Africa,0.3,0,0,Centurion,0,0,3,3,117
4230,37,India,South Africa,0.4,4,0,Centurion,4,0,4,4,116
4231,37,India,South Africa,0.5,6,0,Centurion,10,0,5,5,115
...,...,...,...,...,...,...,...,...,...,...,...,...
313547,2610,England,South Africa,19.2,0,0,Cardiff,180,19,2,116,4
313548,2610,England,South Africa,19.3,0,JC Buttler,Cardiff,180,19,3,117,3
313549,2610,England,South Africa,19.4,0,DJ Willey,Cardiff,180,19,4,118,2
313550,2610,England,South Africa,19.5,0,0,Cardiff,180,19,5,119,1


In [None]:
df['player_dismissed'] = df['player_dismissed'].apply(lambda x:0 if x=='0' else 1)
df['player_dismissed'] = df['player_dismissed'].astype(int)
# Ensure 'player_dismissed' is numeric before applying cumsum
df['player_dismissed'] = pd.to_numeric(df['player_dismissed'], errors='coerce')
df['player_dismissed'] = df.groupby('match_id')['player_dismissed'].cumsum()
df['wickets_left'] = 10 - df['player_dismissed']

In [None]:
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,current_score,over,ball_no,balls_bowled,balls_left,wickets_left,crr
4227,37,India,South Africa,0.1,0,1,Centurion,0,0,1,1,119,9,0.000000
4228,37,India,South Africa,0.2,0,2,Centurion,0,0,2,2,118,8,0.000000
4229,37,India,South Africa,0.3,0,3,Centurion,0,0,3,3,117,7,0.000000
4230,37,India,South Africa,0.4,4,4,Centurion,4,0,4,4,116,6,6.000000
4231,37,India,South Africa,0.5,6,5,Centurion,10,0,5,5,115,5,12.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
313547,2610,England,South Africa,19.2,0,119,Cardiff,180,19,2,116,4,-109,9.310345
313548,2610,England,South Africa,19.3,0,120,Cardiff,180,19,3,117,3,-110,9.230769
313549,2610,England,South Africa,19.4,0,121,Cardiff,180,19,4,118,2,-111,9.152542
313550,2610,England,South Africa,19.5,0,122,Cardiff,180,19,5,119,1,-112,9.075630


In [None]:
df['crr'] = (df['current_score']*6)/df['balls_bowled']

In [None]:
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,current_score,over,ball_no,balls_bowled,balls_left,wickets_left,crr
4227,37,India,South Africa,0.1,0,0,Centurion,0,0,1,1,119,10,0.000000
4228,37,India,South Africa,0.2,0,1,Centurion,0,0,2,2,118,9,0.000000
4229,37,India,South Africa,0.3,0,1,Centurion,0,0,3,3,117,9,0.000000
4230,37,India,South Africa,0.4,4,1,Centurion,4,0,4,4,116,9,6.000000
4231,37,India,South Africa,0.5,6,1,Centurion,10,0,5,5,115,9,12.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
313547,2610,England,South Africa,19.2,0,6,Cardiff,180,19,2,116,4,4,9.310345
313548,2610,England,South Africa,19.3,0,7,Cardiff,180,19,3,117,3,3,9.230769
313549,2610,England,South Africa,19.4,0,8,Cardiff,180,19,4,118,2,2,9.152542
313550,2610,England,South Africa,19.5,0,8,Cardiff,180,19,5,119,1,2,9.075630


In [None]:
groups = df.groupby('match_id')

match_ids = df['match_id'].unique()
last_five = []
for id in match_ids:
    # Convert 'runs' to numeric before applying rolling sum
    group_runs = groups.get_group(id)['runs'].astype(float)
    last_five.extend(group_runs.rolling(window=30).sum().values.tolist())

In [None]:
df['last_five'] = last_five

In [None]:
#df[['batting_team','bowling_team','city','current_score','balls_left','wickets_left','crr','last_five','']]

In [None]:
 final_df = df.groupby('match_id').sum()['runs'].reset_index().merge(df,on='match_id')

In [None]:
final_df=final_df[['batting_team','bowling_team','city','current_score','balls_left','wickets_left','crr','last_five','runs_x']]

In [None]:
final_df.dropna(inplace=True)

In [None]:
final_df.isnull().sum()

Unnamed: 0,0
batting_team,0
bowling_team,0
city,0
current_score,0
balls_left,0
wickets_left,0
crr,0
last_five,0
runs_x,0


In [None]:
final_df = final_df.sample(final_df.shape[0])

In [None]:
final_df.sample(2)

Unnamed: 0,batting_team,bowling_team,city,current_score,balls_left,wickets_left,crr,last_five,runs_x
61830,South Africa,Sri Lanka,Colombo,65,65,6,7.090909,30.0,98
31540,South Africa,Bangladesh,Mirpur,127,27,7,8.193548,31.0,169


In [None]:
X = final_df.drop(columns=['runs_x'])
y = final_df['runs_x']
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [None]:
X_train

Unnamed: 0,batting_team,bowling_team,city,current_score,balls_left,wickets_left,crr,last_five
14192,Bangladesh,New Zealand,Hamilton,43,75,5,5.733333,25.0
63408,Pakistan,Australia,Abu Dhabi,67,67,9,7.584906,35.0
3970,West Indies,South Africa,Johannesburg,102,12,3,5.666667,39.0
18547,West Indies,England,London,116,11,3,6.385321,25.0
15772,England,Pakistan,Dubai,54,69,9,6.352941,42.0
...,...,...,...,...,...,...,...,...
5816,India,Pakistan,Johannesburg,138,10,5,7.527273,35.0
22976,Pakistan,South Africa,Centurion,133,43,8,10.363636,59.0
40030,West Indies,England,Gros Islet,48,88,9,9.000000,45.0
37390,Pakistan,India,Mirpur,63,38,3,4.609756,20.0


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score,mean_absolute_error

In [None]:
trf = ColumnTransformer([
    ('trf',OneHotEncoder(sparse_output=False,drop='first'),['batting_team','bowling_team','city'])
]
,remainder='passthrough')

In [None]:
pipe = Pipeline(steps=[
    ('step1',trf),
    ('step2',StandardScaler()),
    ('step3',XGBRegressor(n_estimators=1000,learning_rate=0.2,max_depth=12,random_state=1))
])

In [None]:
pipe.fit(X_train,y_train)
y_pred = pipe.predict(X_test)
print(r2_score(y_test,y_pred))
print(mean_absolute_error(y_test,y_pred))

0.987042248249054
1.8555880784988403


In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

# Evaluate the model
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# Print the results
print(f"R² Score       : {r2:.4f}")
print(f"Mean Absolute Error (MAE)  : {mae:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

R² Score       : 0.9870
Mean Absolute Error (MAE)  : 1.86
Root Mean Squared Error (RMSE): 3.86


In [None]:
pickle.dump(pipe,open('pipe_01.pkl','wb'))

In [None]:
import xgboost
xgboost.__version__