# Model Building

In this notebook, we are building a predictive model to estimate the total runs a team will score in an IPL innings. <br>
This involves selecting the right machine learning algorithm, training the model on historical data, and evaluating its performance. <br>
By the end of this notebook, we will have a trained ML model capable of predicting the total runs in an IPL innings based on match conditions and historical data.

In [22]:
import pandas as pd
import pickle

In [2]:
df = pickle.load(open('dataset_level3.pkl', 'rb'))

In [3]:
df

Unnamed: 0,batting_team,bowling_team,city,current_score,balls_left,wickets_left,crr,last_five,runs_x
92210,Kolkata Knight Riders,Mumbai Indians,Kolkata,158,9,5,8.540541,47.0,175
54800,Gujarat Titans,Chennai Super Kings,Ahmedabad,70,74,9,9.130435,48.0,214
130970,Mumbai Indians,Kings XI Punjab,Visakhapatnam,116,7,1,6.159292,54.0,124
28562,Delhi Capitals,Kings XI Punjab,Dubai,141,15,6,8.057143,38.0,164
69043,Royal Challengers Bangalore,Delhi Daredevils,Bangalore,68,62,8,7.034483,37.0,154
...,...,...,...,...,...,...,...,...,...
96232,Mumbai Indians,Kings XI Punjab,Mumbai,140,17,7,8.155340,55.0,163
25926,Royal Challengers Bangalore,Chennai Super Kings,Dubai,85,37,7,6.144578,31.0,169
109512,Sunrisers Hyderabad,Mumbai Indians,Mumbai,47,86,9,8.294118,28.0,178
105241,Pune Warriors,Kings XI Punjab,Chandigarh,142,17,8,8.271845,50.0,185


In [4]:
df.sample(2)

Unnamed: 0,batting_team,bowling_team,city,current_score,balls_left,wickets_left,crr,last_five,runs_x
92962,Pune Warriors,Mumbai Indians,Mumbai,34,78,7,4.857143,31.0,129
73697,Mumbai Indians,Royal Challengers Bangalore,Johannesburg,53,60,7,5.3,30.0,149


In [5]:
X = df.drop(columns=['runs_x'])
y = df['runs_x']

To evaluate the performance of our model, we are splitting the dataset into training and testing sets. <br>
The training set is used to train the model, while the testing set helps assess how well the model generalizes to unseen data.

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=1)

In [8]:
X_train

Unnamed: 0,batting_team,bowling_team,city,current_score,balls_left,wickets_left,crr,last_five
44495,Rajasthan Royals,Lucknow Super Giants,Mumbai,148,19,6,8.792079,43.0
86929,Mumbai Indians,Deccan Chargers,Hyderabad,63,78,9,9.000000,46.0
80918,Kolkata Knight Riders,Kings XI Punjab,Kolkata,57,74,9,7.434783,36.0
110567,Kolkata Knight Riders,Sunrisers Hyderabad,Hyderabad,47,72,8,5.875000,28.0
118112,Kolkata Knight Riders,Kings XI Punjab,Kolkata,124,22,5,7.591837,41.0
...,...,...,...,...,...,...,...,...
114527,Delhi Daredevils,Kolkata Knight Riders,Delhi,105,26,6,6.702128,27.0
31502,Chennai Super Kings,Royal Challengers Bangalore,Mumbai,86,57,9,8.190476,40.0
122093,Royal Challengers Bangalore,Rajasthan Royals,Bangalore,155,18,5,9.117647,50.0
107490,Rajasthan Royals,Kolkata Knight Riders,Kolkata,123,6,5,6.473684,26.0


In [9]:
X_test

Unnamed: 0,batting_team,bowling_team,city,current_score,balls_left,wickets_left,crr,last_five
30130,Punjab Kings,Chennai Super Kings,Mumbai,51,55,5,4.707692,25.0
3235,Kings XI Punjab,Gujarat Lions,Rajkot,180,3,4,9.230769,47.0
51953,Gujarat Titans,Lucknow Super Giants,Ahmedabad,142,46,9,11.513514,51.0
62714,Delhi Capitals,Lucknow Super Giants,Delhi,105,60,8,10.500000,42.0
44622,Delhi Capitals,Punjab Kings,Navi Mumbai,139,16,5,8.019231,31.0
...,...,...,...,...,...,...,...,...
125126,Sunrisers Hyderabad,Mumbai Indians,Hyderabad,70,36,4,5.000000,20.0
104433,Rajasthan Royals,Mumbai Indians,Jaipur,46,88,10,8.625000,44.0
16336,Delhi Capitals,Sunrisers Hyderabad,Delhi,80,35,5,5.647059,28.0
19707,Kolkata Knight Riders,Rajasthan Royals,Kolkata,162,4,4,8.379310,63.0


In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_absolute_error

Since our dataset contains categorical columns with string values, we need to transform them into a numerical format for the model to process. <br>
To achieve this, we are using ColumnTransformer along with OneHotEncoder. <br>
This will convert categorical features into numerical representations while preserving the remaining columns.

In [12]:
trf = ColumnTransformer([
    ('trf', OneHotEncoder(sparse_output=False, drop='first'),['batting_team', 'bowling_team', 'city'])
], remainder='passthrough')

To streamline the preprocessing and modeling process, we are using Pipeline. <br> 
This ensures that all necessary transformations and model training steps are applied sequentially in a structured manner.

In [14]:
pipe = Pipeline(steps=[
    ('step1', trf),
    ('step2', StandardScaler()),
    ('step3', XGBRegressor(n_estimators=1000, learning_rate=0.2, max_depth=12, random_state=1))
])

Training the model

In [15]:
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

Finding the R2 Score and MAE

In [19]:
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

In [17]:
print("R2 Score: ")
print(r2)

R2 Score: 
0.979379415512085


In [20]:
print("Mean Absolute Error:")
print(mae)

Mean Absolute Error:
2.3191866874694824


Exporting the model

In [21]:
pickle.dump(pipe, open('pipe.pkl', 'wb'))