# Feature Extraction

In this notebook, we are performing feature extraction to transform raw match data into meaningful insights for analysis and predictive modeling. <br>
By deriving key features from the dataset, we aim to enhance the model’s ability to predict total runs in an IPL innings.

Feature extraction involves identifying important patterns and relationships within the data, which helps improve accuracy and efficiency in further processing. <br>
This step is crucial for building a robust ML model by ensuring that only the most relevant and informative data is used.

Through this process, we are refining the dataset, making it more structured and valuable for our predictive analysis.

The aim is to extract the following information:
1. Batting team
2. Bowling team
3. City
4. Current score
5. Balls left
6. Wickets left
7. Current run rate
8. Score in the last five overs
9. Total runs scored

In [1]:
import pandas as pd
import pickle

In [2]:
df = pickle.load(open('dataset_level2.pkl', 'rb'))

In [3]:
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,venue
0,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.1,0,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
1,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.2,0,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
2,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.3,4,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
3,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.4,0,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
4,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.5,2,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
...,...,...,...,...,...,...,...,...
135013,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.2,6,0,Bangalore,M Chinnaswamy Stadium
135014,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.3,6,0,Bangalore,M Chinnaswamy Stadium
135015,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.4,1,0,Bangalore,M Chinnaswamy Stadium
135016,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.5,1,0,Bangalore,M Chinnaswamy Stadium


# Handling missing data

We are identifying and handling missing (null) values in the dataset to ensure data consistency and accuracy. <br>
Missing values can impact the quality of the analysis and predictions, so it is essential to address them appropriately.

In [4]:
df.isnull().sum()

match_id               0
batting_team           0
bowling_team           0
ball                   0
runs                   0
player_dismissed       0
city                6343
venue                  0
dtype: int64

In [5]:
df['city'].isnull()

0         False
1         False
2         False
3         False
4         False
          ...  
135013    False
135014    False
135015    False
135016    False
135017    False
Name: city, Length: 135018, dtype: bool

We are identifying that some values in the city column are NaN (missing), which needs to be addressed to maintain data accuracy. <br>
Since the city information is essential for analysis, we are working on filling these missing values.

In [6]:
df[df['city'].isnull()]

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,venue
22095,182,Delhi Capitals,Kings XI Punjab,0.1,0,0,,Dubai International Cricket Stadium
22096,182,Delhi Capitals,Kings XI Punjab,0.2,0,0,,Dubai International Cricket Stadium
22097,182,Delhi Capitals,Kings XI Punjab,0.3,0,0,,Dubai International Cricket Stadium
22098,182,Delhi Capitals,Kings XI Punjab,0.4,4,0,,Dubai International Cricket Stadium
22099,182,Delhi Capitals,Kings XI Punjab,0.5,1,0,,Dubai International Cricket Stadium
...,...,...,...,...,...,...,...,...
115558,937,Sunrisers Hyderabad,Mumbai Indians,19.2,0,KL Rahul,,Dubai International Cricket Stadium
115559,937,Sunrisers Hyderabad,Mumbai Indians,19.3,1,0,,Dubai International Cricket Stadium
115560,937,Sunrisers Hyderabad,Mumbai Indians,19.4,1,0,,Dubai International Cricket Stadium
115561,937,Sunrisers Hyderabad,Mumbai Indians,19.5,4,0,,Dubai International Cricket Stadium


In [14]:
df[df['city'].isnull()]['venue'].value_counts()

venue
Dubai International Cricket Stadium    4106
Sharjah Cricket Stadium                2237
Name: count, dtype: int64

In [15]:
4106 + 2237

6343

During testing, we identified two instances where the city information was missing:
1. Dubai International Cricket Stadium
2. Sharjah Cricket Stadium

To resolve this, we can manually replace these values by referencing their respective cities. <br>
For larger datasets, alternative approaches can be used to efficiently handle missing data.

In [16]:
df.loc[df['venue'] == 'Dubai International Cricket Stadium', 'city'] = 'Dubai'

In [17]:
df.loc[df['venue'] == 'Sharjah Cricket Stadium', 'city'] = 'Sharjah'

In [18]:
df.isnull().sum()

match_id            0
batting_team        0
bowling_team        0
ball                0
runs                0
player_dismissed    0
city                0
venue               0
dtype: int64

Every null value has been resolved.

In [19]:
df['city'].value_counts()

city
Mumbai            21504
Kolkata           11380
Delhi             11059
Chennai           10554
Hyderabad          9499
Bangalore          7869
Chandigarh         7449
Jaipur             7073
Pune               6300
Dubai              5729
Abu Dhabi          4565
Ahmedabad          4450
Bengaluru          3521
Sharjah            3482
Visakhapatnam      1869
Durban             1858
Lucknow            1738
Dharamsala         1618
Centurion          1486
Rajkot             1229
Navi Mumbai        1133
Indore             1082
Johannesburg        995
Port Elizabeth      870
Cuttack             856
Ranchi              837
Cape Town           813
Raipur              742
Mohali              622
Kochi               603
Kanpur              492
East London         380
Guwahati            372
Nagpur              370
Kimberley           368
Bloemfontein        251
Name: count, dtype: int64

In [20]:
df.drop(columns=['venue'], inplace=True)

In [21]:
df.shape

(135018, 7)

In [22]:
df.columns

Index(['match_id', 'batting_team', 'bowling_team', 'ball', 'runs',
       'player_dismissed', 'city'],
      dtype='object')

We are identifying venues where 600 or fewer balls have been bowled, as such venues do not provide enough data to effectively train the model. <br> Since a limited number of deliveries may not capture match dynamics accurately, we are ensuring that only venues with sufficient data are retained.

In [23]:
df['city'].value_counts()

city
Mumbai            21504
Kolkata           11380
Delhi             11059
Chennai           10554
Hyderabad          9499
Bangalore          7869
Chandigarh         7449
Jaipur             7073
Pune               6300
Dubai              5729
Abu Dhabi          4565
Ahmedabad          4450
Bengaluru          3521
Sharjah            3482
Visakhapatnam      1869
Durban             1858
Lucknow            1738
Dharamsala         1618
Centurion          1486
Rajkot             1229
Navi Mumbai        1133
Indore             1082
Johannesburg        995
Port Elizabeth      870
Cuttack             856
Ranchi              837
Cape Town           813
Raipur              742
Mohali              622
Kochi               603
Kanpur              492
East London         380
Guwahati            372
Nagpur              370
Kimberley           368
Bloemfontein        251
Name: count, dtype: int64

In [24]:
df['city'].value_counts()[df['city'].value_counts() > 600]

city
Mumbai            21504
Kolkata           11380
Delhi             11059
Chennai           10554
Hyderabad          9499
Bangalore          7869
Chandigarh         7449
Jaipur             7073
Pune               6300
Dubai              5729
Abu Dhabi          4565
Ahmedabad          4450
Bengaluru          3521
Sharjah            3482
Visakhapatnam      1869
Durban             1858
Lucknow            1738
Dharamsala         1618
Centurion          1486
Rajkot             1229
Navi Mumbai        1133
Indore             1082
Johannesburg        995
Port Elizabeth      870
Cuttack             856
Ranchi              837
Cape Town           813
Raipur              742
Mohali              622
Kochi               603
Name: count, dtype: int64

In [25]:
eligible_cities = df['city'].value_counts()[df['city'].value_counts() > 600].index.tolist()

In [27]:
df = df[df['city'].isin(eligible_cities)]

In [28]:
df['city'].value_counts()

city
Mumbai            21504
Kolkata           11380
Delhi             11059
Chennai           10554
Hyderabad          9499
Bangalore          7869
Chandigarh         7449
Jaipur             7073
Pune               6300
Dubai              5729
Abu Dhabi          4565
Ahmedabad          4450
Bengaluru          3521
Sharjah            3482
Visakhapatnam      1869
Durban             1858
Lucknow            1738
Dharamsala         1618
Centurion          1486
Rajkot             1229
Navi Mumbai        1133
Indore             1082
Johannesburg        995
Port Elizabeth      870
Cuttack             856
Ranchi              837
Cape Town           813
Raipur              742
Mohali              622
Kochi               603
Name: count, dtype: int64

In [29]:
df.shape

(132785, 7)

We are determining the current score of the batting team by computing the cumulative sum of runs from the dataset. <br>
This helps track the team's total runs at any given point in the innings.

In [38]:
df['runs'] = pd.to_numeric(df['runs'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['runs'] = pd.to_numeric(df['runs'], errors='coerce')


In [32]:
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city
0,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.1,0,0,Hyderabad
1,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.2,0,0,Hyderabad
2,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.3,4,0,Hyderabad
3,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.4,0,0,Hyderabad
4,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.5,2,0,Hyderabad
...,...,...,...,...,...,...,...
135013,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.2,6,0,Bangalore
135014,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.3,6,0,Bangalore
135015,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.4,1,0,Bangalore
135016,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.5,1,0,Bangalore


In [41]:
df['runs'].dtype

dtype('int64')

In [43]:
df['current_score'] = df.groupby('match_id')['runs'].cumsum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['current_score'] = df.groupby('match_id')['runs'].cumsum()


In [44]:
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,current_score
0,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.1,0,0,Hyderabad,0
1,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.2,0,0,Hyderabad,0
2,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.3,4,0,Hyderabad,4
3,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.4,0,0,Hyderabad,4
4,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.5,2,0,Hyderabad,6
...,...,...,...,...,...,...,...,...
135013,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.2,6,0,Bangalore,194
135014,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.3,6,0,Bangalore,200
135015,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.4,1,0,Bangalore,201
135016,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.5,1,0,Bangalore,202


We are extracting the ball number from the over data, which is currently represented in a decimal format
(e.g., 5.3 for the third ball of the sixth over). <br>
To simplify further processing, we are converting this into a standardized format.

In [45]:
df['over'] = df['ball'].apply(lambda x:str(x).split(".")[0])
df['ball_no'] = df['ball'].apply(lambda x:str(x).split(".")[1])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['over'] = df['ball'].apply(lambda x:str(x).split(".")[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['ball_no'] = df['ball'].apply(lambda x:str(x).split(".")[1])


In [46]:
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,current_score,over,ball_no
0,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.1,0,0,Hyderabad,0,0,1
1,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.2,0,0,Hyderabad,0,0,2
2,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.3,4,0,Hyderabad,4,0,3
3,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.4,0,0,Hyderabad,4,0,4
4,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.5,2,0,Hyderabad,6,0,5
...,...,...,...,...,...,...,...,...,...,...
135013,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.2,6,0,Bangalore,194,19,2
135014,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.3,6,0,Bangalore,200,19,3
135015,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.4,1,0,Bangalore,201,19,4
135016,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.5,1,0,Bangalore,202,19,5


In [47]:
df['balls_bowled'] = (df['over'].astype('int')*6) + df['ball_no'].astype('int')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['balls_bowled'] = (df['over'].astype('int')*6) + df['ball_no'].astype('int')


In [48]:
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,current_score,over,ball_no,balls_bowled
0,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.1,0,0,Hyderabad,0,0,1,1
1,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.2,0,0,Hyderabad,0,0,2,2
2,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.3,4,0,Hyderabad,4,0,3,3
3,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.4,0,0,Hyderabad,4,0,4,4
4,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.5,2,0,Hyderabad,6,0,5,5
...,...,...,...,...,...,...,...,...,...,...,...
135013,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.2,6,0,Bangalore,194,19,2,116
135014,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.3,6,0,Bangalore,200,19,3,117
135015,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.4,1,0,Bangalore,201,19,4,118
135016,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.5,1,0,Bangalore,202,19,5,119


We are determining the number of balls left in the match using the extracted ball number. <br>Since a T20 match consists of 120 balls per innings, we can compute the remaining deliveries at any stage of the game. <br> By calculating the balls left, we are enhancing the dataset with an important feature that helps in understanding the match situation and improving predictive modeling.


In [49]:
df['balls_left'] = 120 - df['balls_bowled']
df['balls_left'] = df['balls_left'].apply(lambda x:0 if x<0 else x)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['balls_left'] = 120 - df['balls_bowled']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['balls_left'] = df['balls_left'].apply(lambda x:0 if x<0 else x)


In [50]:
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,current_score,over,ball_no,balls_bowled,balls_left
0,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.1,0,0,Hyderabad,0,0,1,1,119
1,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.2,0,0,Hyderabad,0,0,2,2,118
2,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.3,4,0,Hyderabad,4,0,3,3,117
3,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.4,0,0,Hyderabad,4,0,4,4,116
4,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.5,2,0,Hyderabad,6,0,5,5,115
...,...,...,...,...,...,...,...,...,...,...,...,...
135013,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.2,6,0,Bangalore,194,19,2,116,4
135014,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.3,6,0,Bangalore,200,19,3,117,3
135015,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.4,1,0,Bangalore,201,19,4,118,2
135016,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.5,1,0,Bangalore,202,19,5,119,1


We are determining the number of wickets left in the innings using the dismissed player data. <br>
Since each team starts with 10 wickets, tracking dismissals helps analyze match progression. <br>
By computing wickets left, we are enhancing match analysis and capturing a key feature for predicting team performance.

In [51]:
df['player_dismissed'] = df['player_dismissed'].apply(lambda x:0 if x=='0' else 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['player_dismissed'] = df['player_dismissed'].apply(lambda x:0 if x=='0' else 1)


In [53]:
df['player_dismissed'].value_counts()

player_dismissed
0    126213
1      6572
Name: count, dtype: int64

In [54]:
df['player_dismissed'] = df['player_dismissed'].astype('int')
df['player_dismissed'] = df.groupby('match_id')['player_dismissed'].cumsum()
df['wickets_left'] = 10 - df['player_dismissed']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['player_dismissed'] = df['player_dismissed'].astype('int')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['player_dismissed'] = df.groupby('match_id')['player_dismissed'].cumsum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['wickets_left'] = 10 - df['player_dismissed']


In [55]:
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,current_score,over,ball_no,balls_bowled,balls_left,wickets_left
0,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.1,0,0,Hyderabad,0,0,1,1,119,10
1,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.2,0,0,Hyderabad,0,0,2,2,118,10
2,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.3,4,0,Hyderabad,4,0,3,3,117,10
3,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.4,0,0,Hyderabad,4,0,4,4,116,10
4,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.5,2,0,Hyderabad,6,0,5,5,115,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...
135013,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.2,6,7,Bangalore,194,19,2,116,4,3
135014,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.3,6,7,Bangalore,200,19,3,117,3,3
135015,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.4,1,7,Bangalore,201,19,4,118,2,3
135016,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.5,1,7,Bangalore,202,19,5,119,1,3


In [59]:
df['wickets_left'].value_counts()

wickets_left
9     26303
10    26228
8     23790
7     19486
6     15812
5      9845
4      5792
3      3027
2      1617
1       814
0        71
Name: count, dtype: int64

We are determining the Current Run Rate (CRR) at each delivery. <br>It is the measure of the team's scoring pace by calculating the average runs scored per over. <br> This is a crucial metric for analyzing a team's performance in real-time.

In [60]:
df['crr'] = (df['current_score']*6)/df['balls_bowled']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['crr'] = (df['current_score']*6)/df['balls_bowled']


In [61]:
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,current_score,over,ball_no,balls_bowled,balls_left,wickets_left,crr
0,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.1,0,0,Hyderabad,0,0,1,1,119,10,0.000000
1,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.2,0,0,Hyderabad,0,0,2,2,118,10,0.000000
2,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.3,4,0,Hyderabad,4,0,3,3,117,10,8.000000
3,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.4,0,0,Hyderabad,4,0,4,4,116,10,6.000000
4,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.5,2,0,Hyderabad,6,0,5,5,115,10,7.200000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
135013,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.2,6,7,Bangalore,194,19,2,116,4,3,10.034483
135014,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.3,6,7,Bangalore,200,19,3,117,3,3,10.256410
135015,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.4,1,7,Bangalore,201,19,4,118,2,3,10.220339
135016,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.5,1,7,Bangalore,202,19,5,119,1,3,10.184874


We are determining the total runs scored in the last five overs to capture the team's recent scoring momentum. <br>
This feature is crucial for understanding how aggressively the team is playing in the final phase of the innings.

In [65]:
df['runs'] = pd.to_numeric(df['runs'], errors='coerce')

groups = df.groupby('match_id')
match_ids = df['match_id'].unique()

last_five = []
for id in match_ids:
    group = groups.get_group(id).copy()
    group['runs'] = pd.to_numeric(group['runs'], errors='coerce')
    rolling_sum = group['runs'].rolling(window=30).sum()
    last_five.extend(rolling_sum.tolist())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['runs'] = pd.to_numeric(df['runs'], errors='coerce')


In [69]:
last_five[25:35]

[nan, nan, nan, nan, 38.0, 42.0, 42.0, 42.0, 46.0, 44.0]

In [70]:
df['last_five'] = last_five

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['last_five'] = last_five


In [71]:
df

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,current_score,over,ball_no,balls_bowled,balls_left,wickets_left,crr,last_five
0,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.1,0,0,Hyderabad,0,0,1,1,119,10,0.000000,
1,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.2,0,0,Hyderabad,0,0,2,2,118,10,0.000000,
2,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.3,4,0,Hyderabad,4,0,3,3,117,10,8.000000,
3,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.4,0,0,Hyderabad,4,0,4,4,116,10,6.000000,
4,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.5,2,0,Hyderabad,6,0,5,5,115,10,7.200000,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
135013,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.2,6,7,Bangalore,194,19,2,116,4,3,10.034483,54.0
135014,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.3,6,7,Bangalore,200,19,3,117,3,3,10.256410,56.0
135015,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.4,1,7,Bangalore,201,19,4,118,2,3,10.220339,56.0
135016,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.5,1,7,Bangalore,202,19,5,119,1,3,10.184874,57.0


In [75]:
final_df = df.groupby('match_id').sum()['runs'].reset_index().merge(df,on='match_id')

In [76]:
final_df

Unnamed: 0,match_id,runs_x,batting_team,bowling_team,ball,runs_y,player_dismissed,city,current_score,over,ball_no,balls_bowled,balls_left,wickets_left,crr,last_five
0,2,207,Sunrisers Hyderabad,Royal Challengers Bangalore,0.1,0,0,Hyderabad,0,0,1,1,119,10,0.000000,
1,2,207,Sunrisers Hyderabad,Royal Challengers Bangalore,0.2,0,0,Hyderabad,0,0,2,2,118,10,0.000000,
2,2,207,Sunrisers Hyderabad,Royal Challengers Bangalore,0.3,4,0,Hyderabad,4,0,3,3,117,10,8.000000,
3,2,207,Sunrisers Hyderabad,Royal Challengers Bangalore,0.4,0,0,Hyderabad,4,0,4,4,116,10,6.000000,
4,2,207,Sunrisers Hyderabad,Royal Challengers Bangalore,0.5,2,0,Hyderabad,6,0,5,5,115,10,7.200000,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
132780,1096,208,Sunrisers Hyderabad,Royal Challengers Bangalore,19.2,6,7,Bangalore,194,19,2,116,4,3,10.034483,54.0
132781,1096,208,Sunrisers Hyderabad,Royal Challengers Bangalore,19.3,6,7,Bangalore,200,19,3,117,3,3,10.256410,56.0
132782,1096,208,Sunrisers Hyderabad,Royal Challengers Bangalore,19.4,1,7,Bangalore,201,19,4,118,2,3,10.220339,56.0
132783,1096,208,Sunrisers Hyderabad,Royal Challengers Bangalore,19.5,1,7,Bangalore,202,19,5,119,1,3,10.184874,57.0


Formatting the dataset a bit more

In [78]:
final_df = final_df[['batting_team', 'bowling_team', 'city', 'current_score', 'balls_left', 'wickets_left', 'crr', 'last_five', 'runs_x']]

In [79]:
final_df

Unnamed: 0,batting_team,bowling_team,city,current_score,balls_left,wickets_left,crr,last_five,runs_x
0,Sunrisers Hyderabad,Royal Challengers Bangalore,Hyderabad,0,119,10,0.000000,,207
1,Sunrisers Hyderabad,Royal Challengers Bangalore,Hyderabad,0,118,10,0.000000,,207
2,Sunrisers Hyderabad,Royal Challengers Bangalore,Hyderabad,4,117,10,8.000000,,207
3,Sunrisers Hyderabad,Royal Challengers Bangalore,Hyderabad,4,116,10,6.000000,,207
4,Sunrisers Hyderabad,Royal Challengers Bangalore,Hyderabad,6,115,10,7.200000,,207
...,...,...,...,...,...,...,...,...,...
132780,Sunrisers Hyderabad,Royal Challengers Bangalore,Bangalore,194,4,3,10.034483,54.0,208
132781,Sunrisers Hyderabad,Royal Challengers Bangalore,Bangalore,200,3,3,10.256410,56.0,208
132782,Sunrisers Hyderabad,Royal Challengers Bangalore,Bangalore,201,2,3,10.220339,56.0,208
132783,Sunrisers Hyderabad,Royal Challengers Bangalore,Bangalore,202,1,3,10.184874,57.0,208


We are performing a final review of the dataset to identify and remove any remaining null (missing) values before proceeding with model training. <br>
By conducting this final cleanup, we are ensuring that the dataset is free of inconsistencies, making it more reliable for feature extraction and model training.

In [80]:
final_df.isnull().sum()

batting_team         0
bowling_team         0
city                 0
current_score        0
balls_left           0
wickets_left         0
crr                  0
last_five        31233
runs_x               0
dtype: int64

In [81]:
final_df.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df.dropna(inplace=True)


In [82]:
final_df.isnull().sum()

batting_team     0
bowling_team     0
city             0
current_score    0
balls_left       0
wickets_left     0
crr              0
last_five        0
runs_x           0
dtype: int64

In [83]:
final_df.shape

(101552, 9)

We are shuffling the dataset to ensure that the model does not learn any unintended patterns due to the original order of the data. <br>
Randomizing the data helps in improving generalization and prevents the model from being biased toward specific sequences.

In [85]:
final_df = final_df.sample(final_df.shape[0])

In [86]:
final_df

Unnamed: 0,batting_team,bowling_team,city,current_score,balls_left,wickets_left,crr,last_five,runs_x
99674,Delhi Daredevils,Chennai Super Kings,Chennai,78,30,5,5.200000,26.0,114
52863,Sunrisers Hyderabad,Lucknow Super Giants,Hyderabad,173,5,4,9.026087,44.0,182
102778,Rajasthan Royals,Kolkata Knight Riders,Jaipur,104,21,5,6.303030,35.0,144
19135,Royal Challengers Bangalore,Chennai Super Kings,Bengaluru,58,80,9,8.700000,51.0,161
7717,Rajasthan Royals,Delhi Daredevils,Jaipur,99,49,7,8.366197,42.0,153
...,...,...,...,...,...,...,...,...,...
19944,Sunrisers Hyderabad,Rajasthan Royals,Jaipur,137,11,4,7.541284,29.0,160
7510,Rajasthan Royals,Sunrisers Hyderabad,Hyderabad,117,12,3,6.500000,25.0,125
79619,Delhi Daredevils,Royal Challengers Bangalore,Bangalore,172,4,6,8.896552,44.0,183
101798,Chennai Super Kings,Kolkata Knight Riders,Chennai,161,16,8,9.288462,55.0,190


We are saving the final cleaned and shuffled dataset as a pickle file to preserve progress before proceeding to model training. <br>
This ensures that we can reload the dataset quickly without repeating the preprocessing steps.

In [87]:
pickle.dump(final_df, open('dataset_level3.pkl', 'wb'))

In [90]:
output_chk = pickle.load(open('dataset_level3.pkl', 'rb'))

In [91]:
output_chk

Unnamed: 0,batting_team,bowling_team,city,current_score,balls_left,wickets_left,crr,last_five,runs_x
99674,Delhi Daredevils,Chennai Super Kings,Chennai,78,30,5,5.200000,26.0,114
52863,Sunrisers Hyderabad,Lucknow Super Giants,Hyderabad,173,5,4,9.026087,44.0,182
102778,Rajasthan Royals,Kolkata Knight Riders,Jaipur,104,21,5,6.303030,35.0,144
19135,Royal Challengers Bangalore,Chennai Super Kings,Bengaluru,58,80,9,8.700000,51.0,161
7717,Rajasthan Royals,Delhi Daredevils,Jaipur,99,49,7,8.366197,42.0,153
...,...,...,...,...,...,...,...,...,...
19944,Sunrisers Hyderabad,Rajasthan Royals,Jaipur,137,11,4,7.541284,29.0,160
7510,Rajasthan Royals,Sunrisers Hyderabad,Hyderabad,117,12,3,6.500000,25.0,125
79619,Delhi Daredevils,Royal Challengers Bangalore,Bangalore,172,4,6,8.896552,44.0,183
101798,Chennai Super Kings,Kolkata Knight Riders,Chennai,161,16,8,9.288462,55.0,190
