# Data Extraction

This notebook focuses on retrieving and processing raw data for further analysis. <br>
This step is crucial in ensuring that the data is well-structured, clean and ready for modeling.

**Key Objectives:**
1. Load and convert data into readable format
2. Handle missing and inconsistent data
3. Transform and structure data
4. Extract ball by ball data for efficient analysis 

In [1]:
import pandas as pd
import numpy as np
from yaml import safe_load
import os
from tqdm import tqdm

Importing the match data stored in YAML format and converting it into a structured pandas DataFrame for further processing. <br>
Since YAML is a human-readable format commonly used for configuration files and structured data storage, it is being parsed before working with it in Python.

In [2]:
filenames = []
for file in os.listdir('ipl-data'):
    filenames.append(os.path.join('ipl-data', file))

In [3]:
filenames[:5]

['ipl-data\\1082591.yaml',
 'ipl-data\\1082592.yaml',
 'ipl-data\\1082593.yaml',
 'ipl-data\\1082594.yaml',
 'ipl-data\\1082595.yaml']

Since processing the next cell can take time, tqdm is being used to display a progress bar. <br>
This is providing real-time feedback on the process, helping track execution time and improving usability.

In [4]:
final_df = pd.DataFrame()
counter = 1
for file in tqdm(filenames):
    with open(file, 'r') as f:
        df = pd.json_normalize(safe_load(f))
        df['match_id'] = counter
        final_df = pd.concat([final_df, df], ignore_index=True)
        counter+=1

100%|█████████████████████████████████████████████████████████████████████████████▉| 1095/1096 [02:24<00:00,  7.55it/s]


NotImplementedError: 

In [5]:
final_df

Unnamed: 0,innings,meta.data_version,meta.created,meta.revision,info.balls_per_over,info.city,info.competition,info.dates,info.gender,info.match_type,...,info.registry.people.B Aparajith,info.registry.people.GS Sandhu,info.players.Rising Pune Supergiants,info.registry.people.P Sahu,info.registry.people.KJ Abbott,info.registry.people.PSP Handscomb,info.registry.people.SM Boland,info.registry.people.UT Khawaja,info.registry.people.F Behardien,info.registry.people.ER Dwivedi
0,[{'1st innings': {'team': 'Sunrisers Hyderabad...,0.91,2017-04-06,1,6,Hyderabad,IPL,[2017-04-05],male,T20,...,,,,,,,,,,
1,"[{'1st innings': {'team': 'Mumbai Indians', 'd...",0.91,2017-04-07,1,6,Pune,IPL,[2017-04-06],male,T20,...,,,,,,,,,,
2,"[{'1st innings': {'team': 'Gujarat Lions', 'de...",0.91,2017-04-07,2,6,Rajkot,IPL,[2017-04-07],male,T20,...,,,,,,,,,,
3,[{'1st innings': {'team': 'Rising Pune Supergi...,0.91,2017-04-08,1,6,Indore,IPL,[2017-04-08],male,T20,...,,,,,,,,,,
4,[{'1st innings': {'team': 'Royal Challengers B...,0.91,2017-04-08,2,6,Bengaluru,IPL,[2017-04-08],male,T20,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1090,"[{'1st innings': {'team': 'Delhi Daredevils', ...",0.91,2016-05-23,1,6,Raipur,IPL,[2016-05-22],male,T20,...,,,,,,,,,,
1091,"[{'1st innings': {'team': 'Gujarat Lions', 'de...",0.91,2016-05-24,1,6,Bangalore,IPL,[2016-05-24],male,T20,...,,,,,,,,,,b274dbbd
1092,[{'1st innings': {'team': 'Sunrisers Hyderabad...,0.91,2016-05-25,1,6,Delhi,IPL,[2016-05-25],male,T20,...,,,,,,,,,,
1093,"[{'1st innings': {'team': 'Gujarat Lions', 'de...",0.91,2016-05-28,1,6,Delhi,IPL,[2016-05-27],male,T20,...,,,,,,,,,,b274dbbd


Saving a copy of the extracted data to ensure data integrity before making further modifications.

In [7]:
backup = final_df.copy()

In [8]:
backup

Unnamed: 0,innings,meta.data_version,meta.created,meta.revision,info.balls_per_over,info.city,info.competition,info.dates,info.gender,info.match_type,...,info.registry.people.B Aparajith,info.registry.people.GS Sandhu,info.players.Rising Pune Supergiants,info.registry.people.P Sahu,info.registry.people.KJ Abbott,info.registry.people.PSP Handscomb,info.registry.people.SM Boland,info.registry.people.UT Khawaja,info.registry.people.F Behardien,info.registry.people.ER Dwivedi
0,[{'1st innings': {'team': 'Sunrisers Hyderabad...,0.91,2017-04-06,1,6,Hyderabad,IPL,[2017-04-05],male,T20,...,,,,,,,,,,
1,"[{'1st innings': {'team': 'Mumbai Indians', 'd...",0.91,2017-04-07,1,6,Pune,IPL,[2017-04-06],male,T20,...,,,,,,,,,,
2,"[{'1st innings': {'team': 'Gujarat Lions', 'de...",0.91,2017-04-07,2,6,Rajkot,IPL,[2017-04-07],male,T20,...,,,,,,,,,,
3,[{'1st innings': {'team': 'Rising Pune Supergi...,0.91,2017-04-08,1,6,Indore,IPL,[2017-04-08],male,T20,...,,,,,,,,,,
4,[{'1st innings': {'team': 'Royal Challengers B...,0.91,2017-04-08,2,6,Bengaluru,IPL,[2017-04-08],male,T20,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1090,"[{'1st innings': {'team': 'Delhi Daredevils', ...",0.91,2016-05-23,1,6,Raipur,IPL,[2016-05-22],male,T20,...,,,,,,,,,,
1091,"[{'1st innings': {'team': 'Gujarat Lions', 'de...",0.91,2016-05-24,1,6,Bangalore,IPL,[2016-05-24],male,T20,...,,,,,,,,,,b274dbbd
1092,[{'1st innings': {'team': 'Sunrisers Hyderabad...,0.91,2016-05-25,1,6,Delhi,IPL,[2016-05-25],male,T20,...,,,,,,,,,,
1093,"[{'1st innings': {'team': 'Gujarat Lions', 'de...",0.91,2016-05-28,1,6,Delhi,IPL,[2016-05-27],male,T20,...,,,,,,,,,,b274dbbd


# Data Cleaning

After extracting the data, the next step is to clean it to ensure consistency, accuracy, and usability for further analysis.

Removing unnecessary columns and retaining only the essential ones to optimize data processing and improve analysis. <br>
This step is helping reduce memory usage and eliminating irrelevant information.

In [9]:
final_df.shape

(1095, 927)

In [16]:
final_df.columns[:14]

Index(['innings', 'meta.data_version', 'meta.created', 'meta.revision',
       'info.balls_per_over', 'info.city', 'info.competition', 'info.dates',
       'info.gender', 'info.match_type', 'info.outcome.by.runs',
       'info.outcome.winner', 'info.overs', 'info.player_of_match'],
      dtype='object')

In [22]:
final_df.columns[45:51]

Index(['info.toss.decision', 'info.toss.winner', 'info.umpires', 'info.venue',
       'match_id', 'info.outcome.by.wickets'],
      dtype='object')

In [23]:
if 'info.teams' in final_df.columns:
    print("true")

true


In [39]:
final_df = final_df[['innings', 'info.city', 'info.teams', 'info.competition', 'info.dates', 'info.gender', 'info.match_type', 'info.overs', 'info.outcome.winner', 'info.player_of_match', 'info.toss.decision', 'info.toss.winner', 'info.umpires', 'info.venue', 'match_id']]

In [40]:
final_df

Unnamed: 0,innings,info.city,info.teams,info.competition,info.dates,info.gender,info.match_type,info.overs,info.outcome.winner,info.player_of_match,info.toss.decision,info.toss.winner,info.umpires,info.venue,match_id
0,[{'1st innings': {'team': 'Sunrisers Hyderabad...,Hyderabad,"[Sunrisers Hyderabad, Royal Challengers Bangal...",IPL,[2017-04-05],male,T20,20,Sunrisers Hyderabad,[Yuvraj Singh],field,Royal Challengers Bangalore,"[AY Dandekar, NJ Llong]","Rajiv Gandhi International Stadium, Uppal",1
1,"[{'1st innings': {'team': 'Mumbai Indians', 'd...",Pune,"[Rising Pune Supergiant, Mumbai Indians]",IPL,[2017-04-06],male,T20,20,Rising Pune Supergiant,[SPD Smith],field,Rising Pune Supergiant,"[A Nand Kishore, S Ravi]",Maharashtra Cricket Association Stadium,2
2,"[{'1st innings': {'team': 'Gujarat Lions', 'de...",Rajkot,"[Gujarat Lions, Kolkata Knight Riders]",IPL,[2017-04-07],male,T20,20,Kolkata Knight Riders,[CA Lynn],field,Kolkata Knight Riders,"[Nitin Menon, CK Nandan]",Saurashtra Cricket Association Stadium,3
3,[{'1st innings': {'team': 'Rising Pune Supergi...,Indore,"[Kings XI Punjab, Rising Pune Supergiant]",IPL,[2017-04-08],male,T20,20,Kings XI Punjab,[GJ Maxwell],field,Kings XI Punjab,"[AK Chaudhary, C Shamshuddin]",Holkar Cricket Stadium,4
4,[{'1st innings': {'team': 'Royal Challengers B...,Bengaluru,"[Royal Challengers Bangalore, Delhi Daredevils]",IPL,[2017-04-08],male,T20,20,Royal Challengers Bangalore,[KM Jadhav],bat,Royal Challengers Bangalore,"[S Ravi, VK Sharma]",M.Chinnaswamy Stadium,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1090,"[{'1st innings': {'team': 'Delhi Daredevils', ...",Raipur,"[Delhi Daredevils, Royal Challengers Bangalore]",IPL,[2016-05-22],male,T20,20,Royal Challengers Bangalore,[V Kohli],field,Royal Challengers Bangalore,"[A Nand Kishore, BNJ Oxenford]",Shaheed Veer Narayan Singh International Stadium,1091
1091,"[{'1st innings': {'team': 'Gujarat Lions', 'de...",Bangalore,"[Gujarat Lions, Royal Challengers Bangalore]",IPL,[2016-05-24],male,T20,20,Royal Challengers Bangalore,[AB de Villiers],field,Royal Challengers Bangalore,"[AK Chaudhary, HDPK Dharmasena]",M Chinnaswamy Stadium,1092
1092,[{'1st innings': {'team': 'Sunrisers Hyderabad...,Delhi,"[Sunrisers Hyderabad, Kolkata Knight Riders]",IPL,[2016-05-25],male,T20,20,Sunrisers Hyderabad,[MC Henriques],field,Kolkata Knight Riders,"[M Erasmus, C Shamshuddin]",Feroz Shah Kotla,1093
1093,"[{'1st innings': {'team': 'Gujarat Lions', 'de...",Delhi,"[Gujarat Lions, Sunrisers Hyderabad]",IPL,[2016-05-27],male,T20,20,Sunrisers Hyderabad,[DA Warner],field,Sunrisers Hyderabad,"[M Erasmus, CK Nandan]",Feroz Shah Kotla,1094


Determining whether gender is affecting the data by analyzing its influence on the dataset.

In [41]:
final_df.shape

(1095, 15)

In [43]:
final_df['info.gender']

0       male
1       male
2       male
3       male
4       male
        ... 
1090    male
1091    male
1092    male
1093    male
1094    male
Name: info.gender, Length: 1095, dtype: object

In [44]:
if(final_df['info.gender'] == 'female').any():
    print("yes")
else:
    print("no")

no


Since the data only contains information about male players, the gender column is not providing any variability or meaningful insights. <br>
Hence, it is being dropped.

In [45]:
final_df.drop(columns=['info.gender'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df.drop(columns=['info.gender'], inplace=True)


In [46]:
final_df

Unnamed: 0,innings,info.city,info.teams,info.competition,info.dates,info.match_type,info.overs,info.outcome.winner,info.player_of_match,info.toss.decision,info.toss.winner,info.umpires,info.venue,match_id
0,[{'1st innings': {'team': 'Sunrisers Hyderabad...,Hyderabad,"[Sunrisers Hyderabad, Royal Challengers Bangal...",IPL,[2017-04-05],T20,20,Sunrisers Hyderabad,[Yuvraj Singh],field,Royal Challengers Bangalore,"[AY Dandekar, NJ Llong]","Rajiv Gandhi International Stadium, Uppal",1
1,"[{'1st innings': {'team': 'Mumbai Indians', 'd...",Pune,"[Rising Pune Supergiant, Mumbai Indians]",IPL,[2017-04-06],T20,20,Rising Pune Supergiant,[SPD Smith],field,Rising Pune Supergiant,"[A Nand Kishore, S Ravi]",Maharashtra Cricket Association Stadium,2
2,"[{'1st innings': {'team': 'Gujarat Lions', 'de...",Rajkot,"[Gujarat Lions, Kolkata Knight Riders]",IPL,[2017-04-07],T20,20,Kolkata Knight Riders,[CA Lynn],field,Kolkata Knight Riders,"[Nitin Menon, CK Nandan]",Saurashtra Cricket Association Stadium,3
3,[{'1st innings': {'team': 'Rising Pune Supergi...,Indore,"[Kings XI Punjab, Rising Pune Supergiant]",IPL,[2017-04-08],T20,20,Kings XI Punjab,[GJ Maxwell],field,Kings XI Punjab,"[AK Chaudhary, C Shamshuddin]",Holkar Cricket Stadium,4
4,[{'1st innings': {'team': 'Royal Challengers B...,Bengaluru,"[Royal Challengers Bangalore, Delhi Daredevils]",IPL,[2017-04-08],T20,20,Royal Challengers Bangalore,[KM Jadhav],bat,Royal Challengers Bangalore,"[S Ravi, VK Sharma]",M.Chinnaswamy Stadium,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1090,"[{'1st innings': {'team': 'Delhi Daredevils', ...",Raipur,"[Delhi Daredevils, Royal Challengers Bangalore]",IPL,[2016-05-22],T20,20,Royal Challengers Bangalore,[V Kohli],field,Royal Challengers Bangalore,"[A Nand Kishore, BNJ Oxenford]",Shaheed Veer Narayan Singh International Stadium,1091
1091,"[{'1st innings': {'team': 'Gujarat Lions', 'de...",Bangalore,"[Gujarat Lions, Royal Challengers Bangalore]",IPL,[2016-05-24],T20,20,Royal Challengers Bangalore,[AB de Villiers],field,Royal Challengers Bangalore,"[AK Chaudhary, HDPK Dharmasena]",M Chinnaswamy Stadium,1092
1092,[{'1st innings': {'team': 'Sunrisers Hyderabad...,Delhi,"[Sunrisers Hyderabad, Kolkata Knight Riders]",IPL,[2016-05-25],T20,20,Sunrisers Hyderabad,[MC Henriques],field,Kolkata Knight Riders,"[M Erasmus, C Shamshuddin]",Feroz Shah Kotla,1093
1093,"[{'1st innings': {'team': 'Gujarat Lions', 'de...",Delhi,"[Gujarat Lions, Sunrisers Hyderabad]",IPL,[2016-05-27],T20,20,Sunrisers Hyderabad,[DA Warner],field,Sunrisers Hyderabad,"[M Erasmus, CK Nandan]",Feroz Shah Kotla,1094


Checking the match type and the number of overs played to ensure that the dataset contains only T20 IPL matches.

In [47]:
final_df.shape

(1095, 14)

In [48]:
final_df['info.match_type'].value_counts()

info.match_type
T20    1095
Name: count, dtype: int64

In [49]:
final_df['info.overs'].value_counts()

info.overs
20    1095
Name: count, dtype: int64

Since all matches in the dataset belong to the T20 IPL format, these columns are redundant and are being removed.

In [50]:
final_df.drop(columns=['info.overs', 'info.match_type'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df.drop(columns=['info.overs', 'info.match_type'], inplace=True)


In [51]:
final_df.shape

(1095, 12)

To preserve the current state of the cleaned dataset, saving it as a pickle file. <br>
This is acting as a checkpoint, allowing quick reloading without repeating the extraction and cleaning steps.

In [52]:
import pickle
pickle.dump(final_df, open('dataset_level1.pkl', 'wb'))

In [54]:
matches = pickle.load(open('dataset_level1.pkl', 'rb'))

In [55]:
matches

Unnamed: 0,innings,info.city,info.teams,info.competition,info.dates,info.outcome.winner,info.player_of_match,info.toss.decision,info.toss.winner,info.umpires,info.venue,match_id
0,[{'1st innings': {'team': 'Sunrisers Hyderabad...,Hyderabad,"[Sunrisers Hyderabad, Royal Challengers Bangal...",IPL,[2017-04-05],Sunrisers Hyderabad,[Yuvraj Singh],field,Royal Challengers Bangalore,"[AY Dandekar, NJ Llong]","Rajiv Gandhi International Stadium, Uppal",1
1,"[{'1st innings': {'team': 'Mumbai Indians', 'd...",Pune,"[Rising Pune Supergiant, Mumbai Indians]",IPL,[2017-04-06],Rising Pune Supergiant,[SPD Smith],field,Rising Pune Supergiant,"[A Nand Kishore, S Ravi]",Maharashtra Cricket Association Stadium,2
2,"[{'1st innings': {'team': 'Gujarat Lions', 'de...",Rajkot,"[Gujarat Lions, Kolkata Knight Riders]",IPL,[2017-04-07],Kolkata Knight Riders,[CA Lynn],field,Kolkata Knight Riders,"[Nitin Menon, CK Nandan]",Saurashtra Cricket Association Stadium,3
3,[{'1st innings': {'team': 'Rising Pune Supergi...,Indore,"[Kings XI Punjab, Rising Pune Supergiant]",IPL,[2017-04-08],Kings XI Punjab,[GJ Maxwell],field,Kings XI Punjab,"[AK Chaudhary, C Shamshuddin]",Holkar Cricket Stadium,4
4,[{'1st innings': {'team': 'Royal Challengers B...,Bengaluru,"[Royal Challengers Bangalore, Delhi Daredevils]",IPL,[2017-04-08],Royal Challengers Bangalore,[KM Jadhav],bat,Royal Challengers Bangalore,"[S Ravi, VK Sharma]",M.Chinnaswamy Stadium,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1090,"[{'1st innings': {'team': 'Delhi Daredevils', ...",Raipur,"[Delhi Daredevils, Royal Challengers Bangalore]",IPL,[2016-05-22],Royal Challengers Bangalore,[V Kohli],field,Royal Challengers Bangalore,"[A Nand Kishore, BNJ Oxenford]",Shaheed Veer Narayan Singh International Stadium,1091
1091,"[{'1st innings': {'team': 'Gujarat Lions', 'de...",Bangalore,"[Gujarat Lions, Royal Challengers Bangalore]",IPL,[2016-05-24],Royal Challengers Bangalore,[AB de Villiers],field,Royal Challengers Bangalore,"[AK Chaudhary, HDPK Dharmasena]",M Chinnaswamy Stadium,1092
1092,[{'1st innings': {'team': 'Sunrisers Hyderabad...,Delhi,"[Sunrisers Hyderabad, Kolkata Knight Riders]",IPL,[2016-05-25],Sunrisers Hyderabad,[MC Henriques],field,Kolkata Knight Riders,"[M Erasmus, C Shamshuddin]",Feroz Shah Kotla,1093
1093,"[{'1st innings': {'team': 'Gujarat Lions', 'de...",Delhi,"[Gujarat Lions, Sunrisers Hyderabad]",IPL,[2016-05-27],Sunrisers Hyderabad,[DA Warner],field,Sunrisers Hyderabad,"[M Erasmus, CK Nandan]",Feroz Shah Kotla,1094


# Extract ball by ball data

To study the dataset better, extracting ball-by-ball data from the entire dataset. <br>
Capturing details such as batsman, bowler, runs scored, dismissals, and extras. <br>
This step is allowing the study of each individual delivery, making the data more detailed and efficient for further processing.

In [69]:
import pandas as pd

count = 1
delivery_df = pd.DataFrame()
dataframes = []

for index, row in matches.iterrows():

    count += 1
    match_id = []
    ball_of_match = []
    batsman = []
    bowler = []
    runs = []
    player_of_dismissed = []
    teams = []
    batting_team = []
    city = []
    venue = []

    for ball in row['innings'][0]['1st innings']['deliveries']:
        for key in ball.keys():
            match_id.append(count)
            batting_team.append(row['innings'][0]['1st innings']['team'])
            teams.append(row['info.teams'])
            ball_of_match.append(key)
            batsman.append(ball[key]['batsman'])
            bowler.append(ball[key]['bowler'])
            runs.append(ball[key]['runs']['total'])
            city.append(row['info.city'])
            venue.append(row['info.venue'])

            player_of_dismissed.append(ball[key].get('wicket', {}).get('player_out', '0'))
            
    loop_df = pd.DataFrame({
        'match_id': match_id,
        'team': teams,
        'batting_team': batting_team,
        'ball': ball_of_match,
        'batsman': batsman,
        'bowler': bowler,
        'runs': runs,
        'player_dismissed': player_of_dismissed,
        'city': city,
        'venue': venue
    })

    dataframes.append(loop_df)
    
delivery_df = pd.concat(dataframes, ignore_index=True)

In [70]:
delivery_df

Unnamed: 0,match_id,team,batting_team,ball,batsman,bowler,runs,player_dismissed,city,venue
0,2,"[Sunrisers Hyderabad, Royal Challengers Bangal...",Sunrisers Hyderabad,0.1,DA Warner,TS Mills,0,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
1,2,"[Sunrisers Hyderabad, Royal Challengers Bangal...",Sunrisers Hyderabad,0.2,DA Warner,TS Mills,0,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
2,2,"[Sunrisers Hyderabad, Royal Challengers Bangal...",Sunrisers Hyderabad,0.3,DA Warner,TS Mills,4,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
3,2,"[Sunrisers Hyderabad, Royal Challengers Bangal...",Sunrisers Hyderabad,0.4,DA Warner,TS Mills,0,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
4,2,"[Sunrisers Hyderabad, Royal Challengers Bangal...",Sunrisers Hyderabad,0.5,DA Warner,TS Mills,2,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
...,...,...,...,...,...,...,...,...,...,...
135013,1096,"[Royal Challengers Bangalore, Sunrisers Hydera...",Sunrisers Hyderabad,19.2,BCJ Cutting,SR Watson,6,0,Bangalore,M Chinnaswamy Stadium
135014,1096,"[Royal Challengers Bangalore, Sunrisers Hydera...",Sunrisers Hyderabad,19.3,BCJ Cutting,SR Watson,6,0,Bangalore,M Chinnaswamy Stadium
135015,1096,"[Royal Challengers Bangalore, Sunrisers Hydera...",Sunrisers Hyderabad,19.4,BCJ Cutting,SR Watson,1,0,Bangalore,M Chinnaswamy Stadium
135016,1096,"[Royal Challengers Bangalore, Sunrisers Hydera...",Sunrisers Hyderabad,19.5,B Kumar,SR Watson,1,0,Bangalore,M Chinnaswamy Stadium


To enhance data clarity, separating the batting team and bowling team for each delivery. <br>
This distinction is being made to understand team-wise performance and analyze match dynamics.

In [71]:
def bowl(row):
    for team in row['team']:
        if team != row['batting_team']:
            return team

In [72]:
delivery_df['bowling_team'] = delivery_df.apply(bowl,axis=1)

In [73]:
delivery_df.drop(columns=['team'], inplace=True)

In [74]:
delivery_df

Unnamed: 0,match_id,batting_team,ball,batsman,bowler,runs,player_dismissed,city,venue,bowling_team
0,2,Sunrisers Hyderabad,0.1,DA Warner,TS Mills,0,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal",Royal Challengers Bangalore
1,2,Sunrisers Hyderabad,0.2,DA Warner,TS Mills,0,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal",Royal Challengers Bangalore
2,2,Sunrisers Hyderabad,0.3,DA Warner,TS Mills,4,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal",Royal Challengers Bangalore
3,2,Sunrisers Hyderabad,0.4,DA Warner,TS Mills,0,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal",Royal Challengers Bangalore
4,2,Sunrisers Hyderabad,0.5,DA Warner,TS Mills,2,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal",Royal Challengers Bangalore
...,...,...,...,...,...,...,...,...,...,...
135013,1096,Sunrisers Hyderabad,19.2,BCJ Cutting,SR Watson,6,0,Bangalore,M Chinnaswamy Stadium,Royal Challengers Bangalore
135014,1096,Sunrisers Hyderabad,19.3,BCJ Cutting,SR Watson,6,0,Bangalore,M Chinnaswamy Stadium,Royal Challengers Bangalore
135015,1096,Sunrisers Hyderabad,19.4,BCJ Cutting,SR Watson,1,0,Bangalore,M Chinnaswamy Stadium,Royal Challengers Bangalore
135016,1096,Sunrisers Hyderabad,19.5,B Kumar,SR Watson,1,0,Bangalore,M Chinnaswamy Stadium,Royal Challengers Bangalore


In [75]:
delivery_df['batting_team'].value_counts()

batting_team
Mumbai Indians                 16675
Chennai Super Kings            16085
Royal Challengers Bangalore    14761
Kolkata Knight Riders          14719
Rajasthan Royals               12730
Kings XI Punjab                11869
Sunrisers Hyderabad            11645
Delhi Daredevils                8753
Deccan Chargers                 5280
Delhi Capitals                  5105
Punjab Kings                    3882
Lucknow Super Giants            2874
Gujarat Titans                  2614
Pune Warriors                   2448
Gujarat Lions                   1726
Royal Challengers Bengaluru     1133
Rising Pune Supergiant           994
Kochi Tuskers Kerala             876
Rising Pune Supergiants          849
Name: count, dtype: int64

In [76]:
output = delivery_df[['match_id', 'batting_team', 'bowling_team', 'ball', 'runs', 'player_dismissed', 'city', 'venue']]

In [77]:
output

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,venue
0,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.1,0,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
1,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.2,0,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
2,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.3,4,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
3,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.4,0,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
4,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.5,2,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
...,...,...,...,...,...,...,...,...
135013,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.2,6,0,Bangalore,M Chinnaswamy Stadium
135014,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.3,6,0,Bangalore,M Chinnaswamy Stadium
135015,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.4,1,0,Bangalore,M Chinnaswamy Stadium
135016,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.5,1,0,Bangalore,M Chinnaswamy Stadium


In [79]:
if(output['player_dismissed'] != 0).any():
    print("yes")

yes


In [80]:
output['player_dismissed'].value_counts()

player_dismissed
0                128323
RG Sharma           119
S Dhawan            115
V Kohli             114
KD Karthik          100
                  ...  
DJ Muthuswami         1
KC Cariappa           1
P Sahu                1
PSP Handscomb         1
F Behardien           1
Name: count, Length: 541, dtype: int64

In [81]:
output['city'].value_counts()

city
Mumbai            21504
Kolkata           11380
Delhi             11059
Chennai           10554
Hyderabad          9499
Bangalore          7869
Chandigarh         7449
Jaipur             7073
Pune               6300
Abu Dhabi          4565
Ahmedabad          4450
Bengaluru          3521
Visakhapatnam      1869
Durban             1858
Lucknow            1738
Dubai              1623
Dharamsala         1618
Centurion          1486
Sharjah            1245
Rajkot             1229
Navi Mumbai        1133
Indore             1082
Johannesburg        995
Port Elizabeth      870
Cuttack             856
Ranchi              837
Cape Town           813
Raipur              742
Mohali              622
Kochi               603
Kanpur              492
East London         380
Guwahati            372
Nagpur              370
Kimberley           368
Bloemfontein        251
Name: count, dtype: int64

After extracting ball-by-ball data and separating the batting and bowling teams, saving the updated dataset as another pickle file. <br>
This is ensuring that progress is preserved, allowing efficient reloading without reprocessing previous steps.

In [87]:
pickle.dump(output, open('dataset_level2.pkl', 'wb'))

In [88]:
output_chk = pickle.load(open('dataset_level2.pkl', 'rb'))

In [89]:
output_chk

Unnamed: 0,match_id,batting_team,bowling_team,ball,runs,player_dismissed,city,venue
0,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.1,0,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
1,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.2,0,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
2,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.3,4,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
3,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.4,0,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
4,2,Sunrisers Hyderabad,Royal Challengers Bangalore,0.5,2,0,Hyderabad,"Rajiv Gandhi International Stadium, Uppal"
...,...,...,...,...,...,...,...,...
135013,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.2,6,0,Bangalore,M Chinnaswamy Stadium
135014,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.3,6,0,Bangalore,M Chinnaswamy Stadium
135015,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.4,1,0,Bangalore,M Chinnaswamy Stadium
135016,1096,Sunrisers Hyderabad,Royal Challengers Bangalore,19.5,1,0,Bangalore,M Chinnaswamy Stadium
