# Data Extraction
The purpose of the code in this notebook is to take the data stored in yaml files from our Kaggle Dataset, and convert it to a format appropriate for our model developement. We will process and store data for 1st and 2nd innings seperately.

## 0. Setup

In [1]:
# Start by importing the necessary libraries.
import numpy as np
import pandas as pd
from yaml import safe_load
import os
from tqdm import tqdm

# safe_load will allow us to parse a YAML string and convert it into a python object.
# YAML is a data serialization standard used generally in an exchange b/w diff languages.
# tqdm is used to create progress bars for loops.

## 1. Data Import

We will be using data from the following Kaggle dataset: https://www.kaggle.com/datasets/veeralakrishna/cricsheet-a-retrosheet-for-cricket. In this dataset, we will look at the data of T20 matches, which has data for 1,433 matches.

In [2]:
# Extract the path of all the YAML files in the data. 
# There are 1433 YAML files with each file corresponding to a T20 match.
filenames = []
for file in os.listdir('data'):
    filenames.append(os.path.join('data', file))

In [3]:
# Transfer the contents from each YAML file to a pandas DataFrame.
main_df = pd.DataFrame()
# Iterate over all the files.
for file in tqdm(filenames):
    with open(file, 'r') as f:
        # For each file, we open it, load the contents and normalize into a DF
        # Then, we add a column with our generated match id and append to the main DataFrame.
        df = pd.json_normalize(safe_load(f))
        main_df = pd.concat([main_df, df])

100%|████████████████████████████████████▉| 1432/1433 [04:29<00:00,  5.32it/s]


NotImplementedError: 

In [49]:
# We create a copy of the dataframe for backup
backup_df = main_df.copy()

## 2. Data Cleaning for 1st Innings

Next, we drop the unrequired data.

In [13]:
# Discard the data features that are not required for our model developement.
main_df.drop(columns = [
    'meta.data_version',
    'meta.created',
    'meta.revision',
    'info.outcome.bowl_out',
    'info.bowl_out',
    'info.supersubs.South Africa',
    'info.supersubs.New Zealand',
    'info.outcome.eliminator',
    'info.outcome.result',
    'info.outcome.method',
    'info.neutral_venue',
    'info.match_type_number',
    'info.outcome.by.runs',
    'info.outcome.by.wickets',
], inplace = True)

In [14]:
main_df.shape

(1432, 13)

In [15]:
main_df['info.gender'].value_counts()

info.gender
male      966
female    466
Name: count, dtype: int64

In [16]:
# Filter and segregate the data pertaining to men's T20 cricket matches.
main_df = main_df.loc[main_df['info.gender'] == 'male']
# Remove gender column from data, since it is the same value for all entries.
main_df.drop(columns = ['info.gender'], inplace = True)
main_df.shape

(966, 12)

In [17]:
# Check to ensure all the data entries pertain to T20 matches.
main_df['info.match_type'].value_counts()

info.match_type
T20    966
Name: count, dtype: int64

In [18]:
# Check to ensure all the data entries pertain to 20 over matches.
main_df['info.overs'].value_counts()

info.overs
20    963
50      3
Name: count, dtype: int64

In [19]:
# Filter the data to only include data from 20 over matches.
main_df = main_df.loc[main_df['info.overs'] == 20]
# Also, remove the columns of overs and match type since the value is the same for all entries.
main_df.drop(columns = ['info.overs','info.match_type'], inplace = True)
main_df.shape

(963, 10)

In [51]:
# Serialize and save the dataFrame in a file using pickle.
# This enables us to not have to revisit the YAML files and load the data faster.
import pickle
pickle.dump(main_df, open('dataset_level1.pkl', 'wb'))

## 3. Data Processing for 1st Innings

The main goal is to take the current data from a match-by-match format to a delivery-by-delivery format. This data will be more valuable given that we want to train our model to make predictions from different points within a match.

In [55]:
# Load the data from the file into a DataFrame called matches for use moving forward.
import pickle
matches_df = pickle.load(open('dataset_level1.pkl', 'rb'))

# Trial Data Access
# We access the first delivery of the first innings of the first match in our data.
# The deliveries are in the form of a list of single-element dictionaries for each delivery.
# Each dictionary has dictionaries within to capture details for that delivery.
matches_df.iloc[0]['innings'][0]['1st innings']['deliveries'][0]

{0.1: {'batsman': 'AJ Finch',
  'bowler': 'SL Malinga',
  'non_striker': 'M Klinger',
  'runs': {'batsman': 0, 'extras': 0, 'total': 0}}}

In [56]:
# Convert the data into a delivery-by-delivery setup for all the matches.
# Declare a variable to assign a unique match id to every match.
matchIdx = 1
# Declare a list, where each element will be a dictionary corresponding to a single delivery.
# We want to populate this list, and then construct a DataFrame using it.
# This is more efficient than creating a DataFrame and appending to it.
delivery_data = []
# Iterate through every line item (match) in the matches DataFrame.
for index, row in matches_df.iterrows():
    # Skip some matches with faulty data.
    if matchIdx in [75,108,150,180,268,360,443,458,584,748,982,1052,1111,1226,1345]:
        matchIdx += 1
        continue
    # Iterate through every dictionary in the list of deliveries, where each dictionary is a single element
    # struct with details for that delivery.
    for ball in row['innings'][0]['1st innings']['deliveries']:
        for key in ball.keys():
            # Create a dictionary to extract and store details for each delivery.
            delivery_info = {
                'match_id': matchIdx,
                'teams': row['info.teams'],
                'batting_team': row['innings'][0]['1st innings']['team'],
                'ball': key,
                'batsman': ball[key]['batsman'],
                'bowler': ball[key]['bowler'],
                'runs': ball[key]['runs']['total'],
                'city': row['info.city'],
                'venue': row['info.venue'],
                'player_dismissed': ball[key].get('wicket', {}).get('player_out', '0')
            }
            # Append the dictionary to the list of dictionaries for each delivery.
            delivery_data.append(delivery_info)
    # Increment the match id.
    matchIdx += 1

# Create a data frame to capture delivery-by-delivery data.
delivery_df = pd.DataFrame(delivery_data)

In [57]:
delivery_df

Unnamed: 0,match_id,teams,batting_team,ball,batsman,bowler,runs,city,venue,player_dismissed
0,1,"[Australia, Sri Lanka]",Australia,0.1,AJ Finch,SL Malinga,0,,Melbourne Cricket Ground,0
1,1,"[Australia, Sri Lanka]",Australia,0.2,AJ Finch,SL Malinga,0,,Melbourne Cricket Ground,0
2,1,"[Australia, Sri Lanka]",Australia,0.3,AJ Finch,SL Malinga,1,,Melbourne Cricket Ground,0
3,1,"[Australia, Sri Lanka]",Australia,0.4,M Klinger,SL Malinga,2,,Melbourne Cricket Ground,0
4,1,"[Australia, Sri Lanka]",Australia,0.5,M Klinger,SL Malinga,0,,Melbourne Cricket Ground,0
...,...,...,...,...,...,...,...,...,...,...
115320,963,"[Sri Lanka, Australia]",Sri Lanka,19.3,SMSM Senanayake,MA Starc,1,Colombo,R Premadasa Stadium,0
115321,963,"[Sri Lanka, Australia]",Sri Lanka,19.4,DM de Silva,MA Starc,0,Colombo,R Premadasa Stadium,0
115322,963,"[Sri Lanka, Australia]",Sri Lanka,19.5,DM de Silva,MA Starc,0,Colombo,R Premadasa Stadium,DM de Silva
115323,963,"[Sri Lanka, Australia]",Sri Lanka,19.6,SMSM Senanayake,MA Starc,2,Colombo,R Premadasa Stadium,0


In [58]:
# We design a function to extract the name of the bowling team.
def bowlingTeam(row):
    # For a given delivery, look at the team names in the teams column
    for team in row['teams']:
        # Find the bowling team.
        if team != row['batting_team']:
            return team

In [59]:
# Add a bowling team column to the delivery dataFrame and remove the teams column.
delivery_df['bowling_team'] = delivery_df.apply(bowlingTeam, 1)
delivery_df.drop(columns = ['teams'], inplace=True)

In [60]:
# Evaluate data available for each team
teams_frequency = delivery_df['batting_team'].value_counts()
teams_frequency

batting_team
Pakistan                    10100
South Africa                 8434
India                        8257
New Zealand                  8232
Sri Lanka                    7937
West Indies                  7243
Australia                    6861
England                      6845
Afghanistan                  5206
Bangladesh                   4881
Ireland                      4539
Zimbabwe                     4361
Netherlands                  4011
United Arab Emirates         3069
Hong Kong                    2220
Scotland                     2214
Kenya                        1819
Oman                         1743
Malaysia                     1618
Nepal                        1588
Singapore                    1360
Papua New Guinea             1239
Canada                       1231
Namibia                       989
Jersey                        983
Thailand                      751
Nigeria                       739
Bermuda                       703
Cayman Islands                619
V

In [61]:
# We will filter the data for only the top 12 teams, to optimize the model's performance.
# 12 specifically to align with the 2nd innings data.
top_12_teams = teams_frequency.head(12).index
delivery_df = delivery_df[delivery_df['batting_team'].isin(top_12_teams)]
delivery_df = delivery_df[delivery_df['bowling_team'].isin(top_12_teams)]

In [28]:
# Make a backup of the delivery DataFrame and create a pickle dump.
final_df = delivery_df[['match_id', 'batting_team', 'bowling_team', 'ball', 'runs', 'player_dismissed', 'city', 'venue']]
pickle.dump(final_df, open('dataset_level2_first_innings.pkl','wb'))

## 4. Data Processing for 2nd Innings

We follow similar steps to use data exported in dataset_level1 after cleaning and convert it from match-to-match format to delivery-to-delivery format with the appropriate data included.

In [40]:
# Load the data from the file into a DataFrame called matches for use moving forward.
import pickle
matches_second_df = pickle.load(open('dataset_level1.pkl', 'rb'))

# Trial Data Access.
matches_second_df.iloc[0]['innings'][1]['2nd innings']['deliveries'][0]

{0.1: {'batsman': 'N Dickwella',
  'bowler': 'PJ Cummins',
  'non_striker': 'WU Tharanga',
  'runs': {'batsman': 1, 'extras': 0, 'total': 1}}}

In [33]:
# Convert the data into a delivery-by-delivery setup for all the matches.
# We focus only on the data from second innings though.

# Declare a variable to assign a unique match id to every match.
matchIdx = 1
# Declare a list, where each element will be a dictionary corresponding to a single delivery.
# We want to populate this list, and then construct a DataFrame using it.
# This is more efficient than creating a DataFrame and appending to it.
delivery_second_data = []
# Iterate through every line item (match) in the matches DataFrame.
for index, row in matches_second_df.iterrows():
    # Skip some matches with faulty data.
    if matchIdx in [66,112,137,212,227,306,432,653,699,818,900]:
        matchIdx += 1
        continue
    # Iterate through every dictionary in the list of deliveries, where each dictionary is a single element
    # struct with details for that delivery.
    for ball in row['innings'][1]['2nd innings']['deliveries']:
        for key in ball.keys():
            # Create a dictionary to extract and store details for each delivery.
            delivery_info = {
                'match_id': matchIdx,
                'teams': row['info.teams'],
                'batting_team': row['innings'][1]['2nd innings']['team'],
                'ball': key,
                'batsman': ball[key]['batsman'],
                'bowler': ball[key]['bowler'],
                'runs': ball[key]['runs']['total'],
                'city': row['info.city'],
                'venue': row['info.venue'],
                'player_dismissed': ball[key].get('wicket', {}).get('player_out', '0')
            }
            # Append the dictionary to the list of dictionaries for each delivery.
            delivery_second_data.append(delivery_info)
    # Increment the match id.
    matchIdx += 1

# Create a data frame to capture delivery-by-delivery data.
delivery_second_df = pd.DataFrame(delivery_second_data)

In [34]:
delivery_second_df

Unnamed: 0,match_id,teams,batting_team,ball,batsman,bowler,runs,city,venue,player_dismissed
0,1,"[Australia, Sri Lanka]",Sri Lanka,0.1,N Dickwella,PJ Cummins,1,,Melbourne Cricket Ground,0
1,1,"[Australia, Sri Lanka]",Sri Lanka,0.2,WU Tharanga,PJ Cummins,1,,Melbourne Cricket Ground,0
2,1,"[Australia, Sri Lanka]",Sri Lanka,0.3,N Dickwella,PJ Cummins,0,,Melbourne Cricket Ground,0
3,1,"[Australia, Sri Lanka]",Sri Lanka,0.4,N Dickwella,PJ Cummins,0,,Melbourne Cricket Ground,0
4,1,"[Australia, Sri Lanka]",Sri Lanka,0.5,N Dickwella,PJ Cummins,3,,Melbourne Cricket Ground,0
...,...,...,...,...,...,...,...,...,...,...
104302,963,"[Sri Lanka, Australia]",Australia,17.1,TM Head,SS Pathirana,1,Colombo,R Premadasa Stadium,0
104303,963,"[Sri Lanka, Australia]",Australia,17.2,PM Nevill,SS Pathirana,3,Colombo,R Premadasa Stadium,0
104304,963,"[Sri Lanka, Australia]",Australia,17.3,TM Head,SS Pathirana,0,Colombo,R Premadasa Stadium,0
104305,963,"[Sri Lanka, Australia]",Australia,17.4,TM Head,SS Pathirana,0,Colombo,R Premadasa Stadium,0


In [41]:
# We design a function to extract the name of the bowling team.
def bowlingTeam(row):
    # For a given delivery, look at the team names in the teams column
    for team in row['teams']:
        # Find the bowling team.
        if team != row['batting_team']:
            return team

In [42]:
# Add a bowling team column to the delivery dataFrame and remove the teams column.
delivery_second_df['bowling_team'] = delivery_second_df.apply(bowlingTeam, 1)
delivery_second_df.drop(columns = ['teams'], inplace=True)

In [44]:
# Evaluate data available for each team
teams_frequency = delivery_second_df['batting_team'].value_counts()
teams_frequency

batting_team
Pakistan                    8285
Australia                   7777
England                     7612
India                       7316
New Zealand                 7033
West Indies                 6741
Sri Lanka                   6330
Bangladesh                  5803
South Africa                5578
Ireland                     4711
Zimbabwe                    4299
Afghanistan                 3728
Netherlands                 3319
Scotland                    3246
Hong Kong                   2576
United Arab Emirates        2121
Oman                        1739
Nepal                       1677
Kenya                       1480
Canada                      1170
Malaysia                     867
Papua New Guinea             830
Namibia                      698
Bermuda                      682
Qatar                        616
Botswana                     596
United States of America     534
Spain                        525
ICC World XI                 475
Singapore                    4

In [52]:
# We will filter the data for only the top 12 teams, to optimize the performance
# of the model. 12 specifically to align with 1st innings data.
top_12_teams = teams_frequency.head(12).index
delivery_second_df = delivery_second_df[delivery_second_df['batting_team'].isin(top_12_teams)]
delivery_second_df = delivery_second_df[delivery_second_df['bowling_team'].isin(top_12_teams)]

In [54]:
delivery_second_df

Unnamed: 0,match_id,batting_team,ball,batsman,bowler,runs,city,venue,player_dismissed,bowling_team
0,1,Sri Lanka,0.1,N Dickwella,PJ Cummins,1,,Melbourne Cricket Ground,0,Australia
1,1,Sri Lanka,0.2,WU Tharanga,PJ Cummins,1,,Melbourne Cricket Ground,0,Australia
2,1,Sri Lanka,0.3,N Dickwella,PJ Cummins,0,,Melbourne Cricket Ground,0,Australia
3,1,Sri Lanka,0.4,N Dickwella,PJ Cummins,0,,Melbourne Cricket Ground,0,Australia
4,1,Sri Lanka,0.5,N Dickwella,PJ Cummins,3,,Melbourne Cricket Ground,0,Australia
...,...,...,...,...,...,...,...,...,...,...
104302,963,Australia,17.1,TM Head,SS Pathirana,1,Colombo,R Premadasa Stadium,0,Sri Lanka
104303,963,Australia,17.2,PM Nevill,SS Pathirana,3,Colombo,R Premadasa Stadium,0,Sri Lanka
104304,963,Australia,17.3,TM Head,SS Pathirana,0,Colombo,R Premadasa Stadium,0,Sri Lanka
104305,963,Australia,17.4,TM Head,SS Pathirana,0,Colombo,R Premadasa Stadium,0,Sri Lanka


In [53]:
# Make a backup of the delivery DataFrame and create a pickle dump.
final_second_df = delivery_second_df[['match_id', 'batting_team', 'bowling_team', 'ball', 'runs', 'player_dismissed', 'city', 'venue']]
pickle.dump(final_second_df, open('dataset_level2_second_innings.pkl','wb'))