# Data Extraction
The purpose of the code in this notebook is to take the data stored in yaml files from our Kaggle Dataset, and convert it to a format appropriate for our model developement. We will process and store data for 1st and 2nd innings seperately.

## 0. Setup

In [1]:
# Start by importing the necessary libraries.
import numpy as np
import pandas as pd
from yaml import safe_load
import os
from tqdm import tqdm

# safe_load will allow us to parse a YAML string and convert it into a python object.
# YAML is a data serialization standard used generally in an exchange b/w diff languages.
# tqdm is used to create progress bars for loops.

## 1. Data Import

We will be using data from the following Kaggle dataset: https://www.kaggle.com/datasets/veeralakrishna/cricsheet-a-retrosheet-for-cricket. In this dataset, we will look at the data of T20 matches, which has data for 1,433 matches.

In [2]:
# Extract the path of all the YAML files in the data. 
# There are 1433 YAML files with each file corresponding to a T20 match.
filenames = []
for file in os.listdir('data'):
    filenames.append(os.path.join('data', file))

In [3]:
# Transfer the contents from each YAML file to a pandas DataFrame.
main_df = pd.DataFrame()
# Iterate over all the files.
for file in tqdm(filenames):
    with open(file, 'r') as f:
        # For each file, we open it, load the contents and normalize into a DF
        # Then, we add a column with our generated match id and append to the main DataFrame.
        df = pd.json_normalize(safe_load(f))
        main_df = pd.concat([main_df, df])

100%|████████████████████████████████████▉| 1432/1433 [04:29<00:00,  5.32it/s]


NotImplementedError: 

In [12]:
# We create a copy of the dataframe for use for the second innings.
main_second_df = main_df.copy()

## 2. Data Cleaning for 1st Innings

Next, we drop the unrequired data.

In [13]:
# Discard the data features that are not required for our model developement.
main_df.drop(columns = [
    'meta.data_version',
    'meta.created',
    'meta.revision',
    'info.outcome.bowl_out',
    'info.bowl_out',
    'info.supersubs.South Africa',
    'info.supersubs.New Zealand',
    'info.outcome.eliminator',
    'info.outcome.result',
    'info.outcome.method',
    'info.neutral_venue',
    'info.match_type_number',
    'info.outcome.by.runs',
    'info.outcome.by.wickets',
], inplace = True)

In [14]:
main_df.shape

(1432, 13)

In [15]:
main_df['info.gender'].value_counts()

info.gender
male      966
female    466
Name: count, dtype: int64

In [16]:
# Filter and segregate the data pertaining to men's T20 cricket matches.
main_df = main_df.loc[main_df['info.gender'] == 'male']
# Remove gender column from data, since it is the same value for all entries.
main_df.drop(columns = ['info.gender'], inplace = True)
main_df.shape

(966, 12)

In [17]:
# Check to ensure all the data entries pertain to T20 matches.
main_df['info.match_type'].value_counts()

info.match_type
T20    966
Name: count, dtype: int64

In [18]:
# Check to ensure all the data entries pertain to 20 over matches.
main_df['info.overs'].value_counts()

info.overs
20    963
50      3
Name: count, dtype: int64

In [19]:
# Filter the data to only include data from 20 over matches.
main_df = main_df.loc[main_df['info.overs'] == 20]
# Also, remove the columns of overs and match type since the value is the same for all entries.
main_df.drop(columns = ['info.overs','info.match_type'], inplace = True)
main_df.shape

(963, 10)

In [20]:
# Serialize and save the dataFrame in a file using pickle.
# This enables us to not have to revisit the YAML files and load the data faster.
import pickle
pickle.dump(main_df, open('dataset_level1_first_innings.pkl', 'wb'))

## 3. Data Processing for 1st Innings

The main goal is to take the current data from a match-by-match format to a delivery-by-delivery format. This data will be more valuable given that we want to train our model to make predictions from different points within a match.

In [4]:
# Load the data from the file into a DataFrame called matches for use moving forward.
import pickle
matches_df = pickle.load(open('dataset_level1_first_innings.pkl', 'rb'))

# Trial Data Access
# We access the first delivery of the first innings of the first match in our data.
# The deliveries are in the form of a list of single-element dictionaries for each delivery.
# Each dictionary has dictionaries within to capture details for that delivery.
matches_df.iloc[0]['innings'][0]['1st innings']['deliveries'][0]

{0.1: {'batsman': 'AJ Finch',
  'bowler': 'SL Malinga',
  'non_striker': 'M Klinger',
  'runs': {'batsman': 0, 'extras': 0, 'total': 0}}}

In [5]:
# Convert the data into a delivery-by-delivery setup for all the matches.
# Declare a variable to assign a unique match id to every match.
matchIdx = 1
# Declare a list, where each element will be a dictionary corresponding to a single delivery.
# We want to populate this list, and then construct a DataFrame using it.
# This is more efficient than creating a DataFrame and appending to it.
delivery_data = []
# Iterate through every line item (match) in the matches DataFrame.
for index, row in matches_df.iterrows():
    # Skip some matches with faulty data.
    if matchIdx in [75,108,150,180,268,360,443,458,584,748,982,1052,1111,1226,1345]:
        matchIdx += 1
        continue
    # Iterate through every dictionary in the list of deliveries, where each dictionary is a single element
    # struct with details for that delivery.
    for ball in row['innings'][0]['1st innings']['deliveries']:
        for key in ball.keys():
            # Create a dictionary to extract and store details for each delivery.
            delivery_info = {
                'match_id': matchIdx,
                'teams': row['info.teams'],
                'batting_team': row['innings'][0]['1st innings']['team'],
                'ball': key,
                'batsman': ball[key]['batsman'],
                'bowler': ball[key]['bowler'],
                'runs': ball[key]['runs']['total'],
                'city': row['info.city'],
                'venue': row['info.venue'],
                'player_dismissed': ball[key].get('wicket', {}).get('player_out', '0')
            }
            # Append the dictionary to the list of dictionaries for each delivery.
            delivery_data.append(delivery_info)
    # Increment the match id.
    matchIdx += 1

# Create a data frame to capture delivery-by-delivery data.
delivery_df = pd.DataFrame(delivery_data)

In [6]:
delivery_df

Unnamed: 0,match_id,teams,batting_team,ball,batsman,bowler,runs,city,venue,player_dismissed
0,1,"[Australia, Sri Lanka]",Australia,0.1,AJ Finch,SL Malinga,0,,Melbourne Cricket Ground,0
1,1,"[Australia, Sri Lanka]",Australia,0.2,AJ Finch,SL Malinga,0,,Melbourne Cricket Ground,0
2,1,"[Australia, Sri Lanka]",Australia,0.3,AJ Finch,SL Malinga,1,,Melbourne Cricket Ground,0
3,1,"[Australia, Sri Lanka]",Australia,0.4,M Klinger,SL Malinga,2,,Melbourne Cricket Ground,0
4,1,"[Australia, Sri Lanka]",Australia,0.5,M Klinger,SL Malinga,0,,Melbourne Cricket Ground,0
...,...,...,...,...,...,...,...,...,...,...
115320,963,"[Sri Lanka, Australia]",Sri Lanka,19.3,SMSM Senanayake,MA Starc,1,Colombo,R Premadasa Stadium,0
115321,963,"[Sri Lanka, Australia]",Sri Lanka,19.4,DM de Silva,MA Starc,0,Colombo,R Premadasa Stadium,0
115322,963,"[Sri Lanka, Australia]",Sri Lanka,19.5,DM de Silva,MA Starc,0,Colombo,R Premadasa Stadium,DM de Silva
115323,963,"[Sri Lanka, Australia]",Sri Lanka,19.6,SMSM Senanayake,MA Starc,2,Colombo,R Premadasa Stadium,0


In [7]:
# We design a function to extract the name of the bowling team.
def bowlingTeam(row):
    # For a given delivery, look at the team names in the teams column
    for team in row['teams']:
        # Find the bowling team.
        if team != row['batting_team']:
            return team

In [8]:
# Add a bowling team column to the delivery dataFrame and remove the teams column.
delivery_df['bowling_team'] = delivery_df.apply(bowlingTeam, 1)
delivery_df.drop(columns = ['teams'], inplace=True)

In [9]:
# Evaluate data available for each team
teams_frequency = delivery_df['batting_team'].value_counts()

In [10]:
# We will filter the data for only the top 10 teams, to optimize the model's performance.
top_10_teams = teams_frequency.head(10).index
delivery_df = delivery_df[delivery_df['batting_team'].isin(top_10_teams)]
delivery_df = delivery_df[delivery_df['bowling_team'].isin(top_10_teams)]

In [11]:
# Make a backup of the delivery DataFrame and create a pickle dump.
final_df = delivery_df[['match_id', 'batting_team', 'bowling_team', 'ball', 'runs', 'player_dismissed', 'city', 'venue']]
pickle.dump(final_df, open('dataset_level2_first_innings.pkl','wb'))

## 4. Data Cleaning for 2nd Innings

We repeat the same steps 