# IPL Exploratory Data Analysis

## This notebook serves to explore the data of all IPL player statistics between 2016 and 2022. There will be EDA conducted on both the bowling and the batting statistics.

### 1. Import the data into the notebook. 
First I will define a function to load in data given a file path and a filename.

In [1]:
import os
import pandas as pd


def load_data(file_path , filename):
    
    csv_path = os.path.join(file_path, filename)
    return pd.read_csv(csv_path)

In [34]:
def write_csv_data(file_path, filename, df):
    csv_path = os.path.join(file_path, filename)
    df.to_csv(csv_path)
    
    if os.path.exists(csv_path) and os.path.getsize(csv_path) > 0:
        print(filename + " was written to successfully!")

#### 1a. Load in batting data
We will then call the function to load in all batting data.

In [2]:
batting_file_path = "Data/Batting_Stats"


# All batting dataframes
batting_2016 = load_data(batting_file_path, "BATTING_STATS-IPL_2016.csv")
batting_2017 = load_data(batting_file_path, "BATTING_STATS-IPL_2017.csv")
batting_2018 = load_data(batting_file_path, "BATTING_STATS-IPL_2018.csv")
batting_2019 = load_data(batting_file_path, "BATTING_STATS-IPL_2019.csv")
batting_2020 = load_data(batting_file_path, "BATTING_STATS-IPL_2020.csv")
batting_2021 = load_data(batting_file_path, "BATTING_STATS-IPL_2021.csv")
batting_2022 = load_data(batting_file_path, "BATTING_STATS-IPL_2022.csv")



#### 1b. Load in bowling data
We will then call the function to load in all bowling data.

In [3]:
bowling_file_path = "Data/Bowling_Stats"

# All bowling dataframes
bowling_2016 = load_data(bowling_file_path, "BOWLING_STATS-IPL_2016.csv")
bowling_2017 = load_data(bowling_file_path, "BOWLING_STATS-IPL_2017.csv")
bowling_2018 = load_data(bowling_file_path, "BOWLING_STATS-IPL_2018.csv")
bowling_2019 = load_data(bowling_file_path, "BOWLING_STATS-IPL_2019.csv")
bowling_2020 = load_data(bowling_file_path, "BOWLING_STATS-IPL_2020.csv")
bowling_2021 = load_data(bowling_file_path, "BOWLING_STATS-IPL_2021.csv")
bowling_2022 = load_data(bowling_file_path, "BOWLING_STATS-IPL_2022.csv")

### 2. Take a first look at the dataframes
We will use some basic dataframe commands to inspect the content.

In [5]:
batting_2022.head()

Unnamed: 0,POS,Player,Mat,Inns,NO,Runs,HS,Avg,BF,SR,100,50,4s,6s
0,1,Jos Buttler,17,17,2,863,116,57.53,579,149.05,4,4,83,45
1,2,K L Rahul,15,15,3,616,103*,51.33,455,135.38,2,4,45,30
2,3,Quinton De Kock,15,15,1,508,140*,36.29,341,148.97,1,3,47,23
3,4,Hardik Pandya,15,15,4,487,87*,44.27,371,131.26,0,4,49,12
4,5,Shubman Gill,16,16,2,483,96,34.5,365,132.32,0,4,51,11


In [6]:
bowling_2022.head()

Unnamed: 0,POS,Player,Mat,Inns,Ov,Runs,Wkts,BBI,Avg,Econ,SR,4w,5w
0,1,Yuzvendra Chahal,17,17,68.0,527,27,40/5,19.51,7.75,15.11,1,1
1,2,Wanindu Hasaranga,16,16,57.0,430,26,18/5,16.53,7.54,13.15,1,1
2,3,Kagiso Rabada,13,13,48.0,406,23,33/4,17.65,8.45,12.52,2,0
3,4,Umran Malik,14,14,49.1,444,22,25/5,20.18,9.03,13.4,1,1
4,5,Kuldeep Yadav,14,14,49.4,419,21,14/4,19.95,8.43,14.19,2,0


### 2a. Info function analysis
Info function will give us details on whether any fields contain null values and after running the function on all dataframes, there doesn't seem to be any null values. This is very good.

In [16]:
batting_2022.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162 entries, 0 to 161
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   POS     162 non-null    int64  
 1   Player  162 non-null    object 
 2   Mat     162 non-null    int64  
 3   Inns    162 non-null    int64  
 4   NO      162 non-null    int64  
 5   Runs    162 non-null    int64  
 6   HS      162 non-null    object 
 7   Avg     162 non-null    object 
 8   BF      162 non-null    int64  
 9   SR      162 non-null    float64
 10  100     162 non-null    int64  
 11  50      162 non-null    int64  
 12  4s      162 non-null    int64  
 13  6s      162 non-null    int64  
dtypes: float64(1), int64(10), object(3)
memory usage: 17.8+ KB


In [23]:
bowling_2022.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   POS     103 non-null    int64  
 1   Player  103 non-null    object 
 2   Mat     103 non-null    int64  
 3   Inns    103 non-null    int64  
 4   Ov      103 non-null    float64
 5   Runs    103 non-null    int64  
 6   Wkts    103 non-null    int64  
 7   BBI     103 non-null    object 
 8   Avg     103 non-null    float64
 9   Econ    103 non-null    float64
 10  SR      103 non-null    float64
 11  4w      103 non-null    int64  
 12  5w      103 non-null    int64  
dtypes: float64(4), int64(7), object(2)
memory usage: 10.6+ KB


### 3. Merging the dataframes
Before performing analysis on the data, I think it is best to merge the batting and bowling dataframes into 1 each to better represent a player's performance over the course of the years.

### 3a. Merging batting data

In [42]:
# Merge year 2016 and 2017
batting_data_2016_2017_merged = batting_2016.merge(batting_2017, how='outer', on='Player', suffixes=('_2016', '_2017'))
# Merge year 2018 and 2019
batting_data_2018_2019_merged = batting_2018.merge(batting_2019, how='outer', on='Player', suffixes=('_2018', '_2019'))
# Merge year 2016, 2017, 2018, and 2019
batting_data_2016_2019_merged = batting_data_2016_2017_merged.merge(batting_data_2018_2019_merged, how='outer', on='Player')

# Merge year 2020 and 2021
batting_data_2020_2021_merged = batting_2020.merge(batting_2021, how='outer', on='Player', suffixes=('_2020', '_2021'))
# Merge year 2020, 2021, and 2022
batting_data_2020_2022_merged = batting_data_2020_2021_merged.merge(batting_2022, how='outer', on='Player')
# Merge all dataframes together essentially
batting_data = batting_data_2016_2019_merged.merge(batting_data_2020_2022_merged, how='outer', on='Player')

# Calling function to write a csv file to our output folder
write_csv_data("Outputs", "batting_data_merged.csv", batting_data)

batting_data_merged.csv was written to successfully!


### 3b. Merging bowling data

In [48]:
# Merge year 2016 and 2017
bowling_data_2016_2017_merged = bowling_2016.merge(bowling_2017, how='outer', on='Player', suffixes=('_2016', '_2017'))
# Merge year 2018 and 2019
bowling_data_2018_2019_merged = bowling_2018.merge(bowling_2019, how='outer', on='Player', suffixes=('_2018', '_2019'))
# Merge year 2016, 2017, 2018, and 2019
bowling_data_2016_2019_merged = bowling_data_2016_2017_merged.merge(bowling_data_2018_2019_merged, how='outer', on='Player')

# Merge year 2020 and 2021
bowling_data_2020_2021_merged = bowling_2020.merge(bowling_2021, how='outer', on='Player', suffixes=('_2020', '_2021'))
# Merge year 2020, 2021, and 2022
bowling_data_2020_2022_merged = bowling_data_2020_2021_merged.merge(bowling_2022, how='outer', on='Player')
# Merge all dataframes together essentially
bowling_data = bowling_data_2016_2019_merged.merge(bowling_data_2020_2022_merged, how='outer', on='Player')

# Calling function to write a csv file to our output folder
write_csv_data("Outputs", "bowling_data_merged.csv", bowling_data)

bowling_data_merged.csv was written to successfully!


### 3a. Dropping unneeded columns
For the batting data, we do not need the POS field since merging the data will cause duplicates in this field. For bowling we do not need POS and Best Bowling Innings since the date value isn't relevant anymore.

In [44]:
batting_data.drop(['POS_2016', 'POS_2017', 'POS_2018', 'POS_2019', 'POS_2020', 'POS_2021', 'POS'], axis=1, inplace=True)

# Calling function to write a csv file to our output folder
write_csv_data("Outputs", "batting_data_dropped_unneeded_columns.csv", batting_data)

batting_data_dropped_unneeded_columns.csv was written to successfully!


In [49]:
bowling_data.drop(['POS_2016', 'POS_2017', 'POS_2018', 'POS_2019', 'POS_2020', 'POS_2021', 'POS'], axis=1, inplace=True)
bowling_data.drop(['BBI_2016', 'BBI_2017', 'BBI_2018', 'BBI_2019', 'BBI_2020', 'BBI_2021', 'BBI'], axis=1, inplace=True)

# Calling function to write a csv file to our output folder
write_csv_data("Outputs", "bowling_data_dropped_unneeded_columns.csv", bowling_data)

bowling_data_dropped_unneeded_columns.csv was written to successfully!


### 4. Extracting more detailed information about each dataset
We want to get a better feel for the data so we will use the describe method and the hist() function to plot all values of the dataframe in histograms.