# IPL Exploratory Data Analysis

## This notebook serves to explore the data of all IPL player statistics between 2016 and 2022. There will be EDA conducted on both the bowling and the batting statistics.

### 1. Import the data into the notebook. 
First I will define a function to load in data given a file path and a filename.

In [49]:
import os
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


def load_data(file_path , filename):
    
    csv_path = os.path.join(file_path, filename)
    return pd.read_csv(csv_path)

In [50]:
def write_csv_data(file_path, filename, df):
    csv_path = os.path.join(file_path, filename)
    df.to_csv(csv_path)
    
    if os.path.exists(csv_path) and os.path.getsize(csv_path) > 0:
        print(filename + " was written to successfully!")

#### 1a. Load in batting data
We will then call the function to load in all batting data.

In [51]:
batting_file_path = "Data/Batting_Stats"


# All batting dataframes
df_batting_2016 = load_data(batting_file_path, "BATTING_STATS-IPL_2016.csv")
df_batting_2017 = load_data(batting_file_path, "BATTING_STATS-IPL_2017.csv")
df_batting_2018 = load_data(batting_file_path, "BATTING_STATS-IPL_2018.csv")
df_batting_2019 = load_data(batting_file_path, "BATTING_STATS-IPL_2019.csv")
df_batting_2020 = load_data(batting_file_path, "BATTING_STATS-IPL_2020.csv")
df_batting_2021 = load_data(batting_file_path, "BATTING_STATS-IPL_2021.csv")
df_batting_2022 = load_data(batting_file_path, "BATTING_STATS-IPL_2022.csv")



#### 1b. Load in bowling data
We will then call the function to load in all bowling data.

In [52]:
bowling_file_path = "Data/Bowling_Stats"

# All bowling dataframes
df_bowling_2016 = load_data(bowling_file_path, "BOWLING_STATS-IPL_2016.csv")
df_bowling_2017 = load_data(bowling_file_path, "BOWLING_STATS-IPL_2017.csv")
df_bowling_2018 = load_data(bowling_file_path, "BOWLING_STATS-IPL_2018.csv")
df_bowling_2019 = load_data(bowling_file_path, "BOWLING_STATS-IPL_2019.csv")
df_bowling_2020 = load_data(bowling_file_path, "BOWLING_STATS-IPL_2020.csv")
df_bowling_2021 = load_data(bowling_file_path, "BOWLING_STATS-IPL_2021.csv")
df_bowling_2022 = load_data(bowling_file_path, "BOWLING_STATS-IPL_2022.csv")

### 2. Take a first look at the dataframes
We will use some basic dataframe commands to inspect the content.

In [53]:
df_batting_2022.head()

Unnamed: 0,POS,Player,Mat,Inns,NO,Runs,HS,Avg,BF,SR,100,50,4s,6s
0,1,Jos Buttler,17,17,2,863,116,57.53,579,149.05,4,4,83,45
1,2,K L Rahul,15,15,3,616,103*,51.33,455,135.38,2,4,45,30
2,3,Quinton De Kock,15,15,1,508,140*,36.29,341,148.97,1,3,47,23
3,4,Hardik Pandya,15,15,4,487,87*,44.27,371,131.26,0,4,49,12
4,5,Shubman Gill,16,16,2,483,96,34.5,365,132.32,0,4,51,11


In [54]:
df_bowling_2022.head()

Unnamed: 0,POS,Player,Mat,Inns,Ov,Runs,Wkts,BBI,Avg,Econ,SR,4w,5w
0,1,Yuzvendra Chahal,17,17,68.0,527,27,40/5,19.51,7.75,15.11,1,1
1,2,Wanindu Hasaranga,16,16,57.0,430,26,18/5,16.53,7.54,13.15,1,1
2,3,Kagiso Rabada,13,13,48.0,406,23,33/4,17.65,8.45,12.52,2,0
3,4,Umran Malik,14,14,49.1,444,22,25/5,20.18,9.03,13.4,1,1
4,5,Kuldeep Yadav,14,14,49.4,419,21,14/4,19.95,8.43,14.19,2,0


### 2a. Info function analysis
Info function will give us details on whether any fields contain null values and after running the function on all dataframes, there doesn't seem to be any null values. This is very good.

We can see however that the highest score, average and BBI columns are all strings. This is a problem for the average column since it needs to be a float but BBI will always contain a slash and high score can contain an * to indicate not out. We will omit both BBI and High Score since it is not needed for later analysis.

In [55]:
df_batting_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133 entries, 0 to 132
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   POS     133 non-null    int64  
 1   Player  133 non-null    object 
 2   Mat     133 non-null    int64  
 3   Inns    133 non-null    int64  
 4   NO      133 non-null    int64  
 5   Runs    133 non-null    int64  
 6   HS      133 non-null    object 
 7   Avg     133 non-null    float64
 8   BF      133 non-null    int64  
 9   SR      133 non-null    float64
 10  100     133 non-null    int64  
 11  50      133 non-null    int64  
 12  4s      133 non-null    int64  
 13  6s      133 non-null    int64  
dtypes: float64(2), int64(10), object(2)
memory usage: 14.7+ KB


In [56]:
df_bowling_2022.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   POS     103 non-null    int64  
 1   Player  103 non-null    object 
 2   Mat     103 non-null    int64  
 3   Inns    103 non-null    int64  
 4   Ov      103 non-null    float64
 5   Runs    103 non-null    int64  
 6   Wkts    103 non-null    int64  
 7   BBI     103 non-null    object 
 8   Avg     103 non-null    float64
 9   Econ    103 non-null    float64
 10  SR      103 non-null    float64
 11  4w      103 non-null    int64  
 12  5w      103 non-null    int64  
dtypes: float64(4), int64(7), object(2)
memory usage: 10.6+ KB


### 3. Fix issues in the concatenated datasets
We need to address the BBI, High Score, and Average column for both batting and bowling data. Will create utility functions for each discipline to handle the cleaning in a functional way.

In [57]:
def cleanBattingData(df_batting):
    # Handles cleaning the Average column
    df_batting['Avg'] = df_batting['Avg'].astype(str)
    df_batting['Avg'] = df_batting['Avg'].str.replace("-", "0")
    df_batting['Avg'] = df_batting['Avg'].str.strip("")
    df_batting['Avg'] = df_batting['Avg'].astype('float64')

    # Removing unneeded * from High Score column
    df_batting['HS'] = df_batting['HS'].str.replace("*", "")
    
    # Dropping unneeded POS column
    df_batting.drop(['POS'], axis=1, inplace=True)
    
    return df_batting

In [58]:
df_clean_batting_2016 = cleanBattingData(df_batting_2016)
df_clean_batting_2017 = cleanBattingData(df_batting_2017)
df_clean_batting_2018 = cleanBattingData(df_batting_2018)
df_clean_batting_2019 = cleanBattingData(df_batting_2019)
df_clean_batting_2020 = cleanBattingData(df_batting_2020)
df_clean_batting_2021 = cleanBattingData(df_batting_2021)
df_clean_batting_2022 = cleanBattingData(df_batting_2022)


write_csv_data("Outputs\Cleaned_Datasets\Batting", "cleaned_batting_data_2016.csv", df_clean_batting_2016)
write_csv_data("Outputs\Cleaned_Datasets\Batting", "cleaned_batting_data_2017.csv", df_clean_batting_2017)
write_csv_data("Outputs\Cleaned_Datasets\Batting", "cleaned_batting_data_2018.csv", df_clean_batting_2018)
write_csv_data("Outputs\Cleaned_Datasets\Batting", "cleaned_batting_data_2019.csv", df_clean_batting_2019)
write_csv_data("Outputs\Cleaned_Datasets\Batting", "cleaned_batting_data_2020.csv", df_clean_batting_2020)
write_csv_data("Outputs\Cleaned_Datasets\Batting", "cleaned_batting_data_2021.csv", df_clean_batting_2021)
write_csv_data("Outputs\Cleaned_Datasets\Batting", "cleaned_batting_data_2022.csv", df_clean_batting_2022)

cleaned_batting_data_2016.csv was written to successfully!
cleaned_batting_data_2017.csv was written to successfully!
cleaned_batting_data_2018.csv was written to successfully!
cleaned_batting_data_2019.csv was written to successfully!
cleaned_batting_data_2020.csv was written to successfully!
cleaned_batting_data_2021.csv was written to successfully!
cleaned_batting_data_2022.csv was written to successfully!


In [39]:
def cleanBowlingData(df_bowling):
    # Dropping unneeded POS column
    df_bowling.drop(['POS'], axis=1, inplace=True)

    # Dropping unneeded BBI column
    df_bowling.drop(['BBI'], axis=1, inplace=True)
    
    return df_bowling

In [59]:
df_clean_bowling_2016 = cleanBowlingData(df_bowling_2016)
df_clean_bowling_2017 = cleanBowlingData(df_bowling_2017)
df_clean_bowling_2018 = cleanBowlingData(df_bowling_2018)
df_clean_bowling_2019 = cleanBowlingData(df_bowling_2019)
df_clean_bowling_2020 = cleanBowlingData(df_bowling_2020)
df_clean_bowling_2021 = cleanBowlingData(df_bowling_2021)
df_clean_bowling_2022 = cleanBowlingData(df_bowling_2022)


write_csv_data("Outputs\Cleaned_Datasets\Bowling", "cleaned_bowling_data_2016.csv", df_clean_bowling_2016)
write_csv_data("Outputs\Cleaned_Datasets\Bowling", "cleaned_bowling_data_2017.csv", df_clean_bowling_2017)
write_csv_data("Outputs\Cleaned_Datasets\Bowling", "cleaned_bowling_data_2018.csv", df_clean_bowling_2018)
write_csv_data("Outputs\Cleaned_Datasets\Bowling", "cleaned_bowling_data_2019.csv", df_clean_bowling_2019)
write_csv_data("Outputs\Cleaned_Datasets\Bowling", "cleaned_bowling_data_2020.csv", df_clean_bowling_2020)
write_csv_data("Outputs\Cleaned_Datasets\Bowling", "cleaned_bowling_data_2021.csv", df_clean_bowling_2021)
write_csv_data("Outputs\Cleaned_Datasets\Bowling", "cleaned_bowling_data_2022.csv", df_clean_bowling_2022)

cleaned_bowling_data_2016.csv was written to successfully!
cleaned_bowling_data_2017.csv was written to successfully!
cleaned_bowling_data_2018.csv was written to successfully!
cleaned_bowling_data_2019.csv was written to successfully!
cleaned_bowling_data_2020.csv was written to successfully!
cleaned_bowling_data_2021.csv was written to successfully!
cleaned_bowling_data_2022.csv was written to successfully!


### 3. Combine dataframes for all years into a single dataframe
We can combine the batting data and bowling data to represent all of the players' statistics regardless of the year they took place. This will give us a large enough dataset that we can potentially use a regression model to forecast runs for batsmen based on features and wickets for bowlers based on features.

#### 3a. Combining and formatting batting data

In [65]:
def combineDataFramesWithSeasonCount(df_list):
    # This is used to concatenate the dataframes and count the number of seasons each player played by counting the duplicates
    df_season_count = pd.concat(df_list)['Player'].value_counts().reset_index()
    # We will rename the columns for the this dataframe to easily merge into the total dataframe
    df_season_count = df_season_count.rename(columns={"Player": "Seasons", "index": "Player"})
    
    # This concatenation will produce the actual total dataframe with unique entries for each player because of groupby
    df_all = pd.concat(df_list).groupby(['Player']).sum().reset_index()
    # We merge the season count into the total dataframe to add the Season column
    df_all_with_season_count = df_all.merge(df_season_count, on='Player', how='left')

    return df_all_with_season_count

In [68]:
def formatBattingCombined(df_batting_combined):
    # Will divide average and Strike Rate by number of seasons to get the average of these metrics across all seasons
    df_batting_combined['Avg'] = round(df_batting_combined['Avg'] / df_batting_combined['Seasons'],2)
    df_batting_combined['SR'] = round(df_batting_combined['SR'] / df_batting_combined['Seasons'],2)
    
    return df_batting_combined

In [69]:
batting_list = [df_clean_batting_2016, df_clean_batting_2017, df_clean_batting_2018, df_clean_batting_2019, df_clean_batting_2020
                , df_clean_batting_2021, df_clean_batting_2022]

df_batting_combined = combineDataFramesWithSeasonCount(batting_list)
df_batting_combined_formatted = formatBattingCombined(df_batting_combined)
write_csv_data("Outputs\Combined", "combined_and_formatted_batting.csv", df_batting_combined_formatted)

combined_and_formatted_batting.csv was written to successfully!


#### 3b. Combining and formatting bowling data

In [70]:
def formatBowlingCombined(df_bowling_combined):
    # Will divide average, economy and Strike Rate by number of seasons to get the average of these metrics across all seasons
    df_bowling_combined['Avg'] = round(df_bowling_combined['Avg'] / df_bowling_combined['Seasons'],2)
    df_bowling_combined['SR'] = round(df_bowling_combined['SR'] / df_bowling_combined['Seasons'],2)
    df_bowling_combined['Econ'] = round(df_bowling_combined['Econ'] / df_bowling_combined['Seasons'],2)
    
    return df_bowling_combined

In [71]:
bowling_list = [df_clean_bowling_2016, df_clean_bowling_2017, df_clean_bowling_2018, df_clean_bowling_2019, df_clean_bowling_2020
                , df_clean_bowling_2021, df_clean_bowling_2022]

df_bowling_combined = combineDataFramesWithSeasonCount(bowling_list)
df_bowling_combined_formatted = formatBowlingCombined(df_bowling_combined)
write_csv_data("Outputs\Combined", "combined_and_formatted_bowling.csv", df_bowling_combined_formatted)

combined_and_formatted_bowling.csv was written to successfully!


### 4. Merging the dataframes
Before performing analysis on the data, I think it is best to merge the batting and bowling dataframes into 1 each to better represent a player's performance over the course of the years.

### 4a. Merging batting data

In [None]:
# Merge year 2016 and 2017
batting_data_2016_2017_merged = batting_2016.merge(batting_2017, how='outer', on='Player', suffixes=('_2016', '_2017'))
# Merge year 2018 and 2019
batting_data_2018_2019_merged = batting_2018.merge(batting_2019, how='outer', on='Player', suffixes=('_2018', '_2019'))
# Merge year 2016, 2017, 2018, and 2019
batting_data_2016_2019_merged = batting_data_2016_2017_merged.merge(batting_data_2018_2019_merged, how='outer', on='Player')

# Merge year 2020 and 2021
batting_data_2020_2021_merged = batting_2020.merge(batting_2021, how='outer', on='Player', suffixes=('_2020', '_2021'))
# Add suffix to 2022 before merging it
batting_2022 = batting_2022.add_suffix('_2022')
batting_2022 = batting_2022.rename(columns={"Player_2022": "Player"})
# Merge year 2020, 2021, and 2022
batting_data_2020_2022_merged = batting_data_2020_2021_merged.merge(batting_2022, how='outer', on='Player')

# Merge all dataframes together essentially
batting_data_merged = batting_data_2016_2019_merged.merge(batting_data_2020_2022_merged, how='outer', on='Player')
# Calling function to write a csv file to our output folder
write_csv_data("Outputs", "batting_data_merged.csv", batting_data_merged)

### 4b. Merging bowling data

In [None]:
# Merge year 2016 and 2017
bowling_data_2016_2017_merged = bowling_2016.merge(bowling_2017, how='outer', on='Player', suffixes=('_2016', '_2017'))
# Merge year 2018 and 2019
bowling_data_2018_2019_merged = bowling_2018.merge(bowling_2019, how='outer', on='Player', suffixes=('_2018', '_2019'))
# Merge year 2016, 2017, 2018, and 2019
bowling_data_2016_2019_merged = bowling_data_2016_2017_merged.merge(bowling_data_2018_2019_merged, how='outer', on='Player')

# Merge year 2020 and 2021
bowling_data_2020_2021_merged = bowling_2020.merge(bowling_2021, how='outer', on='Player', suffixes=('_2020', '_2021'))
# Add suffix to 2022 before merging it
bowling_2022 = bowling_2022.add_suffix('_2022')
bowling_2022 = bowling_2022.rename(columns={"Player_2022": "Player"})
# Merge year 2020, 2021, and 2022
bowling_data_2020_2022_merged = bowling_data_2020_2021_merged.merge(bowling_2022, how='outer', on='Player')

# Merge all dataframes together essentially
bowling_data_merged = bowling_data_2016_2019_merged.merge(bowling_data_2020_2022_merged, how='outer', on='Player')
# Calling function to write a csv file to our output folder
write_csv_data("Outputs", "bowling_data_merged.csv", bowling_data_merged)

### 5. Dropping unneeded columns
For the batting data, we do not need the POS field since merging the data will cause duplicates in this field. For bowling we do not need POS and Best Bowling Innings since the date value isn't relevant anymore.

### 5a. Drop unneeded columns from merged datasets

In [None]:
batting_data_merged.drop(['POS_2016', 'POS_2017', 'POS_2018', 'POS_2019', 'POS_2020', 'POS_2021', 'POS_2022'], axis=1, inplace=True)

# Calling function to write a csv file to our output folder
write_csv_data("Outputs", "batting_data_merged_dropped_unneeded_columns.csv", batting_data_merged)

In [None]:
bowling_data_merged.drop(['POS_2016', 'POS_2017', 'POS_2018', 'POS_2019', 'POS_2020', 'POS_2021', 'POS_2022'], axis=1, inplace=True)
bowling_data_merged.drop(['BBI_2016', 'BBI_2017', 'BBI_2018', 'BBI_2019', 'BBI_2020', 'BBI_2021', 'BBI_2022'], axis=1, inplace=True)

# Calling function to write a csv file to our output folder
write_csv_data("Outputs", "bowling_data_merged_dropped_unneeded_columns.csv", bowling_data_merged)

### 6. Extracting more detailed information about each dataset
We want to get a better feel for the data so we will use the describe method and info method again.

### 6a. Describe method on the merged datasets

In [None]:
batting_data_merged.describe()

Describe shows some basic statistics such as the average of all key statistics as well as the max for each field on the merged datasets.

In [None]:
bowling_data_merged.describe()

### 6b. Describe method on the concatenated datasets

We also want to analyze the concatenated datasets to ensure everything is good.

In [None]:
batting_data_concatenated.describe()

In [None]:
bowling_data_concatenated.describe()

### 6c. Info function for merged datasets

We can use info again but with show_counts enabled in order to see how many null values are present. Batting contains about 200 values per statistic column with a null value so we can impute that with a 0. Also need to convert all High Scores to a float.

In [None]:
batting_data_merged.info(show_counts=True)

Bowling data does not have any issues with regards to data types but we need to fill the empty values with 0s.

In [None]:
bowling_data_merged.info(show_counts=True)

### 6d. Info function for concatenated datasets

In [None]:
batting_data_concatenated.info()

In [None]:
bowling_data_concatenated.info()

### 7. Fix data type issues
Need to convert all the object fields to float64 in batting data.

The reason High Scores are objects and not float64 is because they contain an * field which indicates whether the person was not out or not. Since we have a field already tracking not outs, then we simply need to remove all occurrences of * in the high score before converting to a float.

### 7a. Fix issues in merged datasets

In [None]:
# Remove all occurrences of * in the high score
batting_data_merged['HS_2016'] = batting_data_merged['HS_2016'].str.replace("*", "")
batting_data_merged['HS_2017'] = batting_data_merged['HS_2017'].str.replace("*", "")
batting_data_merged['HS_2018'] = batting_data_merged['HS_2018'].str.replace("*", "")
batting_data_merged['HS_2019'] = batting_data_merged['HS_2019'].str.replace("*", "")
batting_data_merged['HS_2020'] = batting_data_merged['HS_2020'].str.replace("*", "")
batting_data_merged['HS_2021'] = batting_data_merged['HS_2021'].str.replace("*", "")
batting_data_merged['HS_2022'] = batting_data_merged['HS_2022'].str.replace("*", "")
batting_data_merged['Avg_2022'] = batting_data_merged['Avg_2022'].str.replace("-", "0")
batting_data_merged['Avg_2022'] = batting_data_merged['Avg_2022'].str.strip("")


# Convert to a float
batting_data_merged['HS_2016'] = batting_data_merged['HS_2016'].astype('float64')
batting_data_merged['HS_2017'] = batting_data_merged['HS_2017'].astype('float64')
batting_data_merged['HS_2018'] = batting_data_merged['HS_2018'].astype('float64')
batting_data_merged['HS_2019'] = batting_data_merged['HS_2019'].astype('float64')
batting_data_merged['HS_2020'] = batting_data_merged['HS_2020'].astype('float64')
batting_data_merged['HS_2021'] = batting_data_merged['HS_2021'].astype('float64')
batting_data_merged['HS_2022'] = batting_data_merged['HS_2022'].astype('float64')

batting_data_merged['Avg_2022'] = batting_data_merged['Avg_2022'].astype('float64')

# Calling function to write a csv file to our output folder
write_csv_data("Outputs", "batting_data_merged_fixed_data_type_mismatches.csv", batting_data_merged)

Rerunning info method to verify if the data type conversions were successful.

In [None]:
batting_data_merged.info(show_counts=True)

Rerunning info method to verify everything looks good.

In [None]:
batting_data_concatenated.info()

### 8. Plotting Histograms for key attributes in merged batting data
So instead of plotting histograms for all fields at once, I decided to plot similiar fields for all years together. So below I did runs for batting. Then I will do average, high scores, and matches for now.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Batting runs
batting_data_merged[['Runs_2016', 'Runs_2017', 'Runs_2018', 'Runs_2019', 'Runs_2020', 'Runs_2021', 'Runs_2022']].hist(bins=25, figsize=(20,15))
plt.show()

In [None]:
# Batting matches
batting_data_merged[['Mat_2016', 'Mat_2017', 'Mat_2018', 'Mat_2019', 'Mat_2020', 'Mat_2021', 'Mat_2022']].hist(bins=25, figsize=(20,15))
plt.show()

In [None]:
# Batting average
batting_data_merged[['Avg_2016', 'Avg_2017', 'Avg_2018', 'Avg_2019', 'Avg_2020', 'Avg_2021', 'Avg_2022']].hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
# Batting high scores
batting_data_merged[['HS_2016', 'HS_2017', 'HS_2018', 'HS_2019', 'HS_2020', 'HS_2021', 'HS_2022']].hist(bins=50, figsize=(20,15))
plt.show()

### 8a. Plotting histograms for key attributes in merged bowling data
So instead of plotting histograms for all fields at once, I decided to plot similiar fields for all years together. So below I did wickets for bowling. Then I will do average, economy rate, and matches for now.

In [None]:
# Bowling wickets
bowling_data_merged[['Wkts_2016', 'Wkts_2017', 'Wkts_2018', 'Wkts_2019', 'Wkts_2020', 'Wkts_2021', 'Wkts_2022']].hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
# Bowling average
bowling_data_merged[['Avg_2016', 'Avg_2017', 'Avg_2018', 'Avg_2019', 'Avg_2020', 'Avg_2021', 'Avg_2022']].hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
# Bowling economy rates
bowling_data_merged[['Econ_2016', 'Econ_2017', 'Econ_2018', 'Econ_2019', 'Econ_2020', 'Econ_2021', 'Econ_2022']].hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
# Bowling matches
bowling_data_merged[['Mat_2016', 'Mat_2017', 'Mat_2018', 'Mat_2019', 'Mat_2020', 'Mat_2021', 'Mat_2022']].hist(bins=50, figsize=(20,15))
plt.show()

### 9. Clean the data

Need to inpute 0s for all nulls in both dataframes.

### 9a. Clean the merged datasets

In [None]:
batting_data_merged.fillna(0, inplace=True)
bowling_data_merged.fillna(0, inplace=True)

# Calling function to write a csv file to our output folder
write_csv_data("Outputs", "bowling_data_merged_filled_all_blanks.csv", bowling_data_merged)
write_csv_data("Outputs", "batting_data_merged_filled_all_blanks.csv", batting_data_merged)

### 9b. Clean the concatenated datasets

In [None]:
batting_data_concatenated.fillna(0, inplace=True)
bowling_data_concatenated.fillna(0, inplace=True)

# Calling function to write a csv file to our output folder
write_csv_data("Outputs", "bowling_data_concatenated_filled_all_blanks.csv", bowling_data_concatenated)
write_csv_data("Outputs", "batting_data_concatenated_filled_all_blanks.csv", batting_data_concatenated)

### 10. Need to create training and test sets
By doing this, we do not get too familiar with a portion of the data which can affect us in choosing a proper model in the future. 

#### 10a. Doing the split on concatenated datasets

In [None]:
from sklearn.model_selection import train_test_split

batting_data_concatenated_train_set, batting_data_concatenated_test_set = train_test_split(batting_data_concatenated, test_size=0.2, random_state=42)

batting_data_concatenated_train_set.info()

In [None]:
bowling_data_concatenated_train_set, bowling_data_concatenated_test_set = train_test_split(bowling_data_concatenated, test_size=0.2, random_state=42)

bowling_data_concatenated_train_set.info()

#### 10b. Doing the split on merged datasets

In [None]:
batting_data_merged_train_set, batting_data_merged_test_set = train_test_split(batting_data_merged, test_size=0.2, random_state=42)

batting_data_merged_train_set.info()

In [None]:
bowling_data_merged_train_set, bowling_data_merged_test_set = train_test_split(bowling_data_merged, test_size=0.2, random_state=42)

bowling_data_merged_train_set.info()

### 11. Visualizing and gaining insights on the data

First we will copy the training sets so we don't manipulate the original set.

In [None]:
batting_data_concatenated_copy = batting_data_concatenated_train_set.copy()
bowling_data_concatenated_copy = bowling_data_concatenated_train_set.copy()

Next, we want to create some visualizations such as scatter plots to better get a feel for the data. First we will plot runs against balls faced. It is evident that the more balls faced will lead to more runs naturally.

In [None]:
batting_data_concatenated_copy.plot(kind='scatter', x='Runs', y='BF', alpha=0.1)

In [None]:
batting_data_concatenated_copy["Avg"].describe()

Next we will create a scatter plot of Runs to Average.

In [None]:
batting_data_concatenated_copy.plot(kind='scatter', x='Avg', y='Runs', alpha=0.1)

### 12. Running correlation functions to identify highly correlated features
Here we want to run a correlation function to understand how correlated are the other attributes to each attribute.

In [None]:
batting_concatenated_corr_matrix = batting_data_concatenated_copy.corr()

In [None]:
bowling_concatenated_corr_matrix = bowling_data_concatenated_copy.corr()

In [None]:
batting_concatenated_corr_matrix["Runs"].sort_values(ascending=False)

In [None]:
bowling_concatenated_corr_matrix["Wkts"].sort_values(ascending=False)