#  Exploratory Data Analysis (EDA) Report
## Author: Charan Kumar Pathakamuri Project: Cricket Query AI

## 1. Introduction
This document outlines the initial steps of Exploratory Data Analysis (EDA) performed on the Asia Cup cricket datasets. The primary goal of this phase is to load the data, understand its structure, and perform essential cleaning to prepare it for further analysis and for building the Text-to-SQL system.

## 2. Data Loading and Inspection
### 2.1. Loading Datasets
<b> The first step was to load all eight provided CSV files into the environment using the pandas library. Each file was loaded into a separate DataFrame, corresponding to a table in the future database.

In [80]:
import pandas as pd 
import numpy as np
import plotly.express as px

In [81]:
asia_cup = pd.read_csv("C:/Users/cnaid/Downloads/asiacup.csv")
champion = pd.read_csv("C:/Users/cnaid/Downloads/champion.csv")

In [77]:
batsman_odi = pd.read_csv("C:/Users/cnaid/Downloads/batsman data odi.csv")
bowler_odi = pd.read_csv("C:/Users/cnaid/Downloads/bowler data odi.csv")
wicketkeeper_odi = pd.read_csv("C:/Users/cnaid/Downloads/wicketkeeper data odi.csv")

In [82]:

batsman_t20i = pd.read_csv("C:/Users/cnaid/Downloads/wicketkeeper data t20i.csv")
bowler_t20i = pd.read_csv("C:/Users/cnaid/Downloads/bowler data odi.csv")
wicketkeeper_t20i = pd.read_csv("C:/Users/cnaid/Downloads/wicketkeeper data t20i.csv")

### 2.2. Initial Data Inspection
To get a first look at the structure and content of the datasets, the .head() method was used. This displays the first five rows of a DataFrame, providing a quick overview of the columns and the types of data they contain.

In [83]:
asia_cup.head()

Unnamed: 0,Team,Opponent,Format,Ground,Year,Toss,Selection,Run Scored,Wicket Lost,Fours,Sixes,Extras,Run Rate,Avg Bat Strike Rate,Highest Score,Wicket Taken,Given Extras,Highest Individual wicket,Player Of The Match,Result
0,Pakistan,Sri Lanka,ODI,Sharjah,1984,Lose,Batting,187.0,9.0,9.0,3.0,21.0,4.06,52.04,47.0,5.0,26.0,2.0,Roy Dias,Lose
1,Sri Lanka,Pakistan,ODI,Sharjah,1984,Win,Bowling,190.0,5.0,11.0,1.0,26.0,4.36,68.51,57.0,9.0,21.0,3.0,Roy Dias,Win
2,India,Sri Lanka,ODI,Sharjah,1984,Win,Bowling,97.0,0.0,9.0,0.0,14.0,4.47,60.48,51.0,10.0,8.0,3.0,Surinder Khanna,Win
3,Sri Lanka,India,ODI,Sharjah,1984,Lose,Batting,96.0,10.0,7.0,0.0,8.0,2.34,25.74,38.0,0.0,14.0,0.0,Surinder Khanna,Lose
4,India,Pakistan,ODI,Sharjah,1984,Win,Batting,188.0,4.0,13.0,3.0,17.0,4.08,60.21,56.0,10.0,5.0,3.0,Surinder Khanna,Win


In [84]:
champion.head()

Unnamed: 0,Year,Host,No Of Team,Champion,Runner Up,Player Of The Series,Highest Run Scorer,Highest Wicket Taker
0,1984,UAE,3,India,Sri Lanka,Surinder Khanna,Surinder Khanna,Ravi Shastri
1,1986,Sri Lanka,3,Sri Lanka,Pakistan,Arjuna Ranatunga,Arjuna Ranatunga,Abdul Qadir
2,1988,Bangladesh,4,India,Sri Lanka,Navjot Sidhu,Ijaz Ahmed,Arshad Ayub
3,1990,India,3,India,Sri Lanka,Not Awarded,Arjuna Ranatunga,Kapil Dev
4,1995,UAE,4,India,Sri Lanka,Navjot Sidhu,Sachin Tendulkar,Anil Kumble


In [85]:
batsman_odi.head()

Unnamed: 0,Player Name,Country,Time Period,Matches,Played,Not Outs,Runs,Highest Score,Batting Average,Balls Faced,Strike Rate,Centuries,Fifties,Ducks,Fours,Sixes
0,ST Jayasuriya,Sri Lanka,1990-2008,25,24,1,1220,130,53.04,1190,102.52,6,3,1,139,23
1,KC Sangakkara,Sri Lanka,2004-2014,24,23,1,1075,121,48.86,1272,84.51,4,8,2,107,7
2,SR Tendulkar,India,1990-2012,23,21,2,971,114,51.1,1136,85.47,2,7,0,108,8
3,Shoaib Malik,Pakistan,2000-2018,17,15,3,786,143,65.5,867,90.65,3,3,0,76,8
4,RG Sharma,India,2008-2018,22,21,5,745,111,46.56,877,84.94,1,6,1,60,17


In [86]:
bowler_odi.head()

Unnamed: 0,Player Name,Country,Time Period,Matches,Played,Overs,Maiden Overs,Runs,Wickets,Best Figure,Bowling Average,Economy Rate,Strike Rate,Four Wickets,Five Wickets
0,M Muralidaran,Sri Lanka,1995-2010,24,24,230.2,13,865,30,5/31,28.83,3.75,46.0,1,1
1,SL Malinga,Sri Lanka,2004-2018,14,14,128.1,6,596,29,5/34,20.55,4.65,26.5,1,3
2,BAW Mendis,Sri Lanka,2008-2014,8,8,68.0,5,271,26,6/13,10.42,3.98,15.6,2,2
3,Saeed Ajmal,Pakistan,2008-2014,12,12,115.0,6,485,25,3/26,19.4,4.21,27.6,0,0
4,WPUJC Vaas,Sri Lanka,1995-2008,19,19,152.2,20,639,23,3/30,27.78,4.19,39.7,0,0


In [87]:
wicketkeeper_odi.head()

Unnamed: 0,Player Name,Country,Time Period,Matches,Played,Dismissals,Catches,Stumpings,Maximum Dismissals
0,MS Dhoni,India,2008-2018,19,19,36,25,11,5
1,KC Sangakkara,Sri Lanka,2004-2014,24,24,36,27,9,4
2,Moin Khan,Pakistan,1995-2004,14,13,17,12,5,3
3,Mushfiqur Rahim,Bangladesh,2008-2018,21,17,17,14,3,4
4,DSBP Kuruppu,Sri Lanka,1984-1988,9,9,14,12,2,4


In [131]:
asia_cup.info()

<class 'pandas.core.frame.DataFrame'>
Index: 252 entries, 0 to 253
Data columns (total 20 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Team                       252 non-null    object 
 1   Opponent                   252 non-null    object 
 2   Format                     252 non-null    object 
 3   Ground                     252 non-null    object 
 4   Year                       252 non-null    int64  
 5   Toss                       252 non-null    object 
 6   Selection                  252 non-null    object 
 7   Run Scored                 252 non-null    float64
 8   Wicket Lost                252 non-null    float64
 9   Fours                      252 non-null    float64
 10  Sixes                      252 non-null    float64
 11  Extras                     252 non-null    float64
 12  Run Rate                   252 non-null    float64
 13  Avg Bat Strike Rate        252 non-null    float64
 14 

In [132]:
champion.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Year                  15 non-null     int64 
 1   Host                  15 non-null     object
 2   No Of Team            15 non-null     int64 
 3   Champion              15 non-null     object
 4   Runner Up             15 non-null     object
 5   Player Of The Series  15 non-null     object
 6   Highest Run Scorer    15 non-null     object
 7   Highest Wicket Taker  15 non-null     object
dtypes: int64(2), object(6)
memory usage: 1.1+ KB


In [133]:
batsman_odi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Player Name      50 non-null     object 
 1   Country          50 non-null     object 
 2   Time Period      50 non-null     object 
 3   Matches          50 non-null     int64  
 4   Played           50 non-null     int64  
 5   Not Outs         50 non-null     int64  
 6   Runs             50 non-null     int64  
 7   Highest Score    50 non-null     int64  
 8   Batting Average  50 non-null     float64
 9   Balls Faced      50 non-null     int64  
 10  Strike Rate      50 non-null     float64
 11  Centuries        50 non-null     int64  
 12  Fifties          50 non-null     int64  
 13  Ducks            50 non-null     int64  
 14  Fours            50 non-null     int64  
 15  Sixes            50 non-null     int64  
dtypes: float64(2), int64(11), object(3)
memory usage: 6.4+ KB


## 3. Data Cleaning: Handling Missing Values
### 3.1. Null Value Identification
A helper function, checknull(), was created to systematically check for missing (null) values in any given DataFrame. This function uses the .isnull().sum() method to return a count of nulls for each column.

In [88]:
def checknull(df):
    return df.isnull().sum()


In [89]:
checknull(asia_cup)


Team                         0
Opponent                     0
Format                       0
Ground                       0
Year                         0
Toss                         0
Selection                    0
Run Scored                   2
Wicket Lost                  2
Fours                        2
Sixes                        2
Extras                       2
Run Rate                     2
Avg Bat Strike Rate          2
Highest Score                2
Wicket Taken                 2
Given Extras                 2
Highest Individual wicket    2
Player Of The Match          2
Result                       0
dtype: int64

These mussing values are because a match called off due to weather conditions. Dropped those columns because it might disturb analysis. Even in the player stats this match will not be included  

In [90]:
asia_cup = asia_cup.dropna()


In [91]:
checknull(asia_cup)

Team                         0
Opponent                     0
Format                       0
Ground                       0
Year                         0
Toss                         0
Selection                    0
Run Scored                   0
Wicket Lost                  0
Fours                        0
Sixes                        0
Extras                       0
Run Rate                     0
Avg Bat Strike Rate          0
Highest Score                0
Wicket Taken                 0
Given Extras                 0
Highest Individual wicket    0
Player Of The Match          0
Result                       0
dtype: int64

In [92]:
checknull(champion)

Year                    0
Host                    0
No Of Team              0
Champion                0
Runner Up               0
Player Of The Series    0
Highest Run Scorer      0
Highest Wicket Taker    0
dtype: int64

In [93]:
checknull(batsman_odi)
checknull(batsman_t20i)
checknull(bowler_odi)
checknull(bowler_t20i)
checknull(wicketkeeper_odi)
checknull(wicketkeeper_t20i)

Player Name           0
Country               0
Time Period           0
Matches               0
Played                0
Dismissals            0
Catches               0
Stumpings             0
Maximum Dismissals    0
dtype: int64

## 4. Visualizations 

First, we will see evolution of ODI format Batting. Did the batters become aggressive or respecting the bowlers.  

In [95]:
odi_matches = asia_cup[asia_cup['Format'] == 'ODI'].copy()
avg_run_rate_per_year = odi_matches.groupby('Year')['Run Rate'].mean().reset_index()
avg_run_rate_per_year.rename(columns={'Run Rate': 'Average Run Rate'}, inplace=True)
fig = px.bar(
    avg_run_rate_per_year,
    x='Year',
    y='Average Run Rate',
    title='Average Run Rate per Year in ODI Matches',
    labels={'Year': 'Year', 'Average Run Rate': 'Average Run Rate'},
    text='Average Run Rate'  # Display the value on each bar
)

fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.update_layout(
    xaxis_title='Match Year',
    yaxis_title='Average Run Rate',
    xaxis={'type': 'category'}
)
fig.show()





As we see a clear upward trend in run rate. The batsmen has become more aggressive in nature. As the run scoring has increased almost every tournament. 

How about the bowlers?

In [126]:
odi_matches = asia_cup[asia_cup['Format'] == 'ODI'].copy()
avg_wickets_per_year = odi_matches.groupby('Year')['Wicket Taken'].mean().reset_index()
avg_wickets_per_year.rename(columns={'Wicket Taken': 'Average Wickets'}, inplace=True)
avg_wickets_per_year["Average Wickets"] = avg_wickets_per_year['Average Wickets']*2
fig = px.bar(
    avg_wickets_per_year,
    x='Year',
    y='Average Wickets',
    title='Average Wickets Taken per Year in ODI Matches',
    labels={'Year': 'Year', 'Average Wickets': 'Average Wickets Taken'},
    text='Average Wickets'  # This will display the value on top of each bar
)
fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.update_layout(
    xaxis_title='Match Year',
    yaxis_title='Average Number of Wickets Taken',
    xaxis={'type': 'category'}
)
fig.show()

Even bowlers has become more skillful in taking wickets. But that's only one side of the coin, when batsman become aggressive. They tend to make big shots more often and risk of getting out also increases.

## Inspecting why batsmen have become more aggressive ?
1. Limiting number of bouncers per over. <br>
2. Usage of two different bowls in two ends <br>
3. Only four fielders are sllowed outside the 30 yards circle for majority of the game. <br>
4. Modernisation of Bats. <br>
5. Free hit after a No ball 


### Player of the series distribution

In [103]:
batsman_map = batsman_odi[['Player Name', 'Country']].copy()
bowler_map = bowler_odi[['Player Name', 'Country']].copy()
player_country_map = pd.concat([batsman_map, bowler_map]).drop_duplicates().reset_index(drop=True)
player_country_map.rename(columns={'Player Name': 'Player'}, inplace=True)
player_to_country_dict = pd.Series(player_country_map.Country.values, index=player_country_map.Player).to_dict()
def find_player_country(full_name, country_lookup_dict):
    # Handle non-string inputs gracefully to prevent errors.
    if not isinstance(full_name, str):
        return None

    # Clean the input name: remove whitespace, get last word, and convert to lowercase.
    player_lastname = full_name.strip().split(' ')[-1].lower()
    
    # Search for a match in our dictionary of players.
    for key_name, country in country_lookup_dict.items():
        # Handle cases where key_name might not be a string
        if isinstance(key_name, str):
            # Clean the key name in the same way for a reliable comparison.
            key_lastname = key_name.strip().split(' ')[-1].lower()
            if player_lastname == key_lastname:
                return country # Return the country if last names match.
            
    return None
champion_filtered = champion[champion['Player Of The Series'] != 'Not Awarded'].copy()
champion_filtered['POTS_Country'] = champion_filtered['Player Of The Series'].apply(
    lambda name: find_player_country(name, player_to_country_dict)
)
champion_filtered.dropna(subset=['POTS_Country'], inplace=True)


In [104]:
def get_award_category(row):
    if row['POTS_Country'] == row['Champion']:
        return 'Champion Team'
    elif row['POTS_Country'] == row['Runner Up']:
        return 'Runner Up Team'
    else:
        return 'Other Team'
champion_filtered['Award Category'] = champion_filtered.apply(get_award_category, axis=1)
category_counts = champion_filtered['Award Category'].value_counts()




In [105]:
category_counts

Award Category
Champion Team     8
Runner Up Team    2
Other Team        1
Name: count, dtype: int64

In [107]:
import plotly.graph_objects as go
fig = go.Figure(data=[go.Pie(
        labels=category_counts.index,
        values=category_counts.values,
        hole=.4,  # This creates the donut shape.
        pull=[0.05, 0, 0] # Slightly "pull" the largest slice for emphasis.
    )])
fig.update_layout(
        title_text='Player of The Series Award Distribution',
        annotations=[dict(text='Awards', x=0.5, y=0.5, font_size=20, showarrow=False)]
    )
fig.update_traces(
        hoverinfo='label+percent',
        textinfo='value+label',
        textfont_size=14
    )

fig.show()


Clearly, this pie chart shows that single player impact (top performer), inspires the whole team. So, the team with more spirit likely win the tournament or atleast ending up in second position. 

## Does toss show impact on result of the match ?

In [124]:
toss_winners = asia_cup[asia_cup['Toss'].str.lower() == 'win'].copy()

toss_winners['Toss Winner Won Match'] = toss_winners['Result'].str.lower() == 'win'

outcome_counts = toss_winners['Toss Winner Won Match'].value_counts()
labels = outcome_counts.index.map({True: 'Won the Match', False: 'Lost the Match'})
values = outcome_counts.values

fig = go.Figure(data=[go.Pie(
    labels=labels,
    values=values,
    hole=.4,
    marker_colors=['deepskyblue', 'indianred']
)])

fig.update_layout(
    title_text='Does Winning the Toss Mean Winning the Match?',
    annotations=[dict(text='Toss', x=0.5, y=0.5, font_size=20, showarrow=False)]
)

fig.update_traces(
    hoverinfo='label+percent',
    textinfo='percent+label',
    textfont_size=14,
    pull=[0.05, 0]
)
fig.show()

The toss has given 7.1% extra chance for the team to get the win. Winning the toss, gives the team freedom to bat first or next. Generally, this decision will be taken inconsideration of Pitch behavior and Weather forecast. Sometimes irresoective of external conditions, the team has a comfort of doing something. Some teams are good at chasing and some are not. so, the toss plays the significant role deciding the match result. 