# Cricket Data Analysis Notebook
*Author: [Utkarsh Tomar]*
*Date: [08-09-2023]*

![Cricket](cricket.jpg)

---

## Table of Contents
1. Introduction
2. Load Cricket Data
3. Player Statistics
    3.1 Extract Player-Level Statistics
    3.2 Determine Player Type

## Overview

This Jupyter Notebook presents a comprehensive analysis of cricket data, focusing on player performance, match outcomes, and venue statistics. The analysis is based on a dataset containing information about deliveries and matches, allowing us to derive valuable insights into the game of cricket.

The notebook is organized into several sections, each addressing specific aspects of the analysis. Here's an outline of what you'll find in this notebook:

1. **Data Preparation**
   - Loading and cleaning the dataset.
   - Merging relevant data for analysis.
   
2. **Player Performance Analysis**
   - Analyzing batsmen's performance in different innings.
   - Comparing players' performance between innings.
   
3. **Match Outcome Analysis**
   - Investigating the relationship between winning the toss and winning the match.
   
4. **Venue Statistics**
   - Analyzing match statistics for different venues.
   - Evaluating player performance at specific venues.
   
5. **Player Performance by Venue, Phase, and Opposition**
   - Analyzing player performance based on venue, game phase, and opposition team.
   
6. **Performance Score Calculation**
   - Calculating a performance score for players based on weighted statistics.
   
7. **Conclusion**
   - Summarizing key findings and insights.

Feel free to navigate through the sections to explore the analysis in detail. Let's begin!

## Section 1: Importing Libraries and Loading Data

In this section, we import essential Python libraries for data analysis and visualization.

We also load the latest cricket dataset, which includes both match information (matches) and detailed ball-by-ball data (deliveries) using the Pandas library.

In [None]:
# Import necessary libraries
import math
import warnings
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objs as go

warnings.filterwarnings('ignore')

# Load the latest cricket data (matches and deliveries datasets)
path = Path('/kaggle/input/ipl-mens-cricket-matches-data-2008-2023')

matches = pd.read_csv(path/'match_info_data.csv')
deliveries = pd.read_csv(path/'match_data.csv', low_memory=False)

## Section 2: Preprocessing Deliveries Data

This section defines a function preprocess_deliveries_data that is used to preprocess the 'deliveries' DataFrame.

It replaces NaN values with 0 in specified columns, calculates the 'total_runs' for each ball, and restructures the DataFrame to a standardized format.

In [None]:
def preprocess_deliveries_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Preprocess the deliveries DataFrame.

    This function replaces NaN values with integer 0 in specified columns,
    calculates 'total_runs' by summing up the specified columns, and restructures
    the DataFrame to a standardized format.

    Args:
        df (pd.DataFrame): The deliveries DataFrame to be preprocessed.

    Returns:
        pd.DataFrame: The preprocessed deliveries DataFrame.
    """
    
    # Replace NaN values with integer 0 in the specified columns
    columns_to_replace = ['runs_off_bat', 'extras', 'wides', 'noballs', 'byes', 'legbyes', 'penalty']
    df[columns_to_replace] = df[columns_to_replace].fillna(0).astype(int)
    
    # Calculate 'total_runs' by summing up the specified columns
    df['total_runs'] = df[columns_to_replace].sum(axis=1)
    
    # Convert 'ball' to string and split it into 'over' and 'ball' columns
    df['ball'] = df['ball'].astype(str)
    df['over'] = df['ball'].apply(lambda x: int(x.split('.')[0]) + 1)
    df['ball'] = df['ball'].apply(lambda x: int(x.split('.')[1]))

    # Reorder the columns
    column_order = ['match_id', 'season', 'start_date', 'venue', 'innings', 'over', 'ball', 'batting_team', 'bowling_team', 'striker', 'non_striker', 'bowler', 'runs_off_bat', 'extras', 'wides', 'noballs', 'byes', 'legbyes', 'penalty', 'total_runs', 'wicket_type', 'player_dismissed', 'other_wicket_type', 'other_player_dismissed', 'cricsheet_id']

    # Reassign the DataFrame with the new column order
    df = df[column_order]

    return df

# Assuming you have a 'deliveries' DataFrame
deliveries = preprocess_deliveries_data(deliveries)

## Section 3: Cleaning Team Names and Venues

This section defines functions to clean team names and venue names in both the 'deliveries' and 'matches' DataFrames.

Inconsistent team names are mapped to standardized names, and venue names are cleaned by removing location information and renaming specific venues. The clean_data function is then called to clean both DataFrames.

In [None]:
def clean_team_names(data, team1_col, team2_col):
    """
    Clean team names in a DataFrame.

    Parameters:
    - data (pd.DataFrame): The DataFrame containing team names.
    - team1_col (str): The column name for the first team.
    - team2_col (str): The column name for the second team.

    Returns:
    pd.DataFrame: The DataFrame with cleaned team names.
    """
    
    # Create a mapping of inconsistent team names to standardized names
    team_name_mapping = {
        'Rising Pune Supergiant': 'Rising Pune Supergiants',
        'Delhi Daredevils': 'Delhi Capitals',
        'Kings XI Punjab': 'Punjab Kings'
    }

    # Replace inconsistent team names with standardized names in team1 and team2 columns
    data[team1_col] = data[team1_col].replace(team_name_mapping)
    data[team2_col] = data[team2_col].replace(team_name_mapping)

    return data

In [None]:
def clean_venues(data):
    """
    Clean venue names in a DataFrame.

    Parameters:
    - data (pd.DataFrame): The DataFrame containing venue names.

    Returns:
    pd.DataFrame: The DataFrame with cleaned venue names.
    """
    # Clean venue names by removing location information after comma (if present)
    data['venue'] = data['venue'].str.split(',').str[0]

    # Rename specific venues
    venue_rename_mapping = {
        'Sardar Patel Stadium': 'Narendra Modi Stadium',
        'Feroz Shah Kotla': 'Arun Jaitley Stadium'
    }
    data['venue'] = data['venue'].replace(venue_rename_mapping)

    return data

In [None]:
def clean_data(deliveries, matches):
    """
    Clean team names and venue names in DataFrames.

    Parameters:
    - deliveries (pd.DataFrame): The DataFrame containing delivery data.
    - matches (pd.DataFrame): The DataFrame containing match data.

    Returns:
    pd.DataFrame, pd.DataFrame: Cleaned delivery and match DataFrames.
    """
    # Clean team names in deliveries and matches DataFrames
    deliveries = clean_team_names(deliveries, 'batting_team', 'bowling_team')
    matches = clean_team_names(matches, 'team1', 'team2')

    # Clean venues in deliveries and matches DataFrames
    deliveries = clean_venues(deliveries)
    matches = clean_venues(matches)

    return deliveries, matches

In [None]:
# Call the function to clean the data
deliveries, matches = clean_data(deliveries, matches)

print(matches.head(2), deliveries.head(2))

#### Part 1: Player Statistics

1.1 : Extract player-level statistics

1.2 : Determine player type from the above stats

## 3. Player Statistics <a id="player-statistics"></a>

In this section, we will calculate player-level statistics and determine player types based on those statistics.

### 3.1 Extract Player-Level Statistics <a id="extract-player-level-statistics"></a>

We start by defining functions to calculate various player statistics.

`balls_per_dismissal(balls, dismissals)`
This function calculates the average number of balls faced per dismissal for a player.

In [None]:
def balls_per_dismissal(balls, dismissals):
    if dismissals > 0:
        return balls/dismissals
    else:
        return balls/1 

`balls_per_boundary(balls, boundaries)`
This function calculates the average number of balls per boundary (four or six) scored by a player.

In [None]:
def balls_per_boundary(balls, boundaries):
    if boundaries > 0:
        return balls/boundaries
    else:
        return balls/1 

`calculate_batting_statistics(df)`

This function calculates various batting statistics for each player, including runs, innings, balls faced, dismissals, dot balls, ones, twos, threes, fours, and sixes. It also calculates strike rate (SR), runs per innings (RPI), balls per dismissal (BPD), and balls per boundary (BPB).

In [None]:
def calculate_player_statistics(df, criteria=None):
    """
    Calculate various batting statistics for each player.

    Args:
        df (pd.DataFrame): The DataFrame containing delivery data.
        criteria (dict, optional): A dictionary specifying a filter criteria, e.g., {'column': 'season', 'value': 2023}.
        
    Returns:
        pd.DataFrame: The DataFrame with player-level statistics.
        
    Calculates the following player-level statistics:
    - Strike Rate (SR): Measures a player's ability to score quickly.
    - Runs per Inning (RPI): Indicates a player's average runs per inning.
    - Balls per Dismissal (BPD): Shows how often a player gets dismissed.
    - Balls per Boundary (BPB): Measures a player's boundary-hitting ability.
    """
    
    # Calculate dot balls, ones, twos, threes, fours, and sixes
    df['isDot'] = (df['runs_off_bat'] == 0).astype(int)
    df['isOne'] = (df['runs_off_bat'] == 1).astype(int)
    df['isTwo'] = (df['runs_off_bat'] == 2).astype(int)
    df['isThree'] = (df['runs_off_bat'] == 3).astype(int)
    df['isFour'] = (df['runs_off_bat'] == 4).astype(int)
    df['isSix'] = (df['runs_off_bat'] == 6).astype(int)
    
    # Calculate runs, innings, balls, dismissals, and other statistics
    runs = pd.DataFrame(df.groupby(['striker'])['runs_off_bat'].sum().reset_index()).groupby(['striker'])['runs_off_bat'].sum().reset_index().rename(columns={'runs_off_bat':'runs'})
    innings = pd.DataFrame(df.groupby(['striker'])['match_id'].apply(lambda x: len(list(np.unique(x)))).reset_index()).rename(columns={'match_id':'innings'})
    balls = pd.DataFrame(df.groupby(['striker'])['match_id'].count()).reset_index().rename(columns={'match_id':'balls'})
    dismissals = pd.DataFrame(df.groupby(['striker'])['player_dismissed'].count()).reset_index().rename(columns={'player_dismissed':'dismissals'})
    
    dots = pd.DataFrame(df.groupby(['striker'])['isDot'].sum()).reset_index().rename(columns={'isDot':'dots'})
    ones = pd.DataFrame(df.groupby(['striker'])['isOne'].sum()).reset_index().rename(columns={'isOne':'ones'})
    twos = pd.DataFrame(df.groupby(['striker'])['isTwo'].sum()).reset_index().rename(columns={'isTwo':'twos'})
    threes = pd.DataFrame(df.groupby(['striker'])['isThree'].sum()).reset_index().rename(columns={'isThree':'threes'})
    fours = pd.DataFrame(df.groupby(['striker'])['isFour'].sum()).reset_index().rename(columns={'isFour':'fours'})
    sixes = pd.DataFrame(df.groupby(['striker'])['isSix'].sum()).reset_index().rename(columns={'isSix':'sixes'})
    
    df = pd.merge(innings, runs, on='striker').merge(balls, on='striker').merge(dismissals, on='striker').merge(dots, on='striker').merge(ones, on='striker').merge(twos, on='striker').merge(threes, on='striker').merge(fours, on='striker').merge(sixes, on='striker')
    
    # Calculate Strike Rate (SR)
    df['SR'] = df.apply(lambda x: 100*(x['runs']/x['balls']), axis=1)

    # Calculate Runs per Inning (RPI)
    df['RPI'] = df.apply(lambda x: x['runs']/x['innings'], axis=1)

    # Calculate Balls per Dismissal (BPD)
    df['BPD'] = df.apply(lambda x: balls_per_dismissal(x['balls'], x['dismissals']), axis=1)

    # Calculate Balls per Boundary (BPB)
    df['BPB'] = df.apply(lambda x: balls_per_boundary(x['balls'], (x['fours'] + x['sixes'])), axis=1)
    
    return df

In [None]:
# Calculate player statistics from delivery data
player_stats_df = calculate_player_statistics(deliveries)

# Display the first few rows of the player statistics DataFrame
print(player_stats_df.head())

## 1.3: Performance in Different Phases of Play and Innings

### Determining Phases of Play

In this section, we determine different phases of play (Powerplay, Middle, Death) based on the over number. We then add a 'phase' column to the deliveries DataFrame to categorize each delivery into one of these phases.

```python
# Define a function to determine the phase based on the over number
def phase(over):
    if over <= 6:
        return 'Powerplay'
    elif over <= 15:
        return 'Middle'
    else:
        return 'Death'

# Add a 'phase' column to the deliveries DataFrame
deliveries['phase'] = deliveries['over'].apply(lambda x: phase(x))


In [None]:
# Define a function to determine the phase based on the over number
def phase(over):
    if over <= 6:
        return 'Powerplay'
    elif over <= 15:
        return 'Middle'
    else:
        return 'Death'

# Add a 'phase' column to the deliveries DataFrame
deliveries['phase'] = deliveries['over'].apply(lambda x: phase(x))

## Analyzing Player Performance in Different Phases
Now that we have categorized deliveries into phases, we can analyze player performance in each phase. We use the phasesOfPlay function to filter the DataFrame for the specified phase and then calculate player statistics using the `calculate_player_statistics` helper function.

In [None]:
# Define a function to analyze player performance in different phases
def phasesOfPlay(df, current_phase):
    # Filter the DataFrame for the specified phase
    df = df[df['phase'] == current_phase].copy()
    
    # Reset the index
    df.reset_index(drop=True, inplace=True)
    
    # Utilize the calculate_player_statistics helper function
    return calculate_player_statistics(df)

# Calculate player statistics for different phases
pp_df = phasesOfPlay(deliveries, 'Powerplay')  # Powerplay phase
mid_df = phasesOfPlay(deliveries, 'Middle')      # Middle phase
dth_df = phasesOfPlay(deliveries, 'Death')       # Death phase

# Display the first 2 rows of the Powerplay DataFrame
print(pp_df.head(2))

## Analyzing Player Performance in Different Innings
In addition to phases of play, we can also analyze player performance in different innings (1st inning, 2nd inning). We use the ByInning function to filter the DataFrame for the specified inning and then calculate player statistics.

In [None]:
# Define a function to analyze player performance in different innings
def ByInning(df, current_inning):
    # Filter the DataFrame to include only the specified inning
    df = df[df.innings == current_inning]
    
    # Reset the index of the filtered DataFrame
    df.reset_index(drop=True, inplace=True)
    
    # Utilize the calculate_player_statistics helper function
    return calculate_player_statistics(df)

# Analyze player performance in the 1st inning
ing1_df = ByInning(deliveries, 1)

# Analyze player performance in the 2nd inning
ing2_df = ByInning(deliveries, 2)

# Display the player statistics for the 1st inning
print(ing1_df.head(2))

## 1.4: Comparing Player Performance in 1st and 2nd Innings

### Merging RPI Values of Batsmen

In this section, we merge the Runs per Inning (RPI) values of batsmen from the 1st and 2nd innings of cricket matches. We create a DataFrame called `comp` to store this comparison.

```python
# Merge the RPI values of batsmen from the 1st and 2nd innings
comp = ing1_df[['striker', 'RPI']].merge(ing2_df[['striker', 'RPI']], on='striker', how='inner').rename(columns={'RPI_x': '1st_RPI', 'RPI_y': '2nd_RPI'})


In [None]:
# Merge the RPI values of batsmen from the 1st and 2nd innings
comp = ing1_df[['striker', 'RPI']].merge(ing2_df[['striker', 'RPI']], on='striker', how='inner').rename(columns={'RPI_x': '1st_RPI', 'RPI_y': '2nd_RPI'})

# Display the first 2 rows of the comparison DataFrame
comp.head(2)

### Creating a Player Comparison Plot
Next, we create a scatter plot to visualize the comparison of average runs (RPI) for players between the 1st and 2nd innings. The create_player_comparison_plot function allows us to customize this plot.

In [None]:
def create_player_comparison_plot(comp, selected_players, highlight_color):
    # Create a scatter plot using Plotly Express without player names
    fig = px.scatter(comp, x='1st_RPI', y='2nd_RPI', title='Comparison of Avg Runs - 1st Inning vs 2nd Inning')

    # Filter the plot to display only selected player names
    selected_players_df = comp[comp['striker'].isin(selected_players)]

    # Define a default color for non-selected players
    default_color = '#e06377'  # Change this color as needed

    # Add selected player plots
    for _, row in selected_players_df.iterrows():
        fig.add_trace(
            go.Scatter(
                x=[row['1st_RPI']],
                y=[row['2nd_RPI']],
                text=[row['striker']],
                mode='markers+text',
                marker=dict(size=10, color=highlight_color),  # Use the highlight color for selected players
                showlegend=False,
                textposition='top center',  # Position the text above the marker
            )
        )

    # Set the marker color for non-selected players
    fig.update_traces(marker=dict(size=10, color=default_color), selector=dict(mode='markers'))

    # Show the plot
    fig.show()

# Example usage:
create_player_comparison_plot(comp, ['CH Gayle', 'V Kohli', 'AB de Villiers'], '#4040a1')

In [None]:
# comp[comp['1st_RPI'] > 50]

## 1.5: Analyzing Player Performance Against Different Oppositions

### Analyzing Player Performance Against a Specific Opposition

In this section, we analyze a player's performance against a specific opposition team using the `ByOpposition` function. We filter the delivery data to include only matches against the specified opposition and then calculate player statistics.

```python
# Analyze player performance against 'Royal Challengers Bangalore'
result = ByOpposition(deliveries, 'Royal Challengers Bangalore')
result.head(3)  # Display the first 3 rows of the result dataframe


In [None]:
def ByOpposition(df, current_opposition):
    # Filter the DataFrame to include only the specified opposition
    df = df[df.bowling_team == current_opposition]
    
    # Reset the index of the filtered DataFrame
    df.reset_index(drop=True, inplace=True)
    
    # Utilize the calculate_player_statistics helper function
    return calculate_player_statistics(df)

result = ByOpposition(deliveries, 'Royal Challengers Bangalore')
result.head(3)  # Display the first 3 rows of the result dataframe

### Plotting a Player's Runs Against All Teams
Next, we create a bar chart to visualize a player's runs against all opposition teams using the `plot_runs_by_opposition` function.

This chart provides insights into a player's performance against different teams.

In [None]:
def plot_runs_by_opposition(player_name, data, top_n=2, default_color='blue', top_color='green'):
    # Filter the data for the specified player
    player_data = data[data['striker'] == player_name]

    # Group data by bowling team and calculate runs scored
    runs_by_team = player_data.groupby('bowling_team')['runs_off_bat'].sum().reset_index()

    # Sort the DataFrame in descending order of runs scored
    runs_by_team = runs_by_team.sort_values(by='runs_off_bat', ascending=True)

    # Create a new column for color based on the top_n teams
    runs_by_team['color'] = default_color  # Default color for all bars
    if top_n > 0:
        runs_by_team.iloc[-top_n:, -1] = top_color  # Set color for the top_n bars to top_color

    # Create a horizontal bar chart using Plotly Express
    fig = px.bar(
        runs_by_team,
        x='runs_off_bat',
        y='bowling_team',
        orientation='h',
        title=f'{player_name} Runs - against all teams',
        labels={'runs_off_bat': 'Runs scored', 'bowling_team': 'Opposition Teams'},
        color='color',  # Use the 'color' column for coloring bars
        color_discrete_map={default_color: default_color, top_color: top_color},  # Define color mapping
    )

    # Customize the layout to remove legend and tick labels on the y-axis
    fig.update_layout(
        xaxis_title='Runs scored',
        yaxis_title='Opposition Teams',
        showlegend=False,  # Remove legend
    )

    # Show the plot
    fig.show()

# Example usage:
player_name = 'V Kohli'
plot_runs_by_opposition(player_name, deliveries, top_n=2, default_color='blue', top_color='green')

## 1.6: Analyzing Match Outcomes and Venue Statistics

### Analyzing the Impact of Winning the Toss

In this section, we determine if winning the toss has an impact on winning the match. We create a new column, 'wintoss_winmatch,' in the `matches` DataFrame to check if the team that won the toss also won the match based on the toss decision and match winner.

```python
# Define a function to check if the team winning the toss also won the match
def wintoss_winmatch(toss_decision, team1, team2, winner):
    if toss_decision == 'field':
        if team2 == winner:
            return True
        else:
            return False
    else:
        if team1 == winner:
            return True
        else:
            return False

# Apply the function row-wise to create the 'wintoss_winmatch' column
matches['wintoss_winmatch'] = matches.apply(lambda x: wintoss_winmatch(x['toss_decision'], x['team1'], x['team2'], x['winner']), axis=1)

# Display the selected columns for the first few rows
print(matches[['id', 'season', 'team1', 'team2', 'wintoss_winmatch']].head())


In [None]:
def wintoss_winmatch(toss_decision, team1, team2, winner):
    if toss_decision == 'field':
        if team2 == winner:
            return True
        else:
            return False
    else:
        if team1 == winner:
            return True
        else:
            return False

# Apply the function row-wise to create the 'wintoss_winmatch' column
matches['wintoss_winmatch'] = matches.apply(lambda x: wintoss_winmatch(x['toss_decision'], x['team1'], x['team2'], x['winner']), axis=1)

# Display the selected columns for the first few rows
print(matches[['id', 'season', 'team1', 'team2', 'wintoss_winmatch']].head())

### Venue Analysis
Next, we perform an analysis of match statistics for different venues using the venueAnalysis function.

This function calculates total runs, total balls bowled, and total wickets fallen for each match inning and merges the data into a consolidated venue analysis DataFrame.

In [None]:
# Define a function to check if a player is out
def isOut(player_dismissed):
    try:
        x = math.isnan(player_dismissed)
        return 0
    except:
        return 1

# Define the venueAnalysis function
def venueAnalysis(mdf, df):
    # Calculate total runs for each match inning
    runs = pd.DataFrame(df.groupby(['match_id', 'innings'])['total_runs'].sum().reset_index())
    runs['Id_Ing'] = runs.apply(lambda x: str(x['match_id']) + '-' + str(x['innings']), axis=1)

    # Calculate total balls bowled for each match inning
    balls = pd.DataFrame(df.groupby(['match_id', 'innings'])['total_runs'].count().reset_index()).rename(columns={'total_runs': 'total_balls'})
    balls['Id_Ing'] = balls.apply(lambda x: str(x['match_id']) + '-' + str(x['innings']), axis=1)

    # Calculate total wickets fallen for each match inning
    df['isOut'] = df['player_dismissed'].apply(lambda x: isOut(x))
    outs = pd.DataFrame(df.groupby(['match_id', 'innings'])['isOut'].sum().reset_index()).rename(columns={'isOut': 'wickets'})
    outs['Id_Ing'] = outs.apply(lambda x: str(x['match_id']) + '-' + str(x['innings']), axis=1)

    # Merge the dataframes to get a consolidated venue analysis dataframe
    df = pd.merge(runs, balls[['Id_Ing', 'total_balls']], on='Id_Ing').merge(outs[['Id_Ing', 'wickets']], on='Id_Ing')

    # Rename columns and add venue information
    mdf = mdf.rename(columns={'id': 'match_id'})
    df = pd.merge(df, mdf[['match_id', 'venue']], on='match_id')
    df = df[['match_id', 'venue', 'innings', 'total_runs', 'total_balls', 'wickets']]

    return df

# Perform venue analysis and store the result in ven_df
ven_df = venueAnalysis(matches, deliveries)

# Display the first few rows of the venue analysis dataframe
ven_df.head()

## 1.7: Analyzing Player Performance by Venue, Phase, and Opposition

### Analyzing Batsman Performance by Venue

In this section, we analyze the performance of batsmen at a specific cricket venue using the `analyze_batsman_performance_by_venue` function. This function filters the data for the specified venue and calculates player statistics.

```python
# Define a function to analyze batsman performance at a specific venue
def analyze_batsman_performance_by_venue(df, current_venue):
    # Filter the data for the specified venue
    df = df[df['venue'] == current_venue]

    # Utilize the calculate_player_statistics helper function
    return calculate_player_statistics(df)

# Specify the current venue for analysis
current_venue = 'Wankhede Stadium'

# Analyze batsman performance at the specified venue
venue_performance_df = analyze_batsman_performance_by_venue(combined_data, current_venue)

# Display the first few rows of the venue performance DataFrame
print(venue_performance_df.head())


In [None]:
# matches.rename(columns={'id': 'match_id'}, inplace=True)

def analyze_batsman_performance_by_venue(df, current_venue):
    # Filter the data for the specified venue
    df = df[df['venue'] == current_venue]

    # Utilize the calculate_player_statistics helper function
    return calculate_player_statistics(df)

current_venue = 'Wankhede Stadium'
venue_performance_df = analyze_batsman_performance_by_venue(combined_data, current_venue)

# Display the first few rows of the venue performance DataFrame
print(venue_performance_df.head())

### Analyzing Player Performance by Venue, Phase, and Opposition
In this section, we analyze player performance based on venue, game phase, and opposition team.

The `analyze_player_performance` function filters the data based on the specified criteria, calculates player statistics using the `calculate_player_statistics` function, and calculates the dot percentage within this function.

In [None]:
def analyze_player_performance(df, current_venue, current_phase, current_opposition):
    # Filter the data based on the specified criteria
    df = df[df['venue'] == current_venue]
    df = df[df['phase'] == current_phase]
    df = df[df['bowling_team'] == current_opposition]

    # Utilize the calculate_player_statistics helper function
    df = calculate_player_statistics(df)

    # Calculate dot percentage within this function
    df['dot_percentage'] = (df['dots'] / df['balls']) * 100 if 'balls' in df.columns and 'dots' in df.columns and df['balls'].sum() > 0 else 0

    return df

current_venue = 'Wankhede Stadium'
current_phase = 'Middle'
current_opposition = 'Mumbai Indians'

df = analyze_player_performance(combined_data, current_venue, current_phase, current_opposition)
print(df.head())

## 1.8: Player Performance Score Calculation

In this section, we calculate a player's overall performance score based on various player statistics. The process involves several steps, including normalization and scoring. 

```python
# Define weights for different statistics
wt_sr, wt_rpi, wt_bpd, wt_dot_percentage = 0.13, 0.27, 0.16, 0.45

# Filter data for innings greater than or equal to 2 (to analyze players with more games)
df = df[df.innings >= 2]


In [None]:
wt_sr, wt_rpi, wt_bpd, wt_dot_percentage = 0.13, 0.27, 0.16, 0.45
df = df[df.innings >= 2 ]

In [None]:
# Step 1: Square of all values
df['calc_SR'] = df['SR'].apply(lambda x: x * x)
df['calc_RPI'] = df['RPI'].apply(lambda x: x * x)
df['calc_BPD'] = df['BPD'].apply(lambda x: x * x)
df['calc_dot_percentage'] = df['dot_percentage'].apply(lambda x: x * x)

# Calculate square root of the sum of squared values
sq_sr, sq_rpi, sq_bpd, sq_dot_percentage = np.sqrt(df[['calc_SR', 'calc_RPI', 'calc_BPD', 'calc_dot_percentage']].sum(axis=0))

# Step 2: Normalize values
df['calc_SR'] = df['calc_SR'].apply(lambda x: x / sq_sr)
df['calc_RPI'] = df['calc_RPI'].apply(lambda x: x / sq_rpi)
df['calc_BPD'] = df['calc_BPD'].apply(lambda x: x / sq_bpd)
df['calc_dot_percentage'] = df['calc_dot_percentage'].apply(lambda x: x / sq_dot_percentage)

# Step 3: Apply weights (if needed)
df['calc_SR'] = df['calc_SR'].apply(lambda x: x * wt_sr)
df['calc_RPI'] = df['calc_RPI'].apply(lambda x: x * wt_rpi)
df['calc_BPD'] = df['calc_BPD'].apply(lambda x: x * wt_bpd)
df['calc_dot_percentage'] = df['calc_dot_percentage'].apply(lambda x: x * wt_dot_percentage)

# Step 4: Find best and worst values
best_sr, worst_sr = max(df['calc_SR']), min(df['calc_SR'])
best_rpi, worst_rpi = max(df['calc_RPI']), min(df['calc_RPI'])
best_bpd, worst_bpd = max(df['calc_BPD']), min(df['calc_BPD'])
best_dot_percentage, worst_dot_percentage = min(df['calc_dot_percentage']), max(df['calc_dot_percentage'])

# Step 5: Calculate deviation from best and worst values
df['dev_best_SR'] = df['calc_SR'].apply(lambda x: (x - best_sr) * (x - best_sr))
df['dev_best_RPI'] = df['calc_RPI'].apply(lambda x: (x - best_rpi) * (x - best_rpi))
df['dev_best_BPD'] = df['calc_BPD'].apply(lambda x: (x - best_bpd) * (x - best_bpd))
df['dev_best_dot_percentage'] = df['calc_dot_percentage'].apply(lambda x: (x - best_dot_percentage) * (x - best_dot_percentage))

df['dev_best_sqrt'] = df.apply(lambda x: x['dev_best_SR'] + x['dev_best_RPI'] + x['dev_best_BPD'] + x['dev_best_dot_percentage'], axis=1)

df['dev_worst_SR'] = df['calc_SR'].apply(lambda x: (x - worst_sr) * (x - worst_sr))
df['dev_worst_RPI'] = df['calc_RPI'].apply(lambda x: (x - worst_rpi) * (x - worst_rpi))
df['dev_worst_BPD'] = df['calc_BPD'].apply(lambda x: (x - worst_bpd) * (x - worst_bpd))
df['dev_worst_dot_percentage'] = df['calc_dot_percentage'].apply(lambda x: (x - worst_dot_percentage) * (x - worst_dot_percentage))

df['dev_worst_sqrt'] = df.apply(lambda x: x['dev_worst_SR'] + x['dev_worst_RPI'] + x['dev_worst_BPD'] + x['dev_worst_dot_percentage'], axis=1)

# Step 6: Calculate overall score for each player
df['score'] = df.apply(lambda x: x['dev_worst_sqrt'] / (x['dev_worst_sqrt'] + x['dev_best_sqrt']), axis=1)

# Select relevant columns
result_df = df[['striker', 'score']]
print(result_df.head())

In [None]:
top_players = df[['striker', 'innings', 'runs', 'balls', 'dismissals', 'dot_percentage', 'score']].sort_values(['score'], ascending=False).reset_index(drop=True).head(25)
print(top_players)