<a href="https://colab.research.google.com/github/firojahmed1313/MlAITR/blob/main/pySpark/IPL_ANALYSIS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### KPI


**Match-Level KPIs:**
1. **Total Matches Played per Season**  
   - Count of unique `id` per `season` in `match.csv`.  

2. **Win Percentage of Teams**  
   - `(Total Wins of a Team / Total Matches Played by the Team) * 100`  

3. **Average Winning Margin (Runs & Wickets)**  
   - If `result` is "runs", calculate the average `result_margin`.  
   - If `result` is "wickets", calculate the average margin of wickets left.  

4. **Batting First vs Bowling First Win Percentage**  
   - Compare win rates of teams that won the toss and chose `toss_decision = "bat"` vs `"field"`.  

**Batting KPIs (Using `deliveries.csv`):**
5. **Top Run Scorers (Most Runs by Batter)**  
   - Sum of `batsman_runs` grouped by `batter`.  

6. **Batting Strike Rate (Player & Team Level)**  
   - `Strike Rate = (Total Runs Scored / Total Balls Faced) * 100`  

7. **Average Runs per Over by Team**  
   - `Total Runs / Total Overs Faced` grouped by `batting_team`.  

**Bowling KPIs:**
8. **Top Wicket Takers**  
   - Count of `is_wicket = 1` grouped by `bowler`.  

9. **Bowling Economy (Runs Conceded per Over)**  
   - `(Total Runs Given / Total Overs Bowled)` for each `bowler`.  

10. **Dot Ball Percentage (Dot Balls / Total Balls Bowled)**  
   - `(Count of deliveries where batsman_runs = 0) / (Total Balls Bowled) * 100`  

---

**Advanced Match-Level KPIs:**  
1. **Clutch Performance Index (CPI) of Players**  
   - Evaluates a player's performance in high-pressure situations (e.g., last 5 overs, chasing high targets).  
   - Formula: `(Runs/Wickets Taken in last 5 overs) / (Total Runs/Wickets in the match) * 100`  

2. **Toss Impact Factor**  
   - Measures whether winning the toss influences match results.  
   - `Win Percentage When Toss Won - Win Percentage When Toss Lost`.  

3. **Impact of Toss Decision on Match Outcome**  
   - Check if `toss_decision` (bat or field) affects win probability based on past matches.  

4. **Home Advantage Analysis**  
   - Calculate the win percentage of teams when playing in their home city.  

---

**Advanced Batting KPIs:**  
5. **Boundary Frequency (Per Over & Per Batter)**  
   - `Total Boundaries (4s & 6s) / Total Balls Faced`.  

6. **Dot Ball Pressure Index (DBPI)**  
   - Identifies batsmen under pressure due to dot balls.  
   - `DBPI = (Total Dot Balls Faced / Total Balls Faced) * 100`.  

7. **Batting Consistency Score**  
   - Uses standard deviation of a player's scores to see if they are consistent or streaky.  
   - Lower standard deviation → More consistent.  

8. **Explosiveness Rating (Powerplay & Death Overs Performance)**  
   - `(Runs Scored in Overs 16-20) / (Total Runs Scored in the Match) * 100`.  

---

**Advanced Bowling KPIs:**  
9. **Death Over Economy Rate**  
   - `Total Runs Conceded in Overs 16-20 / Total Overs Bowled in 16-20`.  

10. **Bowler’s Pressure Index**  
   - `(Total Runs Given in Last 5 Overs / Total Wickets Taken in Last 5 Overs)`.  
   - Lower values mean the bowler performs well under pressure.  

11. **Bowler’s Match Impact Factor**  
   - `(Wickets Taken + (Total Runs Prevented vs Team's Average))`.  
   - This shows if a bowler conceded fewer runs than the average team economy rate.  

12. **Wicket Conversion Rate (WCR)**  
   - `(Total Wickets Taken / Total Balls Bowled) * 100`.  

---

**Team-Level KPIs:**  
13. **Middle Order Stability Score**  
   - `(Runs Scored by Batters at Positions 4-7) / (Total Runs Scored by the Team) * 100`.  

14. **Net Run Rate (NRR) Trends Per Season**  
   - `(Total Runs Scored per Over - Total Runs Conceded per Over)`.  
   - Analyzing how NRR trends change per season.  

15. **Match Closer Effectiveness**  
   - Evaluates a team's performance in close matches (victory margin < 10 runs or < 2 wickets).  

---


### Starting

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("IPL").getOrCreate()
spark

In [3]:
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,DateType

In [4]:
MatchSchema = StructType([
    StructField("id",IntegerType(),True),
    StructField("season",StringType(),True),
    StructField("city",StringType(),True),
    StructField("date",DateType(),True),
    StructField("match_type",StringType(),True),
    StructField("player_of_match",StringType(),True),
    StructField("venue",StringType(),True),
    StructField("team1",StringType(),True),
    StructField("team2",StringType(),True),
    StructField("toss_winner",StringType(),True),
    StructField("toss_decision",StringType(),True),
    StructField("winner",StringType(),True),
    StructField("result",StringType(),True),
    StructField("result_margin",IntegerType(),True),
    StructField("target_runs",IntegerType(),True),
    StructField("target_overs",IntegerType(),True),
    StructField("super_over",StringType(),True),
    StructField("method",StringType(),True),
    StructField("umpire1",StringType(),True),
    StructField("umpire2",StringType(),True),
])

In [5]:
dfMatch=spark.read.format("csv").option("header",True).schema(MatchSchema).load("/content/sample_data/matches.csv")
dfMatch.show()
dfMatch.printSchema()

+------+-------+----------+----------+----------+---------------+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+-------+-------------+-----------+------------+----------+------+-----------+--------------+
|    id| season|      city|      date|match_type|player_of_match|               venue|               team1|               team2|         toss_winner|toss_decision|              winner| result|result_margin|target_runs|target_overs|super_over|method|    umpire1|       umpire2|
+------+-------+----------+----------+----------+---------------+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+-------+-------------+-----------+------------+----------+------+-----------+--------------+
|335982|2007/08| Bangalore|2008-04-18|    League|    BB McCullum|M Chinnaswamy Sta...|Royal Challengers...|Kolkata Knight Ri...|Royal Challengers...|        field|Kolkat

In [6]:
SchemaDeliver = StructType([
    StructField("id",IntegerType(),True),
    StructField("inning",IntegerType(),True),
    StructField("batting_team",StringType(),True),
    StructField("bowling_team",StringType(),True),
    StructField("over",IntegerType(),True),
    StructField("ball",IntegerType(),True),
    StructField("batsman",StringType(),True),
    StructField("bowler",StringType(),True),
    StructField("non_striker",StringType(),True),
    StructField("batsman_runs",IntegerType(),False),
    StructField("extra_runs",IntegerType(),True),
    StructField("total_runs",IntegerType(),True),
    StructField("extra_type",StringType(),True),
    StructField("is_wicket",IntegerType(),True),
    StructField("player_dismissed",StringType(),True),
    StructField("dismissal_kind",StringType(),True),
    StructField("fielder",StringType(),True),

])

In [7]:
dfDeliver=spark.read.format("csv").option("header",True).schema(SchemaDeliver).load("/content/sample_data/deliveries.csv")
dfDeliver.show()
dfDeliver.printSchema()

+------+------+--------------------+--------------------+----+----+-----------+---------+-----------+------------+----------+----------+----------+---------+----------------+--------------+-------+
|    id|inning|        batting_team|        bowling_team|over|ball|    batsman|   bowler|non_striker|batsman_runs|extra_runs|total_runs|extra_type|is_wicket|player_dismissed|dismissal_kind|fielder|
+------+------+--------------------+--------------------+----+----+-----------+---------+-----------+------------+----------+----------+----------+---------+----------------+--------------+-------+
|335982|     1|Kolkata Knight Ri...|Royal Challengers...|   0|   1| SC Ganguly|  P Kumar|BB McCullum|           0|         1|         1|   legbyes|        0|              NA|            NA|     NA|
|335982|     1|Kolkata Knight Ri...|Royal Challengers...|   0|   2|BB McCullum|  P Kumar| SC Ganguly|           0|         0|         0|      NULL|        0|              NA|            NA|     NA|
|335982|  

In [8]:
from pyspark.sql.functions import countDistinct,count,round,avg,col

### First Part

***Total Matches Played per Season***

In [9]:
tmdf= dfMatch.groupBy("season").agg(countDistinct("id").alias("count"))
tmdf.sort("count",ascending=False).show()

+-------+-----+
| season|count|
+-------+-----+
|   2013|   76|
|   2012|   74|
|   2022|   74|
|   2023|   74|
|   2011|   73|
|   2024|   71|
|2009/10|   60|
|   2016|   60|
|   2019|   60|
|   2014|   60|
|2020/21|   60|
|   2018|   60|
|   2021|   60|
|   2017|   59|
|   2015|   59|
|2007/08|   58|
|   2009|   57|
+-------+-----+



**Win Percentage of Teams**

In [10]:
matches_played = dfMatch.groupBy("team1").agg(count("id").alias("matches_played")) \
    .union(dfMatch.groupBy("team2").agg(count("id").alias("matches_played"))) \
    .groupBy("team1").sum("matches_played").withColumnRenamed("sum(matches_played)", "no_matches_played").withColumnRenamed("team1", "team")

In [11]:
matches_won= dfMatch.groupBy("winner").agg(count("winner").alias("matches_won")).withColumnRenamed("winner", "team")

In [12]:
dfWinPer = matches_played.join(matches_won, matches_played.team == matches_won.team,"left").drop(matches_won.team).fillna(0)
dfWinPer.withColumn("win_percentage", (dfWinPer["matches_won"] / dfWinPer["no_matches_played"]) * 100).withColumn("win_percentage", round("win_percentage", 2)).drop("matches_won").drop("no_matches_played").show()

+--------------------+--------------+
|                team|win_percentage|
+--------------------+--------------+
| Sunrisers Hyderabad|         48.35|
|Lucknow Super Giants|         54.55|
| Chennai Super Kings|         57.98|
|      Gujarat Titans|         62.22|
|Royal Challengers...|         46.67|
|Rising Pune Super...|          62.5|
|     Deccan Chargers|         38.67|
|Kochi Tuskers Kerala|         42.86|
|    Rajasthan Royals|         50.68|
|       Gujarat Lions|         43.33|
|Royal Challengers...|         48.33|
|Kolkata Knight Ri...|         52.19|
|Rising Pune Super...|         35.71|
|     Kings XI Punjab|         46.32|
|        Punjab Kings|         42.86|
|       Pune Warriors|         26.09|
|    Delhi Daredevils|         41.61|
|      Delhi Capitals|         52.75|
|      Mumbai Indians|         55.17|
+--------------------+--------------+



**Average Winning Margin (Runs & Wickets)**

In [13]:
fdMargin = dfMatch.filter(dfMatch.result != "ties").groupBy("winner","result").avg("result_margin").withColumnRenamed("avg(result_margin)", "avg_margin")\
    .withColumn("avg_margin", round("avg_margin", 2)).sort("winner","result")
fdMargin.show()

+--------------------+-------+----------+
|              winner| result|avg_margin|
+--------------------+-------+----------+
| Chennai Super Kings|   runs|     34.94|
| Chennai Super Kings|wickets|      6.03|
|     Deccan Chargers|   runs|     23.39|
|     Deccan Chargers|wickets|      6.55|
|      Delhi Capitals|   runs|      24.0|
|      Delhi Capitals|    tie|      NULL|
|      Delhi Capitals|wickets|      5.67|
|    Delhi Daredevils|   runs|      27.0|
|    Delhi Daredevils|wickets|      6.55|
|       Gujarat Lions|   runs|       1.0|
|       Gujarat Lions|wickets|      5.42|
|      Gujarat Titans|   runs|     34.18|
|      Gujarat Titans|wickets|      5.76|
|     Kings XI Punjab|   runs|     25.85|
|     Kings XI Punjab|    tie|      NULL|
|     Kings XI Punjab|wickets|      6.42|
|Kochi Tuskers Kerala|   runs|      11.5|
|Kochi Tuskers Kerala|wickets|       7.5|
|Kolkata Knight Ri...|   runs|     33.59|
|Kolkata Knight Ri...|    tie|      NULL|
+--------------------+-------+----

In [14]:
avg_runs_margin = dfMatch.filter(col("result") == "runs") \
    .agg(avg("result_margin").alias("avg_win_margin_runs"))

avg_wickets_margin = dfMatch.filter(col("result") == "wickets") \
    .agg(avg("result_margin").alias("avg_win_margin_wickets"))

avg_runs_margin.show()
avg_wickets_margin.show()

+-------------------+
|avg_win_margin_runs|
+-------------------+
| 30.104417670682732|
+-------------------+

+----------------------+
|avg_win_margin_wickets|
+----------------------+
|     6.192041522491349|
+----------------------+



**Batting First vs Bowling First Win Percentage**

In [15]:
fdwinpbb = dfMatch.groupBy("winner","toss_decision").count().sort("winner","toss_decision")
fdwinpbb = fdwinpbb.withColumnRenamed("count", "no_of_wins").withColumnRenamed("winner", "team")
fdwinpbb.show()

+--------------------+-------------+----------+
|                team|toss_decision|no_of_wins|
+--------------------+-------------+----------+
| Chennai Super Kings|          bat|        63|
| Chennai Super Kings|        field|        75|
|     Deccan Chargers|          bat|        14|
|     Deccan Chargers|        field|        15|
|      Delhi Capitals|          bat|        13|
|      Delhi Capitals|        field|        35|
|    Delhi Daredevils|          bat|        29|
|    Delhi Daredevils|        field|        38|
|       Gujarat Lions|          bat|         2|
|       Gujarat Lions|        field|        11|
|      Gujarat Titans|          bat|         9|
|      Gujarat Titans|        field|        19|
|     Kings XI Punjab|          bat|        24|
|     Kings XI Punjab|        field|        64|
|Kochi Tuskers Kerala|        field|         6|
|Kolkata Knight Ri...|          bat|        50|
|Kolkata Knight Ri...|        field|        81|
|Lucknow Super Giants|          bat|    

In [16]:
fdwin= fdwinpbb.join(matches_won, fdwinpbb.team == matches_won.team,"left").drop(matches_won.team).fillna(0).sort("team","toss_decision")
fdwin.withColumn("win_percentage", (fdwin["no_of_wins"] / fdwin["matches_won"]) * 100).withColumn("win_percentage", round("win_percentage", 2)).show()

+--------------------+-------------+----------+-----------+--------------+
|                team|toss_decision|no_of_wins|matches_won|win_percentage|
+--------------------+-------------+----------+-----------+--------------+
| Chennai Super Kings|          bat|        63|        138|         45.65|
| Chennai Super Kings|        field|        75|        138|         54.35|
|     Deccan Chargers|          bat|        14|         29|         48.28|
|     Deccan Chargers|        field|        15|         29|         51.72|
|      Delhi Capitals|          bat|        13|         48|         27.08|
|      Delhi Capitals|        field|        35|         48|         72.92|
|    Delhi Daredevils|          bat|        29|         67|         43.28|
|    Delhi Daredevils|        field|        38|         67|         56.72|
|       Gujarat Lions|          bat|         2|         13|         15.38|
|       Gujarat Lions|        field|        11|         13|         84.62|
|      Gujarat Titans|   

### Second Part Batter

**Top Run Scorers (Most Runs by Batter)**

In [17]:
dfDeliver.groupBy("batsman").sum("batsman_runs").withColumnRenamed("sum(batsman_runs)", "total_runs").sort(col("total_runs").desc()).limit(1).show()

+-------+----------+
|batsman|total_runs|
+-------+----------+
|V Kohli|      8014|
+-------+----------+



**Batting Strike Rate (Player & Team Level)**

In [18]:
dfTotalRun=dfDeliver.groupBy("batsman").sum("batsman_runs").withColumnRenamed("sum(batsman_runs)", "total_runs")


In [19]:
dfTotalBallFace=dfDeliver.groupBy("batsman").count().withColumnRenamed("count", "total_balls_faced")

In [20]:
PStrikeRate=dfTotalRun.join(dfTotalBallFace,dfTotalRun.batsman==dfTotalBallFace.batsman).drop(dfTotalBallFace.batsman)
PStrikeRate.withColumn("strike_rate", (PStrikeRate["total_runs"] / PStrikeRate["total_balls_faced"]) * 100).withColumn("strike_rate", round("strike_rate", 2))\
    .orderBy("strike_rate",ascending=False).show()

+----------+---------------+-----------------+-----------+
|total_runs|        batsman|total_balls_faced|strike_rate|
+----------+---------------+-----------------+-----------+
|         9|         L Wood|                3|      300.0|
|         5|     B Stanlake|                2|      250.0|
|       330|J Fraser-McGurk|              150|      220.0|
|        13|  R Sai Kishore|                6|     216.67|
|        39|       Umar Gul|               19|     205.26|
|         4|       RS Sodhi|                2|      200.0|
|        81|  Shahid Afridi|               46|     176.09|
|         7|     I Malhotra|                4|      175.0|
|       230|       WG Jacks|              133|     172.93|
|       653|        PD Salt|              385|     169.61|
|       405|       T Stubbs|              239|     169.46|
|       115|     R Shepherd|               68|     169.12|
|       772|        TM Head|              458|     168.56|
|       106|      LJ Wright|               63|     168.2

In [21]:
dfTotalRunTeam=dfDeliver.groupBy("batting_team").sum("batsman_runs").withColumnRenamed("sum(batsman_runs)", "total_runs")
dfTotalBallFaceTeam=dfDeliver.groupBy("batting_team").count().withColumnRenamed("count", "total_balls_faced")
TStrikeRate=dfTotalRunTeam.join(dfTotalBallFaceTeam,dfTotalRunTeam.batting_team==dfTotalBallFaceTeam.batting_team).drop(dfTotalBallFaceTeam.batting_team)
TStrikeRate.withColumn("strike_rate", (TStrikeRate["total_runs"] / TStrikeRate["total_balls_faced"]) * 100).withColumn("strike_rate", round("strike_rate", 2))\
    .orderBy("strike_rate",ascending=False).show()

+----------+--------------------+-----------------+-----------+
|total_runs|        batting_team|total_balls_faced|strike_rate|
+----------+--------------------+-----------------+-----------+
|      2789|Royal Challengers...|             1818|     153.41|
|      7357|      Gujarat Titans|             5494|     133.91|
|      9042|        Punjab Kings|             6833|     132.33|
|      7081|Lucknow Super Giants|             5400|     131.13|
|     14229|      Delhi Capitals|            10946|     129.99|
|      4629|       Gujarat Lions|             3566|     129.81|
|     36739| Chennai Super Kings|            28651|     128.23|
|     39946|      Mumbai Indians|            31437|     127.07|
|     35810|Royal Challengers...|            28205|     126.96|
|     27641| Sunrisers Hyderabad|            21843|     126.54|
|     33074|    Rajasthan Royals|            26242|     126.03|
|     28541|     Kings XI Punjab|            22646|     126.03|
|     37149|Kolkata Knight Ri...|       

In [22]:
from pyspark.sql.functions import sum as _sum

In [23]:
PlayerStrikeRate = dfDeliver.groupBy("batsman").agg(
    count("ball").alias("total_balls_faced"),
    _sum("batsman_runs").alias("total_runs_scored")
).withColumn("strike_rate", (col("total_runs_scored") / col("total_balls_faced")) *100).withColumn("strike_rate", round("strike_rate", 2)).sort("strike_rate",ascending=False)
PlayerStrikeRate.show()

+---------------+-----------------+-----------------+-----------+
|        batsman|total_balls_faced|total_runs_scored|strike_rate|
+---------------+-----------------+-----------------+-----------+
|         L Wood|                3|                9|      300.0|
|     B Stanlake|                2|                5|      250.0|
|J Fraser-McGurk|              150|              330|      220.0|
|  R Sai Kishore|                6|               13|     216.67|
|       Umar Gul|               19|               39|     205.26|
|       RS Sodhi|                2|                4|      200.0|
|  Shahid Afridi|               46|               81|     176.09|
|     I Malhotra|                4|                7|      175.0|
|       WG Jacks|              133|              230|     172.93|
|        PD Salt|              385|              653|     169.61|
|       T Stubbs|              239|              405|     169.46|
|     R Shepherd|               68|              115|     169.12|
|        T

In [24]:
TeamStrikeRate = dfDeliver.groupBy("batting_team")\
    .agg(
        count("ball").alias("total_balls_faced"),
        _sum("batsman_runs").alias("total_runs_scored")
    )\
    .withColumn("strike_rate", (col("total_runs_scored") / col("total_balls_faced")) * 100)\
    .withColumn("strike_rate", round("strike_rate", 2))\
    .sort("strike_rate",ascending=False)
TeamStrikeRate.show()

+--------------------+-----------------+-----------------+-----------+
|        batting_team|total_balls_faced|total_runs_scored|strike_rate|
+--------------------+-----------------+-----------------+-----------+
|Royal Challengers...|             1818|             2789|     153.41|
|      Gujarat Titans|             5494|             7357|     133.91|
|        Punjab Kings|             6833|             9042|     132.33|
|Lucknow Super Giants|             5400|             7081|     131.13|
|      Delhi Capitals|            10946|            14229|     129.99|
|       Gujarat Lions|             3566|             4629|     129.81|
| Chennai Super Kings|            28651|            36739|     128.23|
|      Mumbai Indians|            31437|            39946|     127.07|
|Royal Challengers...|            28205|            35810|     126.96|
| Sunrisers Hyderabad|            21843|            27641|     126.54|
|    Rajasthan Royals|            26242|            33074|     126.03|
|     

**Average Runs per Over by Team**

In [25]:
dfAvgRunsPerOver = dfDeliver.groupBy("batting_team","id","over").sum("batsman_runs").withColumnRenamed("sum(batsman_runs)", "total_runs")\
.groupBy("batting_team").agg(_sum("total_runs").alias("total_runs"), count("over").alias("total_overs"))\
.withColumn("avg_runs_per_over", col("total_runs") /col("total_overs"))\
.withColumn("avg_runs_per_over", round("avg_runs_per_over", 2)).sort("avg_runs_per_over",ascending=False )\
.show()


+--------------------+----------+-----------+-----------------+
|        batting_team|total_runs|total_overs|avg_runs_per_over|
+--------------------+----------+-----------+-----------------+
|Royal Challengers...|      2789|        290|             9.62|
|      Gujarat Titans|      7357|        887|             8.29|
|        Punjab Kings|      9042|       1093|             8.27|
|Lucknow Super Giants|      7081|        863|             8.21|
|      Delhi Capitals|     14229|       1770|             8.04|
|       Gujarat Lions|      4629|        577|             8.02|
| Chennai Super Kings|     36739|       4628|             7.94|
|      Mumbai Indians|     39946|       5070|             7.88|
| Sunrisers Hyderabad|     27641|       3526|             7.84|
|Royal Challengers...|     35810|       4566|             7.84|
|Kolkata Knight Ri...|     37149|       4768|             7.79|
|     Kings XI Punjab|     28541|       3666|             7.79|
|    Rajasthan Royals|     33074|       

### 3rd part Bowler

**Top 5 Wicket Takers**

In [26]:
dfWicket=dfDeliver.groupBy("bowler").sum("is_wicket").withColumnRenamed("sum(is_wicket)", "total_wickets").sort(col("total_wickets").desc()).limit(5).show()

+---------+-------------+
|   bowler|total_wickets|
+---------+-------------+
|YS Chahal|          213|
| DJ Bravo|          207|
|PP Chawla|          201|
|SP Narine|          200|
| R Ashwin|          198|
+---------+-------------+



**Bowling Economy (Runs Conceded per Over)**

In [27]:
dfTotalBolRun= dfDeliver.groupBy("bowler").agg(_sum("total_runs").alias("total_runs"), count("ball").alias("count"))\
.withColumnRenamed("count", "total_overs").withColumn("total_overs", col("total_overs")/6).withColumn("economy_rate", col("total_runs") / col("total_overs"))\
.withColumn("economy_rate", round("economy_rate", 2)).sort("economy_rate",ascending=False).limit(5)
dfTotalBolRun.show()

+-------------+----------+-------------------+------------+
|       bowler|total_runs|        total_overs|economy_rate|
+-------------+----------+-------------------+------------+
|  YBK Jaiswal|         6|0.16666666666666666|        36.0|
|Atharva Taide|         4|0.16666666666666666|        24.0|
|   I Malhotra|        23|                1.0|        23.0|
|    LPC Silva|        21|                1.0|        21.0|
|     B Chipli|        20|                1.0|        20.0|
+-------------+----------+-------------------+------------+



**Dot Ball Percentage (Dot Balls / Total Balls Bowled)**

In [28]:
from pyspark.sql.functions import when

In [29]:
dfTotalDot= dfDeliver.groupBy("bowler").agg(count("ball").alias("count"),count(when(dfDeliver.batsman_runs == 0, True))).withColumnRenamed("count", "total_balls_bowled").withColumnRenamed("count(CASE WHEN (batsman_runs = 0) THEN true END)", "dot_balls")\
.withColumn("dot_ball_percentage", (col("dot_balls")/ col("total_balls_bowled") * 100)).withColumn("dot_ball_percentage", round("dot_ball_percentage", 2))
dfTotalDot.show()

+---------------+------------------+---------+-------------------+
|         bowler|total_balls_bowled|dot_balls|dot_ball_percentage|
+---------------+------------------+---------+-------------------+
|     TM Dilshan|               275|       85|              30.91|
|  Kuldeep Yadav|              1786|      579|              32.42|
| M Muralitharan|              1581|      695|              43.96|
|  LA Carseldine|                 7|        5|              71.43|
|        J Botha|               709|      283|              39.92|
|     KA Pollard|              1586|      518|              32.66|
|       DR Smith|               557|      194|              34.83|
| Jaskaran Singh|               111|       49|              44.14|
|     A Flintoff|                66|       21|              31.82|
|       M Manhas|                42|       15|              35.71|
|      GR Napier|                24|        9|               37.5|
|          B Lee|               916|      435|              47

In [30]:
dfDeliver.filter(dfDeliver.batsman_runs != 0).groupBy("bowler").count().show()

+---------------+-----+
|         bowler|count|
+---------------+-----+
|     TM Dilshan|  190|
|  Kuldeep Yadav| 1207|
| M Muralitharan|  886|
|  LA Carseldine|    2|
|        J Botha|  426|
|     KA Pollard| 1068|
|       DR Smith|  363|
| Jaskaran Singh|   62|
|     A Flintoff|   45|
|       M Manhas|   27|
|      GR Napier|   15|
|          B Lee|  481|
|     D du Preez|   25|
|    BMAJ Mendis|   19|
|       AR Patel| 2038|
|       SA Yadav|    4|
|NM Coulter-Nile|  473|
|Mohammad Hafeez|   35|
|      LPC Silva|    6|
|     AL Menaria|   70|
+---------------+-----+
only showing top 20 rows



## Advance

**Clutch Performance Index (CPI) of Players**

In [31]:
BatRunastfive= dfDeliver.groupBy("batsman").agg(_sum("batsman_runs").alias("total_runs"), _sum(when(dfDeliver.over >= 15, dfDeliver.batsman_runs)).alias("high_runs"))\
.withColumn("high_run_percentage", (col("high_runs")/ col("total_runs") * 100)).withColumn("high_run_percentage", round("high_run_percentage", 2))\
.sort("total_runs",ascending=False).show()

+--------------+----------+---------+-------------------+
|       batsman|total_runs|high_runs|high_run_percentage|
+--------------+----------+---------+-------------------+
|       V Kohli|      8014|     1469|              18.33|
|      S Dhawan|      6769|      668|               9.87|
|     RG Sharma|      6630|     1513|              22.82|
|     DA Warner|      6567|      628|               9.56|
|      SK Raina|      5536|      899|              16.24|
|      MS Dhoni|      5243|     3292|              62.79|
|AB de Villiers|      5181|     1868|              36.05|
|      CH Gayle|      4997|      581|              11.63|
|    RV Uthappa|      4954|      567|              11.45|
|    KD Karthik|      4843|     1904|              39.31|
|      KL Rahul|      4689|      752|              16.04|
|     AM Rahane|      4642|      509|              10.97|
|  F du Plessis|      4571|      568|              12.43|
|     SV Samson|      4419|      780|              17.65|
|     AT Rayud

In [32]:
BollastFive= dfDeliver.groupBy("bowler").agg(_sum("is_wicket").alias("total_wickets"), _sum(when(dfDeliver.over >=15, dfDeliver.is_wicket)).alias("high_wickets"))\
.withColumn("high_wicket_percentage", (col("high_wickets")/ col("total_wickets") * 100)).withColumn("high_wicket_percentage", round("high_wicket_percentage", 2))\
.sort("total_wickets",ascending=False).show()

+---------------+-------------+------------+----------------------+
|         bowler|total_wickets|high_wickets|high_wicket_percentage|
+---------------+-------------+------------+----------------------+
|      YS Chahal|          213|          63|                 29.58|
|       DJ Bravo|          207|         130|                  62.8|
|      PP Chawla|          201|          41|                  20.4|
|      SP Narine|          200|          83|                  41.5|
|       R Ashwin|          198|          40|                  20.2|
|        B Kumar|          195|         108|                 55.38|
|     SL Malinga|          188|         122|                 64.89|
|       A Mishra|          183|          45|                 24.59|
|      JJ Bumrah|          182|         101|                 55.49|
|      RA Jadeja|          169|          32|                 18.93|
|       UT Yadav|          163|          65|                 39.88|
|Harbhajan Singh|          161|          28|    

**Toss Impact Factor**

In [33]:
tossWinwin= dfMatch.groupBy("winner").agg(count("winner").alias("no_of_wins"),count(when(dfMatch.toss_winner == dfMatch.winner, True)).alias("toss_win"))\
.withColumn("toss_win_percentage", (col("toss_win")/ col("no_of_wins") * 100)).withColumn("toss_win_percentage", round("toss_win_percentage", 2))
tossWinwin.show()
#

+--------------------+----------+--------+-------------------+
|              winner|no_of_wins|toss_win|toss_win_percentage|
+--------------------+----------+--------+-------------------+
| Sunrisers Hyderabad|        88|      38|              43.18|
|Lucknow Super Giants|        24|      10|              41.67|
| Chennai Super Kings|       138|      75|              54.35|
|      Gujarat Titans|        28|      14|               50.0|
|                  NA|         5|       0|                0.0|
|Royal Challengers...|         7|       4|              57.14|
|Rising Pune Super...|        10|       5|               50.0|
|     Deccan Chargers|        29|      19|              65.52|
|Kochi Tuskers Kerala|         6|       4|              66.67|
|    Rajasthan Royals|       112|      60|              53.57|
|       Gujarat Lions|        13|      10|              76.92|
|Royal Challengers...|       116|      57|              49.14|
|Kolkata Knight Ri...|       131|      68|             

**Impact of Toss Decision on Match Outcome**

In [36]:
tossField= dfMatch.filter((dfMatch.toss_decision == "field") & (dfMatch.winner == dfMatch.toss_winner)).count()
tossBat= dfMatch.filter((dfMatch.toss_decision == "bat") & (dfMatch.winner == dfMatch.toss_winner)).count()
total_matches = dfMatch.count()
bat_first_win_pct = (tossField / total_matches) * 100
field_first_win_pct = (tossBat / total_matches) * 100
print(f"Bat First Win Percentage: {bat_first_win_pct:.2f}%")
print(f"Field First Win Percentage: {field_first_win_pct:.2f}%")

Bat First Win Percentage: 34.43%
Field First Win Percentage: 16.16%


**Home Advantage Analysis**

**Boundary Frequency (Per Over & Per Batter)**

**Dot Ball Pressure Index (DBPI)**

**Batting Consistency Score**

**Explosiveness Rating (Powerplay & Death Overs Performance)**

**Death Over Economy Rate**

**Bowler’s Pressure Index**

**Bowler’s Match Impact Factor**

**Wicket Conversion Rate (WCR)**

**Middle Order Stability Score**

**Net Run Rate (NRR) Trends Per Season**

**Match Closer Effectiveness**