# IPL Matches Data Analysis Using Spark

Let’s mine the data of IPL and derive some important primitives from it like which stadium is most suitable for batting first and which stadium is most suitable for bowling first.

Here is the data set description:

**id,season,city,date,team1,team2,toss_winner,toss_decision,result,dl_applied,winner,win_by_runs,win_by_wickets,player_of_match,venue,umpire1,umpire2,umpire3**


In [1]:
import findspark
findspark.init()
import pyspark

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('usecase_6').getOrCreate()

In [4]:
sc = spark.sparkContext

In [5]:
sc.setLogLevel('ERROR')

### IPL Data

In [8]:
df = spark.read.format('csv').options(header=False, inferSchema=True).load('matches.csv')

In [9]:
df.show(3)

+---+----+----------+-------------------+--------------------+--------------------+--------------------+-----+------+---+--------------------+----+----+-----------+--------------------+---------+--------------+----+
|_c0| _c1|       _c2|                _c3|                 _c4|                 _c5|                 _c6|  _c7|   _c8|_c9|                _c10|_c11|_c12|       _c13|                _c14|     _c15|          _c16|_c17|
+---+----+----------+-------------------+--------------------+--------------------+--------------------+-----+------+---+--------------------+----+----+-----------+--------------------+---------+--------------+----+
|  1|2008| Bangalore|2008-04-18 00:00:00|Kolkata Knight Ri...|Royal Challengers...|Royal Challengers...|field|normal|  0|Kolkata Knight Ri...| 140|   0|BB McCullum|M Chinnaswamy Sta...|Asad Rauf|   RE Koertzen|null|
|  2|2008|Chandigarh|2008-04-19 00:00:00| Chennai Super Kings|     Kings XI Punjab| Chennai Super Kings|  bat|normal|  0| Chennai Super 

In [10]:
df.count()

577

In [11]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: integer (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: timestamp (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: integer (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: integer (nullable = true)
 |-- _c12: integer (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)



In [12]:
headers = ['id','season','city','date','team1','team2','toss_winner','toss_decision','result'
           ,'dl_applied','winner','win_by_runs','win_by_wickets','player_of_match','venue','umpire1'
           ,'umpire2','umpire3']

In [14]:
len(headers)

18

In [15]:
for i,col in enumerate(df.columns):
    df = df.withColumnRenamed(col,headers[i])

In [16]:
df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- season: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- team1: string (nullable = true)
 |-- team2: string (nullable = true)
 |-- toss_winner: string (nullable = true)
 |-- toss_decision: string (nullable = true)
 |-- result: string (nullable = true)
 |-- dl_applied: integer (nullable = true)
 |-- winner: string (nullable = true)
 |-- win_by_runs: integer (nullable = true)
 |-- win_by_wickets: integer (nullable = true)
 |-- player_of_match: string (nullable = true)
 |-- venue: string (nullable = true)
 |-- umpire1: string (nullable = true)
 |-- umpire2: string (nullable = true)
 |-- umpire3: string (nullable = true)



In [17]:
df.show(3)

+---+------+----------+-------------------+--------------------+--------------------+--------------------+-------------+------+----------+--------------------+-----------+--------------+---------------+--------------------+---------+--------------+-------+
| id|season|      city|               date|               team1|               team2|         toss_winner|toss_decision|result|dl_applied|              winner|win_by_runs|win_by_wickets|player_of_match|               venue|  umpire1|       umpire2|umpire3|
+---+------+----------+-------------------+--------------------+--------------------+--------------------+-------------+------+----------+--------------------+-----------+--------------+---------------+--------------------+---------+--------------+-------+
|  1|  2008| Bangalore|2008-04-18 00:00:00|Kolkata Knight Ri...|Royal Challengers...|Royal Challengers...|        field|normal|         0|Kolkata Knight Ri...|        140|             0|    BB McCullum|M Chinnaswamy Sta...|Asad R

In [30]:
from pyspark.sql.functions import desc

### 1.Which stadium is best suitable for first batting ?

##### Total no of matches won in each stadium by Batting First

In [38]:
matches_won_df = df.filter(df['win_by_runs'] > 0).groupBy('venue').count().orderBy(desc('count'))

In [39]:
matches_won_df.show(5,truncate=False)

+-------------------------------+-----+
|venue                          |count|
+-------------------------------+-----+
|MA Chidambaram Stadium, Chepauk|30   |
|Wankhede Stadium               |25   |
|M Chinnaswamy Stadium          |24   |
|Feroz Shah Kotla               |24   |
|Eden Gardens                   |22   |
+-------------------------------+-----+
only showing top 5 rows



In [42]:
matches_won_df.createOrReplaceTempView('matches_won_venue')

##### Total no of matches played in each stadium

In [40]:
all_matches_df = df.groupBy('venue').count().orderBy(desc('count'))

In [41]:
all_matches_df.show(5,truncate=False)

+-------------------------------+-----+
|venue                          |count|
+-------------------------------+-----+
|M Chinnaswamy Stadium          |58   |
|Eden Gardens                   |54   |
|Feroz Shah Kotla               |53   |
|Wankhede Stadium               |49   |
|MA Chidambaram Stadium, Chepauk|48   |
+-------------------------------+-----+
only showing top 5 rows



In [43]:
all_matches_df.createOrReplaceTempView('all_matches_venue')

In [49]:
successful_stadium = spark.sql("SELECT m.venue, (m.count/a.count)*100 AS win_percent \
                                FROM all_matches_venue a JOIN matches_won_venue m \
                                ON a.venue = m.venue ORDER BY win_percent DESC")

In [50]:
successful_stadium.show(5, truncate=False)

+---------------------------------------------------+-----------------+
|venue                                              |win_percent      |
+---------------------------------------------------+-----------------+
|Buffalo Park                                       |66.66666666666666|
|Vidarbha Cricket Association Stadium, Jamtha       |66.66666666666666|
|Subrata Roy Sahara Stadium                         |64.70588235294117|
|Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium|63.63636363636363|
|MA Chidambaram Stadium, Chepauk                    |62.5             |
+---------------------------------------------------+-----------------+
only showing top 5 rows



### 2.Which stadium is best suitable for first bowling

##### Total no of matches won in each stadium by Bowling First

In [56]:
matches_won_bowling_df = df.filter(df['win_by_wickets'] > 0).groupBy('venue').count().orderBy(desc('count'))

In [57]:
matches_won_bowling_df.show(5,truncate=False)

+-----------------------------------------+-----+
|venue                                    |count|
+-----------------------------------------+-----+
|Eden Gardens                             |32   |
|M Chinnaswamy Stadium                    |31   |
|Feroz Shah Kotla                         |28   |
|Rajiv Gandhi International Stadium, Uppal|26   |
|Wankhede Stadium                         |24   |
+-----------------------------------------+-----+
only showing top 5 rows



In [58]:
matches_won_bowling_df.createOrReplaceTempView('matches_won_bowling')

In [59]:
successful_stadium_bowling = spark.sql("SELECT m.venue, (m.count/a.count)*100 AS win_percent \
                                FROM all_matches_venue a JOIN matches_won_bowling m \
                                ON a.venue = m.venue ORDER BY win_percent DESC")

In [60]:
successful_stadium_bowling.show(5, truncate=False)

+--------------------------------------+-----------------+
|venue                                 |win_percent      |
+--------------------------------------+-----------------+
|Holkar Cricket Stadium                |100.0            |
|Green Park                            |100.0            |
|Saurashtra Cricket Association Stadium|80.0             |
|JSCA International Stadium Complex    |71.42857142857143|
|Sawai Mansingh Stadium                |69.6969696969697 |
+--------------------------------------+-----------------+
only showing top 5 rows



## Closing Spark Session

In [61]:
spark.stop()