# IPL Match Analysis and Prediction

In this notebook, I will perform exploratory data analysis on an IPL dataset to investigate the factors that might determine the winning team in any given IPL match and use this information predict the outcome of future IPL matches using Supervised Machine Learning Algorithms.

In [5]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

Importing and previewing dataset of ipl matches

In [6]:
matches_df = pd.read_csv("matches.csv")
matches_df.head()

Unnamed: 0,id,season,city,date,team1,team2,toss_winner,toss_decision,result,dl_applied,winner,win_by_runs,win_by_wickets,player_of_match,venue,umpire1,umpire2,umpire3
0,1,2017,Hyderabad,2017-04-05,Sunrisers Hyderabad,Royal Challengers Bangalore,Royal Challengers Bangalore,field,normal,0,Sunrisers Hyderabad,35,0,Yuvraj Singh,"Rajiv Gandhi International Stadium, Uppal",AY Dandekar,NJ Llong,
1,2,2017,Pune,2017-04-06,Mumbai Indians,Rising Pune Supergiant,Rising Pune Supergiant,field,normal,0,Rising Pune Supergiant,0,7,SPD Smith,Maharashtra Cricket Association Stadium,A Nand Kishore,S Ravi,
2,3,2017,Rajkot,2017-04-07,Gujarat Lions,Kolkata Knight Riders,Kolkata Knight Riders,field,normal,0,Kolkata Knight Riders,0,10,CA Lynn,Saurashtra Cricket Association Stadium,Nitin Menon,CK Nandan,
3,4,2017,Indore,2017-04-08,Rising Pune Supergiant,Kings XI Punjab,Kings XI Punjab,field,normal,0,Kings XI Punjab,0,6,GJ Maxwell,Holkar Cricket Stadium,AK Chaudhary,C Shamshuddin,
4,5,2017,Bangalore,2017-04-08,Royal Challengers Bangalore,Delhi Daredevils,Royal Challengers Bangalore,bat,normal,0,Royal Challengers Bangalore,15,0,KM Jadhav,M Chinnaswamy Stadium,,,


In [7]:
matches_df.describe()

Unnamed: 0,id,season,dl_applied,win_by_runs,win_by_wickets
count,756.0,756.0,756.0,756.0,756.0
mean,1792.178571,2013.444444,0.025132,13.283069,3.350529
std,3464.478148,3.366895,0.15663,23.471144,3.387963
min,1.0,2008.0,0.0,0.0,0.0
25%,189.75,2011.0,0.0,0.0,0.0
50%,378.5,2013.0,0.0,0.0,4.0
75%,567.25,2016.0,0.0,19.0,6.0
max,11415.0,2019.0,1.0,146.0,10.0


# Observations: 

The following inferences can be made from the information returned from the describe method:
    1.) The dataset contains match information from 2008 to 2019
    2.) There were 756 IPL matches from 2008 to 2019
    3.) The largest winning margin for a team batting first was 146 runs
    4.) The largest winning margin for a team bowling first was 10 wickets.

# Data Cleaning and Formatting

Earlier when we previewed the data a number of null values were evident. We will now check where null values exist in our data and remove these values so they do not influence the findings.

In [8]:
#checking for null values
matches_df.isnull().sum()

id                   0
season               0
city                 7
date                 0
team1                0
team2                0
toss_winner          0
toss_decision        0
result               0
dl_applied           0
winner               4
win_by_runs          0
win_by_wickets       0
player_of_match      4
venue                0
umpire1              2
umpire2              2
umpire3            637
dtype: int64

The column 'umpire3' has a significant number of null values and since the name of the third umpire for the match is not significant for our predictions, we will drop this column

In [9]:
matches_df.drop(columns=['umpire3'], axis=1)

Unnamed: 0,id,season,city,date,team1,team2,toss_winner,toss_decision,result,dl_applied,winner,win_by_runs,win_by_wickets,player_of_match,venue,umpire1,umpire2
0,1,2017,Hyderabad,2017-04-05,Sunrisers Hyderabad,Royal Challengers Bangalore,Royal Challengers Bangalore,field,normal,0,Sunrisers Hyderabad,35,0,Yuvraj Singh,"Rajiv Gandhi International Stadium, Uppal",AY Dandekar,NJ Llong
1,2,2017,Pune,2017-04-06,Mumbai Indians,Rising Pune Supergiant,Rising Pune Supergiant,field,normal,0,Rising Pune Supergiant,0,7,SPD Smith,Maharashtra Cricket Association Stadium,A Nand Kishore,S Ravi
2,3,2017,Rajkot,2017-04-07,Gujarat Lions,Kolkata Knight Riders,Kolkata Knight Riders,field,normal,0,Kolkata Knight Riders,0,10,CA Lynn,Saurashtra Cricket Association Stadium,Nitin Menon,CK Nandan
3,4,2017,Indore,2017-04-08,Rising Pune Supergiant,Kings XI Punjab,Kings XI Punjab,field,normal,0,Kings XI Punjab,0,6,GJ Maxwell,Holkar Cricket Stadium,AK Chaudhary,C Shamshuddin
4,5,2017,Bangalore,2017-04-08,Royal Challengers Bangalore,Delhi Daredevils,Royal Challengers Bangalore,bat,normal,0,Royal Challengers Bangalore,15,0,KM Jadhav,M Chinnaswamy Stadium,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
751,11347,2019,Mumbai,05/05/19,Kolkata Knight Riders,Mumbai Indians,Mumbai Indians,field,normal,0,Mumbai Indians,0,9,HH Pandya,Wankhede Stadium,Nanda Kishore,O Nandan
752,11412,2019,Chennai,07/05/19,Chennai Super Kings,Mumbai Indians,Chennai Super Kings,bat,normal,0,Mumbai Indians,0,6,AS Yadav,M. A. Chidambaram Stadium,Nigel Llong,Nitin Menon
753,11413,2019,Visakhapatnam,08/05/19,Sunrisers Hyderabad,Delhi Capitals,Delhi Capitals,field,normal,0,Delhi Capitals,0,2,RR Pant,ACA-VDCA Stadium,,
754,11414,2019,Visakhapatnam,10/05/19,Delhi Capitals,Chennai Super Kings,Chennai Super Kings,field,normal,0,Chennai Super Kings,0,6,F du Plessis,ACA-VDCA Stadium,Sundaram Ravi,Bruce Oxenford


Unnamed: 0,id,season,city,date,team1,team2,toss_winner,toss_decision,result,dl_applied,winner,win_by_runs,win_by_wickets,player_of_match,venue,umpire1,umpire2,umpire3
461,462,2014,,2014-04-19,Mumbai Indians,Royal Challengers Bangalore,Royal Challengers Bangalore,field,normal,0,Royal Challengers Bangalore,0,7,PA Patel,Dubai International Cricket Stadium,Aleem Dar,AK Chaudhary,
462,463,2014,,2014-04-19,Kolkata Knight Riders,Delhi Daredevils,Kolkata Knight Riders,bat,normal,0,Delhi Daredevils,0,4,JP Duminy,Dubai International Cricket Stadium,Aleem Dar,VA Kulkarni,
466,467,2014,,2014-04-23,Chennai Super Kings,Rajasthan Royals,Rajasthan Royals,field,normal,0,Chennai Super Kings,7,0,RA Jadeja,Dubai International Cricket Stadium,HDPK Dharmasena,RK Illingworth,
468,469,2014,,2014-04-25,Sunrisers Hyderabad,Delhi Daredevils,Sunrisers Hyderabad,bat,normal,0,Sunrisers Hyderabad,4,0,AJ Finch,Dubai International Cricket Stadium,M Erasmus,S Ravi,
469,470,2014,,2014-04-25,Mumbai Indians,Chennai Super Kings,Mumbai Indians,bat,normal,0,Chennai Super Kings,0,7,MM Sharma,Dubai International Cricket Stadium,BF Bowden,M Erasmus,
474,475,2014,,2014-04-28,Royal Challengers Bangalore,Kings XI Punjab,Kings XI Punjab,field,normal,0,Kings XI Punjab,0,5,Sandeep Sharma,Dubai International Cricket Stadium,BF Bowden,S Ravi,
476,477,2014,,2014-04-30,Sunrisers Hyderabad,Mumbai Indians,Mumbai Indians,field,normal,0,Sunrisers Hyderabad,15,0,B Kumar,Dubai International Cricket Stadium,HDPK Dharmasena,M Erasmus,


In [10]:
##checking missing winner values
matches_df.loc[matches_df['winner'].isna()]

Unnamed: 0,id,season,city,date,team1,team2,toss_winner,toss_decision,result,dl_applied,winner,win_by_runs,win_by_wickets,player_of_match,venue,umpire1,umpire2,umpire3
300,301,2011,Delhi,2011-05-21,Delhi Daredevils,Pune Warriors,Delhi Daredevils,bat,no result,0,,0,0,,Feroz Shah Kotla,SS Hazare,RJ Tucker,
545,546,2015,Bangalore,2015-04-29,Royal Challengers Bangalore,Rajasthan Royals,Rajasthan Royals,field,no result,0,,0,0,,M Chinnaswamy Stadium,JD Cloete,PG Pathak,
570,571,2015,Bangalore,2015-05-17,Delhi Daredevils,Royal Challengers Bangalore,Royal Challengers Bangalore,field,no result,0,,0,0,,M Chinnaswamy Stadium,HDPK Dharmasena,K Srinivasan,
744,11340,2019,Bengaluru,30/04/19,Royal Challengers Bangalore,Rajasthan Royals,Rajasthan Royals,field,no result,0,,0,0,,M. Chinnaswamy Stadium,Nigel Llong,Ulhas Gandhe,Anil Chaudhary


In [11]:
##replacing null values
matches_df['winner'].fillna('No result', inplace=True)

# Visualization and Exploratory Data Analysis

First, we will examine the unique teams and venues to ensure that these align with the current teams participating in the IPL for the 2022 season.

In [12]:
matches_df['team1'].unique()


array(['Sunrisers Hyderabad', 'Mumbai Indians', 'Gujarat Lions',
       'Rising Pune Supergiant', 'Royal Challengers Bangalore',
       'Kolkata Knight Riders', 'Delhi Daredevils', 'Kings XI Punjab',
       'Chennai Super Kings', 'Rajasthan Royals', 'Deccan Chargers',
       'Kochi Tuskers Kerala', 'Pune Warriors', 'Rising Pune Supergiants',
       'Delhi Capitals'], dtype=object)

An interesting observation is that both  'Dehli Daredevils' and 'Delhi Capitals' are listed as unique teams. The Dehli Daredevils were renamed the Delhi Capitals as the start of the 2019 season. There is also a spelling mistake and Rising Pune Supergiants also appears as Eising Pune Supergiant. Below I will address these issues.

In [13]:
matches_df.replace(to_replace ="Delhi Daredevils", value= "Delhi Capitals", inplace=True)
matches_df.replace(to_replace ="Rising Pune Supergiant", value ="Rising Pune Supergiants", inplace=True)

In [14]:
##checking duplicate teams have been replaced
matches_df['team1'].unique()
matches_df['team2'].unique()


array(['Royal Challengers Bangalore', 'Rising Pune Supergiants',
       'Kolkata Knight Riders', 'Kings XI Punjab', 'Delhi Capitals',
       'Sunrisers Hyderabad', 'Mumbai Indians', 'Gujarat Lions',
       'Rajasthan Royals', 'Chennai Super Kings', 'Deccan Chargers',
       'Pune Warriors', 'Kochi Tuskers Kerala'], dtype=object)

Next, look at the historical winning records of each team

In [15]:
matches_df['winner'].value_counts()

Mumbai Indians                 109
Chennai Super Kings            100
Kolkata Knight Riders           92
Royal Challengers Bangalore     84
Kings XI Punjab                 82
Delhi Capitals                  77
Rajasthan Royals                75
Sunrisers Hyderabad             58
Deccan Chargers                 29
Rising Pune Supergiants         15
Gujarat Lions                   13
Pune Warriors                   12
Kochi Tuskers Kerala             6
No result                        4
Name: winner, dtype: int64

Because ML algorithms use numeric as opposed to textual data, we will encode new numeric values for each team in the dataset

In [16]:
team_encodings = {
    'Sunrisers Hyderabad': 1,
    'Mumbai Indians': 2,
    'Gujarat Lions': 3,
    'Rising Pune Supergiants': 4,
    'Royal Challengers Bangalore': 5,
    'Kolkata Knight Riders': 6,
    'Delhi Capitals': 7,
    'Kings XI Punjab': 8,
    'Chennai Super Kings': 9,
    'Rajasthan Royals': 10,
    'Deccan Chargers': 11,
    'Kochi Tuskers Kerala': 12,
    'Pune Warriors': 13,
    'No result': 14
}

##encode all columns with teams

team_encoding_dict = {
    'team1': team_encodings,
    'team2': team_encodings,
    'toss_winner': team_encodings,
    'winner': team_encodings
}

matches_df.replace(team_encoding_dict, inplace=True)
matches_df.head()

Unnamed: 0,id,season,city,date,team1,team2,toss_winner,toss_decision,result,dl_applied,winner,win_by_runs,win_by_wickets,player_of_match,venue,umpire1,umpire2,umpire3
0,1,2017,Hyderabad,2017-04-05,1,5,5,field,normal,0,1,35,0,Yuvraj Singh,"Rajiv Gandhi International Stadium, Uppal",AY Dandekar,NJ Llong,
1,2,2017,Pune,2017-04-06,2,4,4,field,normal,0,4,0,7,SPD Smith,Maharashtra Cricket Association Stadium,A Nand Kishore,S Ravi,
2,3,2017,Rajkot,2017-04-07,3,6,6,field,normal,0,6,0,10,CA Lynn,Saurashtra Cricket Association Stadium,Nitin Menon,CK Nandan,
3,4,2017,Indore,2017-04-08,4,8,8,field,normal,0,8,0,6,GJ Maxwell,Holkar Cricket Stadium,AK Chaudhary,C Shamshuddin,
4,5,2017,Bangalore,2017-04-08,5,7,5,bat,normal,0,5,15,0,KM Jadhav,M Chinnaswamy Stadium,,,


In [18]:
##checking missing city values
matches_df['city'].value_counts()

Mumbai            101
Kolkata            77
Delhi              74
Bangalore          66
Hyderabad          64
Chennai            57
Jaipur             47
Chandigarh         46
Pune               38
Durban             15
Bengaluru          14
Visakhapatnam      13
Centurion          12
Ahmedabad          12
Rajkot             10
Mohali             10
Indore              9
Dharamsala          9
Johannesburg        8
Cuttack             7
Ranchi              7
Port Elizabeth      7
Cape Town           7
Abu Dhabi           7
Sharjah             6
Raipur              6
Kochi               5
Kanpur              4
Nagpur              3
Kimberley           3
East London         3
Bloemfontein        2
Name: city, dtype: int64

In [17]:
matches_df.loc[matches_df['city'].isna()]

Unnamed: 0,id,season,city,date,team1,team2,toss_winner,toss_decision,result,dl_applied,winner,win_by_runs,win_by_wickets,player_of_match,venue,umpire1,umpire2,umpire3
461,462,2014,,2014-04-19,2,5,5,field,normal,0,5,0,7,PA Patel,Dubai International Cricket Stadium,Aleem Dar,AK Chaudhary,
462,463,2014,,2014-04-19,6,7,6,bat,normal,0,7,0,4,JP Duminy,Dubai International Cricket Stadium,Aleem Dar,VA Kulkarni,
466,467,2014,,2014-04-23,9,10,10,field,normal,0,9,7,0,RA Jadeja,Dubai International Cricket Stadium,HDPK Dharmasena,RK Illingworth,
468,469,2014,,2014-04-25,1,7,1,bat,normal,0,1,4,0,AJ Finch,Dubai International Cricket Stadium,M Erasmus,S Ravi,
469,470,2014,,2014-04-25,2,9,2,bat,normal,0,9,0,7,MM Sharma,Dubai International Cricket Stadium,BF Bowden,M Erasmus,
474,475,2014,,2014-04-28,5,8,8,field,normal,0,8,0,5,Sandeep Sharma,Dubai International Cricket Stadium,BF Bowden,S Ravi,
476,477,2014,,2014-04-30,1,2,2,field,normal,0,1,15,0,B Kumar,Dubai International Cricket Stadium,HDPK Dharmasena,M Erasmus,


We can see that although the city is null, the venue is Dubai International Cricket Stadium. Therefore, the missing values can easily be replaced

In [19]:
matches_df['city'].fillna('Dubai', inplace=True)

In [20]:
#checking for null values
matches_df.isnull().sum()

id                   0
season               0
city                 0
date                 0
team1                0
team2                0
toss_winner          0
toss_decision        0
result               0
dl_applied           0
winner               0
win_by_runs          0
win_by_wickets       0
player_of_match      4
venue                0
umpire1              2
umpire2              2
umpire3            637
dtype: int64

# Toss wins and match wins by each team

In [21]:
toss_wins = matches_df['toss_winner'].value_counts(sort=True)
match_wins = matches_df['winner'].value_counts(sort=True)

for idx, val in toss_wins.iteritems():
    print(f"{list(team_encoding_dict['winner'].keys())[idx -1]} -> {toss_wins[idx]}")

Mumbai Indians -> 98
Kolkata Knight Riders -> 92
Delhi Capitals -> 90
Chennai Super Kings -> 89
Royal Challengers Bangalore -> 81
Kings XI Punjab -> 81
Rajasthan Royals -> 80
Sunrisers Hyderabad -> 46
Deccan Chargers -> 43
Pune Warriors -> 20
Gujarat Lions -> 15
Rising Pune Supergiants -> 13
Kochi Tuskers Kerala -> 8


Appears there is some kind of relationship between who wins the toss and who wins the match

## Drop the redundant columns 


In [22]:
matches_df = matches_df[['team1', 'team2', 'city', 'toss_decision', 'toss_winner', 'venue', 'winner']]
matches_df

Unnamed: 0,team1,team2,city,toss_decision,toss_winner,venue,winner
0,1,5,Hyderabad,field,5,"Rajiv Gandhi International Stadium, Uppal",1
1,2,4,Pune,field,4,Maharashtra Cricket Association Stadium,4
2,3,6,Rajkot,field,6,Saurashtra Cricket Association Stadium,6
3,4,8,Indore,field,8,Holkar Cricket Stadium,8
4,5,7,Bangalore,bat,5,M Chinnaswamy Stadium,5
...,...,...,...,...,...,...,...
751,6,2,Mumbai,field,2,Wankhede Stadium,2
752,9,2,Chennai,bat,9,M. A. Chidambaram Stadium,2
753,1,7,Visakhapatnam,field,7,ACA-VDCA Stadium,7
754,7,9,Visakhapatnam,field,9,ACA-VDCA Stadium,9


Use label encoding to encode the remaining values so that values have same weight non-ordinal data categorical variables

In [23]:
#use LabelEncoder class from sklearn to encode remaining categorical variables
from sklearn.preprocessing import LabelEncoder

#list of columns to transform
column_list = ['city', 'toss_decision', 'venue']
#instantiate encoder instance
encoder =  LabelEncoder()
for column in column_list:
    matches_df[column] = encoder.fit_transform(matches_df[column])
    print(encoder.classes_) # check the classes being encoded

matches_df

['Abu Dhabi' 'Ahmedabad' 'Bangalore' 'Bengaluru' 'Bloemfontein'
 'Cape Town' 'Centurion' 'Chandigarh' 'Chennai' 'Cuttack' 'Delhi'
 'Dharamsala' 'Dubai' 'Durban' 'East London' 'Hyderabad' 'Indore' 'Jaipur'
 'Johannesburg' 'Kanpur' 'Kimberley' 'Kochi' 'Kolkata' 'Mohali' 'Mumbai'
 'Nagpur' 'Port Elizabeth' 'Pune' 'Raipur' 'Rajkot' 'Ranchi' 'Sharjah'
 'Visakhapatnam']
['bat' 'field']
['ACA-VDCA Stadium' 'Barabati Stadium' 'Brabourne Stadium' 'Buffalo Park'
 'De Beers Diamond Oval' 'Dr DY Patil Sports Academy'
 'Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium'
 'Dubai International Cricket Stadium' 'Eden Gardens' 'Feroz Shah Kotla'
 'Feroz Shah Kotla Ground' 'Green Park'
 'Himachal Pradesh Cricket Association Stadium' 'Holkar Cricket Stadium'
 'IS Bindra Stadium' 'JSCA International Stadium Complex' 'Kingsmead'
 'M Chinnaswamy Stadium' 'M. A. Chidambaram Stadium'
 'M. Chinnaswamy Stadium' 'MA Chidambaram Stadium, Chepauk'
 'Maharashtra Cricket Association Stadium' 'Nehru Stadium'
 'New 

Unnamed: 0,team1,team2,city,toss_decision,toss_winner,venue,winner
0,1,5,15,1,5,28,1
1,2,4,27,1,4,21,4
2,3,6,29,1,6,31,6
3,4,8,16,1,8,13,8
4,5,7,2,0,5,17,5
...,...,...,...,...,...,...,...
751,6,2,24,1,2,40,2
752,9,2,8,0,9,18,2
753,1,7,32,1,7,0,7
754,7,9,32,1,9,0,9


## Machine Learning

split data to set aside some testing data

In [24]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(matches_df, test_size= 0.2, random_state=5)
print(train_df.shape)
print(test_df.shape)

(604, 7)
(152, 7)


In [25]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

def print_model_scores(model, data, predictor_var, target_var):
    model.fit(data[predictor_var], data[target_var])
    predictions = model.predict(data[predictor_var])
    accuracy = accuracy_score(predictions, data[target_var])
    print('Accuracy %s' % '{0:.2}'.format(accuracy))
    
    #cross validation scores
    scores = cross_val_score(model, data[predictor_var], data[target_var],
                            scoring ="neg_mean_squared_error", cv=5) #cross validation
    print('Cross-validation scores: {}'.format(np.sqrt(-scores)))
    print(f'Average RSME: {np.sqrt(-scores).mean()}')

In [26]:
##logistic regression
target = ["winner"]
predictor = ['team1', 'team2', 'venue', 'toss_winner', 'city', 'toss_decision']
model = LogisticRegression()
print_model_scores(model, train_df, predictor, target)

Accuracy 0.32
Cross-validation scores: [2.80937039 3.12811828 3.22054356 3.07885349 3.09030743]
Average RSME: 3.065438630350576


In [42]:
## random forest classifier
model = RandomForestClassifier(n_estimators=100)
print_model_scores(model, train_df, predictor, target)

Accuracy 0.89
Cross-validation scores: [2.96536481 3.19736462 2.82111285 3.19089614 2.94108823]
Average RSME: 3.0231653285881537


In [43]:
team1 = 'Sunrisers Hyderabad'
team2 = 'Mumbai Indians'
toss_winner = 'Sunrisers Hyderabad'
inp = [team_encoding_dict['team1'][team1], team_encoding_dict['team2'][team2], '14', team_encoding_dict['toss_winner'][toss_winner], '2', '1']
inp = np.array(inp).reshape((1, -1))
output = model.predict(inp)
print(f"The winner would be: {list(team_encodings.keys())[list(team_encoding_dict['team1'].values()).index(output)]}")

The winner would be: Sunrisers Hyderabad


In [45]:
pd.Series(index=predictor, data=model.feature_importances_)

team1            0.218147
team2            0.244353
venue            0.187287
toss_winner      0.162275
city             0.153956
toss_decision    0.033982
dtype: float64