In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm

IMPORT Cricket data with the player salaries

In [6]:
IPLPlayer = pd.read_csv('C:/Users/Jeet/Desktop/Cricket Analytics/IPL18Player.csv')
IPLPlayer.head()


Unnamed: 0,player_id,long_scorecard_name,Salary,team,matches,wins,team_runs_for,team_runs_against,matches_keeper,byes_conceded,...,bowling_dot_balls,bowling_sixes,no_balls,balls_bowled_1_to_6,runs_conceded_1_to_6,balls_bowled_7_to_14,runs_conceded_7_to_14,balls_bowled_15_to_20,runs_conceded_15_to_20,event_winner
0,8931,AT Rayudu,343750.0,Chennai Super Kings,16,11,2809,2750,0,0,...,0,0,0,0,0,0,0,0,0,1
1,254771,D Shorey,31250.0,Chennai Super Kings,1,1,128,127,0,0,...,0,0,0,0,0,0,0,0,0,1
2,44613,DJ Bravo,1000000.0,Chennai Super Kings,16,11,2809,2750,0,0,...,90,29,0,0,0,126,160,195,373,1
3,214425,DJ Willey,,Chennai Super Kings,3,2,484,483,0,0,...,20,3,0,24,38,6,10,30,47,1
4,258155,DL Chahar,125000.0,Chennai Super Kings,12,9,2117,2068,0,0,...,118,10,2,194,236,37,42,0,0,1


#MISSING VALUES

In [7]:
IPLPlayer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 35 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   player_id               149 non-null    int64  
 1   long_scorecard_name     149 non-null    object 
 2   Salary                  141 non-null    float64
 3   team                    149 non-null    object 
 4   matches                 149 non-null    int64  
 5   wins                    149 non-null    int64  
 6   team_runs_for           149 non-null    int64  
 7   team_runs_against       149 non-null    int64  
 8   matches_keeper          149 non-null    int64  
 9   byes_conceded           149 non-null    int64  
 10  moms                    149 non-null    int64  
 11  innings                 149 non-null    int64  
 12  not_outs                149 non-null    int64  
 13  runs                    149 non-null    int64  
 14  balls_faced             149 non-null    in

We see that salary column has missing values so we drop those missing values.

In [9]:
IPLPlayer = IPLPlayer.dropna()
IPLPlayer.info()

<class 'pandas.core.frame.DataFrame'>
Index: 141 entries, 0 to 148
Data columns (total 35 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   player_id               141 non-null    int64  
 1   long_scorecard_name     141 non-null    object 
 2   Salary                  141 non-null    float64
 3   team                    141 non-null    object 
 4   matches                 141 non-null    int64  
 5   wins                    141 non-null    int64  
 6   team_runs_for           141 non-null    int64  
 7   team_runs_against       141 non-null    int64  
 8   matches_keeper          141 non-null    int64  
 9   byes_conceded           141 non-null    int64  
 10  moms                    141 non-null    int64  
 11  innings                 141 non-null    int64  
 12  not_outs                141 non-null    int64  
 13  runs                    141 non-null    int64  
 14  balls_faced             141 non-null    int64  

# Create dummy variables to indicate the role of players as Batsmen and Bowlers and Allrounders who can both bat and bowl.

In [12]:
IPLPlayer['Batsmen'] = np.where(IPLPlayer['innings']>0,1,0)
IPLPlayer['Batsmen'].describe()

count    141.000000
mean       0.943262
std        0.232165
min        0.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: Batsmen, dtype: float64

In [13]:
IPLPlayer['Bowler'] = np.where(IPLPlayer['balls_bowled']>0,1,0)
IPLPlayer['Bowler'].describe()

count    141.000000
mean       0.631206
std        0.484198
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: Bowler, dtype: float64

The last type of player that is not captured by either batsman or bowler is wicket keeper. In the dataset, the variable "matches_keeper" indicates the number of matches that a player is a wicket keeper.

# Performance Measures

batting average = runs / the numbers of outs
batting strike rate = (runs * 100) / balls faced
bowling average = runs conceded / wicket taken
bowling strike rate = number of balls bowled / wicket taken
Notice that if a batsman has scored runs but not been dismissed, his batting average is technically infinite. Similarly, if a player did not face any ball, his batting strike would be infinite and if a player did not lose any wicket, his bowling average or bowling strike would be infinite.

We will not be able to run a regression when our variables have some infinite values.

There are two alternatives we will consider to deal with this issue.

a)Add 1 to the number of outs, balls faced, andn wickets taken in calculating the above variables.
b)Instead of creating the above measures, we can simply include total runs, total number of outs, and balls faced to measure a batsman's performance, and include runs conceded, number of balls bowled, and wickets taken to measure a bowler's performance.

In [16]:
IPLPlayer['outs']=np.where(IPLPlayer['Batsmen'] == 1, IPLPlayer['innings'] - IPLPlayer['not_outs'],0)
IPLPlayer['outs'].describe()

count    141.000000
mean       5.000000
std        4.605897
min        0.000000
25%        1.000000
50%        4.000000
75%        9.000000
max       16.000000
Name: outs, dtype: float64

Create batting average, batting strke rate, bowling average, and bowling strike rate variables. Add 1 to the number of outs, balls faced, andn wickets taken in calculating these variables.

In [17]:
IPLPlayer['batting_average']=IPLPlayer['runs']/(IPLPlayer['outs']+1)
IPLPlayer['batting_strike']=IPLPlayer['runs']/((IPLPlayer['balls_faced']+1))*100
IPLPlayer['bowling_average']=IPLPlayer['runs_conceded']/(IPLPlayer['wickets']+1)
IPLPlayer['bowling_strike']=IPLPlayer['balls_bowled']/(IPLPlayer['wickets']+1)

In [18]:
IPLPlayer['batting_average'].describe()

count    141.000000
mean      15.093066
std       13.761819
min        0.000000
25%        4.000000
50%       12.500000
75%       23.000000
max       65.000000
Name: batting_average, dtype: float64

In [19]:
IPLPlayer['batting_strike'].describe()

count    141.000000
mean     104.164456
std       53.873378
min        0.000000
25%       73.913043
50%      118.446602
75%      139.669421
max      250.000000
Name: batting_strike, dtype: float64

In [20]:
IPLPlayer['bowling_average'].describe()

count    141.000000
mean      17.493864
std       16.108488
min        0.000000
25%        0.000000
50%       20.052632
75%       27.466667
max       72.000000
Name: bowling_average, dtype: float64

In [21]:
IPLPlayer['batting_strike'].describe()

count    141.000000
mean     104.164456
std       53.873378
min        0.000000
25%       73.913043
50%      118.446602
75%      139.669421
max      250.000000
Name: batting_strike, dtype: float64

# Regression Analysis

First let's run a regression of the salary on the type of player, batsman, bowler, and all-rounder

In [23]:
reg_IPL1 = sm.ols(formula= 'Salary ~ Batsmen + Bowler + Batsmen*Bowler',  data=IPLPlayer, missing="drop").fit()
print(reg_IPL1.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.060
Model:                            OLS   Adj. R-squared:                  0.046
Method:                 Least Squares   F-statistic:                     4.379
Date:                Fri, 15 Mar 2024   Prob (F-statistic):             0.0143
Time:                        21:07:32   Log-Likelihood:                -2069.2
No. Observations:                 141   AIC:                             4144.
Df Residuals:                     138   BIC:                             4153.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept       2.859e+05   1.11e+05      2.

Only the batsmen variable is statistically significant(i.e < alpha value 0.05) and r^2 of this modle is just 0.06 so this indicates that there are also some other
factors that impact the players salary.

# Next we will first focus on performance of batsman.

We will first simply use the total number of runs, number of not outs, and number of balls faced to measure players’ performance.

In [25]:
reg_IPL2 = sm.ols(formula= 'Salary ~ runs', data=IPLPlayer).fit()
print(reg_IPL2.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.267
Model:                            OLS   Adj. R-squared:                  0.261
Method:                 Least Squares   F-statistic:                     50.57
Date:                Fri, 15 Mar 2024   Prob (F-statistic):           5.54e-11
Time:                        21:12:54   Log-Likelihood:                -2051.7
No. Observations:                 141   AIC:                             4107.
Df Residuals:                     139   BIC:                             4113.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   3.878e+05   5.39e+04      7.198      0.0

As we can see the total number of runs singnificantly effect the pplayers salary.The coefficient is 1737.94, which indicates that 
if the players runs increases by 1 then the players salary will be increased by 1737.94. And there is a signifiicant increase in the value of our r^2 i.e 0.267.

In [27]:
reg_IPL3 = sm.ols(formula= 'Salary ~ runs + not_outs', data=IPLPlayer).fit()
print(reg_IPL3.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.318
Model:                            OLS   Adj. R-squared:                  0.308
Method:                 Least Squares   F-statistic:                     32.15
Date:                Fri, 15 Mar 2024   Prob (F-statistic):           3.45e-12
Time:                        21:19:18   Log-Likelihood:                -2046.6
No. Observations:                 141   AIC:                             4099.
Df Residuals:                     138   BIC:                             4108.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    2.88e+05   6.07e+04      4.747      0.0

The regression analysis on Players salary and runs and not outs are statistically significant. The estimate on runs is reduced to 1491 and estimate on not_outs is 89550 which means that if a player can stay in a match not out in one more innings his salary will increase about 89550.We can notice even the r^2 value has significantly improved to 0.318

In [30]:
reg_IPL4 = sm.ols(formula= 'Salary ~ runs + not_outs + balls_faced', data=IPLPlayer).fit()
print(reg_IPL4.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.321
Model:                            OLS   Adj. R-squared:                  0.306
Method:                 Least Squares   F-statistic:                     21.60
Date:                Fri, 15 Mar 2024   Prob (F-statistic):           1.62e-11
Time:                        21:24:12   Log-Likelihood:                -2046.3
No. Observations:                 141   AIC:                             4101.
Df Residuals:                     137   BIC:                             4112.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept    3.013e+05   6.29e+04      4.791      

The estimated coefficient on balls faced is negative so it indicates that if a bowlers faces large number of balls and even looking at the runs he has scored facing so many balls the runs coefficient has actually doubled to 2871, so this indicates that the estimated coefficient on the number of balls faced we need to take into consideration the number of balls as well.
We also need to know that the estimate on the number of balls faced is not statistically significant. The P value, right now, it's 0.416, which is much greater than the commonly used significance level 0.05. So with this regression, we do not have strong evidence that the number of balls faced have a significant impact on players salary

# We will use the modified batting average and batting strike variables to measure player performance.

In [31]:
reg_IPL5=sm.ols(formula = 'Salary ~ batting_average', data= IPLPlayer).fit()
print(reg_IPL5.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.233
Model:                            OLS   Adj. R-squared:                  0.227
Method:                 Least Squares   F-statistic:                     42.13
Date:                Fri, 15 Mar 2024   Prob (F-statistic):           1.40e-09
Time:                        21:28:03   Log-Likelihood:                -2054.9
No. Observations:                 141   AIC:                             4114.
Df Residuals:                     139   BIC:                             4120.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept        3.072e+05   6.52e+04     

 We could see that the estimates on batting average is 20,740. And this estimate is statistically significant at least 1% level. This means that as the batting average of a player increased by 1, his salary is expected to increase by 20,740. The R square of the model is 0.233. This does suggest that the player performance, indeed, it's an important determinant of his salary. 

In [32]:
reg_IPL6=sm.ols(formula = 'Salary ~ batting_average+batting_strike', data= IPLPlayer).fit()
print(reg_IPL6.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.234
Model:                            OLS   Adj. R-squared:                  0.223
Method:                 Least Squares   F-statistic:                     21.12
Date:                Fri, 15 Mar 2024   Prob (F-statistic):           9.96e-09
Time:                        21:29:35   Log-Likelihood:                -2054.7
No. Observations:                 141   AIC:                             4115.
Df Residuals:                     138   BIC:                             4124.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept        2.668e+05   9.69e+04     

While the estimate on batting average is reduced to 19,030, it is still statistically significant. The estimates on batting strike rate is 6,353,000, meaning that as batting strike increased by 0.01, the player's salary is expected to go up by 63,530. Unfortunately, this estimated coefficient is not statistically significant, and the R square does not improve much, compared to the previous regression.
This suggest that batting strike is not an important factor that affects player's salary. The results using our modified batting average and batting strike are consistent with our regression for using the balls face variable. It appears that batting strike is not a significant factor in determining player salary.

# We will now turn to bowlers' performance.
Again, we will first use number of runs conceded, number of balls bowled, and number of wickets taken to measure bowlers' performance.

In [33]:
reg_IPL7=sm.ols(formula = 'Salary ~ runs_conceded', data= IPLPlayer).fit()
print(reg_IPL7.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.023
Model:                            OLS   Adj. R-squared:                  0.015
Method:                 Least Squares   F-statistic:                     3.200
Date:                Fri, 15 Mar 2024   Prob (F-statistic):             0.0758
Time:                        21:32:05   Log-Likelihood:                -2072.0
No. Observations:                 141   AIC:                             4148.
Df Residuals:                     139   BIC:                             4154.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept      5.438e+05   6.53e+04      8.322

The number of runs consider increase, the player's salary will increase as well. This appears to be counterintuitive, as the role of a bowler is to stop the opponent from scoring runs. However, the runs considered may be correlated with the number of matches that a player play or the number of balls that are player bowl. Does this based on regression, may not repeat the whole story. This can also be reflected by the small r square as well.The r square in this regression is only at 0.023.

In [34]:
reg_IPL8=sm.ols(formula = 'Salary ~ runs_conceded+balls_bowled', data= IPLPlayer).fit()
print(reg_IPL8.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.042
Model:                            OLS   Adj. R-squared:                  0.028
Method:                 Least Squares   F-statistic:                     3.026
Date:                Fri, 15 Mar 2024   Prob (F-statistic):             0.0518
Time:                        21:33:55   Log-Likelihood:                -2070.5
No. Observations:                 141   AIC:                             4147.
Df Residuals:                     138   BIC:                             4156.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept      5.565e+05   6.54e+04      8.514

Now the estimated coefficient on the runs consider is negative. While the estimated coefficient on the balls bowl is positive, this confirms our previous suspicion that it is the number of balls bowled that, positively impacts players salary. Unfortunately, the estimate on the number of runs considered is not statistically significant with a P value equals 0.172. And the estimated coefficient on the number of balls bowl, it's only significant in the 10% level as the P value it's at point 0.096. The r square is still very small, suggesting a poor fit for our data. 

In [35]:
reg_IPL9=sm.ols(formula = 'Salary ~ runs_conceded+balls_bowled+wickets', data= IPLPlayer).fit()
print(reg_IPL9.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.049
Model:                            OLS   Adj. R-squared:                  0.028
Method:                 Least Squares   F-statistic:                     2.329
Date:                Fri, 15 Mar 2024   Prob (F-statistic):             0.0772
Time:                        21:35:54   Log-Likelihood:                -2070.1
No. Observations:                 141   AIC:                             4148.
Df Residuals:                     137   BIC:                             4160.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept      5.543e+05   6.54e+04      8.472

The estimate on the number of runs considered and the number of wickets taken are both negative. This results suggest that the poorer the performance of the bowler, the lower the salary. The balls bowl is positive as 6343.24, which means that bowling one more ball will increase the player's salary by 6343.24. Neither the runs considered nor the wickets taken variable is statistically significant. The number of balls bowl variable is statistically significant at 5% level. This may indicate that the performance of a bowl is not as important a factor compared to the term he played.

In the next regression, we will use the modified bowling average and bowling strike variables to measure player performance.

In [36]:
reg_IPL10=sm.ols(formula = 'Salary ~ bowling_average+bowling_strike', data= IPLPlayer).fit()
print(reg_IPL10.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.054
Model:                            OLS   Adj. R-squared:                  0.040
Method:                 Least Squares   F-statistic:                     3.912
Date:                Fri, 15 Mar 2024   Prob (F-statistic):             0.0223
Time:                        21:38:26   Log-Likelihood:                -2069.7
No. Observations:                 141   AIC:                             4145.
Df Residuals:                     138   BIC:                             4154.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept        6.535e+05   7.33e+04     

In this regression, both bowling average and bowling strike as statistically significant, the estimate on bowling averages negative and the estimate on bowling strike is positive. We call it bowling average, is defined as the number of runs considered divided by the number of wickets taken. And the bowling strike is defined as the number of balls bowled, divided by the number of wickets taken. The sign on bowling average makes sense as the know what the bowling average, the better the performance bowler is, the sign on bowling strike. However, it's opposite from what we would expect. Bowling strike measures the effectiveness of bowler taking wickets, the lower the bowling strike, the more effective a bowler is a taking wickets quickly or getting a batman out. Our regression results suggest that the more effective the bowler is the last salary he receives, again. This can be due to the dominant role of the number of balls bowled for a player. 

# Lastly, we will incorporate performance measures of both batsman and bowler in the same regression.
We will first use the original variables, total number of runs, number of not outs, number of balls faced, number of runs conceded, number of balls bowled, and number of wickets in the regression.

In [37]:
reg_IPL11=sm.ols(formula = 'Salary ~ runs+not_outs+balls_faced+runs_conceded+balls_bowled+wickets', data= IPLPlayer).fit()
print(reg_IPL11.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.408
Model:                            OLS   Adj. R-squared:                  0.382
Method:                 Least Squares   F-statistic:                     15.41
Date:                Fri, 15 Mar 2024   Prob (F-statistic):           2.20e-13
Time:                        21:41:28   Log-Likelihood:                -2036.6
No. Observations:                 141   AIC:                             4087.
Df Residuals:                     134   BIC:                             4108.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept      1.458e+05   7.32e+04      1.993

Compared to the questions 4 and regression 9, the signs of the coefficients are the same. The sciences of their estimates are smaller in the new regression. The number runs change from 2871 to 2089, while the number of not_outs changed from 89,450 to 59,500. The estimated coefficient on the number of balls faced change from -2044 to -354, and the number of runs considered changed from -3049 to -1737. Additionally, the estimate on the number of balls bowl changed from 6243 to 5030. And the estimate on the number of balls bowl changed from 6243 to 5030. And the estimated coefficients on the wickets taken change from -27,370 to -22,310. From the relative size changes we could see that runs is the most important determinant in batman's salary, while for bowlers, the number of balls bowled and the number of wickets taken are more important. Also notes that in this new regression, the r square is improved to 0.408, which suggest that we obtain a better fit compared to the previous models

In [38]:
reg_IPL12=sm.ols(formula = 'Salary ~ batting_average+batting_strike+bowling_average+bowling_strike', data= IPLPlayer).fit()
print(reg_IPL12.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.308
Model:                            OLS   Adj. R-squared:                  0.288
Method:                 Least Squares   F-statistic:                     15.16
Date:                Fri, 15 Mar 2024   Prob (F-statistic):           2.85e-10
Time:                        21:46:35   Log-Likelihood:                -2047.6
No. Observations:                 141   AIC:                             4105.
Df Residuals:                     136   BIC:                             4120.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept         1.37e+05   1.14e+05     

The estimate co-efficient on betting average changes from 19,000 to 24,000. The estimate co-efficient on betting strike changes from 6,353,000 to -612,000, but it is no longer a statistically significant. The estimated coefficients on bowling average changes from -33,000 to -31,860, and the estimated coefficient on bowling strike changes from 49,140 to 59,410. Again, the r square is now at 0.308, which is higher than the r squaring regression 6 at 0.234. And much higher than the are square in regression 10 at 0.054. With all this analysis, we can see that compared to bowlers, the performance of batman is more important in determining their salary.