First I read the final output data into a pandas data frame and select the matches which were affected by D/L. Then I read the grounds data into another pandas data frame.

In [1]:
import pandas as pd
dat = pd.read_csv("../data/final_output.csv")
DL = dat[dat[' duckworth_lewis']==1]
groundsDat = pd.read_csv("../data/grounds.csv")

Now, I define a function to merge the two final output data and the grounds data. Then do this merging for the "all" data set ("dat") and the subset of matches affected by D/L ("DL"). Some data was lost here because some games were played at grounds whose ground information is not available. However, the percentage of lost data is not very high.

In [2]:
def mergeGrounds(df, grounds):
    # Get rid of the spaces in the names of grounds
    df2 = df.rename(columns={' ground': 'ground'}, inplace=False)
    df2['ground'] = df2['ground'].str.strip()
    return pd.merge(df2, grounds, on='ground')

DLmerged = mergeGrounds(DL, groundsDat)
allMerged = mergeGrounds(dat, groundsDat)

print "Out of ", dat.shape[0], " games, ", allMerged.shape[0], " are being retained"
print "That is ", allMerged.shape[0]*100./dat.shape[0], "% of total matches"

Out of  43185  games,  42740  are being retained
That is  98.9695496121 % of total matches


The first question I asked is, are day/night games affected by D/L more than day games?

In [3]:
import matplotlib.pyplot as plt
allDN =sum(allMerged[' day_n_night']==1)*100./len(allMerged)
DLDN = sum(DLmerged[' day_n_night']==1)*100./len(DLmerged)
plt.bar(range(2),[allDN, DLDN])
plt.xticks([0.5,1.5], ["All", "D/L affected"])
plt.ylabel("% of day & night matches")
#plt.savefig("../figures/01-DL-DN.png")
plt.show()

<img align=center src="test/01-DL-DN.png" width="400" height="400"/>

Clearly, day & night games are affected more by D/L.

The next question I ask is, does D/L affect matches at all locations (countries)?

In [None]:
totalCountries = len(allMerged['country'].value_counts())
DLcountries = len(DLmerged['country'].value_counts())
plt.bar(range(2),[totalCountries, DLcountries])
plt.xticks([0.5,1.5], ["All", "D/L affected"])
plt.ylabel("Number of countries")
plt.savefig("../figures/01-DL-countries.png")
plt.show() 

<img align=center src="../figures/02-DL-countries.png" width="400" height="400"/>

We see that out of 41 countries where cricket is played in, only 24 has had matches affected by D/L so far.


Next, I looked at how does location (country) affect the number of matches that gets affected by D/L?

In [None]:
def plotCountries(df, save=0, fName="test.png"):
    countries = df['country'].value_counts()
    names = list(countries.index)
    # Renaming some of the countries to make the labels look better
    if ("United States of America" in names):
        ind = names.index("United States of America")
        names[ind] = "USA"
    if ("United Arab Emirates" in names):
        ind = names.index("United Arab Emirates")
        names[ind] = "UAE"
    if ("Papua New Guinea" in names):
        ind = names.index("Papua New Guinea")
        names[ind] = "PNG"
    if ("Cayman Islands" in names):
        ind = names.index("Cayman Islands")
        names[ind] = "KY"
    # Done
    xVals = np.array(range(len(countries)))
    plt.bar(xVals, countries)
    plt.xticks(xVals+0.5,names,rotation='vertical')
    plt.gcf().subplots_adjust(bottom=0.3)
    if (save):
        plt.savefig("../figures/"+fName) 
    plt.show()

plotCountries(DLmerged, save=1, fName="03-DL-countries.png")

<img align=center src="./figures/03-DL-countries.png" width="500" height="400"/>

At first look, it appears that England has the highest instances of matches being affected by D/L. However, this might simply be due to more number of matches being played in England. In other words, highest instances need not mean highest probability.

So we need to look at the number of matches played per country. Since D/L comes into play only in limited over games, we choose only to focus on the four categories, namely, "ODI", "LISTA", "T20I" and "T20", and leave out "TEST" and "FC".

In [None]:
limitedDat = dat[(dat[' type_of_match']!='TEST') & (dat[' type_of_match']!='FC')]
limMerged = mergeGrounds(limitedDat, groundsDat)
plotCountries(limMerged,save=1, fName="04-All-countries.png")

<img align=center src="./figures/04-All-countries.png" width="500" height="400"/>

We see that the hunch was right. The higher occurrence of D/L affected matches in England was simply due to the higher number of matches being played in England. We need to look at the percentage of matches being affected by D/L instead of the number of them.

In [None]:
def plotCountriesPerc(df, dfRef, save=0, fName = "test.png"):
    countries = df['country'].value_counts().sort_index()
    allMatchCountries = dfRef['country'].value_counts().sort_index()
    percDL = (100.0*countries/allMatchCountries).sort_values(ascending=False)
    names = list(percDL.index)
    # Renaming "United States of America" to USA and "United Arab Emirates" to UAE to make the labels look better
    if ("United States of America" in names):
        ind = names.index("United States of America")
        names[ind] = "USA"
    if ("United Arab Emirates" in names):
        ind = names.index("United Arab Emirates")
        names[ind] = "UAE"
    if ("Papua New Guinea" in names):
        ind = names.index("Papua New Guinea")
        names[ind] = "PNG"
    if ("Cayman Islands" in names):
        ind = names.index("Cayman Islands")
        names[ind] = "KY"
    ## Done
    xVals = np.array(range(len(percDL)))
    plt.bar(xVals, percDL)
    plt.xticks(xVals+0.5,names,rotation='vertical')
    plt.ylabel("% of games affected by D/L")
    plt.gcf().subplots_adjust(bottom=0.2)
    if (save):
        plt.savefig("../figures/"+fName) 
    plt.show()

limMergedCountries = limMerged[limMerged['country'].isin(DLmerged['country'])]
plotCountriesPerc(DLmerged,limMergedCountries, save=1, fName='05-DL-countries-percentage.png')

<img align=center src="./figures/05-DL-countries-percentage.png" width="500" height="400"/>
 
Now we see a totally different picture. At this point, let us focus on the countries where most games are played. Let us choose 1000 games as an arbitrary cut-off for the countries we want to include in the analysis.


In [None]:
countries = limMerged['country'].value_counts()
names = list(countries.index)
topCountries = list(countries[countries > 1000].index)
topMergedCountries = limMerged[limMerged['country'].isin(topCountries)]
topDLcountries = DLmerged[DLmerged['country'].isin(topMergedCountries['country'])]
plotCountriesPerc(topDLcountries, topMergedCountries, save=1, fName='06-DL-topCountries-percentage.png')

<img align=center src="./figures/06-DL-topCountries-percentage.png" width="600" height="400"/>

The next question I ask is, is there a correlation between the time of the year when the match is being played and the probability of the match being affected by D/L?

In [None]:
############################![title](./figures/01-DL-countries.png)
