# xG Model

The idea is to be able to predict what will happen to particular chances and events, whether they will be converted or not, using particular attributes of data. This gives us an idea as to whether an event, if it occurs is likely to be a goal or not.

Data:
Data provision courtesy of Stratabet. Here, I've used English Championship, English Premiership, Bundesliga, France, Spain, Italy, division 1 data. It dates from the beginning of season 16-17 to the current 17-18.

Attributes:
For now, I've used attributes such as 'icon' (type of event), 'shotQuality' (used values defined by Stratabet), 'defPressure', 'numDefPlayers', 'numAttPlayers', 'chanceRating' (used values as defined by Stratabet), 'type' (defines passage of play). All attributes are encoded to particular values. The 'outcome' variable is binary encoded, ofcourse.

Although I've used the parameter chanceRating & shotQuality which covers the idea of a shot going in or not, I would also like to incorporate Shot location later on.

In [1]:
#####################################################################################################
# STEP 1: Loading in Data 

# Use data through Pandas and Numpy manipulation


In [2]:
# Loading Libraries

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
from scipy.stats import norm
from scipy.cluster import vq
import plotly.graph_objs as go
import plotly.plotly as py
import plotly as pl


In [3]:
# Loading datasets

engch16 = pd.read_csv('latestdatamarch/EngCh/2016-17/2017-06-27_chances_2016-07-01_2017-06-15.csv')
engch17 = pd.read_csv('latestdatamarch/EngCh/chances_from_2017-07-01.csv')
engpr16 = pd.read_csv('latestdatamarch/EngPr/2016-17/2017-06-27_chances_2016-07-01_2017-06-15.csv')
engpr17 = pd.read_csv('latestdatamarch/EngPr/chances_from_2017-07-01.csv')
bl16 = pd.read_csv('latestdatamarch/GerBL1/2016-17/2017-06-27_chances_2016-07-01_2017-06-15.csv')
bl17 = pd.read_csv('latestdatamarch/GerBL1/chances_from_2017-07-01.csv')
ita16 = pd.read_csv('latestdatamarch/ItaSA/2016-17/2017-06-27_chances_2016-07-01_2017-06-15.csv')
ita17 = pd.read_csv('latestdatamarch/ItaSA/chances_from_2017-07-01.csv')
fra16 = pd.read_csv('latestdatamarch/FraL1/2016-17/2017-06-27_chances_2016-07-01_2017-06-15.csv')
fra17 = pd.read_csv('latestdatamarch/FraL1/chances_from_2017-07-01.csv')
spa16 = pd.read_csv('latestdatamarch/SpaPr/2016-17/2017-06-27_chances_2016-07-01_2017-06-15.csv')
spa17 = pd.read_csv('latestdatamarch/SpaPr/chances_from_2017-07-01.csv')


In [4]:
engch16.shape

(12730, 27)

In [5]:
df = engch16.append(engch17).append(engpr16).append(bl16).append(bl17).append(ita16).append(ita17).append(fra16).append(fra17).append(spa16).append(spa17)
test = engpr17

In [6]:
df.shape

(84334, 27)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84334 entries, 0 to 5845
Data columns (total 27 columns):
Unnamed: 0           84334 non-null int64
competition          84334 non-null object
gsm_id               84334 non-null int64
kickoffDate          84334 non-null object
kickoffTime          84334 non-null object
hometeam_team1       84334 non-null object
awayteam_team2       84334 non-null object
icon                 84334 non-null object
chanceRating         84334 non-null object
team                 84334 non-null object
type                 84334 non-null object
time                 84334 non-null object
player               84334 non-null object
location_x           84334 non-null object
location_y           84334 non-null object
bodyPart             84334 non-null object
shotQuality          83153 non-null object
defPressure          84334 non-null object
numDefPlayers        84334 non-null object
numAttPlayers        84334 non-null object
outcome              84334 non-nul

In [8]:
#####################################################################################################
# STEP 2: Visualising Shot Locations 

# Understanding how well the data provided by Stratabet has been assigned chanceRatings


In [9]:
pl.tools.set_credentials_file(username='abhinavr8', api_key='L7YaGsI86BxiRjGqiXVi')
csl = pd.read_csv("china.csv")

print ('The train data has {} rows and {} columns'.format(csl.shape[0],csl.shape[1]))

total = csl.isnull().sum().sort_values(ascending=False)
percent = (csl.isnull().sum()/csl['gsm_id'].count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent*100], axis=1, keys=['Total', 'Percent'])

csl['shotQuality'] = csl['shotQuality'].replace(['-'], 0)
csl['shotQuality'] = csl['shotQuality'].astype(float)
csl['shotQuality'] = csl['shotQuality'].replace('nan', 0)
csl['shotQuality'] = csl['shotQuality'].fillna(0)

csl2 = csl.loc[csl['team'] == 'Guangzhou Evergrande']
csl2 = csl2.loc[csl2['icon'] == 'goal']
csl2 = csl2[['location_x', 'location_y', 'chanceRating']]
#csl2 = csl2.loc[csl2['chanceRating'] != 'Penalty' ]

csl3 = csl2.loc[csl2['chanceRating'] == 'Superb']
csl4 = csl2.loc[csl2['chanceRating'] == 'Great'] 
csl5 = csl2.loc[csl2['chanceRating'] == 'Very Good']
csl6 = csl2.loc[csl2['chanceRating'] == 'Poor'] + csl2.loc[csl2['chanceRating'] == 'Fairly Good']  

N = 500

trace0 = go.Scatter(
    x = csl3['location_x'],
    y = csl3['location_y'],
    name = 'Chances > 83% scoring chance',
    mode = 'markers',
    marker = dict(
        size = 10,
        color = 'rgba(255,0,0, .8)',
        line = dict(
            width = 2,
            color = 'rgb(0, 0, 0)'
        ))
)
trace1 = go.Scatter(
    x = csl4['location_x'],
    y = csl4['location_y'],
    name = 'Chances > 43% scoring chance',
    mode = 'markers',
    marker = dict(
        size = 10,
        color = 'rgba(255,185,0, .9)',
        line = dict(
            width = 2,
        ))
)
trace2 = go.Scatter(
    x = csl5['location_x'],
    y = csl5['location_y'],
    name = 'Chances > 22% scoring chance',
    mode = 'markers',
    marker = dict(
        size = 10,
        color = 'rgba(0,185,0, .9)',
        line = dict(
            width = 2,
        ))
)
trace3 = go.Scatter(
    x = csl6['location_x'],
    y = csl6['location_y'],
    name = 'Chances > 3% scoring chance',
    mode = 'markers',
    marker = dict(
        size = 10,
        color = 'rgba(0,0, 225, .9)',
        line = dict(
            width = 2,
        ))
)

data = [trace0, trace1, trace2]

layout = dict(title = 'Goals scored by Guangzhou Evergrande',
              yaxis = dict(zeroline = False),
              xaxis = dict(zeroline = False)
             )

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='styled-scatter')


The train data has 5393 rows and 27 columns
High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~abhinavr8/0 or inside your plot.ly account where it is named 'styled-scatter'


In [10]:
csl2 = csl.loc[csl['team'] == 'Shanghai SIPG']
csl2 = csl2.loc[csl2['icon'] == 'goal']
csl2 = csl2[['location_x', 'location_y', 'chanceRating']]

csl3 = csl2.loc[csl2['chanceRating'] == 'Superb']
csl4 = csl2.loc[csl2['chanceRating'] == 'Great'] 
csl5 = csl2.loc[csl2['chanceRating'] == 'Very Good']
csl6 = csl2.loc[csl2['chanceRating'] == 'Poor'] + csl2.loc[csl2['chanceRating'] == 'Fairly Good']

N = 500

trace0 = go.Scatter(
    x = csl3['location_x'],
    y = csl3['location_y'],
    name = 'Chances > 83% scoring chance',
    mode = 'markers',
    marker = dict(
        size = 10,
        color = 'rgba(255,0,0, .8)',
        line = dict(
            width = 2,
            color = 'rgb(0, 0, 0)'
        ))
)
trace1 = go.Scatter(
    x = csl4['location_x'],
    y = csl4['location_y'],
    name = 'Chances > 43% scoring chance',
    mode = 'markers',
    marker = dict(
        size = 10,
        color = 'rgba(255,185,0, .9)',
        line = dict(
            width = 2,
        ))
)
trace2 = go.Scatter(
    x = csl5['location_x'],
    y = csl5['location_y'],
    name = 'Chances > 22% scoring chance',
    mode = 'markers',
    marker = dict(
        size = 10,
        color = 'rgba(0,185,0, .9)',
        line = dict(
            width = 2,
        ))
)
trace3 = go.Scatter(
    x = csl6['location_x'],
    y = csl6['location_y'],
    name = 'Chances > 3% scoring chance',
    mode = 'markers',
    marker = dict(
        size = 10,
        color = 'rgba(0,0, 225, .9)',
        line = dict(
            width = 2,
        ))
)

data = [trace0, trace1, trace2]
layout = dict(title = 'Goals scored by Shanghai SIPG',
              yaxis = dict(zeroline = False),
              xaxis = dict(zeroline = False)
             )

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='styled-scatter')


High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~abhinavr8/0 or inside your plot.ly account where it is named 'styled-scatter'


In [11]:
#####################################################################################################
# STEP 3: Clean Data 

# Remove missing values, treating noisy data


In [12]:
# import from numpy
import numpy as np
import pandas as pd

df.head()

Unnamed: 0.1,Unnamed: 0,competition,gsm_id,kickoffDate,kickoffTime,hometeam_team1,awayteam_team2,icon,chanceRating,team,...,defPressure,numDefPlayers,numAttPlayers,outcome,primaryPlayer,primaryType,primaryLocation_x,primaryLocation_y,secondaryPlayer,secondaryType
0,302,EngCh,2237445,2017-05-29,14:00:00,Huddersfield Town,Reading,goal,Penalty,Huddersfield Town,...,0,1,0,-,-,-,-,-,-,-
1,301,EngCh,2237445,2017-05-29,14:00:00,Huddersfield Town,Reading,goal,Penalty,Huddersfield Town,...,0,1,0,-,-,-,-,-,-,-
2,300,EngCh,2237445,2017-05-29,14:00:00,Huddersfield Town,Reading,goal,Penalty,Reading,...,0,1,0,-,-,-,-,-,-,-
3,299,EngCh,2237445,2017-05-29,14:00:00,Huddersfield Town,Reading,goal,Penalty,Reading,...,0,1,0,-,-,-,-,-,-,-
4,298,EngCh,2237445,2017-05-29,14:00:00,Huddersfield Town,Reading,goal,Penalty,Huddersfield Town,...,0,1,0,-,-,-,-,-,-,-


In [13]:
# Picking out best features to work on

In [14]:
df.shape

(84334, 27)

In [15]:
df = df[['icon', "bodyPart","location_x","location_y","shotQuality","defPressure","numDefPlayers","numAttPlayers","outcome",'primaryType', 'primaryLocation_x', 'primaryLocation_y', 'secondaryType', 'chanceRating', 'type']] 
test = test[['icon', "bodyPart","location_x","location_y","shotQuality","defPressure","numDefPlayers","numAttPlayers","outcome",'primaryType', 'primaryLocation_x', 'primaryLocation_y', 'secondaryType', 'chanceRating', 'type']] 
df.head()

Unnamed: 0,icon,bodyPart,location_x,location_y,shotQuality,defPressure,numDefPlayers,numAttPlayers,outcome,primaryType,primaryLocation_x,primaryLocation_y,secondaryType,chanceRating,type
0,goal,Right,0.0,44.0,3,0,1,0,-,-,-,-,-,Penalty,Penalty
1,goal,Right,0.0,44.0,3,0,1,0,-,-,-,-,-,Penalty,Penalty
2,goal,Right,0.0,44.0,4,0,1,0,-,-,-,-,-,Penalty,Penalty
3,goal,Right,0.0,44.0,3,0,1,0,-,-,-,-,-,Penalty,Penalty
4,goal,Left,0.0,44.0,3,0,1,0,-,-,-,-,-,Penalty,Penalty


In [16]:
null_values = df.isnull().sum()
null_values

icon                    0
bodyPart                0
location_x              0
location_y              0
shotQuality          1181
defPressure             0
numDefPlayers           0
numAttPlayers           0
outcome                 0
primaryType             0
primaryLocation_x       0
primaryLocation_y       0
secondaryType           0
chanceRating            0
type                    0
dtype: int64

In [17]:
# df.loc[df['defPressure']  == '-'] + df.loc[['defPressure']  == 'NaN']
df['shotQuality'] = df['shotQuality'].replace('-', 0)
test['shotQuality'] = test['shotQuality'].replace('-',0)
#empty.head()
df.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Unnamed: 0,icon,bodyPart,location_x,location_y,shotQuality,defPressure,numDefPlayers,numAttPlayers,outcome,primaryType,primaryLocation_x,primaryLocation_y,secondaryType,chanceRating,type
0,goal,Right,0.0,44.0,3,0,1,0,-,-,-,-,-,Penalty,Penalty
1,goal,Right,0.0,44.0,3,0,1,0,-,-,-,-,-,Penalty,Penalty
2,goal,Right,0.0,44.0,4,0,1,0,-,-,-,-,-,Penalty,Penalty
3,goal,Right,0.0,44.0,3,0,1,0,-,-,-,-,-,Penalty,Penalty
4,goal,Left,0.0,44.0,3,0,1,0,-,-,-,-,-,Penalty,Penalty


In [23]:
#df.shotQuality = df.shotQuality.astype(int)
#test.shotQuality = test.shotQuality.astype(int)

In [24]:
df.shotQuality.unique()

array(['3', '4', '2', '0', '1', 0, nan, '5'], dtype=object)

In [25]:
df.head()

Unnamed: 0,icon,bodyPart,location_x,location_y,shotQuality,defPressure,numDefPlayers,numAttPlayers,outcome,primaryType,primaryLocation_x,primaryLocation_y,secondaryType,chanceRating,type
0,goal,Right,0.0,44.0,3,0,1,0,-,-,-,-,-,Penalty,Penalty
1,goal,Right,0.0,44.0,3,0,1,0,-,-,-,-,-,Penalty,Penalty
2,goal,Right,0.0,44.0,4,0,1,0,-,-,-,-,-,Penalty,Penalty
3,goal,Right,0.0,44.0,3,0,1,0,-,-,-,-,-,Penalty,Penalty
4,goal,Left,0.0,44.0,3,0,1,0,-,-,-,-,-,Penalty,Penalty


In [26]:
df = df.dropna(subset = ['shotQuality']) # remove where shotQuality is NaN
test = test.dropna(subset = ['shotQuality'])
test.shotQuality.unique()

array(['3', '2', '0', '1', '4', '5', 0], dtype=object)

In [27]:
df.shotQuality.unique()

array(['3', '4', '2', '0', '1', 0, '5'], dtype=object)

In [28]:
df = df[df.icon != 'owngoal'] # removing own goals
test = test[test.icon != 'owngoal'] # removing own goals

In [29]:
df.shotQuality.unique()

array(['3', '4', '2', '0', '1', '5'], dtype=object)

In [30]:
df.primaryType.unique()

array(['-', 'Cross High', 'Free Kick', 'Cross Low', 'Open Play Pass',
       'Free Kick Won', 'Corner', 'Shot (Deflection)',
       'Shot (Opposition Rebound)', 'Turnover', 'Penalty Earned',
       'Throw in', 'Shot (Woodwork Rebound)', 'Dangerous Moment',
       'Corner Won'], dtype=object)

In [31]:
# further reductio of attributes

In [32]:
df = df[['icon', "location_x","location_y","shotQuality","defPressure","numDefPlayers","numAttPlayers","outcome", 'chanceRating', 'type']] 
test = test[['icon', "location_x","location_y","shotQuality","defPressure","numDefPlayers","numAttPlayers","outcome", 'chanceRating', 'type']]
df.head()

Unnamed: 0,icon,location_x,location_y,shotQuality,defPressure,numDefPlayers,numAttPlayers,outcome,chanceRating,type
0,goal,0.0,44.0,3,0,1,0,-,Penalty,Penalty
1,goal,0.0,44.0,3,0,1,0,-,Penalty,Penalty
2,goal,0.0,44.0,4,0,1,0,-,Penalty,Penalty
3,goal,0.0,44.0,3,0,1,0,-,Penalty,Penalty
4,goal,0.0,44.0,3,0,1,0,-,Penalty,Penalty


In [33]:
df[df.outcome != '-']
test[test.outcome != '-']

Unnamed: 0,icon,location_x,location_y,shotQuality,defPressure,numDefPlayers,numAttPlayers,outcome,chanceRating,type
2,goodchance,37.0,50.0,2,3,4,0,Missed,goodchance,Open Play
3,fairlygoodchance,67.0,36.0,2,1,4,1,Missed,fairlygoodchance,Open Play
4,greatchance,2.0,38.0,0,3,1,0,Defended,greatchance,Dangerous Moment
5,fairlygoodchance,35.0,58.0,3,3,2,0,Saved,fairlygoodchance,Open Play
6,verygoodchance,-45.0,35.0,1,1,1,0,Missed,verygoodchance,Open Play
7,poorchance,63.0,51.0,2,4,4,0,Defended,poorchance,Open Play
8,greatchance,-11.0,18.0,2,3,3,0,Missed,greatchance,Open Play
9,fairlygoodchance,-34.0,56.0,2,1,3,1,Defended,fairlygoodchance,Open Play
10,fairlygoodchance,-31.0,73.0,2,1,3,1,Defended,fairlygoodchance,Open Play
11,fairlygoodchance,42.0,10.0,2,2,3,0,Missed,fairlygoodchance,Open Play


In [34]:
df.type.value_counts()

Open Play                    68992
Open play                     5723
Direct Free-Kick              3842
Dangerous Moment              2906
Penalty                        885
Penalty Earned                 240
Direct free kick               176
-                               51
Direct Corner                   29
Open Play Pass                   4
Cross High                       3
Cross Low                        3
Turnover                         2
Shot (Deflection)                2
Shot (Opposition Rebound)        2
Direct corner                    2
Corner                           1
Free Kick Won                    1
Name: type, dtype: int64

In [35]:
df = df[df.type != '-']
test = test[test.type != '-']

In [36]:
df.icon.value_counts()

poorchance          27093
fairlygoodchance    18961
goodchance          13031
goal                10314
verygoodchance       7992
greatchance          5017
penmissed             262
superbchance          143
Name: icon, dtype: int64

In [37]:
df.chanceRating.value_counts()

poorchance          27093
fairlygoodchance    18961
goodchance          13031
verygoodchance       7992
greatchance          5017
Great                3768
Very Good            1961
Good                 1064
Fairly Good           948
Superb                892
Penalty               886
Poor                  795
-                     262
superbchance          143
Name: chanceRating, dtype: int64

In [38]:
df[df.chanceRating == '-'].head()

Unnamed: 0,icon,location_x,location_y,shotQuality,defPressure,numDefPlayers,numAttPlayers,outcome,chanceRating,type
128,penmissed,-,-,3,-,-,-,Save,-,Penalty Earned
538,penmissed,-,-,2,-,-,-,Save,-,Penalty Earned
539,penmissed,-,-,3,-,-,-,Save,-,Penalty Earned
966,penmissed,-,-,3,-,-,-,Save,-,Penalty Earned
1271,penmissed,-,-,3,-,-,-,Miss,-,Penalty Earned


In [39]:
df.numAttPlayers.value_counts()

0    60377
1    16031
2     4505
3     1287
4      278
-      262
5       54
6       12
7        7
Name: numAttPlayers, dtype: int64

In [40]:
df.shape

(82813, 10)

In [41]:
#####################################################################################################
# STEP 4: Vectorize Data

# Encode data to particular values to eventually understand their importance


In [42]:
df.head()

Unnamed: 0,icon,location_x,location_y,shotQuality,defPressure,numDefPlayers,numAttPlayers,outcome,chanceRating,type
0,goal,0.0,44.0,3,0,1,0,-,Penalty,Penalty
1,goal,0.0,44.0,3,0,1,0,-,Penalty,Penalty
2,goal,0.0,44.0,4,0,1,0,-,Penalty,Penalty
3,goal,0.0,44.0,3,0,1,0,-,Penalty,Penalty
4,goal,0.0,44.0,3,0,1,0,-,Penalty,Penalty


In [43]:
df.columns

Index(['icon', 'location_x', 'location_y', 'shotQuality', 'defPressure',
       'numDefPlayers', 'numAttPlayers', 'outcome', 'chanceRating', 'type'],
      dtype='object')

In [44]:
df.defPressure.value_counts()

2    18241
1    18029
3    17853
0    14859
4    10447
5     3122
-      262
Name: defPressure, dtype: int64

In [45]:
df.head(20)

Unnamed: 0,icon,location_x,location_y,shotQuality,defPressure,numDefPlayers,numAttPlayers,outcome,chanceRating,type
0,goal,0.0,44.0,3,0,1,0,-,Penalty,Penalty
1,goal,0.0,44.0,3,0,1,0,-,Penalty,Penalty
2,goal,0.0,44.0,4,0,1,0,-,Penalty,Penalty
3,goal,0.0,44.0,3,0,1,0,-,Penalty,Penalty
4,goal,0.0,44.0,3,0,1,0,-,Penalty,Penalty
5,goal,0.0,44.0,3,0,1,0,-,Penalty,Penalty
6,goal,0.0,44.0,3,0,1,0,-,Penalty,Penalty
7,poorchance,54.0,44.0,2,5,3,0,Defended,poorchance,Open Play
8,poorchance,14.0,50.0,0,5,1,0,Defended,poorchance,Dangerous Moment
9,fairlygoodchance,28.0,11.0,0,3,1,0,Defended,fairlygoodchance,Dangerous Moment


In [46]:
df.chanceRating.unique()

array(['Penalty', 'poorchance', 'fairlygoodchance', 'verygoodchance',
       'goodchance', 'greatchance', 'superbchance', 'Good', 'Great',
       'Fairly Good', '-', 'Poor', 'Superb', 'Very Good'], dtype=object)

In [47]:
df.columns

Index(['icon', 'location_x', 'location_y', 'shotQuality', 'defPressure',
       'numDefPlayers', 'numAttPlayers', 'outcome', 'chanceRating', 'type'],
      dtype='object')

In [48]:
df.icon.unique()

array(['goal', 'poorchance', 'fairlygoodchance', 'verygoodchance',
       'goodchance', 'greatchance', 'superbchance', 'penmissed'],
      dtype=object)

In [49]:
cleanup_icon = {"icon": {"goal": 1, "superbchance": 0.83, "greatchance": 0.43, "verygoodchance": 0.22, "goodchance": 0.08, "fairlygoodchance": 0.05,  "poorchance": 0.02,  "penmissed": 0}}

In [50]:
df.replace(cleanup_icon, inplace=True)
test.replace(cleanup_icon, inplace=True)

In [51]:
df.icon.unique()

array([1.  , 0.02, 0.05, 0.22, 0.08, 0.43, 0.83, 0.  ])

In [52]:
df.icon.head(20)

0     1.00
1     1.00
2     1.00
3     1.00
4     1.00
5     1.00
6     1.00
7     0.02
8     0.02
9     0.05
10    0.22
11    0.05
12    0.08
13    0.05
14    0.43
15    0.83
16    0.02
17    0.02
18    0.02
19    0.02
Name: icon, dtype: float64

In [53]:
df.shotQuality.value_counts()

2    27231
1    25932
3    22175
4     3621
0     3350
5      504
Name: shotQuality, dtype: int64

In [54]:
df.defPressure.value_counts()

2    18241
1    18029
3    17853
0    14859
4    10447
5     3122
-      262
Name: defPressure, dtype: int64

In [55]:
df.numDefPlayers.value_counts()

2     28542
3     20655
1     15294
4      9206
5      4164
6      2171
7      1072
0       906
8       402
-       262
9       108
10       26
11        5
Name: numDefPlayers, dtype: int64

In [56]:
df.numAttPlayers.unique()

array(['0', '1', '2', '3', '-', '4', '5', '6', '7'], dtype=object)

In [57]:
df.outcome.unique()

array(['-', 'Defended', 'Missed', 'Saved', 'Woodwork', 'Save', 'Miss'],
      dtype=object)

In [58]:
df.outcome.value_counts()

Missed      32839
Saved       21767
Defended    15870
-           10314
Woodwork     1791
Save          195
Miss           37
Name: outcome, dtype: int64

In [59]:
df.shape

(82813, 10)

In [60]:
df = df[['icon', "shotQuality","defPressure","numDefPlayers","numAttPlayers","outcome", 'chanceRating', 'type']] 
test = test[['icon', "shotQuality","defPressure","numDefPlayers","numAttPlayers","outcome", 'chanceRating', 'type']]

In [61]:
df.shape

(82813, 8)

In [62]:
df.outcome.value_counts()

Missed      32839
Saved       21767
Defended    15870
-           10314
Woodwork     1791
Save          195
Miss           37
Name: outcome, dtype: int64

In [63]:
cleanup_outcome = {"outcome" : { "-" : 1, "Missed":0, "Miss":0, "Save":0, "Woodwork":0, "Defended":0, "Saved":0}}


In [64]:
df.replace( cleanup_outcome , inplace = True )
test.replace(cleanup_outcome, inplace= True)

In [65]:
df.columns

Index(['icon', 'shotQuality', 'defPressure', 'numDefPlayers', 'numAttPlayers',
       'outcome', 'chanceRating', 'type'],
      dtype='object')

In [66]:
df.chanceRating.value_counts()

poorchance          27093
fairlygoodchance    18961
goodchance          13031
verygoodchance       7992
greatchance          5017
Great                3768
Very Good            1961
Good                 1064
Fairly Good           948
Superb                892
Penalty               886
Poor                  795
-                     262
superbchance          143
Name: chanceRating, dtype: int64

In [67]:
cleanup_chance = {"chanceRating": {"Penalty": 1, "Superb":0.83, "superbchance": 0.83, 
                                   "greatchance": 0.43, "Great":0.43,
                                   "verygoodchance": 0.22, "Very Good":0.22, 
                                   "Good":0.08 , "goodchance": 0.08, 
                                   "fairlygoodchance": 0.05,  "Fairly Good":0.05,
                                   "Poor": 0.02, "poorchance": 0.02, "-": 0
                                  }}

In [68]:
df.replace(cleanup_chance, inplace = True)
test.replace(cleanup_chance, inplace= True)

In [69]:
df.chanceRating.unique()

array([1.  , 0.02, 0.05, 0.22, 0.08, 0.43, 0.83, 0.  ])

In [70]:
df.chanceRating.value_counts()

0.02    27888
0.05    19909
0.08    14095
0.22     9953
0.43     8785
0.83     1035
1.00      886
0.00      262
Name: chanceRating, dtype: int64

In [71]:
df.type.unique()

array(['Penalty', 'Open Play', 'Dangerous Moment', 'Direct Free-Kick',
       'Open play', 'Penalty Earned', 'Direct free kick', 'Turnover',
       'Direct Corner', 'Shot (Opposition Rebound)', 'Cross High',
       'Direct corner', 'Open Play Pass', 'Cross Low',
       'Shot (Deflection)', 'Corner', 'Free Kick Won'], dtype=object)

In [72]:
df.head()

Unnamed: 0,icon,shotQuality,defPressure,numDefPlayers,numAttPlayers,outcome,chanceRating,type
0,1.0,3,0,1,0,1,1.0,Penalty
1,1.0,3,0,1,0,1,1.0,Penalty
2,1.0,4,0,1,0,1,1.0,Penalty
3,1.0,3,0,1,0,1,1.0,Penalty
4,1.0,3,0,1,0,1,1.0,Penalty


In [73]:
df.type.unique()

array(['Penalty', 'Open Play', 'Dangerous Moment', 'Direct Free-Kick',
       'Open play', 'Penalty Earned', 'Direct free kick', 'Turnover',
       'Direct Corner', 'Shot (Opposition Rebound)', 'Cross High',
       'Direct corner', 'Open Play Pass', 'Cross Low',
       'Shot (Deflection)', 'Corner', 'Free Kick Won'], dtype=object)

In [74]:
df['type'] = df['type'].astype('category')
test['type'] = test['type'].astype('category')
df.dtypes

icon              float64
shotQuality        object
defPressure        object
numDefPlayers      object
numAttPlayers      object
outcome             int64
chanceRating      float64
type             category
dtype: object

In [75]:
df['type'] = df['type'].cat.codes
test['type'] = test['type'].cat.codes
df.type.unique()

array([12,  9,  3,  5, 11, 13,  7, 16,  4, 15,  1,  6, 10,  2, 14,  0,  8])

In [76]:
df.head()

Unnamed: 0,icon,shotQuality,defPressure,numDefPlayers,numAttPlayers,outcome,chanceRating,type
0,1.0,3,0,1,0,1,1.0,12
1,1.0,3,0,1,0,1,1.0,12
2,1.0,4,0,1,0,1,1.0,12
3,1.0,3,0,1,0,1,1.0,12
4,1.0,3,0,1,0,1,1.0,12


In [77]:
df.to_csv('file_name.csv', sep=',')
test.to_csv('test.csv', sep=',')

In [78]:
from numpy import genfromtxt
my_data = genfromtxt('file_name.csv', delimiter=',')
testdata = genfromtxt('test.csv', delimiter=',')

In [79]:
my_data

array([[      nan,       nan,       nan, ...,       nan,       nan,
              nan],
       [0.000e+00, 1.000e+00, 3.000e+00, ..., 1.000e+00, 1.000e+00,
        1.200e+01],
       [1.000e+00, 1.000e+00, 3.000e+00, ..., 1.000e+00, 1.000e+00,
        1.200e+01],
       ...,
       [5.843e+03, 1.000e+00, 3.000e+00, ..., 1.000e+00, 8.300e-01,
        9.000e+00],
       [5.844e+03, 1.000e+00, 4.000e+00, ..., 1.000e+00, 5.000e-02,
        9.000e+00],
       [5.845e+03, 0.000e+00, 3.000e+00, ..., 0.000e+00, 0.000e+00,
        1.300e+01]])

In [80]:
testdata

array([[      nan,       nan,       nan, ...,       nan,       nan,
              nan],
       [0.000e+00, 1.000e+00, 3.000e+00, ..., 1.000e+00, 4.300e-01,
        4.000e+00],
       [1.000e+00, 1.000e+00, 3.000e+00, ..., 1.000e+00, 4.300e-01,
        4.000e+00],
       ...,
       [6.724e+03, 1.000e+00, 3.000e+00, ..., 1.000e+00, 2.200e-01,
        4.000e+00],
       [6.725e+03, 1.000e+00, 4.000e+00, ..., 1.000e+00, 8.000e-02,
        4.000e+00],
       [6.726e+03, 1.000e+00, 4.000e+00, ..., 1.000e+00, 8.000e-02,
        4.000e+00]])

In [81]:
df2 = df.outcome
testdata = test.outcome

In [82]:
target = df2.values
testtarget = testdata
target

array([1, 1, 1, ..., 1, 1, 0])

In [83]:
df.columns

Index(['icon', 'shotQuality', 'defPressure', 'numDefPlayers', 'numAttPlayers',
       'outcome', 'chanceRating', 'type'],
      dtype='object')

In [84]:
df1 = df[['icon', 'shotQuality', 'defPressure', 'numDefPlayers', 'numAttPlayers',
       'chanceRating', 'type']]
test1 = test[['icon', 'shotQuality', 'defPressure', 'numDefPlayers', 'numAttPlayers',
       'chanceRating', 'type']]

In [85]:
df1.chanceRating.unique()

array([1.  , 0.02, 0.05, 0.22, 0.08, 0.43, 0.83, 0.  ])

In [86]:
df1['numDefPlayers'] = df1['numDefPlayers'].replace('-', '0')
test1['numDefPlayers'] = test1['numDefPlayers'].replace('-', '0')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [87]:
df1['numAttPlayers'] = df1['numAttPlayers'].replace('-', '0')
test1['numAttPlayers'] = test1['numAttPlayers'].replace('-', '0')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [88]:
df1['defPressure'] = df1['defPressure'].replace('-', '0')
test1['defPressure'] = test1['defPressure'].replace('-', '0')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [89]:
df1.columns

Index(['icon', 'shotQuality', 'defPressure', 'numDefPlayers', 'numAttPlayers',
       'chanceRating', 'type'],
      dtype='object')

In [90]:
df1.numAttPlayers.value_counts()

0    60639
1    16031
2     4505
3     1287
4      278
5       54
6       12
7        7
Name: numAttPlayers, dtype: int64

In [91]:
df[['icon', 'shotQuality', 'outcome', 'chanceRating', 'type']].corr() # simple correlation results

Unnamed: 0,icon,outcome,chanceRating,type
icon,1.0,0.941334,0.703509,0.339573
outcome,0.941334,1.0,0.527482,0.367738
chanceRating,0.703509,0.527482,1.0,0.239571
type,0.339573,0.367738,0.239571,1.0


In [92]:
#####################################################################################################
# STEP 5: Create target, data, feature names as numpy array

# Getting data ready to apply machine learning algorithms


In [93]:
test = test1.values
test

array([[1.0, '3', '2', ..., '0', 0.43, 4],
       [1.0, '3', '3', ..., '0', 0.43, 4],
       [0.08, '2', '3', ..., '0', 0.08, 4],
       ...,
       [1.0, '3', '4', ..., '0', 0.22, 4],
       [1.0, '4', '4', ..., '1', 0.08, 4],
       [1.0, '4', '0', ..., '0', 0.08, 4]], dtype=object)

In [94]:
data = df1.values
data

array([[1.0, '3', '0', ..., '0', 1.0, 12],
       [1.0, '3', '0', ..., '0', 1.0, 12],
       [1.0, '4', '0', ..., '0', 1.0, 12],
       ...,
       [1.0, '3', '0', ..., '0', 0.83, 9],
       [1.0, '4', '2', ..., '1', 0.05, 9],
       [0.0, '3', '0', ..., '0', 0.0, 13]], dtype=object)

In [95]:
features = df1.columns.values
features

array(['icon', 'shotQuality', 'defPressure', 'numDefPlayers',
       'numAttPlayers', 'chanceRating', 'type'], dtype=object)

In [96]:
#####################################################################################################
# STEP 6: Machine Learning for xG

# Using lasso and RF for now, validate using AUC curve score


In [97]:
df1.icon.value_counts()

0.02    27093
0.05    18961
0.08    13031
1.00    10314
0.22     7992
0.43     5017
0.00      262
0.83      143
Name: icon, dtype: int64

In [98]:
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import cross_val_score, ShuffleSplit
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import roc_auc_score

  
scaler = StandardScaler()
X = scaler.fit_transform(data)
Y = target
names = features
  
lasso = Lasso(alpha=.3)
lasso.fit(X, Y)
 
    
rf = RandomForestRegressor(n_estimators=20, max_depth=4)
rf.fit(X,Y)
scores = []
for i in range(X.shape[1]):
     score = cross_val_score(rf, X[:, i:i+1], Y, scoring="r2",
                              cv=ShuffleSplit(len(X), 3, .3))
     scores.append((round(np.mean(score), 3), names[i]))
print(sorted(scores, reverse=True))

#A helper method for pretty-printing linear models
def pretty_print_linear(coefs, names = None, sort = False):
    if names.any == None:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs, names)
    if sort:
        lst = sorted(lst,  key = lambda x:-np.abs(x[0]))
    return " + ".join("%s * %s" % (round(coef, 3), name)
                                   for coef, name in lst)

#print(scores)


This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.


Data with input dtype object was converted to float64 by StandardScaler.



[(1.0, 'icon'), (0.626, 'type'), (0.305, 'shotQuality'), (0.278, 'chanceRating'), (0.099, 'numDefPlayers'), (0.013, 'numAttPlayers'), (0.013, 'defPressure')]


In [99]:
xg = lasso.predict(test)
roc_auc_score(testtarget, xg)

1.0

In [98]:
##################################################################### END OF XG MODEL, FOR NOW :P ##############################################################