# Prep the data for Machine Learning

## 1. Import compiled data
## 2. Simplify Events
## 3. Build New Dataframe
### 3.1 Functions that will help in our work
### 3.2 Populate our new dataframe with headers for each column
### 3.3 Create lists for each of our relevant unique values
### 3.4 Create the dataframe!
### 3.5 Clean-up the data

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

## 1. Import compiled data

In [2]:
soccer= pd.read_csv('soccer.csv')

In [3]:
soccer.head()

Unnamed: 0,gameweek,winner,competitionId,eventId,subEventName,eventName,teamId,matchPeriod,eventSec,subEventId,lastName,role,x,y,positionGrid
0,38.0,Barcelona,Spanish first division,8.0,Simple pass,Pass,Real Sociedad,1H,1.005442,85,Jiménez López,FW,58.8,40.8,6.0
1,38.0,Barcelona,Spanish first division,8.0,Simple pass,Pass,Real Sociedad,1H,26.00929,85,Jiménez López,FW,86.4,67.2,11.0
2,38.0,Barcelona,Spanish first division,8.0,Simple pass,Pass,Real Sociedad,1H,97.700752,85,Jiménez López,FW,67.2,59.2,10.0
3,38.0,Barcelona,Spanish first division,8.0,Simple pass,Pass,Real Sociedad,1H,132.889252,85,Jiménez López,FW,68.4,66.4,11.0
4,38.0,Barcelona,Spanish first division,8.0,Simple pass,Pass,Real Sociedad,1H,265.013504,85,Jiménez López,FW,98.4,75.2,15.0


In [4]:
soccer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3440476 entries, 0 to 3440475
Data columns (total 15 columns):
 #   Column         Dtype  
---  ------         -----  
 0   gameweek       float64
 1   winner         object 
 2   competitionId  object 
 3   eventId        float64
 4   subEventName   object 
 5   eventName      object 
 6   teamId         object 
 7   matchPeriod    object 
 8   eventSec       float64
 9   subEventId     int64  
 10  lastName       object 
 11  role           object 
 12  x              float64
 13  y              float64
 14  positionGrid   float64
dtypes: float64(6), int64(1), object(8)
memory usage: 393.7+ MB


In [5]:
# convert the necessary float values in the dataframe to integers
floats = ['positionGrid','eventId','gameweek']

for f in floats:
    soccer[f] = soccer[f].astype('int64')


## 2. Simplify Events
Not all events are created equal, some are redundant and we want to simplify these where necessary. A good example would be the various categories of fouls, these should be simplified into the high level event rather than a sub-event. At the same time the different kinds of passes should be separated out because they can have a bigger impact on the flow of the game.

In [6]:
soccer.eventName.unique()

array(['Pass', 'Others on the ball', 'Duel', 'Foul', 'Offside', 'Shot',
       'Free Kick', 'Save attempt', 'Goalkeeper leaving line',
       'Interruption'], dtype=object)

In [7]:
soccer.subEventName.unique()

array(['Simple pass', 'Touch', 'Ground loose ball duel', 'Foul',
       'Ground attacking duel', 'Ground defending duel', 'Offside',
       'Air duel', 'Smart pass', 'Shot', 'Acceleration',
       'Out of game foul', 'Throw in', 'Head pass', 'Cross', 'High pass',
       'Hand foul', 'Launch', 'Clearance', 'Protest', 'Free Kick',
       'Late card foul', 'Corner', 'Free kick shot', 'Free kick cross',
       'Time lost foul', 'Penalty', 'Simulation', 'Goal kick',
       'Save attempt', 'Goalkeeper leaving line', 'Hand pass', 'Reflexes',
       'Violent Foul', 'Ball out of the field', 'Whistle'], dtype=object)

We have 36 unique sub-events. I believe that we can drop all records that include 'Throw in', 'Launch', 'Protest', 'Simulation','Hand pass','Ball out of the field','Whistle','Goalkeeper leaving line','Penalty', and 'Save attempt'. We are also going to drop all 'Offside', and 'foul' related events.

In [8]:
soccer.shape

(3440476, 15)

In [9]:
# Drop the records with the specified sub-events
drops = ['Throw in', 'Launch', 'Protest', 'Simulation','Hand pass',
         'Ball out of the field','Whistle','Goalkeeper leaving line','Penalty','Save attempt']

for i in drops:
    soccer.drop(soccer[soccer['subEventName'] == i].index, inplace = True)

soccer.shape

(3260814, 15)

In [10]:
soccer.eventName.unique()

array(['Pass', 'Others on the ball', 'Duel', 'Foul', 'Offside', 'Shot',
       'Free Kick', 'Save attempt'], dtype=object)

In [11]:
drop = ['Foul', 'Offside']

for i in drop:
    soccer.drop(soccer[soccer['eventName'] == i].index, inplace = True)

In [12]:
unique = soccer.subEventName.unique()
len(unique)

19

In [13]:
ids = soccer.subEventId.unique()
len(ids)

19

In [14]:
ids

array([ 85,  72,  13,  11,  12,  10,  86, 100,  70,  82,  80,  83,  71,
        31,  30,  33,  32,  34,  90], dtype=int64)

In [15]:
# convert positionGrid to categories
soccer['positionGrid'] = soccer['positionGrid'].astype('category')

## 3. Build New Dataframe
We have to build a brand new dataframe where every game for a team is it's own record.

## 3.1 Functions that will help in our work

### team_df
This creates a a dataframe for a single game by team and week. It is this dataframe that we use to build out a single record for the individual games.
### get_unique
Used to return a list of unique values based on the series that is input.
### column_name
Names the columns used in our new dataframe by concatenating the grid location on the field with the event name.
### team_win
Used to populate the team, win, and gameweek columns in the new dataframe.
### counts
This function is called from the populate function to calculate the count for a specific event within a specific grid location.
### populate
Used to populate values within our record for each grid/event pair. It is passed a list of the grid locations and a list of the unique events. It then iterates through them and leverages the counts function to input a value for the number of times that event happens in the specified grid. This outputs a dataframe containing a single row with all of the data for the specified game that can later be appended to the final dataframe.

In [16]:
def team_df(name, week):
    team = soccer[soccer.teamId == name]
    
    # now a dataframe containing just one game
    oneGame = team[team.gameweek == week]
       
    return oneGame

In [17]:
def get_unique(field):
    values = field.unique()
    return values

In [18]:
def column_name(name1,name2):
    grid = str(name1)
    i = 'grid'+grid+name2
    return i

In [19]:
def team_win(team, win, week):
    game_temp['team'] = get_unique(team)
    game_temp['win'] = get_unique(win) 
    game_temp['gameWeek'] = get_unique(week)

In [20]:
def counts(pos,event):
    l = len(temp[(temp['positionGrid'] == p) & (temp['subEventName'] == event)])
    return l

In [21]:
def populate(location,action):
    for l in location:
        for a in action:
            field = column_name(l,a)
            game_temp[field] = len(temp[(temp['positionGrid'] == l) & (temp['subEventName'] == a)])           

### 3.2 Populate our new dataframe with headers for each column

In [22]:
grid = get_unique(soccer.positionGrid)
events = get_unique(soccer.subEventName)

In [23]:
columns = ['team','gameWeek','win']

for g in grid:
    for e in events:
        i = column_name(g,e)
        columns.append(i)

In [24]:
games = pd.DataFrame(columns = columns)

### 3.3 Create lists for each of our relevant unique values
These are values that allow us to pull data for a specific game, so our key fields would be team and gameweek.

In [25]:
teams = get_unique(soccer.teamId)
gameweek = get_unique(soccer.gameweek)

In [26]:
len(gameweek)

38

In [27]:
len(teams)

98

In [28]:
teams

array(['Real Sociedad', 'Barcelona', 'Liverpool', 'Atlético Madrid',
       'Eibar', 'Las Palmas', 'Athletic Club', 'Espanyol',
       'Olympique Lyonnais', 'Deportivo La Coruña', 'Sevilla', 'Valencia',
       'PSG', 'Real Madrid', 'Villarreal', 'Levante', 'Deportivo Alavés',
       'Stoke City', 'Swansea City', 'Everton', 'Málaga', 'Getafe',
       'Girona', 'Real Betis', 'Borussia Dortmund', 'Leganés', 'Watford',
       'Celta de Vigo', 'Hellas Verona', 'Torino', 'Schalke 04',
       'Fiorentina', 'Mainz 05', 'Roma', 'Crotone', 'Arsenal', 'Sassuolo',
       'Manchester City', 'Benevento', 'Bordeaux', 'Saint-Étienne',
       'Internazionale', 'Lazio', 'Cagliari', 'Atalanta', 'Chievo',
       'Sampdoria', 'Napoli', 'Nantes', 'Udinese', 'Bologna', 'Genoa',
       'Milan', 'SPAL', 'Juventus', 'Newcastle United', 'Nice', 'Monaco',
       'West Ham United', 'Chelsea', 'Bayern München', 'Stuttgart',
       'Augsburg', 'Hoffenheim', 'RB Leipzig', 'Bayer Leverkusen',
       'Hertha BSC', 'Fre

### 3.4 Create the dataframe!

In [29]:
for t in teams:
    for w in gameweek:
        # build each game's dataframe
        temp = team_df(t,w)
        
        #temp dataframe to hold our record in
        game_temp = pd.DataFrame(columns = columns)
        
        # populate the record with the team name, winner, and gameweek
        team_win(temp.teamId,temp.winner,temp.gameweek)
        
        # populate the remaining fields using our populate function
        populate(grid, events)
        
        # append the record to the final dataframe
        games = pd.concat([games, game_temp], ignore_index=True)
        

In [30]:
games.tail()

Unnamed: 0,team,gameWeek,win,grid6Simple pass,grid6Touch,grid6Ground loose ball duel,grid6Ground attacking duel,grid6Ground defending duel,grid6Air duel,grid6Smart pass,...,grid3Head pass,grid3Cross,grid3High pass,grid3Clearance,grid3Free Kick,grid3Corner,grid3Free kick shot,grid3Free kick cross,grid3Goal kick,grid3Reflexes
3647,Brighton & Hove Albion,27,0,46,14,2,2,20,14,0,...,6,0,4,6,0,0,0,0,0,0
3648,Brighton & Hove Albion,16,Huddersfield Town,66,2,4,4,2,12,0,...,2,0,6,2,2,0,0,0,0,0
3649,Brighton & Hove Albion,11,Brighton & Hove Albion,52,2,4,0,6,2,0,...,0,0,8,0,0,0,0,0,0,0
3650,Brighton & Hove Albion,10,0,102,4,10,4,12,2,0,...,0,0,4,2,0,0,0,0,0,0
3651,Brighton & Hove Albion,14,0,72,14,2,4,14,10,0,...,0,0,2,0,0,0,0,0,0,0


In [31]:
games.describe()

Unnamed: 0,team,gameWeek,win,grid6Simple pass,grid6Touch,grid6Ground loose ball duel,grid6Ground attacking duel,grid6Ground defending duel,grid6Air duel,grid6Smart pass,...,grid3Head pass,grid3Cross,grid3High pass,grid3Clearance,grid3Free Kick,grid3Corner,grid3Free kick shot,grid3Free kick cross,grid3Goal kick,grid3Reflexes
count,3652,3652,3652,3652,3652,3652,3652,3652,3652,3652,...,3652,3652,3652,3652,3652,3652,3652,3652,3652,3652
unique,98,38,99,143,23,19,18,28,25,6,...,11,3,16,13,7,2,1,2,1,1
top,PSG,19,0,30,2,2,2,4,2,0,...,0,0,2,0,0,0,0,0,0,0
freq,38,98,908,103,748,804,746,560,599,2921,...,1989,3648,878,1434,2244,3650,3652,3651,3652,3652


### 3.5 Clean-up the data

In [32]:
fields = columns[3:]

for f in fields:
    games[f] = games[f].astype('int64')

We are going to need to change the win column to values of 0 = loss, 1 = draw, 2 = win.

In [33]:
games.loc[(games['win'] == games['team']), 'win'] = 2
games.loc[(games['win'] == '0'), 'win'] = 1
games.loc[(games['win'] != 1) & (games['win'] != 2), 'win'] = 0

In [34]:
games.tail()

Unnamed: 0,team,gameWeek,win,grid6Simple pass,grid6Touch,grid6Ground loose ball duel,grid6Ground attacking duel,grid6Ground defending duel,grid6Air duel,grid6Smart pass,...,grid3Head pass,grid3Cross,grid3High pass,grid3Clearance,grid3Free Kick,grid3Corner,grid3Free kick shot,grid3Free kick cross,grid3Goal kick,grid3Reflexes
3647,Brighton & Hove Albion,27,1,46,14,2,2,20,14,0,...,6,0,4,6,0,0,0,0,0,0
3648,Brighton & Hove Albion,16,0,66,2,4,4,2,12,0,...,2,0,6,2,2,0,0,0,0,0
3649,Brighton & Hove Albion,11,2,52,2,4,0,6,2,0,...,0,0,8,0,0,0,0,0,0,0
3650,Brighton & Hove Albion,10,1,102,4,10,4,12,2,0,...,0,0,4,2,0,0,0,0,0,0
3651,Brighton & Hove Albion,14,1,72,14,2,4,14,10,0,...,0,0,2,0,0,0,0,0,0,0


Knowledge of how the game works tells us that some events are likely to never happen in certain grids. For example corner kicks would only ever happen from 4 of the 16 grids, that means that we now have 12 columns with only zero values. This is not the only example, so we need to search through the columns for those that only contain zero values and remove them from the dataframe as they bring no value to our analysis.

In [35]:
# display the columns with only zero values
# code retrieved from: https://stackoverflow.com/questions/16486762/python-pandas-select-columns-with-all-zero-entries-in-dataframe
games.loc[:, (games == 0).all()]

Unnamed: 0,grid6Cross,grid6Corner,grid6Goal kick,grid6Reflexes,grid11Corner,grid11Goal kick,grid11Reflexes,grid10Corner,grid10Goal kick,grid10Reflexes,...,grid1Free kick cross,grid1Goal kick,grid1Reflexes,grid2Corner,grid2Free kick cross,grid2Goal kick,grid2Reflexes,grid3Free kick shot,grid3Goal kick,grid3Reflexes
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3647,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3648,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3649,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3650,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [36]:
# remove all rows with all zero values
# code retrieved from: https://stackoverflow.com/questions/22649693/drop-rows-with-all-zeros-in-pandas-data-frame/22650075
games = games.loc[:, (games!=0).any(axis=0)]

In [37]:
games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3652 entries, 0 to 3651
Columns: 260 entries, team to grid3Free kick cross
dtypes: int64(257), object(3)
memory usage: 7.2+ MB


In [38]:
games.loc[:, (games == 0).all()]

0
1
2
3
4
...
3647
3648
3649
3650
3651


In [39]:
games.describe()

Unnamed: 0,grid6Simple pass,grid6Touch,grid6Ground loose ball duel,grid6Ground attacking duel,grid6Ground defending duel,grid6Air duel,grid6Smart pass,grid6Shot,grid6Acceleration,grid6Head pass,...,grid3Smart pass,grid3Shot,grid3Acceleration,grid3Head pass,grid3Cross,grid3High pass,grid3Clearance,grid3Free Kick,grid3Corner,grid3Free kick cross
count,3652.0,3652.0,3652.0,3652.0,3652.0,3652.0,3652.0,3652.0,3652.0,3652.0,...,3652.0,3652.0,3652.0,3652.0,3652.0,3652.0,3652.0,3652.0,3652.0,3652.0
mean,44.781216,3.914841,3.156079,3.592004,5.662103,4.156627,0.280668,0.00931,0.758488,3.853231,...,0.016703,0.000274,0.203998,0.839266,0.001917,2.487678,1.345564,0.588992,0.000548,0.000274
std,26.914496,3.298303,2.761051,2.807144,4.107466,3.866663,0.645935,0.109384,1.11647,3.337056,...,0.165483,0.016548,0.537557,1.268052,0.072114,2.333731,1.647845,0.897984,0.023399,0.016548
min,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,27.0,2.0,1.0,2.0,3.0,2.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
50%,38.0,3.0,2.0,3.0,5.0,3.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0
75%,55.0,5.0,4.0,5.0,7.0,6.0,0.0,0.0,1.0,5.0,...,0.0,0.0,0.0,1.0,0.0,4.0,2.0,1.0,0.0,0.0
max,276.0,28.0,26.0,24.0,48.0,32.0,8.0,2.0,8.0,24.0,...,4.0,1.0,4.0,12.0,4.0,18.0,14.0,6.0,1.0,1.0


Everything looks good, this will get exported to a csv so that I can use it in separate analyses.

In [40]:
games.to_csv('games.csv', index=False)