## Data Preprocessing

This notebook illustrates steps required to preprocess Wyscout football match event data. Specifically, data is extracted and preprocessed for leagues in Italy, England, Spain, France and Germany.

### Libaries

In [1]:
# import required libraries
import os
import ast
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mplsoccer import Pitch, VerticalPitch

### Import Wyscout Public Match Event Dataset

Wyscout Public Match Event Dataset is available as `PublicWyscoutLoader` object in `socceraction` library.

In [2]:
# import wyscout public match event data loader from socceraction library
from socceraction.data.wyscout import PublicWyscoutLoader 

# load public wyscout data
wyscout_data = PublicWyscoutLoader()

In [3]:
# view all available competitions
wyscout_data.competitions()

Unnamed: 0,competition_id,season_id,country_name,competition_name,competition_gender,season_name
0,524,181248,Italy,Italian first division,male,2017/2018
1,364,181150,England,English first division,male,2017/2018
2,795,181144,Spain,Spanish first division,male,2017/2018
3,412,181189,France,French first division,male,2017/2018
4,426,181137,Germany,German first division,male,2017/2018
5,102,9291,International,European Championship,male,2016
6,28,10078,International,World Cup,male,2018


Retrieve event data for all matches in Italy, England, Spain, France and Germany.

### EPL Data Preprocessing

#### Download Wyscout data for EPL

Wyscout event data for each match is downloaded, aggregated and saved into a single `events.csv` file. 

In [None]:
# England 17/18, competition_id = 364, season_id = 181150
#epl_games = wyscout_data.games(competition_id = 364, season_id = 181150)["game_id"]

# download all premier league matches and save as a single .csv
#events = pd.DataFrame()
#for i in epl_games:
#    df = wyscout_data.events(i)
#    events = pd.concat([events, df])

# save as .csv file
#events.to_csv('data/epl/events.csv', index = False) 

In [38]:
events.head()

Unnamed: 0,event_id,game_id,period_id,milliseconds,team_id,player_id,type_id,type_name,subtype_id,subtype_name,positions,tags
0,177959171,2499719,1,2758.649,1609,25413,8,Pass,85,Simple pass,"[{'y': 49, 'x': 49}, {'y': 78, 'x': 31}]",[{'id': 1801}]
1,177959172,2499719,1,4946.85,1609,370224,8,Pass,83,High pass,"[{'y': 78, 'x': 31}, {'y': 75, 'x': 51}]",[{'id': 1801}]
2,177959173,2499719,1,6542.188,1609,3319,8,Pass,82,Head pass,"[{'y': 75, 'x': 51}, {'y': 71, 'x': 35}]",[{'id': 1801}]
3,177959174,2499719,1,8143.395,1609,120339,8,Pass,82,Head pass,"[{'y': 71, 'x': 35}, {'y': 95, 'x': 41}]",[{'id': 1801}]
4,177959175,2499719,1,10302.366,1609,167145,8,Pass,85,Simple pass,"[{'y': 95, 'x': 41}, {'y': 88, 'x': 72}]",[{'id': 1801}]


In [39]:
events.shape

(643150, 12)

There are in total 643150 events in 380 EPL matches.

#### Desirable formatting of raw events data

In [4]:
# upload 'events' dataframe that includes all events of all 380 EPL games
df = pd.read_csv('data/epl/events.csv')

In [5]:
# view columns
df.columns

Index(['event_id', 'game_id', 'period_id', 'milliseconds', 'team_id',
       'player_id', 'type_id', 'type_name', 'subtype_id', 'subtype_name',
       'positions', 'tags'],
      dtype='object')

In [6]:
# remove columns 'event id' and 'milliseconds'
rm_col_ind = np.r_[0, 3]
df = df.drop(columns = df.columns[rm_col_ind])

In [7]:
df.head()

Unnamed: 0,game_id,period_id,team_id,player_id,type_id,type_name,subtype_id,subtype_name,positions,tags
0,2499719,1,1609,25413,8,Pass,85,Simple pass,"[{'y': 49, 'x': 49}, {'y': 78, 'x': 31}]",[{'id': 1801}]
1,2499719,1,1609,370224,8,Pass,83,High pass,"[{'y': 78, 'x': 31}, {'y': 75, 'x': 51}]",[{'id': 1801}]
2,2499719,1,1609,3319,8,Pass,82,Head pass,"[{'y': 75, 'x': 51}, {'y': 71, 'x': 35}]",[{'id': 1801}]
3,2499719,1,1609,120339,8,Pass,82,Head pass,"[{'y': 71, 'x': 35}, {'y': 95, 'x': 41}]",[{'id': 1801}]
4,2499719,1,1609,167145,8,Pass,85,Simple pass,"[{'y': 95, 'x': 41}, {'y': 88, 'x': 72}]",[{'id': 1801}]


Columns `positions` and `tags` are in a list format but when saved as `.csv` file they are converted into a string format. To work with these columns `literal_eval` function from `ast` library is used.

In [8]:
# convert strings into python lists
df['tags'] = df['tags'].apply(ast.literal_eval)
df['positions'] = df['positions'].apply(ast.literal_eval)

# make 'type_name' and 'subtype_name' columns lowercase 
df['type_name'] = df['type_name'].str.lower()
df['subtype_name'] = df['subtype_name'].str.lower()

# create separate initial(start) and final(end) coordinates from 'positions' column
# if action has only 'start' coordinates set 'end' coordinates to 'nan'
df['x_start'] = df['positions'].apply(lambda x: x[0]['x'])
df['y_start'] = df['positions'].apply(lambda x: x[0]['y'])
df['x_end'] = df['positions'].apply(lambda x: x[1]['x'] if len(x) == 2 else np.nan)
df['y_end'] = df['positions'].apply(lambda x: x[1]['y'] if len(x) == 2  else np.nan)

In [9]:
df.head()

Unnamed: 0,game_id,period_id,team_id,player_id,type_id,type_name,subtype_id,subtype_name,positions,tags,x_start,y_start,x_end,y_end
0,2499719,1,1609,25413,8,pass,85,simple pass,"[{'y': 49, 'x': 49}, {'y': 78, 'x': 31}]",[{'id': 1801}],49,49,31.0,78.0
1,2499719,1,1609,370224,8,pass,83,high pass,"[{'y': 78, 'x': 31}, {'y': 75, 'x': 51}]",[{'id': 1801}],31,78,51.0,75.0
2,2499719,1,1609,3319,8,pass,82,head pass,"[{'y': 75, 'x': 51}, {'y': 71, 'x': 35}]",[{'id': 1801}],51,75,35.0,71.0
3,2499719,1,1609,120339,8,pass,82,head pass,"[{'y': 71, 'x': 35}, {'y': 95, 'x': 41}]",[{'id': 1801}],35,71,41.0,95.0
4,2499719,1,1609,167145,8,pass,85,simple pass,"[{'y': 95, 'x': 41}, {'y': 88, 'x': 72}]",[{'id': 1801}],41,95,72.0,88.0


To map tag ids in `tags` column to their definitions I use a separate `wyscout_tags.csv` file. This file is not a part of `PublicWyscoutLoader`, and it can be downloaded from [this link](https://figshare.com/articles/dataset/Mapping_of_tag_identifiers_to_tag_names/11743818?backTo=/collections/Soccer_match_event_dataset/4415000).

In [10]:
# import tags for encode subevents in event data
tags = pd.read_csv('data/wyscout_tags.csv', sep = ';')

# make all descriptions lowercase
tags['Description'] = tags['Description'].str.lower()

# transform tags data frame into dictionary
tags = dict(zip(tags['Tag'], tags['Description']))

# use dictionaries and list comprehensions to convert tags into tag ids and their descriptions 
df['tag_id'] = df['tags'].apply(lambda x: [value for d in x for value in d.values()])
df['tag_name'] = df['tag_id'].apply(lambda x: [tags[i] for i in x])

# drop redundant column 'positions'
df.drop(columns = ['positions', 'tags'], inplace = True)

In [11]:
df.head()

Unnamed: 0,game_id,period_id,team_id,player_id,type_id,type_name,subtype_id,subtype_name,x_start,y_start,x_end,y_end,tag_id,tag_name
0,2499719,1,1609,25413,8,pass,85,simple pass,49,49,31.0,78.0,[1801],[accurate]
1,2499719,1,1609,370224,8,pass,83,high pass,31,78,51.0,75.0,[1801],[accurate]
2,2499719,1,1609,3319,8,pass,82,head pass,51,75,35.0,71.0,[1801],[accurate]
3,2499719,1,1609,120339,8,pass,82,head pass,35,71,41.0,95.0,[1801],[accurate]
4,2499719,1,1609,167145,8,pass,85,simple pass,41,95,72.0,88.0,[1801],[accurate]


For each event, there can be a multiple number of tags. Thus, these tags are stored as lists.

In [12]:
# rearrange columns using only required indices and passing them into np.r_
rearr_cols = np.r_[0:8, 12, 13, 8:12]
df = df.iloc[:, rearr_cols]

In [13]:
df.head()

Unnamed: 0,game_id,period_id,team_id,player_id,type_id,type_name,subtype_id,subtype_name,tag_id,tag_name,x_start,y_start,x_end,y_end
0,2499719,1,1609,25413,8,pass,85,simple pass,[1801],[accurate],49,49,31.0,78.0
1,2499719,1,1609,370224,8,pass,83,high pass,[1801],[accurate],31,78,51.0,75.0
2,2499719,1,1609,3319,8,pass,82,head pass,[1801],[accurate],51,75,35.0,71.0
3,2499719,1,1609,120339,8,pass,82,head pass,[1801],[accurate],35,71,41.0,95.0
4,2499719,1,1609,167145,8,pass,85,simple pass,[1801],[accurate],41,95,72.0,88.0


In [14]:
# save 'df' as 'refined_events.csv' dataframe
df.to_csv('data/epl/refined_events.csv', index = False)    

In [15]:
df.shape

(643150, 14)

#### Extract EPL shot event data 

In [16]:
# free kicks are not included (penalties are also part of free kicks)
shots = df[df['type_name'] == 'shot']

In [17]:
shots.head()

Unnamed: 0,game_id,period_id,team_id,player_id,type_id,type_name,subtype_id,subtype_name,tag_id,tag_name,x_start,y_start,x_end,y_end
46,2499719,1,1609,25413,10,shot,100,shot,"[101, 402, 201, 1205, 1801]","[goal, right foot, opportunity, position: goal...",88,41,0.0,0.0
62,2499719,1,1631,26150,10,shot,100,shot,"[401, 201, 1211, 1802]","[left foot, opportunity, position: out center ...",85,52,100.0,100.0
91,2499719,1,1631,14763,10,shot,100,shot,"[101, 403, 201, 1207, 1801]","[goal, head/body, opportunity, position: goal ...",96,52,100.0,100.0
128,2499719,1,1609,7868,10,shot,100,shot,"[401, 201, 1215, 1802]","[left foot, opportunity, position: out high le...",81,33,0.0,0.0
249,2499719,1,1609,7868,10,shot,100,shot,"[402, 201, 1205, 1801]","[right foot, opportunity, position: goal low l...",75,30,0.0,0.0


In [18]:
# save 'df' as 'refined_events.csv' dataframe
shots.to_csv('data/epl/shots.csv', index = False)    

### LaLiga Data Preprocessing

Since data retrieval and preprocessing steps are similar to what was data for EPL data, all the above code is implemented in a single cell.

In [None]:
# Spain 17/18, competition_id = 795, season_id = 181144
laliga_games = wyscout_data.games(competition_id = 795, season_id = 181144)["game_id"]

# download all premier league matches and save as a single .csv
#events = pd.DataFrame()
#for i in laliga_games:
#    df = wyscout_data.events(i)
#    events = pd.concat([events, df])

# save as .csv file
#events.to_csv('data/laliga/events.csv', index = False) 

In [19]:
# upload 'events' dataframe that includes all events of all 380 EPL games
df = pd.read_csv('data/laliga/events.csv')

# remove columns 'event id' and 'milliseconds'
rm_col_ind = np.r_[0, 3]
df = df.drop(columns = df.columns[rm_col_ind])

# convert strings into python lists
df['tags'] = df['tags'].apply(ast.literal_eval)
df['positions'] = df['positions'].apply(ast.literal_eval)

# make 'type_name' and 'subtype_name' columns lowercase 
df['type_name'] = df['type_name'].str.lower()
df['subtype_name'] = df['subtype_name'].str.lower()

# create separate initial(start) and final(end) coordinates from 'positions' column
# if action has only 'start' coordinates set 'end' coordinates to 'nan'
df['x_start'] = df['positions'].apply(lambda x: x[0]['x'])
df['y_start'] = df['positions'].apply(lambda x: x[0]['y'])
df['x_end'] = df['positions'].apply(lambda x: x[1]['x'] if len(x) == 2 else np.nan)
df['y_end'] = df['positions'].apply(lambda x: x[1]['y'] if len(x) == 2  else np.nan)

# import tags for encode subevents in event data
tags = pd.read_csv('data/wyscout_tags.csv', sep = ';')

# make all descriptions lowercase
tags['Description'] = tags['Description'].str.lower()

# transform tags data frame into dictionary
tags = dict(zip(tags['Tag'], tags['Description']))

# use dictionaries and list comprehensions to convert tags into tag ids and their descriptions 
df['tag_id'] = df['tags'].apply(lambda x: [value for d in x for value in d.values()])
df['tag_name'] = df['tag_id'].apply(lambda x: [tags[i] for i in x])

# drop redundant column 'positions'
df.drop(columns = ['positions', 'tags'], inplace = True)

# rearrange columns using only required indices and passing them into np.r_
rearr_cols = np.r_[0:8, 12, 13, 8:12]
df = df.iloc[:, rearr_cols]

# save 'df' as 'refined_events.csv' dataframe
df.to_csv('data/laliga/refined_events.csv', index = False)

# free kicks are not included (penalties are also part of free kicks)
shots = df[df['type_name'] == 'shot']

# save 'df' as 'refined_events.csv' dataframe
shots.to_csv('data/laliga/shots.csv', index = False)    

In [20]:
df = pd.read_csv('data/laliga/events.csv')

In [21]:
df.head()

Unnamed: 0,event_id,game_id,period_id,milliseconds,team_id,player_id,type_id,type_name,subtype_id,subtype_name,positions,tags
0,180864419,2565548,1,2994.582,682,3542,8,Pass,85,Simple pass,"[{'y': 61, 'x': 37}, {'y': 50, 'x': 50}]",[{'id': 1801}]
1,180864418,2565548,1,3137.02,682,274435,8,Pass,85,Simple pass,"[{'y': 50, 'x': 50}, {'y': 30, 'x': 45}]",[{'id': 1801}]
2,180864420,2565548,1,6709.668,682,364860,8,Pass,85,Simple pass,"[{'y': 30, 'x': 45}, {'y': 12, 'x': 38}]",[{'id': 1801}]
3,180864421,2565548,1,8805.497,682,3534,8,Pass,85,Simple pass,"[{'y': 12, 'x': 38}, {'y': 69, 'x': 32}]",[{'id': 1801}]
4,180864422,2565548,1,14047.492,682,3695,8,Pass,85,Simple pass,"[{'y': 69, 'x': 32}, {'y': 37, 'x': 31}]",[{'id': 1801}]


In [22]:
df = pd.read_csv('data/laliga/refined_events.csv')
df.head()

Unnamed: 0,game_id,period_id,team_id,player_id,type_id,type_name,subtype_id,subtype_name,tag_id,tag_name,x_start,y_start,x_end,y_end
0,2565548,1,682,3542,8,pass,85,simple pass,[1801],['accurate'],37,61,50.0,50.0
1,2565548,1,682,274435,8,pass,85,simple pass,[1801],['accurate'],50,50,45.0,30.0
2,2565548,1,682,364860,8,pass,85,simple pass,[1801],['accurate'],45,30,38.0,12.0
3,2565548,1,682,3534,8,pass,85,simple pass,[1801],['accurate'],38,12,32.0,69.0
4,2565548,1,682,3695,8,pass,85,simple pass,[1801],['accurate'],32,69,31.0,37.0


In [23]:
df.shape

(628659, 14)

In [24]:
df = pd.read_csv('data/laliga/shots.csv')

In [26]:
df.head()

Unnamed: 0,game_id,period_id,team_id,player_id,type_id,type_name,subtype_id,subtype_name,tag_id,tag_name,x_start,y_start,x_end,y_end
0,2565548,1,695,225089,10,shot,100,shot,"[1901, 401, 201, 1201, 1801]","['counter attack', 'left foot', 'opportunity',...",93,34,0.0,0.0
1,2565548,1,695,255738,10,shot,100,shot,"[402, 2101, 1802]","['right foot', 'blocked', 'not accurate']",80,59,0.0,0.0
2,2565548,1,682,37831,10,shot,100,shot,"[402, 2101, 201, 1802]","['right foot', 'blocked', 'opportunity', 'not ...",88,57,100.0,100.0
3,2565548,1,682,15214,10,shot,100,shot,"[402, 201, 1216, 1802]","['right foot', 'opportunity', 'position: out h...",87,66,100.0,100.0
4,2565548,1,695,225089,10,shot,100,shot,"[402, 1214, 1802]","['right foot', 'position: out high center', 'n...",75,40,0.0,0.0


In [27]:
df.shape

(7979, 14)

### Bundesliga Data Preprocessing

In [None]:
# Germany 17/18, competition_id = 426, season_id = 181137
bundesliga_games = wyscout_data.games(competition_id = 426, season_id = 181137)["game_id"]


# download all bundesliga matches and save as a single .csv
#events = pd.DataFrame()
#for i in bundesliga_games:
#    df = wyscout_data.events(i)
#    events = pd.concat([events, df])

# save as .csv file
#events.to_csv('data/bundesliga/events.csv', index = False)    

In [28]:
df = pd.read_csv('data/bundesliga/events.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,event_id,game_id,period_id,milliseconds,team_id,player_id,type_id,type_name,subtype_id,subtype_name,positions,tags
0,0,179896442,2516739,1,2409.746,2446,15231,8,Pass,85,Simple pass,"[{'y': 50, 'x': 50}, {'y': 48, 'x': 50}]",[{'id': 1801}]
1,1,179896443,2516739,1,2506.082,2446,14786,8,Pass,85,Simple pass,"[{'y': 48, 'x': 50}, {'y': 22, 'x': 22}]",[{'id': 1801}]
2,2,179896444,2516739,1,6946.706,2446,14803,8,Pass,85,Simple pass,"[{'y': 22, 'x': 22}, {'y': 46, 'x': 6}]",[{'id': 1801}]
3,3,179896445,2516739,1,10786.491,2446,14768,8,Pass,85,Simple pass,"[{'y': 46, 'x': 6}, {'y': 10, 'x': 20}]",[{'id': 1801}]
4,4,179896446,2516739,1,12684.514,2446,14803,8,Pass,85,Simple pass,"[{'y': 10, 'x': 20}, {'y': 4, 'x': 27}]",[{'id': 1801}]


In [29]:
df.shape

(519407, 13)

In [30]:
df = df.drop('Unnamed: 0', axis = 1)

In [31]:
df.head()

Unnamed: 0,event_id,game_id,period_id,milliseconds,team_id,player_id,type_id,type_name,subtype_id,subtype_name,positions,tags
0,179896442,2516739,1,2409.746,2446,15231,8,Pass,85,Simple pass,"[{'y': 50, 'x': 50}, {'y': 48, 'x': 50}]",[{'id': 1801}]
1,179896443,2516739,1,2506.082,2446,14786,8,Pass,85,Simple pass,"[{'y': 48, 'x': 50}, {'y': 22, 'x': 22}]",[{'id': 1801}]
2,179896444,2516739,1,6946.706,2446,14803,8,Pass,85,Simple pass,"[{'y': 22, 'x': 22}, {'y': 46, 'x': 6}]",[{'id': 1801}]
3,179896445,2516739,1,10786.491,2446,14768,8,Pass,85,Simple pass,"[{'y': 46, 'x': 6}, {'y': 10, 'x': 20}]",[{'id': 1801}]
4,179896446,2516739,1,12684.514,2446,14803,8,Pass,85,Simple pass,"[{'y': 10, 'x': 20}, {'y': 4, 'x': 27}]",[{'id': 1801}]


In [32]:
df.to_csv('data/bundesliga/events.csv', index = False) 

In [33]:
df = pd.read_csv('data/bundesliga/events.csv')
df.head()

Unnamed: 0,event_id,game_id,period_id,milliseconds,team_id,player_id,type_id,type_name,subtype_id,subtype_name,positions,tags
0,179896442,2516739,1,2409.746,2446,15231,8,Pass,85,Simple pass,"[{'y': 50, 'x': 50}, {'y': 48, 'x': 50}]",[{'id': 1801}]
1,179896443,2516739,1,2506.082,2446,14786,8,Pass,85,Simple pass,"[{'y': 48, 'x': 50}, {'y': 22, 'x': 22}]",[{'id': 1801}]
2,179896444,2516739,1,6946.706,2446,14803,8,Pass,85,Simple pass,"[{'y': 22, 'x': 22}, {'y': 46, 'x': 6}]",[{'id': 1801}]
3,179896445,2516739,1,10786.491,2446,14768,8,Pass,85,Simple pass,"[{'y': 46, 'x': 6}, {'y': 10, 'x': 20}]",[{'id': 1801}]
4,179896446,2516739,1,12684.514,2446,14803,8,Pass,85,Simple pass,"[{'y': 10, 'x': 20}, {'y': 4, 'x': 27}]",[{'id': 1801}]


In [34]:
# upload 'events' dataframe that includes all events of all 380 EPL games
df = pd.read_csv('data/bundesliga/events.csv')

# remove columns 'event id' and 'milliseconds'
rm_col_ind = np.r_[0, 3]
df = df.drop(columns = df.columns[rm_col_ind])

# convert strings into python lists
df['tags'] = df['tags'].apply(ast.literal_eval)
df['positions'] = df['positions'].apply(ast.literal_eval)

# make 'type_name' and 'subtype_name' columns lowercase 
df['type_name'] = df['type_name'].str.lower()
df['subtype_name'] = df['subtype_name'].str.lower()

# create separate initial(start) and final(end) coordinates from 'positions' column
# if action has only 'start' coordinates set 'end' coordinates to 'nan'
df['x_start'] = df['positions'].apply(lambda x: x[0]['x'])
df['y_start'] = df['positions'].apply(lambda x: x[0]['y'])
df['x_end'] = df['positions'].apply(lambda x: x[1]['x'] if len(x) == 2 else np.nan)
df['y_end'] = df['positions'].apply(lambda x: x[1]['y'] if len(x) == 2  else np.nan)

# import tags for encode subevents in event data
tags = pd.read_csv('data/wyscout_tags.csv', sep = ';')

# make all descriptions lowercase
tags['Description'] = tags['Description'].str.lower()

# transform tags data frame into dictionary
tags = dict(zip(tags['Tag'], tags['Description']))

# use dictionaries and list comprehensions to convert tags into tag ids and their descriptions 
df['tag_id'] = df['tags'].apply(lambda x: [value for d in x for value in d.values()])
df['tag_name'] = df['tag_id'].apply(lambda x: [tags[i] for i in x])

# drop redundant column 'positions'
df.drop(columns = ['positions', 'tags'], inplace = True)

# rearrange columns using only required indices and passing them into np.r_
rearr_cols = np.r_[0:8, 12, 13, 8:12]
df = df.iloc[:, rearr_cols]

# save 'df' as 'refined_events.csv' dataframe
df.to_csv('data/bundesliga/refined_events.csv', index = False)

# free kicks are not included (penalties are also part of free kicks)
shots = df[df['type_name'] == 'shot']

# save 'df' as 'refined_events.csv' dataframe
shots.to_csv('data/bundesliga/shots.csv', index = False)    

In [35]:
df = pd.read_csv('data/bundesliga/refined_events.csv')
df.head()

Unnamed: 0,game_id,period_id,team_id,player_id,type_id,type_name,subtype_id,subtype_name,tag_id,tag_name,x_start,y_start,x_end,y_end
0,2516739,1,2446,15231,8,pass,85,simple pass,[1801],['accurate'],50,50,50.0,48.0
1,2516739,1,2446,14786,8,pass,85,simple pass,[1801],['accurate'],50,48,22.0,22.0
2,2516739,1,2446,14803,8,pass,85,simple pass,[1801],['accurate'],22,22,6.0,46.0
3,2516739,1,2446,14768,8,pass,85,simple pass,[1801],['accurate'],6,46,20.0,10.0
4,2516739,1,2446,14803,8,pass,85,simple pass,[1801],['accurate'],20,10,27.0,4.0


In [36]:
df.shape

(519407, 14)

In [37]:
df = pd.read_csv('data/bundesliga/shots.csv')
df.head()

Unnamed: 0,game_id,period_id,team_id,player_id,type_id,type_name,subtype_id,subtype_name,tag_id,tag_name,x_start,y_start,x_end,y_end
0,2516739,1,2444,209091,10,shot,100,shot,"[402, 201, 1206, 1801]","['right foot', 'opportunity', 'position: goal ...",83,66,0.0,0.0
1,2516739,1,2444,134383,10,shot,100,shot,"[101, 403, 201, 1205, 1801]","['goal', 'head/body', 'opportunity', 'position...",95,59,0.0,0.0
2,2516739,1,2446,105619,10,shot,100,shot,"[402, 201, 1201, 1801]","['right foot', 'opportunity', 'position: goal ...",91,66,100.0,100.0
3,2516739,1,2446,14786,10,shot,100,shot,"[402, 201, 1212, 1802]","['right foot', 'opportunity', 'position: out l...",88,49,100.0,100.0
4,2516739,1,2444,20475,10,shot,100,shot,"[402, 1216, 1802]","['right foot', 'position: out high right', 'no...",74,42,0.0,0.0


In [38]:
df.shape

(6898, 14)

### Ligue 1 Data Preprocessing