**Author**: Fabrizio Lucero Fernández. https://www.linkedin.com/in/fabrizio-lucero/

<font size="4.5">**Data Analytics in the Sports World: Assesing 2017/2018 English Premier League Data through Machine Learning**</font>

![1200px-Memorial_University_of_Newfoundland_Logo.svg.png](attachment:1200px-Memorial_University_of_Newfoundland_Logo.svg.png)

**Libraries**: Pandas and numpy for regular data analysis procedures. importing definitions ipynb file to retrieve definitions created. 

In [1]:
import pandas as pd
import numpy as np
import import_ipynb
import definitions as ds

importing Jupyter notebook from definitions.ipynb


**Load Events Dataset created in WyScout Dataset - Exploration jup NB**

In [2]:
Events_df=pd.read_excel("Event EPL Dataset-1718.xlsx")

Before starting to divide the dataset in detail levels, using the comparison between Venue and Local stadium we can assign when a team is playing at home or not (1,0)

In [3]:
##General Wrangling

#Create a column to differentiate between local and visitor teams
Events_df['Home_Away']=np.where(Events_df['venue']==Events_df['local_stadium'],1,0)

**Features selected for analysis**: The scope of this investigation will be centered on events (Such as Duels, Passes, Fouls), sub events (Simple pass, Air Duel, etc)  and their description (Accurate or Not Accurate). Further analysis will be implemented in following investigations. 

**LEVEL 1**:Just focussing on creating dataset focused on event frequency. Without sub event detail or description. As general as possible. For this process we are going to use the `Level_Creation()` that converts the dataframe by grouping features according to the details. It encapsulates another function: `preprocess_features()` which converts all categorical features into dummy variables. This makes it easier to take those dummy variables and sum them up.
> **Dataframe**: Data set of events needed to be converted.
<br>

*For more info visit the link:
<br>
https://github.com/fabriziolufe/GRI-Research---Passing-Networks/blob/main/definitions.ipynb*

In [6]:
Level1_df=Events_df[['matchId','teamId','eventName','id','Home_Away','winner']]

#creating the features concatenating the needed columns
Level1_df['Feature']=Level1_df['eventName'].copy()
Level1_df=Level1_df.drop(['eventName'],axis=1)
Level1_df=Level1_df.drop_duplicates('id')

#Arranging - pre processing features and grouping by the selected dataframe to create results. 
#For more info see 'definitions.ipynb'
Level1_df=ds.Level_Creation(Level1_df)


#Show results
Level1_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Outcome,Home_Away,Feature_Duel,Feature_Foul,Feature_Free Kick,Feature_Goalkeeper leaving line,Feature_Interruption,Feature_Offside,Feature_Others on the ball,Feature_Pass,Feature_Save attempt,Feature_Shot
matchId,teamId,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2499719,1609,1,1,256.0,9.0,51.0,2.0,0.0,6.0,71.0,606.0,4.0,27.0
2499719,1631,-1,0,256.0,12.0,62.0,0.0,84.0,2.0,73.0,230.0,10.0,7.0
2499720,1625,1,0,176.0,9.0,43.0,2.0,0.0,1.0,54.0,754.0,2.0,12.0
2499720,1651,-1,1,176.0,6.0,42.0,0.0,63.0,6.0,65.0,184.0,5.0,6.0
2499721,1610,-1,1,214.0,18.0,37.0,0.0,2.0,2.0,68.0,516.0,5.0,15.0


**LEVEL 2**:Now including the sub events name into the features. Same procedure as first level with more detail. 

In [5]:
Level2_df=Events_df[['matchId','teamId','eventName','subEventName','id','Home_Away','winner']]

#creating the features concatenating the needed columns
Level2_df['Feature']=Level2_df['eventName']+'_'+Level2_df['subEventName']
Level2_df=Level2_df.drop(['eventName','subEventName'],axis=1)
Level2_df=Level2_df.drop_duplicates('id')

#Arranging - pre processing features and grouping by the selected dataframe to create results. 
#For more info see 'definitions.ipynb'
Level2_df=ds.Level_Creation(Level2_df)


#Show results
Level2_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Outcome,Home_Away,Feature_Duel_Air duel,Feature_Duel_Ground attacking duel,Feature_Duel_Ground defending duel,Feature_Duel_Ground loose ball duel,Feature_Foul_Foul,Feature_Foul_Hand foul,Feature_Foul_Late card foul,Feature_Foul_Out of game foul,...,Feature_Pass_Cross,Feature_Pass_Hand pass,Feature_Pass_Head pass,Feature_Pass_High pass,Feature_Pass_Launch,Feature_Pass_Simple pass,Feature_Pass_Smart pass,Feature_Save attempt_Reflexes,Feature_Save attempt_Save attempt,Feature_Shot_Shot
matchId,teamId,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2499719,1609,1,1,58.0,82.0,65.0,51.0,9.0,0.0,0.0,0.0,...,20.0,2.0,32.0,35.0,4.0,507.0,6.0,3.0,1.0,27.0
2499719,1631,-1,0,58.0,65.0,82.0,51.0,12.0,0.0,0.0,0.0,...,18.0,1.0,32.0,27.0,12.0,129.0,11.0,7.0,3.0,7.0
2499720,1625,1,0,41.0,61.0,44.0,30.0,8.0,1.0,0.0,0.0,...,22.0,7.0,22.0,30.0,3.0,651.0,19.0,1.0,1.0,12.0
2499720,1651,-1,1,41.0,44.0,61.0,30.0,6.0,0.0,0.0,0.0,...,3.0,3.0,16.0,26.0,13.0,119.0,4.0,3.0,2.0,6.0
2499721,1610,-1,1,59.0,70.0,49.0,36.0,13.0,1.0,0.0,0.0,...,27.0,4.0,23.0,36.0,15.0,402.0,9.0,5.0,0.0,15.0


**LEVEL 3**: Adding the description, now having the dataset with specific actions. ( could tend to overfit any possible further modelling) 

In [7]:
Level3_df=Events_df[['matchId','teamId','eventName','subEventName','Description','id','Home_Away','winner']]

#creating the features concatenating the needed columns
Level3_df['Feature']=Level3_df['eventName']+"_"+Level3_df['subEventName']+"_"+Level3_df['Description']
Level3_df=Level3_df.drop(['eventName','subEventName','Description'],axis=1)

#Arranging - pre processing features and grouping by the selected dataframe to create results. 
#For more info see 'definitions.ipynb'
Level3_df=ds.Level_Creation(Level3_df)


#Show results
Level3_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Outcome,Home_Away,Feature_Duel_Air duel_Accurate,Feature_Duel_Air duel_Assist,Feature_Duel_Air duel_Counter attack,Feature_Duel_Air duel_Key pass,Feature_Duel_Air duel_Lost,Feature_Duel_Air duel_Neutral,Feature_Duel_Air duel_Not accurate,Feature_Duel_Air duel_Won,...,Feature_Shot_Shot_Position: Out low left,Feature_Shot_Shot_Position: Out low right,Feature_Shot_Shot_Position: Post center left,Feature_Shot_Shot_Position: Post center right,Feature_Shot_Shot_Position: Post high center,Feature_Shot_Shot_Position: Post high left,Feature_Shot_Shot_Position: Post high right,Feature_Shot_Shot_Position: Post low left,Feature_Shot_Shot_Position: Post low right,Feature_Shot_Shot_Right foot
matchId,teamId,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2499719,1609,1,1,32.0,0.0,0.0,1.0,26.0,12.0,26.0,20.0,...,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13.0
2499719,1631,-1,0,38.0,0.0,1.0,0.0,20.0,12.0,20.0,26.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
2499720,1625,1,0,28.0,0.0,0.0,0.0,13.0,5.0,13.0,23.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,9.0
2499720,1651,-1,1,18.0,0.0,0.0,0.0,23.0,5.0,23.0,13.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
2499721,1610,-1,1,27.0,0.0,0.0,0.0,32.0,2.0,32.0,25.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0


**Saving resulting levels in 3 different xlsx files!**

In [None]:
#saving files from the 3 levels.
Level1_df.to_excel("Level1.xlsx") 
Level2_df.to_excel("Level2.xlsx") 
Level3_df.to_excel("Level3.xlsx") 

![Ramsey-2.jpg](attachment:Ramsey-2.jpg)

**Link to Original Dataset documentation**:
    https://figshare.com/collections/Soccer_match_event_dataset/4415000/5

**Published article from where Dataset was retrieved**:
Pappalardo, L., Cintia, P., Rossi, A., Massucco, E., Ferragina, P., Pedreschi, D., &amp; Giannotti, F. (2019). A public data  set of spatio-temporal match events in soccer competitions. Scientific Data, 6(1). https://doi.org/10.1038/s41597-019-0247-7 