# Exctracting model and dataset preprocessing

**ATTENTION:**

Notebook language: **R**

A model from a paper [EXPLAINABLE EXPECTED GOAL MODELS FOR PERFORMANCE
ANALYSIS IN FOOTBALL ANALYTICS](https://arxiv.org/pdf/2206.07212.pdf) will be used for further research. It is a `ranger` model trained on oversampled dataset.

## Preparing preprocessed dataset

In [1]:
df <- read.csv('./data/shotdata2023.csv')
df <- df[,-1]
head(df)

Unnamed: 0_level_0,league,id,minute,result,X,Y,xG,player,h_a,player_id,⋯,season,shotType,match_id,home_team,away_team,home_goals,away_goals,date,player_assisted,lastAction
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<int>,⋯,<int>,<chr>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>
1,La_liga,480067,0,BlockedShot,0.878,0.415,0.0914239,Ezequiel Ávila,h,6955,⋯,2022,RightFoot,18962,Osasuna,Sevilla,2,1,2022-08-12 19:00:00,Kike Barja,Pass
2,La_liga,480068,8,Goal,0.958,0.583,0.298781,Ezequiel Ávila,h,6955,⋯,2022,Head,18962,Osasuna,Sevilla,2,1,2022-08-12 19:00:00,Rubén Peña,Cross
3,La_liga,480070,20,SavedShot,0.747,0.445,0.0173666,Lucas Torró,h,7050,⋯,2022,RightFoot,18962,Osasuna,Sevilla,2,1,2022-08-12 19:00:00,,
4,La_liga,480071,21,BlockedShot,0.79,0.599,0.0338025,Jon Moncayola,h,7857,⋯,2022,RightFoot,18962,Osasuna,Sevilla,2,1,2022-08-12 19:00:00,Moi Gómez,Pass
5,La_liga,480073,26,ShotOnPost,0.757,0.594,0.0232618,Moi Gómez,h,2198,⋯,2022,LeftFoot,18962,Osasuna,Sevilla,2,1,2022-08-12 19:00:00,Aimar Oroz,Pass
6,La_liga,480077,45,BlockedShot,0.709,0.479,0.0175576,Ezequiel Ávila,h,6955,⋯,2022,RightFoot,18962,Osasuna,Sevilla,2,1,2022-08-12 19:00:00,Moi Gómez,Pass


Remove `OwnGoal` - it is also a first preprocessing step 

In [3]:
library(dplyr)

In [4]:
df <- df %>% filter(result != "OwnGoal")
head(df)

Unnamed: 0_level_0,league,id,minute,result,X,Y,xG,player,h_a,player_id,⋯,season,shotType,match_id,home_team,away_team,home_goals,away_goals,date,player_assisted,lastAction
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<int>,⋯,<int>,<chr>,<int>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>
1,La_liga,480067,0,BlockedShot,0.878,0.415,0.0914239,Ezequiel Ávila,h,6955,⋯,2022,RightFoot,18962,Osasuna,Sevilla,2,1,2022-08-12 19:00:00,Kike Barja,Pass
2,La_liga,480068,8,Goal,0.958,0.583,0.298781,Ezequiel Ávila,h,6955,⋯,2022,Head,18962,Osasuna,Sevilla,2,1,2022-08-12 19:00:00,Rubén Peña,Cross
3,La_liga,480070,20,SavedShot,0.747,0.445,0.0173666,Lucas Torró,h,7050,⋯,2022,RightFoot,18962,Osasuna,Sevilla,2,1,2022-08-12 19:00:00,,
4,La_liga,480071,21,BlockedShot,0.79,0.599,0.0338025,Jon Moncayola,h,7857,⋯,2022,RightFoot,18962,Osasuna,Sevilla,2,1,2022-08-12 19:00:00,Moi Gómez,Pass
5,La_liga,480073,26,ShotOnPost,0.757,0.594,0.0232618,Moi Gómez,h,2198,⋯,2022,LeftFoot,18962,Osasuna,Sevilla,2,1,2022-08-12 19:00:00,Aimar Oroz,Pass
6,La_liga,480077,45,BlockedShot,0.709,0.479,0.0175576,Ezequiel Ávila,h,6955,⋯,2022,RightFoot,18962,Osasuna,Sevilla,2,1,2022-08-12 19:00:00,Moi Gómez,Pass


Update `raw_data.csv` to avoid conflicts in future

In [5]:
write.csv(df, './data/shotdata2023.csv')

### Preprocessing

In [6]:
source('./scripts/preprocess.R') # preprocessing steps from the paper

In [7]:
df_preprocessed <- preprocess(df)
head(df_preprocessed)

Unnamed: 0_level_0,status,minute,h_a,situation,shotType,lastAction,distanceToGoal,angleToGoal
Unnamed: 0_level_1,<dbl>,<dbl>,<fct>,<fct>,<fct>,<fct>,<dbl>,<dbl>
1,0,0,h,OpenPlay,RightFoot,Pass,13.506088,29.02087
2,1,8,h,OpenPlay,Head,Cross,8.395523,29.48606
3,0,20,h,OpenPlay,RightFoot,,26.659276,15.58172
4,0,21,h,OpenPlay,RightFoot,Pass,23.536532,16.62475
5,0,26,h,OpenPlay,LeftFoot,Pass,26.707659,14.94127
6,0,45,h,SetPiece,RightFoot,Pass,30.555083,13.66107


In [8]:
df_preprocessed_1 <- df_preprocessed

In [9]:
levels_vector <- readRDS(file = "./data/level_vector.RDS")

In [10]:
levels_vector

In [11]:
factor_cols <- unlist(lapply(df_preprocessed_1, is.factor))
factor_cols <- names(factor_cols[factor_cols == TRUE])

In [12]:
for (col in factor_cols){
    df_preprocessed_1[,col] <- factor(df_preprocessed_1[,col], unlist(levels_vector[col], use.names = FALSE))
    df_preprocessed_1[,col] <- as.integer(df_preprocessed_1[,col])
}

In [13]:
head(df_preprocessed_1)

Unnamed: 0_level_0,status,minute,h_a,situation,shotType,lastAction,distanceToGoal,angleToGoal
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<dbl>
1,0,0,2,3,4,27,13.506088,29.02087
2,1,8,2,3,1,11,8.395523,29.48606
3,0,20,2,3,4,24,26.659276,15.58172
4,0,21,2,3,4,27,23.536532,16.62475
5,0,26,2,3,2,27,26.707659,14.94127
6,0,45,2,5,4,27,30.555083,13.66107


In [14]:
write.csv(df_preprocessed_1, './data/shotdata2023_preprocessed.csv')