# Exctracting model and dataset preprocessing

**ATTENTION:**

Notebook language: **R**

A model from a paper [EXPLAINABLE EXPECTED GOAL MODELS FOR PERFORMANCE
ANALYSIS IN FOOTBALL ANALYTICS](https://arxiv.org/pdf/2206.07212.pdf) will be used for further research. It is a `ranger` model trained on oversampled dataset.

## Preparing preprocessed dataset

In [1]:
df <- read.csv('./data/raw_data.csv')
df <- df[,-1]
head(df)

Unnamed: 0_level_0,league,id,minute,result,X,Y,player,h_a,player_id,situation,season,shotType,match_id,home_team,away_team,home_goals,away_goals,date,player_assisted,lastAction
Unnamed: 0_level_1,<fct>,<int>,<int>,<fct>,<dbl>,<dbl>,<fct>,<fct>,<int>,<fct>,<int>,<fct>,<int>,<fct>,<fct>,<int>,<int>,<fct>,<fct>,<fct>
1,Ligue_1,425095,7,MissedShots,0.964,0.654,Myron Boadu,h,9612,OpenPlay,2021,LeftFoot,17822,Monaco,Nantes,1,1,2021-08-06 19:00:00,,BallRecovery
2,Ligue_1,425098,13,Goal,0.925,0.431,Gelson Martins,h,7012,OpenPlay,2021,RightFoot,17822,Monaco,Nantes,1,1,2021-08-06 19:00:00,Caio Henrique,Throughball
3,Ligue_1,425100,24,BlockedShot,0.785,0.388,Kevin Volland,h,83,OpenPlay,2021,LeftFoot,17822,Monaco,Nantes,1,1,2021-08-06 19:00:00,,
4,Ligue_1,425101,24,MissedShots,0.761,0.525,Jean Lucas,h,7687,OpenPlay,2021,RightFoot,17822,Monaco,Nantes,1,1,2021-08-06 19:00:00,,Rebound
5,Ligue_1,425102,30,MissedShots,0.936,0.415,Kevin Volland,h,83,FromCorner,2021,Head,17822,Monaco,Nantes,1,1,2021-08-06 19:00:00,Jean Lucas,Aerial
6,Ligue_1,425104,42,MissedShots,0.751,0.511,Aurelien Tchouameni,h,6560,OpenPlay,2021,RightFoot,17822,Monaco,Nantes,1,1,2021-08-06 19:00:00,Caio Henrique,Pass


Remove `OwnGoal` - it is also a first preprocessing step 

In [4]:
library(dplyr)

In [5]:
df <- df %>% filter(result != "OwnGoal")
head(df)

Unnamed: 0_level_0,league,id,minute,result,X,Y,player,h_a,player_id,situation,season,shotType,match_id,home_team,away_team,home_goals,away_goals,date,player_assisted,lastAction
Unnamed: 0_level_1,<fct>,<int>,<int>,<fct>,<dbl>,<dbl>,<fct>,<fct>,<int>,<fct>,<int>,<fct>,<int>,<fct>,<fct>,<int>,<int>,<fct>,<fct>,<fct>
1,Ligue_1,425095,7,MissedShots,0.964,0.654,Myron Boadu,h,9612,OpenPlay,2021,LeftFoot,17822,Monaco,Nantes,1,1,2021-08-06 19:00:00,,BallRecovery
2,Ligue_1,425098,13,Goal,0.925,0.431,Gelson Martins,h,7012,OpenPlay,2021,RightFoot,17822,Monaco,Nantes,1,1,2021-08-06 19:00:00,Caio Henrique,Throughball
3,Ligue_1,425100,24,BlockedShot,0.785,0.388,Kevin Volland,h,83,OpenPlay,2021,LeftFoot,17822,Monaco,Nantes,1,1,2021-08-06 19:00:00,,
4,Ligue_1,425101,24,MissedShots,0.761,0.525,Jean Lucas,h,7687,OpenPlay,2021,RightFoot,17822,Monaco,Nantes,1,1,2021-08-06 19:00:00,,Rebound
5,Ligue_1,425102,30,MissedShots,0.936,0.415,Kevin Volland,h,83,FromCorner,2021,Head,17822,Monaco,Nantes,1,1,2021-08-06 19:00:00,Jean Lucas,Aerial
6,Ligue_1,425104,42,MissedShots,0.751,0.511,Aurelien Tchouameni,h,6560,OpenPlay,2021,RightFoot,17822,Monaco,Nantes,1,1,2021-08-06 19:00:00,Caio Henrique,Pass


Update `raw_data.csv` to avoid conflicts in future

In [6]:
write.csv(df, './data/raw_data.csv')

### Preprocessing

In [7]:
source('./scripts/preprocess.R') # preprocessing steps from the paper

In [8]:
df_preprocessed <- preprocess(df)
head(df_preprocessed)

Unnamed: 0_level_0,status,minute,h_a,situation,shotType,lastAction,distanceToGoal,angleToGoal
Unnamed: 0_level_1,<dbl>,<dbl>,<fct>,<fct>,<fct>,<fct>,<dbl>,<dbl>
1,0,7,h,OpenPlay,LeftFoot,BallRecovery,12.554569,10.86049
2,1,13,h,OpenPlay,RightFoot,Throughball,8.497323,44.42738
3,0,24,h,OpenPlay,LeftFoot,,23.388803,17.20585
4,0,24,h,OpenPlay,RightFoot,Rebound,25.298204,16.33905
5,0,30,h,FromCorner,Head,Aerial,7.967234,44.48587
6,0,42,h,OpenPlay,RightFoot,Pass,26.241467,15.82464


In [9]:
df_preprocessed_1 <- df_preprocessed

In [10]:
factor_cols <- unlist(lapply(df_preprocessed_1, is.factor))
factor_cols <- names(factor_cols[factor_cols == TRUE])

In [19]:
levels_vector <- lapply(factor_cols, function(col){levels(df_preprocessed_1[,col])})
names(levels_vector) <- factor_cols
levels_vector

In [20]:
saveRDS(levels_vector, file = "./data/level_vector.RDS")

In [21]:
for (col in factor_cols){
    df_preprocessed_1[,col] <- as.integer(df_preprocessed_1[,col])
}

In [22]:
head(df_preprocessed_1)

Unnamed: 0_level_0,status,minute,h_a,situation,shotType,lastAction,distanceToGoal,angleToGoal
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<dbl>
1,0,7,2,3,2,2,12.554569,10.86049
2,1,13,2,3,4,39,8.497323,44.42738
3,0,24,2,3,2,24,23.388803,17.20585
4,0,24,2,3,4,29,25.298204,16.33905
5,0,30,2,2,1,1,7.967234,44.48587
6,0,42,2,3,4,27,26.241467,15.82464


In [23]:
write.csv(df_preprocessed_1, './data/data_preprocessed.csv')

## Check model prediction for a few first observations

In [19]:
library(ranger)

In [20]:
model <- readRDS('./model/model.RDS')

In [21]:
d <- df_preprocessed[1:11,]
d

Unnamed: 0_level_0,status,minute,h_a,situation,shotType,lastAction,distanceToGoal,angleToGoal
Unnamed: 0_level_1,<dbl>,<dbl>,<fct>,<fct>,<fct>,<fct>,<dbl>,<dbl>
1,0,7,h,OpenPlay,LeftFoot,BallRecovery,12.554569,10.86049
2,1,13,h,OpenPlay,RightFoot,Throughball,8.497323,44.42738
3,0,24,h,OpenPlay,LeftFoot,,23.388803,17.20585
4,0,24,h,OpenPlay,RightFoot,Rebound,25.298204,16.33905
5,0,30,h,FromCorner,Head,Aerial,7.967234,44.48587
6,0,42,h,OpenPlay,RightFoot,Pass,26.241467,15.82464
7,0,47,h,DirectFreekick,RightFoot,Standard,20.834178,19.88836
8,0,55,h,FromCorner,Head,Cross,10.767052,10.19874
9,0,66,h,DirectFreekick,RightFoot,Standard,29.060308,13.06822
10,0,88,h,FromCorner,RightFoot,,10.237765,35.86925


In [25]:
predict(model, d)$predict

0,1
0.16794983,0.83205017
0.91103079,0.08896921
0.08782856,0.91217144
0.07900859,0.92099141
0.36426741,0.63573259
0.2804089,0.7195911
0.30693818,0.69306182
0.16386609,0.83613391
0.23476958,0.76523042
0.2365279,0.7634721


## Extracting model to `Python` format

In [None]:
library(reticulate)

In [None]:
py_save_object(model$forest, './model/model-imported.pickle')