# Training new xG model

## Loading libraries

In [None]:
#install.packages("https://cran.r-project.org/src/contrib/rlang_1.0.4.tar.gz", repo=NULL, type="source")

In [None]:
#install.packages('ranger')
#install.packages('mltools')
#install.packages('dplyr')
#install.packages('ROSE')
#install.packages('data.table')
#install.packages('DALEX')

In [None]:
# sudo apt-get update
# sudo apt-get -y install libssl-dev
# sudo apt-get -y install libcurl4-gnutls-dev libxml2-dev
# sudo apt-get -y install libfontconfig1-dev libcurl4-openssl-dev
# sudo apt-get -y install libgit2-dev
# sudo apt-get -y install r-cran-ragg

In [None]:
#install.packages('devtools')

In [None]:
#library(devtools)

In [None]:
#devtools::install_github("ModelOriented/forester")

In [None]:
library(ranger)
library(forester)
library(mltools)
library(dplyr)
library(ROSE)
library(data.table)
library(DALEX)

## Loading data

In [None]:
raw_data <- read.csv('./data/raw_data.csv')
raw_data <- raw_data[,-1]

## Preprocessing

In [None]:
source("./scripts/preprocess.R")

In [None]:
df <- preprocess(raw_data)
head(df)

Unnamed: 0_level_0,status,minute,h_a_a,h_a_h,situation_DirectFreekick,situation_FromCorner,situation_OpenPlay,situation_Penalty,situation_SetPiece,shotType_Head,⋯,lastAction_Smother,lastAction_Standard,lastAction_Start,lastAction_SubstitutionOff,lastAction_SubstitutionOn,lastAction_Tackle,lastAction_TakeOn,lastAction_Throughball,distanceToGoal,angleToGoal
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>
1,0,7,0,1,0,0,1,0,0,0,⋯,0,0,0,0,0,0,0,0,12.554569,10.86049
2,1,13,0,1,0,0,1,0,0,0,⋯,0,0,0,0,0,0,0,1,8.497323,44.42738
3,0,24,0,1,0,0,1,0,0,0,⋯,0,0,0,0,0,0,0,0,23.388803,17.20585
4,0,24,0,1,0,0,1,0,0,0,⋯,0,0,0,0,0,0,0,0,25.298204,16.33905
5,0,30,0,1,0,1,0,0,0,1,⋯,0,0,0,0,0,0,0,0,7.967234,44.48587
6,0,42,0,1,0,0,1,0,0,0,⋯,0,0,0,0,0,0,0,0,26.241467,15.82464


In [None]:
colnames(df)

In [None]:
write.csv(df, './data/preprocessed_data.csv', row.names=FALSE)

## Oversampling

In previous research, it turned out it was the best method to improve model performance.

In [None]:
set.seed(123)
over_train_data <- ovun.sample(status ~ ., data = df, method = "over")

In [None]:
write.csv(over_train_data$data, './data/oversampled_preprocessed_data.csv', row.names=FALSE)

## Training

I have only `ranger` installed, that's why it is the only one trained.

Original model was trained in the same way, using `forester`, with the same seed. The best was was created by `ranger` library.

Now, new model (with the same default hyperparameters) with one-hot-encoded features is trained so as to use efficient `treeshap` library.

In [None]:
set.seed(123)
over_model <- forester(data   = over_train_data$data,
                       target = "status",
                       type   = "classification")

__________________________

FORESTER

Original shape of train data frame: 515553 rows, 54 columns

_____________

NA values

There is no NA values in your data.

__________________________

CREATING MODELS

Growing trees.. Progress: 3%. Estimated remaining time: 21 minutes, 13 seconds.
Growing trees.. Progress: 5%. Estimated remaining time: 20 minutes, 8 seconds.
Growing trees.. Progress: 8%. Estimated remaining time: 19 minutes, 10 seconds.
Growing trees.. Progress: 11%. Estimated remaining time: 18 minutes, 18 seconds.
Growing trees.. Progress: 13%. Estimated remaining time: 18 minutes, 5 seconds.
Growing trees.. Progress: 16%. Estimated remaining time: 17 minutes, 14 seconds.
Growing trees.. Progress: 19%. Estimated remaining time: 16 minutes, 29 seconds.
Growing trees.. Progress: 22%. Estimated remaining time: 15 minutes, 36 seconds.
Growing trees.. Progress: 25%. Estimated remaining time: 14 minutes, 51 seconds.
Growing trees.. Progress: 28%. Estimated remaining time: 14 minutes, 

Output from training with forester:


| model   | auc        | recall     | precision  | f1         | accuracy   |
| ------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| Ranger  | 0.7421546  | 0.7192585  | 0.7535624  | 0.7360109  | 0.7421677  |

## Save trained model

In [None]:
dir_name <- './model'

if (file.exists(dir_name)) {
 cat("The folder already exists")
} else {
 dir.create(dir_name)
}

In [None]:
saveRDS(over_model$model, './model/model.rds')
saveRDS(over_model$test_data, './model/test_data.rds')

In [None]:
saveRDS(over_model, './model/over_model.rds')

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=638a36e2-efff-486f-858d-cbca546da2c6' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>