# ==== INTERACTIVE CLUSTERING : COMPUTATION TIME STUDY ====
> ### Stage 3 : Apply main effects and post-hoc analysis on interactive clustering computation times.

------------------------------

## READ-ME BEFORE RUNNING

### Quick Description

This notebook is **aimed at run main effects and and post-hoc analysis on interactive clustering computation time over experiments**.
- Environments are represented by subdirectories in the `/experiments` folder. A full path to an experiment environment is `/experiments/[TASK]/[DATASET]/[ALGORITHM]/`.
- Experiments have to be run and evaluated in order to analyze convergency speed.

Before running, **run the notebook `2_Estimate_computation_time.ipynb` to run each algorithm you have set**.

Then, **go to the notebook `4_Plot_some_figures.ipynb` to create figures on interactive clustering computation time**.

### Description each steps

First of all, **load experiment synthesis CSV file** that have been computed with the last notebook.
- It contains parameters used for each experiment and convergency metric to compare.
- Several parameters are studied depending on the task:
    - _preprocessing_: `dataset_size`, `algorithm_name`;
    - _vectorization_: `dataset_size`, `algorithm_name`;
    - _sampling_: `dataset_size`, `algorithm_name`, `previous_nb_constraints`, `previous_nb_clusters`, `algorithm_nb_to_select`;
    - _clustering_: `dataset_size`, `algorithm_name`, `previous_nb_constraints`, `previous_nb_clusters`.
- Two random effects are used : `dataset_random_seed`, `algorithm_random_seed`.
- One values is modelized with these factors : `time_total`.

Then, for each task :
1. First, **compute a global modelization** :
    - Fit a generalized linear model (GLM) on data with all factors.
2. Then, **evaluate the relevance of each factor** :
    - Fit a generalized linear model (GLM) on data with all factors but without the factor you want to study.
    - Perform parametric bootstrapping to evaluate the relevant of the studied factor.
3. Finally, **compute a relevant modelization** :
    - Fit a generalized linear model (GLM) on data with only relevant factors.

------------------------------

## 1. IMPORT R DEPENDENCIES

In [2]:
#library("sjstats")
library("lme4")
library("emmeans")
library("pbnm")

Le chargement a nécessité le package : Matrix



------------------------------

## 2. ANALYSIS FOR PREPROCESSING

------------------------------
### 2.1. ANALYSIS FOR PREPROCESSING

#### 2.1.1. LOAD SYNTHESIS CSV FILE

In [3]:
# Load analysis data.
df_analysis_preprocessing <- read.csv(
    file="../results/experiments_synthesis_for_preprocessing.csv",
    header=TRUE,  # Use the first row as headers.
    sep=";",
    skip=0,  # Number of rows to skip in the file.
)

In [4]:
# Set column type.
df_analysis_preprocessing$dataset_size <- as.numeric( df_analysis_preprocessing$dataset_size )
df_analysis_preprocessing$dataset_random_seed <- as.numeric( df_analysis_preprocessing$dataset_random_seed )
df_analysis_preprocessing$algorithm_name <- as.factor( df_analysis_preprocessing$algorithm_name )
df_analysis_preprocessing$algorithm_random_seed <- as.numeric( df_analysis_preprocessing$algorithm_random_seed )
df_analysis_preprocessing$time_total <- as.double( sub(",", ".", df_analysis_preprocessing$time_total) )

In [5]:
# Show an extract of analysis data.
df_analysis_preprocessing

X,dataset_name,dataset_size,dataset_random_seed,algorithm_name,algorithm_random_seed,time_start,time_stop,time_total
<chr>,<chr>,<dbl>,<dbl>,<fct>,<dbl>,<int>,<int>,<dbl>
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/filter_prep-rand_1/,bank_cards_v2,1000,1,filter_prep,1,1668606138,1668606148,10.645489
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/filter_prep-rand_2/,bank_cards_v2,1000,1,filter_prep,2,1668606138,1668606148,10.604682
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/filter_prep-rand_3/,bank_cards_v2,1000,1,filter_prep,3,1668606148,1668606155,6.896929
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/filter_prep-rand_4/,bank_cards_v2,1000,1,filter_prep,4,1668606148,1668606155,6.832410
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/filter_prep-rand_5/,bank_cards_v2,1000,1,filter_prep,5,1668606148,1668606155,6.849232
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/lemma_prep-rand_1/,bank_cards_v2,1000,1,lemma_prep,1,1668606138,1668606148,10.631167
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/lemma_prep-rand_2/,bank_cards_v2,1000,1,lemma_prep,2,1668606138,1668606148,10.619905
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/lemma_prep-rand_3/,bank_cards_v2,1000,1,lemma_prep,3,1668606138,1668606148,10.423440
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/lemma_prep-rand_4/,bank_cards_v2,1000,1,lemma_prep,4,1668606148,1668606155,6.854967
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/lemma_prep-rand_5/,bank_cards_v2,1000,1,lemma_prep,5,1668606148,1668606155,6.806319


In [6]:
# Subsets by algorithm_name.
df_analysis_preprocessing_simple_prep <- df_analysis_preprocessing[df_analysis_preprocessing$algorithm_name=="simple_prep",]
df_analysis_preprocessing_lemma_prep <- df_analysis_preprocessing[df_analysis_preprocessing$algorithm_name=="lemma_prep",]
df_analysis_preprocessing_filter_prep <- df_analysis_preprocessing[df_analysis_preprocessing$algorithm_name=="filter_prep",]

#### 2.1.2. Apply general analysis

Fit a generalized linear models (GLM).

In [7]:
GLM_fit_preprocessing_FULL <- lmer(
    formula = time_total ~ algorithm_name + dataset_size + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_preprocessing
)
GLM_fit_preprocessing_FULL_WITHOUT_dataset_size <- lmer(
    formula = time_total ~ algorithm_name + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_preprocessing
)

"Some predictor variables are on very different scales: consider rescaling"
boundary (singular) fit: see help('isSingular')



In [45]:
summary(anova(GLM_fit_preprocessing_FULL, GLM_fit_preprocessing_FULL_WITHOUT_dataset_size, test="Chisq"))

refitting model(s) with ML (instead of REML)



      npar           AIC              BIC             logLik     
 Min.   :6.00   Min.   : 737.9   Min.   : 765.4   Min.   :-1354  
 1st Qu.:6.25   1st Qu.:1233.4   1st Qu.:1259.9   1st Qu.:-1106  
 Median :6.50   Median :1728.9   Median :1754.5   Median : -858  
 Mean   :6.50   Mean   :1728.9   Mean   :1754.5   Mean   : -858  
 3rd Qu.:6.75   3rd Qu.:2224.5   3rd Qu.:2249.0   3rd Qu.: -610  
 Max.   :7.00   Max.   :2720.0   Max.   :2743.5   Max.   : -362  
                                                                 
    deviance          Chisq            Df      Pr(>Chisq)
 Min.   : 723.9   Min.   :1984   Min.   :1   Min.   :0   
 1st Qu.:1219.9   1st Qu.:1984   1st Qu.:1   1st Qu.:0   
 Median :1715.9   Median :1984   Median :1   Median :0   
 Mean   :1715.9   Mean   :1984   Mean   :1   Mean   :0   
 3rd Qu.:2212.0   3rd Qu.:1984   3rd Qu.:1   3rd Qu.:0   
 Max.   :2708.0   Max.   :1984   Max.   :1   Max.   :0   
                  NA's   :1      NA's   :1   NA's   :1   

In [66]:
GLM_fit_preprocessing_ALGONAME_simple_prep <- lmer(
    formula = time_total ~ dataset_size + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_preprocessing_simple_prep
)
GLM_fit_preprocessing_ALGONAME_simple_prep_WITHOUT_dataset_size <- lmer(
    formula = time_total ~ (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_preprocessing_simple_prep
)
GLM_fit_preprocessing_ALGONAME_simple_prep_WITHOUT_dataset_random_seed <- lmer(
    formula = time_total ~ dataset_size + ( 1 | algorithm_random_seed ),
    data = df_analysis_preprocessing_simple_prep
)
GLM_fit_preprocessing_ALGONAME_simple_prep_WITHOUT_algorithm_random_seed <- lmer(
    formula = time_total ~ dataset_size + ( 1 | dataset_random_seed ),
    data = df_analysis_preprocessing_simple_prep
)

"Some predictor variables are on very different scales: consider rescaling"
boundary (singular) fit: see help('isSingular')

"Some predictor variables are on very different scales: consider rescaling"
"Some predictor variables are on very different scales: consider rescaling"


In [70]:
anova(GLM_fit_preprocessing_ALGONAME_simple_prep, GLM_fit_preprocessing_ALGONAME_simple_prep_WITHOUT_dataset_random_seed, test="Chisq")

refitting model(s) with ML (instead of REML)



Unnamed: 0_level_0,npar,AIC,BIC,logLik,deviance,Chisq,Df,Pr(>Chisq)
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
GLM_fit_preprocessing_ALGONAME_simple_prep_WITHOUT_dataset_random_seed,4,258.2088,269.522,-125.1044,250.2088,,,
GLM_fit_preprocessing_ALGONAME_simple_prep,5,250.8383,264.9799,-120.4192,240.8383,9.370441,1.0,0.002205125


In [34]:
GLM_fit_preprocessing_ALGONAME_lemma_prep <- lmer(
    formula = time_total ~ dataset_size + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_preprocessing_lemma_prep
)
GLM_fit_preprocessing_ALGONAME_lemma_prep_WITHOUT_dataset_size <- lmer(
    formula = time_total ~ (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_preprocessing_lemma_prep
)

"Some predictor variables are on very different scales: consider rescaling"
boundary (singular) fit: see help('isSingular')

boundary (singular) fit: see help('isSingular')



In [72]:
summary(GLM_fit_preprocessing_ALGONAME_simple_prep_WITHOUT_algorithm_random_seed)

Linear mixed model fit by REML ['lmerMod']
Formula: time_total ~ dataset_size + (1 | dataset_random_seed)
   Data: df_analysis_preprocessing_simple_prep

REML criterion at convergence: 261.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.5547 -0.5997 -0.1386  0.3384  4.6116 

Random effects:
 Groups              Name        Variance Std.Dev.
 dataset_random_seed (Intercept) 0.07243  0.2691  
 Residual                        0.38151  0.6177  
Number of obs: 125, groups:  dataset_random_seed, 5

Fixed effects:
              Estimate Std. Error t value
(Intercept)  8.695e-01  1.768e-01   4.917
dataset_size 6.317e-03  3.906e-05 161.701

Correlation of Fixed Effects:
            (Intr)
dataset_siz -0.663
Some predictor variables are on very different scales: consider rescaling

In [35]:
GLM_fit_preprocessing_ALGONAME_filter_prep <- lmer(
    formula = time_total ~ dataset_size + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_preprocessing_filter_prep
)
GLM_fit_preprocessing_ALGONAME_filter_prep_WITHOUT_dataset_size <- lmer(
    formula = time_total ~ (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_preprocessing_filter_prep
)

"Some predictor variables are on very different scales: consider rescaling"
boundary (singular) fit: see help('isSingular')

boundary (singular) fit: see help('isSingular')



In [None]:
# drop1(GLM_fit_preprocessing, test="Chisq")

Fit a generalized linear model (GLM) on data with all factors minus `algorithm_name` and perform parametric bootstrap.

In [20]:

# pbgmm_preprocessing_without_algorithm_name <- pbnm( GLM_fit_preprocessing, GLM_fit_preprocessing_without_algorithm_name, nsim=1000, tasks=10, cores=2, seed=42 ) 
# summary(pbgmm_preprocessing_without_algorithm_name)
anova_preprocessing_without_algorithm_name <- anova(GLM_fit_preprocessing, GLM_fit_preprocessing_without_algorithm_name, test="Chisq")
summary(anova_preprocessing_without_algorithm_name)

"Some predictor variables are on very different scales: consider rescaling"
refitting model(s) with ML (instead of REML)



      npar          AIC             BIC            logLik        deviance    
 Min.   :5.0   Min.   :734.1   Min.   :753.7   Min.   :-362   Min.   :723.9  
 1st Qu.:5.5   1st Qu.:735.0   1st Qu.:756.6   1st Qu.:-362   1st Qu.:723.9  
 Median :6.0   Median :736.0   Median :759.5   Median :-362   Median :724.0  
 Mean   :6.0   Mean   :736.0   Mean   :759.5   Mean   :-362   Mean   :724.0  
 3rd Qu.:6.5   3rd Qu.:736.9   3rd Qu.:762.5   3rd Qu.:-362   3rd Qu.:724.0  
 Max.   :7.0   Max.   :737.9   Max.   :765.4   Max.   :-362   Max.   :724.1  
                                                                             
     Chisq             Df      Pr(>Chisq)    
 Min.   :0.153   Min.   :2   Min.   :0.9264  
 1st Qu.:0.153   1st Qu.:2   1st Qu.:0.9264  
 Median :0.153   Median :2   Median :0.9264  
 Mean   :0.153   Mean   :2   Mean   :0.9264  
 3rd Qu.:0.153   3rd Qu.:2   3rd Qu.:0.9264  
 Max.   :0.153   Max.   :2   Max.   :0.9264  
 NA's   :1       NA's   :1   NA's   :1       

Fit a generalized linear model (GLM) on data with all factors minus `dataset_size` and perform parametric bootstrap.

In [9]:

pbgmm_preprocessing_without_dataset_size <- pbnm( GLM_fit_preprocessing, GLM_fit_preprocessing_without_dataset_size, nsim=1000, tasks=10, cores=2, seed=42 ) 
summary(pbgmm_preprocessing_without_dataset_size)
#anova_preprocessing_without_dataset_size <- anova(GLM_fit_preprocessing, GLM_fit_preprocessing_without_dataset_size, test="Chisq")
#summary(anova_preprocessing_without_dataset_size)

Parametric bootstrap testing: dataset_size = 0 
from: glm(formula = time_total ~ algorithm_name + dataset_size + (1 |  dataset_random_seed) + (1 | algorithm_random_seed), data = df_analysis_preprocessing) 
1000 samples were taken Tue Mar 14 13:15:01 2023 
1000 samples had errors, 1000 in alternate model 1000 in null model 
1000 unused samples.  0 <= P(abs(dataset_size) > |0.006311287|) <= 1

Fit a generalized linear model (GLM) on data with all factors minus `dataset_random_seed` and perform parametric bootstrap.

In [10]:
GLM_fit_preprocessing_without_dataset_random_seed <- glm(
    formula = time_total ~ algorithm_name + dataset_size + (1 | algorithm_random_seed),
    data = df_analysis_preprocessing
)
pbgmm_preprocessing_without_dataset_random_seed <- pbnm( GLM_fit_preprocessing, GLM_fit_preprocessing_without_dataset_random_seed, nsim=1000, tasks=10, cores=2, seed=42 ) 
summary(pbgmm_preprocessing_without_dataset_random_seed)
#diff_preprocessing_without_dataset_random_seed = logLik(GLM_fit_preprocessing) - logLik(GLM_fit_preprocessing_without_dataset_random_seed)
#pchisq(as.numeric(diff_preprocessing_without_dataset_random_seed), df=1, lower.tail=F)

ERROR: Error in UseMethod("VarCorr"): pas de méthode pour 'VarCorr' applicable pour un objet de classe "c('glm', 'lm')"


Fit a generalized linear model (GLM) on data with all factors minus `algorithm_random_seed` and perform parametric bootstrap.

In [None]:
GLM_fit_preprocessing_without_algorithm_random_seed <- glm(
    formula = time_total ~ algorithm_name + dataset_size + (1 | dataset_random_seed),
    data = df_analysis_preprocessing
)
# pbgmm_preprocessing_without_algorithm_random_seed <- pbnm( GLM_fit_preprocessing, GLM_fit_preprocessing_without_algorithm_random_seed, nsim=1000, tasks=10, cores=2, seed=42 ) 
# summary(pbgmm_preprocessing_without_algorithm_random_seed)
diff_preprocessing_without_algorithm_random_seed = logLik(GLM_fit_preprocessing) - logLik(GLM_fit_preprocessing_without_algorithm_random_seed)
pchisq(as.numeric(diff_preprocessing_without_algorithm_random_seed), df=1, lower.tail=F)

------------------------------

### 2.2. ANALYSIS FOR VECTORIZATION

#### 2.2.1. LOAD SYNTHESIS CSV FILE

In [None]:
# Load analysis data.
df_analysis_vectorization <- read.csv(
    file="../results/experiments_synthesis_for_vectorization.csv",
    header=TRUE,  # Use the first row as headers.
    sep=";",
    skip=0,  # Number of rows to skip in the file.
)

In [None]:
# Set column type.
df_analysis_vectorization$dataset_size <- as.numeric( df_analysis_vectorization$dataset_size )
df_analysis_vectorization$dataset_random_seed <- as.numeric( df_analysis_vectorization$dataset_random_seed )
df_analysis_vectorization$algorithm_name <- as.factor( df_analysis_vectorization$algorithm_name )
df_analysis_vectorization$algorithm_random_seed <- as.numeric( df_analysis_vectorization$algorithm_random_seed )
df_analysis_vectorization$time_total <- as.double( sub(",", ".", df_analysis_vectorization$time_total) )

In [None]:
# Show an extract of analysis data.
df_analysis_vectorization

#### 2.2.2. Apply general analysis

Fit a generalized linear model (GLM) on data with all factors.

In [None]:
GLM_fit_vectorization <- glm(
    formula = time_total ~ algorithm_name + dataset_size + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_vectorization
)
summary(GLM_fit_vectorization)

Fit a generalized linear model (GLM) on data with all factors minus `algorithm_name` and perform parametric bootstrap.

In [None]:
GLM_fit_vectorization_without_algorithm_name <- glm(
    formula = time_total ~ dataset_size + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_vectorization
)
# pbgmm_vectorization_without_algorithm_name <- pbnm( GLM_fit_vectorization, GLM_fit_vectorization_without_algorithm_name, nsim=1000, tasks=10, cores=2, seed=42 ) 
# summary(pbgmm_vectorization_without_algorithm_name)
anova_vectorization_without_algorithm_name <- anova(GLM_fit_vectorization, GLM_fit_vectorization_without_algorithm_name, test="Chisq")
summary(anova_vectorization_without_algorithm_name)

Fit a generalized linear model (GLM) on data with all factors minus `dataset_size` and perform parametric bootstrap.

In [None]:
GLM_fit_vectorization_without_dataset_size <- glm(
    formula = time_total ~ algorithm_name + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_vectorization
)
# pbgmm_vectorization_without_dataset_size <- pbnm( GLM_fit_vectorization, GLM_fit_vectorization_without_dataset_size, nsim=1000, tasks=10, cores=2, seed=42 )
# summary(pbgmm_vectorization_without_dataset_size)
anova_vectorization_without_dataset_size <- anova(GLM_fit_vectorization, GLM_fit_vectorization_without_dataset_size, test="Chisq")
summary(anova_vectorization_without_dataset_size)

------------------------------

### 2.3. ANALYSIS FOR SAMPLING

#### 2.3.1. LOAD SYNTHESIS CSV FILE

In [None]:
# Load analysis data.
df_analysis_sampling <- read.csv(
    file="../results/experiments_synthesis_for_sampling.csv",
    header=TRUE,  # Use the first row as headers.
    sep=";",
    skip=0,  # Number of rows to skip in the file.
)

In [None]:
# Set column type.
df_analysis_sampling$dataset_size <- as.numeric( df_analysis_sampling$dataset_size )
df_analysis_sampling$dataset_random_seed <- as.numeric( df_analysis_sampling$dataset_random_seed )
df_analysis_sampling$previous_nb_constraints <- as.numeric( df_analysis_sampling$previous_nb_constraints )
df_analysis_sampling$previous_nb_clusters <- as.numeric( df_analysis_sampling$previous_nb_clusters )
df_analysis_sampling$algorithm_name <- as.factor( df_analysis_sampling$algorithm_name )
df_analysis_sampling$algorithm_random_seed <- as.numeric( df_analysis_sampling$algorithm_random_seed )
df_analysis_sampling$algorithm_nb_to_select <- as.numeric( df_analysis_sampling$algorithm_nb_to_select )
df_analysis_sampling$time_total <- as.double( sub(",", ".", df_analysis_sampling$time_total) )

In [None]:
# Show an extract of analysis data.
df_analysis_sampling

#### 2.3.2. Apply general analysis

Fit a generalized linear model (GLM) on data.

In [None]:
GLM_fit_sampling <- glm(
    formula = time_total ~ algorithm_name + dataset_size + previous_nb_constraints + previous_nb_clusters + algorithm_nb_to_select + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_sampling
)
summary(GLM_fit_sampling)

Fit a generalized linear model (GLM) on data with all factors minus `algorithm_name` and perform parametric bootstrap.

In [None]:
GLM_fit_sampling_without_algorithm_name <- glm(
    formula = time_total ~ dataset_size + previous_nb_constraints + previous_nb_clusters + algorithm_nb_to_select + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_sampling
)
# pbgmm_sampling_without_algorithm_name <- pbnm( GLM_fit_sampling, GLM_fit_sampling_without_algorithm_name, nsim=1000, tasks=10, cores=2, seed=42 ) 
# summary(pbgmm_sampling_without_algorithm_name)
anova_sampling_without_algorithm_name <- anova(GLM_fit_sampling, GLM_fit_sampling_without_algorithm_name, test="Chisq")
summary(anova_sampling_without_algorithm_name)

Fit a generalized linear model (GLM) on data with all factors minus `dataset_size` and perform parametric bootstrap.

In [None]:
GLM_fit_sampling_without_dataset_size <- glm(
    formula = time_total ~ algorithm_name + previous_nb_constraints + previous_nb_clusters + algorithm_nb_to_select + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_sampling
)
# pbgmm_sampling_without_dataset_size <- pbnm( GLM_fit_sampling, GLM_fit_sampling_without_dataset_size, nsim=1000, tasks=10, cores=2, seed=42 )
# summary(pbgmm_sampling_without_dataset_size)
anova_sampling_without_dataset_size <- aov(GLM_fit_sampling, GLM_fit_sampling_without_dataset_size, test="Chisq")
summary(anova_sampling_without_dataset_size)

Fit a generalized linear model (GLM) on data with all factors minus `previous_nb_constraints` and perform parametric bootstrap.

In [None]:
GLM_fit_sampling_without_previous_nb_constraints <- glm(
    formula = time_total ~ algorithm_name + dataset_size + previous_nb_clusters + algorithm_nb_to_select + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_sampling
)
# pbgmm_sampling_without_previous_nb_constraints <- pbnm( GLM_fit_sampling, GLM_fit_sampling_without_previous_nb_constraints, nsim=1000, tasks=10, cores=2, seed=42 ) 
# summary(pbgmm_sampling_without_previous_nb_constraints)
anova_sampling_without_previous_nb_constraints <- anova(GLM_fit_sampling, GLM_fit_sampling_without_previous_nb_constraints, test="Chisq")
summary(anova_sampling_without_previous_nb_constraints)

Fit a generalized linear model (GLM) on data with all factors minus `previous_nb_clusters` and perform parametric bootstrap.

In [None]:
GLM_fit_sampling_without_previous_nb_clusters <- glm(
    formula = time_total ~ algorithm_name + dataset_size + previous_nb_constraints + algorithm_nb_to_select + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_sampling
)
# pbgmm_sampling_without_previous_nb_clusters <- pbnm( GLM_fit_sampling, GLM_fit_sampling_without_previous_nb_clusters, nsim=1000, tasks=10, cores=2, seed=42 ) 
# summary(pbgmm_sampling_without_previous_nb_clusters)
anova_sampling_without_previous_nb_clusters <- anova(GLM_fit_sampling, GLM_fit_sampling_without_previous_nb_clusters, test="Chisq")
summary(anova_sampling_without_previous_nb_clusters)

Fit a generalized linear model (GLM) on data with all factors minus `algorithm_nb_to_select` and perform parametric bootstrap.

In [None]:
GLM_fit_sampling_without_algorithm_nb_to_select <- glm(
    formula = time_total ~ algorithm_name + dataset_size + previous_nb_constraints + previous_nb_clusters + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_sampling
)
# pbgmm_sampling_without_algorithm_nb_to_select <- pbnm( GLM_fit_sampling, GLM_fit_sampling_without_algorithm_nb_to_select, nsim=1000, tasks=10, cores=2, seed=42 )
# summary(pbgmm_sampling_without_algorithm_nb_to_select)
anova_sampling_without_algorithm_nb_to_select <- anova(GLM_fit_sampling, GLM_fit_sampling_without_algorithm_nb_to_select, test="Chisq")
summary(anova_sampling_without_algorithm_nb_to_select)

------------------------------

### 2.4. ANALYSIS FOR CLUSTERING

#### 2.4.1. LOAD SYNTHESIS CSV FILE

In [None]:
# Load analysis data.
df_analysis_clustering <- read.csv(
    file="../results/experiments_synthesis_for_clustering.csv",
    header=TRUE,  # Use the first row as headers.
    sep=";",
    skip=0,  # Number of rows to skip in the file.
)

In [None]:
# Set column type.
df_analysis_clustering$dataset_size <- as.numeric( df_analysis_clustering$dataset_size )
df_analysis_clustering$dataset_random_seed <- as.numeric( df_analysis_clustering$dataset_random_seed )
df_analysis_clustering$previous_nb_constraints <- as.numeric( df_analysis_clustering$previous_nb_constraints )
df_analysis_clustering$algorithm_name <- as.factor( df_analysis_clustering$algorithm_name )
df_analysis_clustering$algorithm_random_seed <- as.numeric( df_analysis_clustering$algorithm_random_seed )
df_analysis_clustering$algorithm_nb_clusters <- as.numeric( df_analysis_clustering$algorithm_nb_clusters )
df_analysis_clustering$time_total <- as.double( sub(",", ".", df_analysis_clustering$time_total) )

In [None]:
# Show an extract of analysis data.
df_analysis_clustering

#### 2.4.2. Apply general analysis

Fit a generalized linear model (GLM) on data.

In [None]:
GLM_fit_clustering <- glm(
    formula = time_total ~ algorithm_name + dataset_size + previous_nb_constraints + algorithm_nb_clusters + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_clustering
)
summary(GLM_fit_clustering)