# ==== INTERACTIVE CLUSTERING : COMPUTATION TIME STUDY ====
> ### Stage 3 : Apply main effects and post-hoc analysis on interactive clustering computation times.

------------------------------
## READ-ME BEFORE RUNNING

### Quick Description

This notebook is **aimed at run main effects and and post-hoc analysis on interactive clustering computation time over experiments**.
- Environments are represented by subdirectories in the `/experiments` folder. A full path to an experiment environment is `/experiments/[TASK]/[DATASET]/[ALGORITHM]/`.
- Experiments have to be run and evaluated in order to analyze convergency speed.

Before running, **run the notebook `2_Estimate_computation_time.ipynb` to run each algorithm you have set**.

Then, **go to the notebook `4_Plot_some_figures.ipynb` to create figures on interactive clustering computation time**.

### Description each steps

First of all, **load experiment synthesis CSV file** that have been computed with the last notebook.
- It contains parameters used for each experiment and convergency metric to compare.
- Several parameters are studied depending on the task:
    - _preprocessing_: `dataset_size`, `algorithm_name`;
    - _vectorization_: `dataset_size`, `algorithm_name`;
    - _sampling_: `dataset_size`, `algorithm_name`, `previous_nb_constraints`, `previous_nb_clusters`, `algorithm_nb_to_select`;
    - _clustering_: `dataset_size`, `algorithm_name`, `previous_nb_constraints`, `previous_nb_clusters`.
- Two random effects are used : `dataset_random_seed`, `algorithm_random_seed`.
- One values is modelized with these factors : `time_total`.

Then, for each task :
1. First, **compute a global modelization** :
    - Fit a generalized linear model (GLM) on data with all factors.
2. Then, **evaluate the relevance of each factor** :
    - Fit a generalized linear model (GLM) on data with all factors but without the factor you want to study.
    - Perform parametric bootstrapping to evaluate the relevant of the studied factor.
3. Finally, **compute a relevant modelization** :
    - Fit a generalized linear model (GLM) on data with only relevant factors.

------------------------------
## 1. IMPORT R DEPENDENCIES

In [1]:
#library("sjstats")
library("lme4")
library("emmeans")
library("pbnm")

Le chargement a nécessité le package : Matrix



------------------------------
## 2. ANALYSIS FOR PREPROCESSING

------------------------------
### 2.1. ANALYSIS FOR PREPROCESSING

#### 2.1.1. LOAD SYNTHESIS CSV FILE

In [2]:
# Load analysis data.
df_analysis_preprocessing <- read.csv(
    file="../results/experiments_synthesis_for_preprocessing.csv",
    header=TRUE,  # Use the first row as headers.
    sep=";",
    skip=0,  # Number of rows to skip in the file.
)

In [3]:
# Set column type.
df_analysis_preprocessing$dataset_size <- as.numeric( df_analysis_preprocessing$dataset_size )
df_analysis_preprocessing$dataset_random_seed <- as.numeric( df_analysis_preprocessing$dataset_random_seed )
df_analysis_preprocessing$algorithm_name <- as.factor( df_analysis_preprocessing$algorithm_name )
df_analysis_preprocessing$algorithm_random_seed <- as.numeric( df_analysis_preprocessing$algorithm_random_seed )
df_analysis_preprocessing$time_total <- as.double( sub(",", ".", df_analysis_preprocessing$time_total) )

In [4]:
# Show an extract of analysis data.
df_analysis_preprocessing

X,dataset_name,dataset_size,dataset_random_seed,algorithm_name,algorithm_random_seed,time_start,time_stop,time_total
<chr>,<chr>,<dbl>,<dbl>,<fct>,<dbl>,<int>,<int>,<dbl>
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/filter_prep-rand_1/,bank_cards_v2,1000,1,filter_prep,1,1668606138,1668606148,10.645489
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/filter_prep-rand_2/,bank_cards_v2,1000,1,filter_prep,2,1668606138,1668606148,10.604682
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/filter_prep-rand_3/,bank_cards_v2,1000,1,filter_prep,3,1668606148,1668606155,6.896929
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/filter_prep-rand_4/,bank_cards_v2,1000,1,filter_prep,4,1668606148,1668606155,6.832410
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/filter_prep-rand_5/,bank_cards_v2,1000,1,filter_prep,5,1668606148,1668606155,6.849232
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/lemma_prep-rand_1/,bank_cards_v2,1000,1,lemma_prep,1,1668606138,1668606148,10.631167
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/lemma_prep-rand_2/,bank_cards_v2,1000,1,lemma_prep,2,1668606138,1668606148,10.619905
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/lemma_prep-rand_3/,bank_cards_v2,1000,1,lemma_prep,3,1668606138,1668606148,10.423440
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/lemma_prep-rand_4/,bank_cards_v2,1000,1,lemma_prep,4,1668606148,1668606155,6.854967
../experiments/preprocessing/bank_cards_v2-size_1000-rand_1/lemma_prep-rand_5/,bank_cards_v2,1000,1,lemma_prep,5,1668606148,1668606155,6.806319


#### 2.1.2. Apply general analysis

Fit a generalized linear model (GLM) on data with all factors.

In [25]:
GLM_fit_preprocessing <- glm(
    formula = time_total ~ algorithm_name + dataset_size + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_preprocessing,
)
summary(GLM_fit_preprocessing)


Call:
glm(formula = time_total ~ algorithm_name + dataset_size + (1 | 
    dataset_random_seed) + (1 | algorithm_random_seed), data = df_analysis_preprocessing)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0787  -0.3941  -0.1604   0.1892   3.4381  

Coefficients: (2 not defined because of singularities)
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    9.103e-01  9.459e-02   9.623   <2e-16 ***
algorithm_namelemma_prep      -2.850e-02  8.461e-02  -0.337    0.736    
algorithm_namesimple_prep     -2.429e-02  8.461e-02  -0.287    0.774    
dataset_size                   6.311e-03  2.442e-05 258.407   <2e-16 ***
1 | dataset_random_seedTRUE           NA         NA      NA       NA    
1 | algorithm_random_seedTRUE         NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.447394)

    Null deviance: 30040

In [6]:
drop1(GLM_fit_preprocessing, test="Chisq")

Unnamed: 0_level_0,Df,Deviance,AIC,scaled dev.,Pr(>Chi)
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
<none>,,165.9832,768.564,,
algorithm_name,2.0,166.0423,764.6977,0.1336881,0.935341
dataset_size,1.0,30040.2446,2715.9666,1949.4025814,0.0
1 | dataset_random_seed,0.0,165.9832,768.564,0.0,
1 | algorithm_random_seed,0.0,165.9832,768.564,0.0,


Fit a generalized linear model (GLM) on data with all factors minus `algorithm_name` and perform parametric bootstrap.

In [24]:
GLM_fit_preprocessing_without_algorithm_name <- glm(
    formula = time_total ~ dataset_size + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_preprocessing
)
# pbgmm_preprocessing_without_algorithm_name <- pbnm( GLM_fit_preprocessing, GLM_fit_preprocessing_without_algorithm_name, nsim=1000, tasks=10, cores=2, seed=42 ) 
# summary(pbgmm_preprocessing_without_algorithm_name)
anova_preprocessing_without_algorithm_name <- anova(GLM_fit_preprocessing, GLM_fit_preprocessing_without_algorithm_name, test="Chisq")
summary(anova_preprocessing_without_algorithm_name)

refitting model(s) with ML (instead of REML)



      npar        AIC             BIC            logLik          deviance    
 Min.   :3   Min.   :737.9   Min.   :765.4   Min.   :-379.3   Min.   :723.9  
 1st Qu.:4   1st Qu.:744.6   1st Qu.:768.2   1st Qu.:-375.0   1st Qu.:732.6  
 Median :5   Median :751.3   Median :770.9   Median :-370.7   Median :741.3  
 Mean   :5   Mean   :751.3   Mean   :770.9   Mean   :-370.7   Mean   :741.3  
 3rd Qu.:6   3rd Qu.:758.0   3rd Qu.:773.7   3rd Qu.:-366.3   3rd Qu.:750.0  
 Max.   :7   Max.   :764.7   Max.   :776.5   Max.   :-362.0   Max.   :758.7  
                                                                             
     Chisq             Df      Pr(>Chisq)   
 Min.   :34.79   Min.   :4   Min.   :5e-07  
 1st Qu.:34.79   1st Qu.:4   1st Qu.:5e-07  
 Median :34.79   Median :4   Median :5e-07  
 Mean   :34.79   Mean   :4   Mean   :5e-07  
 3rd Qu.:34.79   3rd Qu.:4   3rd Qu.:5e-07  
 Max.   :34.79   Max.   :4   Max.   :5e-07  
 NA's   :1       NA's   :1   NA's   :1      

Fit a generalized linear model (GLM) on data with all factors minus `dataset_size` and perform parametric bootstrap.

In [8]:
GLM_fit_preprocessing_without_dataset_size <- glm(
    formula = time_total ~ algorithm_name + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_preprocessing
)
# pbgmm_preprocessing_without_dataset_size <- pbnm( GLM_fit_preprocessing, GLM_fit_preprocessing_without_dataset_size, nsim=1000, tasks=10, cores=2, seed=42 ) 
# summary(pbgmm_preprocessing_without_dataset_size)
anova_preprocessing_without_dataset_size <- anova(GLM_fit_preprocessing, GLM_fit_preprocessing_without_dataset_size, test="Chisq")
summary(anova_preprocessing_without_dataset_size)

   Resid. Df       Resid. Dev          Df        Deviance         Pr(>Chi)
 Min.   :371.0   Min.   :  166   Min.   :-1   Min.   :-29874   Min.   :0  
 1st Qu.:371.2   1st Qu.: 7635   1st Qu.:-1   1st Qu.:-29874   1st Qu.:0  
 Median :371.5   Median :15103   Median :-1   Median :-29874   Median :0  
 Mean   :371.5   Mean   :15103   Mean   :-1   Mean   :-29874   Mean   :0  
 3rd Qu.:371.8   3rd Qu.:22572   3rd Qu.:-1   3rd Qu.:-29874   3rd Qu.:0  
 Max.   :372.0   Max.   :30040   Max.   :-1   Max.   :-29874   Max.   :0  
                                 NA's   :1    NA's   :1        NA's   :1  

In [21]:
anova_preprocessing_without_dataset_size

Unnamed: 0_level_0,Resid. Df,Resid. Dev,Df,Deviance,Pr(>Chi)
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,371,165.9832,,,
2,372,30040.2446,-1.0,-29874.26,0.0


Fit a generalized linear model (GLM) on data with all factors minus `dataset_random_seed` and perform parametric bootstrap.

In [19]:
GLM_fit_preprocessing_without_dataset_random_seed <- glm(
    formula = time_total ~ algorithm_name + dataset_size + (1 | algorithm_random_seed),
    data = df_analysis_preprocessing
)
# pbgmm_preprocessing_without_dataset_random_seed <- pbnm( GLM_fit_preprocessing, GLM_fit_preprocessing_without_dataset_random_seed, nsim=1000, tasks=10, cores=2, seed=42 ) 
# summary(pbgmm_preprocessing_without_dataset_random_seed)
diff_preprocessing_without_dataset_random_seed = logLik(GLM_fit_preprocessing) - logLik(GLM_fit_preprocessing_without_dataset_random_seed)
pchisq(as.numeric(diff_preprocessing_without_dataset_random_seed), df=1, lower.tail=F)

Fit a generalized linear model (GLM) on data with all factors minus `algorithm_random_seed` and perform parametric bootstrap.

In [20]:
GLM_fit_preprocessing_without_algorithm_random_seed <- glm(
    formula = time_total ~ algorithm_name + dataset_size + (1 | dataset_random_seed),
    data = df_analysis_preprocessing
)
# pbgmm_preprocessing_without_algorithm_random_seed <- pbnm( GLM_fit_preprocessing, GLM_fit_preprocessing_without_algorithm_random_seed, nsim=1000, tasks=10, cores=2, seed=42 ) 
# summary(pbgmm_preprocessing_without_algorithm_random_seed)
diff_preprocessing_without_algorithm_random_seed = logLik(GLM_fit_preprocessing) - logLik(GLM_fit_preprocessing_without_algorithm_random_seed)
pchisq(as.numeric(diff_preprocessing_without_algorithm_random_seed), df=1, lower.tail=F)

------------------------------
### 2.2. ANALYSIS FOR VECTORIZATION

#### 2.2.1. LOAD SYNTHESIS CSV FILE

In [None]:
# Load analysis data.
df_analysis_vectorization <- read.csv(
    file="../results/experiments_synthesis_for_vectorization.csv",
    header=TRUE,  # Use the first row as headers.
    sep=";",
    skip=0,  # Number of rows to skip in the file.
)

In [None]:
# Set column type.
df_analysis_vectorization$dataset_size <- as.numeric( df_analysis_vectorization$dataset_size )
df_analysis_vectorization$dataset_random_seed <- as.numeric( df_analysis_vectorization$dataset_random_seed )
df_analysis_vectorization$algorithm_name <- as.factor( df_analysis_vectorization$algorithm_name )
df_analysis_vectorization$algorithm_random_seed <- as.numeric( df_analysis_vectorization$algorithm_random_seed )
df_analysis_vectorization$time_total <- as.double( sub(",", ".", df_analysis_vectorization$time_total) )

In [None]:
# Show an extract of analysis data.
df_analysis_vectorization

#### 2.2.2. Apply general analysis

Fit a generalized linear model (GLM) on data with all factors.

In [None]:
GLM_fit_vectorization <- glm(
    formula = time_total ~ algorithm_name + dataset_size + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_vectorization
)
summary(GLM_fit_vectorization)

Fit a generalized linear model (GLM) on data with all factors minus `algorithm_name` and perform parametric bootstrap.

In [None]:
GLM_fit_vectorization_without_algorithm_name <- glm(
    formula = time_total ~ dataset_size + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_vectorization
)
# pbgmm_vectorization_without_algorithm_name <- pbnm( GLM_fit_vectorization, GLM_fit_vectorization_without_algorithm_name, nsim=1000, tasks=10, cores=2, seed=42 ) 
# summary(pbgmm_vectorization_without_algorithm_name)
anova_vectorization_without_algorithm_name <- anova(GLM_fit_vectorization, GLM_fit_vectorization_without_algorithm_name, test="Chisq")
summary(anova_vectorization_without_algorithm_name)

Fit a generalized linear model (GLM) on data with all factors minus `dataset_size` and perform parametric bootstrap.

In [None]:
GLM_fit_vectorization_without_dataset_size <- glm(
    formula = time_total ~ algorithm_name + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_vectorization
)
# pbgmm_vectorization_without_dataset_size <- pbnm( GLM_fit_vectorization, GLM_fit_vectorization_without_dataset_size, nsim=1000, tasks=10, cores=2, seed=42 )
# summary(pbgmm_vectorization_without_dataset_size)
anova_vectorization_without_dataset_size <- anova(GLM_fit_vectorization, GLM_fit_vectorization_without_dataset_size, test="Chisq")
summary(anova_vectorization_without_dataset_size)

------------------------------
### 2.3. ANALYSIS FOR SAMPLING

#### 2.3.1. LOAD SYNTHESIS CSV FILE

In [None]:
# Load analysis data.
df_analysis_sampling <- read.csv(
    file="../results/experiments_synthesis_for_sampling.csv",
    header=TRUE,  # Use the first row as headers.
    sep=";",
    skip=0,  # Number of rows to skip in the file.
)

In [None]:
# Set column type.
df_analysis_sampling$dataset_size <- as.numeric( df_analysis_sampling$dataset_size )
df_analysis_sampling$dataset_random_seed <- as.numeric( df_analysis_sampling$dataset_random_seed )
df_analysis_sampling$previous_nb_constraints <- as.numeric( df_analysis_sampling$previous_nb_constraints )
df_analysis_sampling$previous_nb_clusters <- as.numeric( df_analysis_sampling$previous_nb_clusters )
df_analysis_sampling$algorithm_name <- as.factor( df_analysis_sampling$algorithm_name )
df_analysis_sampling$algorithm_random_seed <- as.numeric( df_analysis_sampling$algorithm_random_seed )
df_analysis_sampling$algorithm_nb_to_select <- as.numeric( df_analysis_sampling$algorithm_nb_to_select )
df_analysis_sampling$time_total <- as.double( sub(",", ".", df_analysis_sampling$time_total) )

In [None]:
# Show an extract of analysis data.
df_analysis_sampling

#### 2.3.2. Apply general analysis

Fit a generalized linear model (GLM) on data.

In [None]:
GLM_fit_sampling <- glm(
    formula = time_total ~ algorithm_name + dataset_size + previous_nb_constraints + previous_nb_clusters + algorithm_nb_to_select + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_sampling
)
summary(GLM_fit_sampling)

Fit a generalized linear model (GLM) on data with all factors minus `algorithm_name` and perform parametric bootstrap.

In [None]:
GLM_fit_sampling_without_algorithm_name <- glm(
    formula = time_total ~ dataset_size + previous_nb_constraints + previous_nb_clusters + algorithm_nb_to_select + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_sampling
)
# pbgmm_sampling_without_algorithm_name <- pbnm( GLM_fit_sampling, GLM_fit_sampling_without_algorithm_name, nsim=1000, tasks=10, cores=2, seed=42 ) 
# summary(pbgmm_sampling_without_algorithm_name)
anova_sampling_without_algorithm_name <- anova(GLM_fit_sampling, GLM_fit_sampling_without_algorithm_name, test="Chisq")
summary(anova_sampling_without_algorithm_name)

Fit a generalized linear model (GLM) on data with all factors minus `dataset_size` and perform parametric bootstrap.

In [None]:
GLM_fit_sampling_without_dataset_size <- glm(
    formula = time_total ~ algorithm_name + previous_nb_constraints + previous_nb_clusters + algorithm_nb_to_select + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_sampling
)
# pbgmm_sampling_without_dataset_size <- pbnm( GLM_fit_sampling, GLM_fit_sampling_without_dataset_size, nsim=1000, tasks=10, cores=2, seed=42 )
# summary(pbgmm_sampling_without_dataset_size)
anova_sampling_without_dataset_size <- aov(GLM_fit_sampling, GLM_fit_sampling_without_dataset_size, test="Chisq")
summary(anova_sampling_without_dataset_size)

Fit a generalized linear model (GLM) on data with all factors minus `previous_nb_constraints` and perform parametric bootstrap.

In [None]:
GLM_fit_sampling_without_previous_nb_constraints <- glm(
    formula = time_total ~ algorithm_name + dataset_size + previous_nb_clusters + algorithm_nb_to_select + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_sampling
)
# pbgmm_sampling_without_previous_nb_constraints <- pbnm( GLM_fit_sampling, GLM_fit_sampling_without_previous_nb_constraints, nsim=1000, tasks=10, cores=2, seed=42 ) 
# summary(pbgmm_sampling_without_previous_nb_constraints)
anova_sampling_without_previous_nb_constraints <- anova(GLM_fit_sampling, GLM_fit_sampling_without_previous_nb_constraints, test="Chisq")
summary(anova_sampling_without_previous_nb_constraints)

Fit a generalized linear model (GLM) on data with all factors minus `previous_nb_clusters` and perform parametric bootstrap.

In [None]:
GLM_fit_sampling_without_previous_nb_clusters <- glm(
    formula = time_total ~ algorithm_name + dataset_size + previous_nb_constraints + algorithm_nb_to_select + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_sampling
)
# pbgmm_sampling_without_previous_nb_clusters <- pbnm( GLM_fit_sampling, GLM_fit_sampling_without_previous_nb_clusters, nsim=1000, tasks=10, cores=2, seed=42 ) 
# summary(pbgmm_sampling_without_previous_nb_clusters)
anova_sampling_without_previous_nb_clusters <- anova(GLM_fit_sampling, GLM_fit_sampling_without_previous_nb_clusters, test="Chisq")
summary(anova_sampling_without_previous_nb_clusters)

Fit a generalized linear model (GLM) on data with all factors minus `algorithm_nb_to_select` and perform parametric bootstrap.

In [None]:
GLM_fit_sampling_without_algorithm_nb_to_select <- glm(
    formula = time_total ~ algorithm_name + dataset_size + previous_nb_constraints + previous_nb_clusters + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_sampling
)
# pbgmm_sampling_without_algorithm_nb_to_select <- pbnm( GLM_fit_sampling, GLM_fit_sampling_without_algorithm_nb_to_select, nsim=1000, tasks=10, cores=2, seed=42 )
# summary(pbgmm_sampling_without_algorithm_nb_to_select)
anova_sampling_without_algorithm_nb_to_select <- anova(GLM_fit_sampling, GLM_fit_sampling_without_algorithm_nb_to_select, test="Chisq")
summary(anova_sampling_without_algorithm_nb_to_select)

------------------------------
### 2.4. ANALYSIS FOR CLUSTERING

#### 2.4.1. LOAD SYNTHESIS CSV FILE

In [None]:
# Load analysis data.
df_analysis_clustering <- read.csv(
    file="../results/experiments_synthesis_for_clustering.csv",
    header=TRUE,  # Use the first row as headers.
    sep=";",
    skip=0,  # Number of rows to skip in the file.
)

In [None]:
# Set column type.
df_analysis_clustering$dataset_size <- as.numeric( df_analysis_clustering$dataset_size )
df_analysis_clustering$dataset_random_seed <- as.numeric( df_analysis_clustering$dataset_random_seed )
df_analysis_clustering$previous_nb_constraints <- as.numeric( df_analysis_clustering$previous_nb_constraints )
df_analysis_clustering$algorithm_name <- as.factor( df_analysis_clustering$algorithm_name )
df_analysis_clustering$algorithm_random_seed <- as.numeric( df_analysis_clustering$algorithm_random_seed )
df_analysis_clustering$algorithm_nb_clusters <- as.numeric( df_analysis_clustering$algorithm_nb_clusters )
df_analysis_clustering$time_total <- as.double( sub(",", ".", df_analysis_clustering$time_total) )

In [None]:
# Show an extract of analysis data.
df_analysis_clustering

#### 2.4.2. Apply general analysis

Fit a generalized linear model (GLM) on data.

In [None]:
GLM_fit_clustering <- glm(
    formula = time_total ~ algorithm_name + dataset_size + previous_nb_constraints + algorithm_nb_clusters + (1 | dataset_random_seed) + (1 | algorithm_random_seed),
    data = df_analysis_clustering
)
summary(GLM_fit_clustering)