# ==== INTERACTIVE CLUSTERING COMPARATIVE STUDY ====
> ### Stage 3 : Apply main effects and post-hoc analysis on interactive clustering parameters.

------------------------------
## READ-ME BEFORE RUNNING

### Quick Description

This notebook is **aimed at run main effects and and post-hoc analysis on interactive clustering convergence speed over experiments**.
- Environments are represented by subdirectories in the `/experiments` folder. A full path to an experiment environment is `/experiments/[DATASET]/[PREPROCESSING]/[VECTORIZATION]/[SAMPLING]/[CLUSTERING]/[EXPERIMENT]`.
- Experiments have to be run and evaluated in order to analyze convergency speed.

Before running, **run the notebook `2_Run_and_evaluate_experiments.ipynb` to run and evaluate each experiment you have set**.

### Description each steps

First of all, **load experiment synthesis CSV file** that have been computed with the last notebook.
- It contains parameters used for each experiment and convergency metric to compare.
- Four parameters are studied : `preprocessing`, `vectorization`, `sampling` and `clustering`.

For the next steps, choose an threshold of annotation (`partial annotation (80% of v-measure)`, `sufficient annotation (100% of v-measure)` and `complete annotation (annotation completeness)`)

First, **perform general analyses** :
- Fit a generalized linear model (GLM) on data.

Then, **perform main effect analyses** to determine the parameters that significantly influence the convergence speed :
- Fit an analysis of variance model by a repeated measured anova on data.
- Compute statistic effect size of the variance model.

Finally, **perform post hoct analyses** to determine the best values of parameters that significantly influence the convergence speed :
- Fit a linear mixed-effects model (LMM) on data.
- Fit an estimated marginal means of significant factors and interactions with Tukey HSD adjustment.


------------------------------
## 1. IMPORT R DEPENDENCIES

In [None]:
#library("sjstats")  # common statistical computations
library("lme4")  # linear and mixed models (lmer).
library("emmeans")  # estimated marginal means (emmeans).

------------------------------
## 2. LOAD SYNTHESIS CSV FILE

In [None]:
# Load analysis data.
df_analysis <- read.csv(
    file="../experiments/experiments_synthesis.csv",
    header=TRUE,  # Use the first row as headers.
    sep=";",
    skip=0,  # Number of rows to skip in the file.
)

In [None]:
# Show an extract of analysis data.
str(df_analysis)

In [None]:
# Set column type to factor for columns "preprocessing", "vectorization", "sampling", "clustering", "random_seed".
df_analysis$preprocessing <- as.factor( df_analysis$preprocessing )
df_analysis$vectorization <- as.factor( df_analysis$vectorization )
df_analysis$sampling <- as.factor( df_analysis$sampling )
df_analysis$clustering <- as.factor( df_analysis$clustering )
df_analysis$random_seed <- as.factor( df_analysis$random_seed )

In [None]:
# Set column type to numeric for columns "V050v__"
df_analysis$V050v__iteration <- as.numeric( df_analysis$V050v__iteration )
df_analysis$V050v__sampling_time <- as.numeric( df_analysis$V050v__sampling_time )
df_analysis$V050v__clustering_time <- as.numeric( df_analysis$V050v__clustering_time )
df_analysis$V050v__total_time <- as.numeric( df_analysis$V050v__total_time )
df_analysis$V050v__constraints_must_link <- as.numeric( df_analysis$V050v__constraints_must_link )
df_analysis$V050v__constraints_cannot_link <- as.numeric( df_analysis$V050v__constraints_cannot_link )
df_analysis$V050v__constraints_total <- as.numeric( df_analysis$V050v__constraints_total )
df_analysis$V050v__constraints_ratio_must_link <- as.numeric( df_analysis$V050v__constraints_ratio_must_link )

# Set column type to numeric for columns "V060v__"
df_analysis$V060v__iteration <- as.numeric( df_analysis$V060v__iteration )
df_analysis$V060v__sampling_time <- as.numeric( df_analysis$V060v__sampling_time )
df_analysis$V060v__clustering_time <- as.numeric( df_analysis$V060v__clustering_time )
df_analysis$V060v__total_time <- as.numeric( df_analysis$V060v__total_time )
df_analysis$V060v__constraints_must_link <- as.numeric( df_analysis$V060v__constraints_must_link )
df_analysis$V060v__constraints_cannot_link <- as.numeric( df_analysis$V060v__constraints_cannot_link )
df_analysis$V060v__constraints_total <- as.numeric( df_analysis$V060v__constraints_total )
df_analysis$V060v__constraints_ratio_must_link <- as.numeric( df_analysis$V060v__constraints_ratio_must_link )

# Set column type to numeric for columns "V070v__"
df_analysis$V070v__iteration <- as.numeric( df_analysis$V070v__iteration )
df_analysis$V070v__sampling_time <- as.numeric( df_analysis$V070v__sampling_time )
df_analysis$V070v__clustering_time <- as.numeric( df_analysis$V070v__clustering_time )
df_analysis$V070v__total_time <- as.numeric( df_analysis$V070v__total_time )
df_analysis$V070v__constraints_must_link <- as.numeric( df_analysis$V070v__constraints_must_link )
df_analysis$V070v__constraints_cannot_link <- as.numeric( df_analysis$V070v__constraints_cannot_link )
df_analysis$V070v__constraints_total <- as.numeric( df_analysis$V070v__constraints_total )
df_analysis$V070v__constraints_ratio_must_link <- as.numeric( df_analysis$V070v__constraints_ratio_must_link )

# Set column type to numeric for columns "V080v__"
df_analysis$V080v__iteration <- as.numeric( df_analysis$V080v__iteration )
df_analysis$V080v__sampling_time <- as.numeric( df_analysis$V080v__sampling_time )
df_analysis$V080v__clustering_time <- as.numeric( df_analysis$V080v__clustering_time )
df_analysis$V080v__total_time <- as.numeric( df_analysis$V080v__total_time )
df_analysis$V080v__constraints_must_link <- as.numeric( df_analysis$V080v__constraints_must_link )
df_analysis$V080v__constraints_cannot_link <- as.numeric( df_analysis$V080v__constraints_cannot_link )
df_analysis$V080v__constraints_total <- as.numeric( df_analysis$V080v__constraints_total )
df_analysis$V080v__constraints_ratio_must_link <- as.numeric( df_analysis$V080v__constraints_ratio_must_link )

# Set column type to numeric for columns "V090v__"
df_analysis$V090v__iteration <- as.numeric( df_analysis$V090v__iteration )
df_analysis$V090v__sampling_time <- as.numeric( df_analysis$V090v__sampling_time )
df_analysis$V090v__clustering_time <- as.numeric( df_analysis$V090v__clustering_time )
df_analysis$V090v__total_time <- as.numeric( df_analysis$V090v__total_time )
df_analysis$V090v__constraints_must_link <- as.numeric( df_analysis$V090v__constraints_must_link )
df_analysis$V090v__constraints_cannot_link <- as.numeric( df_analysis$V090v__constraints_cannot_link )
df_analysis$V090v__constraints_total <- as.numeric( df_analysis$V090v__constraints_total )
df_analysis$V090v__constraints_ratio_must_link <- as.numeric( df_analysis$V090v__constraints_ratio_must_link )

# Set column type to numeric for columns "V095v__"
df_analysis$V095v__iteration <- as.numeric( df_analysis$V095v__iteration )
df_analysis$V095v__sampling_time <- as.numeric( df_analysis$V095v__sampling_time )
df_analysis$V095v__clustering_time <- as.numeric( df_analysis$V095v__clustering_time )
df_analysis$V095v__total_time <- as.numeric( df_analysis$V095v__total_time )
df_analysis$V095v__constraints_must_link <- as.numeric( df_analysis$V095v__constraints_must_link )
df_analysis$V095v__constraints_cannot_link <- as.numeric( df_analysis$V095v__constraints_cannot_link )
df_analysis$V095v__constraints_total <- as.numeric( df_analysis$V095v__constraints_total )
df_analysis$V095v__constraints_ratio_must_link <- as.numeric( df_analysis$V095v__constraints_ratio_must_link )

# Set column type to numeric for columns "V099v__"
df_analysis$V099v__iteration <- as.numeric( df_analysis$V099v__iteration )
df_analysis$V099v__sampling_time <- as.numeric( df_analysis$V099v__sampling_time )
df_analysis$V099v__clustering_time <- as.numeric( df_analysis$V099v__clustering_time )
df_analysis$V099v__total_time <- as.numeric( df_analysis$V099v__total_time )
df_analysis$V099v__constraints_must_link <- as.numeric( df_analysis$V099v__constraints_must_link )
df_analysis$V099v__constraints_cannot_link <- as.numeric( df_analysis$V099v__constraints_cannot_link )
df_analysis$V099v__constraints_total <- as.numeric( df_analysis$V099v__constraints_total )
df_analysis$V099v__constraints_ratio_must_link <- as.numeric( df_analysis$V099v__constraints_ratio_must_link )

# Set column type to numeric for columns "V100v__"
df_analysis$V100v__iteration <- as.numeric( df_analysis$V100v__iteration )
df_analysis$V100v__sampling_time <- as.numeric( df_analysis$V100v__sampling_time )
df_analysis$V100v__clustering_time <- as.numeric( df_analysis$V100v__clustering_time )
df_analysis$V100v__total_time <- as.numeric( df_analysis$V100v__total_time )
df_analysis$V100v__constraints_must_link <- as.numeric( df_analysis$V100v__constraints_must_link )
df_analysis$V100v__constraints_cannot_link <- as.numeric( df_analysis$V100v__constraints_cannot_link )
df_analysis$V100v__constraints_total <- as.numeric( df_analysis$V100v__constraints_total )
df_analysis$V100v__constraints_ratio_must_link <- as.numeric( df_analysis$V100v__constraints_ratio_must_link )

# Set column type to numeric for columns "VMAX__"
df_analysis$VMAX__iteration <- as.numeric( df_analysis$VMAX__iteration )
df_analysis$VMAX__sampling_time <- as.numeric( df_analysis$VMAX__sampling_time )
df_analysis$VMAX__clustering_time <- as.numeric( df_analysis$VMAX__clustering_time )
df_analysis$VMAX__total_time <- as.numeric( df_analysis$VMAX__total_time )
df_analysis$VMAX__constraints_must_link <- as.numeric( df_analysis$VMAX__constraints_must_link )
df_analysis$VMAX__constraints_cannot_link <- as.numeric( df_analysis$VMAX__constraints_cannot_link )
df_analysis$VMAX__constraints_total <- as.numeric( df_analysis$VMAX__constraints_total )
df_analysis$VMAX__constraints_ratio_must_link <- as.numeric( df_analysis$VMAX__constraints_ratio_must_link )

------------------------------
## 3. ANALYZE PARTIAL ANNOTATION (`v-measure==0.80`)

### 3.1. Apply general analysis

Fit a generalized linear model (GLM) on data.

In [None]:
GLM_fit_V080v__iteration <- glm(
    formula = V080v__iteration ~ preprocessing + vectorization + sampling + clustering + random_seed,
    data = df_analysis
)
summary(GLM_fit_V080v__iteration)

### 3.2. Apply main effect analysis.

Fit an analysis of variance model by a repeated measured anova on data.

In [None]:
ANOVA_fit_V080v__iteration <- aov(
    formula = V080v__iteration ~ preprocessing * vectorization * sampling * clustering + Error(random_seed / (preprocessing * vectorization * sampling * clustering)),
    data = df_analysis
)
summary(ANOVA_fit_V080v__iteration)

Compute statistic effect size of the variance model.

In [None]:
effectsize::eta_squared(ANOVA_fit_V080v__iteration)

### 3.3. Apply post-hoc analysis

Fit a linear mixed-effects model (LMM) on data.

In [None]:
LMER_fit_V080v__iteration <- lmer(
    formula = V080v__iteration ~ preprocessing * vectorization * sampling * clustering + (1|random_seed),
    data = df_analysis,
)
summary(LMER_fit_V080v__iteration)

Fit an estimated marginal means of significant factors and interactions with Tukey HSD adjustment.

_NB_: These computations are only valid if factors or interactions have a significant main effect.

In [None]:
# Simple interaction.
summary(emmeans(LMER_fit_V080v__iteration, list(pairwise ~ preprocessing), adjust = "tukey"))

In [None]:
# Simple interaction.
summary(emmeans(LMER_fit_V080v__iteration, list(pairwise ~ vectorization), adjust = "tukey"))

In [None]:
# Simple interaction.
summary(emmeans(LMER_fit_V080v__iteration, list(pairwise ~ sampling), adjust = "tukey"))

In [None]:
# Simple interaction.
summary(emmeans(LMER_fit_V080v__iteration, list(pairwise ~ clustering), adjust = "tukey"))

------------------------------
## 4. ANALYZE SUFFICIENT ANNOTATION (`v-measure==1.00`)

### 4.1. Apply general analysis

Fit a generalized linear model (GLM) on data.

In [None]:
GLM_fit_V100v__iteration <- glm(
    formula = V100v__iteration ~ preprocessing + vectorization + sampling + clustering + random_seed,
    data = df_analysis
)
summary(GLM_fit_V100v__iteration)

### 4.2. Apply main effect analysis.

Fit an analysis of variance model by a repeated measured anova on data.

In [None]:
ANOVA_fit_V100v__iteration <- aov(
    formula = V100v__iteration ~ preprocessing * vectorization * sampling * clustering + Error(random_seed / (preprocessing * vectorization * sampling * clustering)),
    data = df_analysis
)
summary(ANOVA_fit_V100v__iteration)

Compute statistic effect size of the variance model.

In [None]:
effectsize::eta_squared(ANOVA_fit_V100v__iteration)

### 4.3. Apply post-hoc analysis

Fit a linear mixed-effects model (LMM) on data.

In [None]:
LMER_fit_V100v__iteration <- lmer(
    formula = V100v__iteration ~ preprocessing * vectorization * sampling * clustering + (1|random_seed),
    data = df_analysis,
)
summary(LMER_fit_V100v__iteration)

Fit an estimated marginal means of significant factors and interactions with Tukey HSD adjustment.

_NB_: These computations are only valid if factors or interactions have a significant main effect.

In [None]:
# Simple interaction.
summary(emmeans(LMER_fit_V100v__iteration, list(pairwise ~ preprocessing), adjust = "tukey"))

In [None]:
# Simple interaction.
summary(emmeans(LMER_fit_V100v__iteration, list(pairwise ~ vectorization), adjust = "tukey"))

In [None]:
# Simple interaction.
summary(emmeans(LMER_fit_V100v__iteration, list(pairwise ~ sampling), adjust = "tukey"))

In [None]:
# Simple interaction.
summary(emmeans(LMER_fit_V100v__iteration, list(pairwise ~ clustering), adjust = "tukey"))

------------------------------
## 5. ANALYZE COMPLETE ANNOTATION (`annotation==completeness`)

### 5.1. Apply general analysis

Fit a generalized linear model (GLM) on data.

In [None]:
GLM_fit_VMAX__iteration <- glm(
    formula = VMAX__iteration ~ preprocessing + vectorization + sampling + clustering + random_seed,
    data = df_analysis
)
summary(GLM_fit_VMAX__iteration)

### 5.2. Apply main effect analysis.

Fit an analysis of variance model by a repeated measured anova on data.

In [None]:
ANOVA_fit_VMAX__iteration <- aov(
    formula = VMAX__iteration ~ preprocessing * vectorization * sampling * clustering + Error(random_seed / (preprocessing * vectorization * sampling * clustering)),
    data = df_analysis
)
summary(ANOVA_fit_VMAX__iteration)

Compute statistic effect size of the variance model.

In [None]:
effectsize::eta_squared(ANOVA_fit_VMAX__iteration)

### 5.3. Apply post-hoc analysis

Fit a linear mixed-effects model (LMM) on data.

In [None]:
LMER_fit_VMAX__iteration <- lmer(
    formula = VMAX__iteration ~ preprocessing * vectorization * sampling * clustering + (1|random_seed),
    data = df_analysis,
)
summary(LMER_fit_VMAX__iteration)

Fit an estimated marginal means of significant factors and interactions with Tukey HSD adjustment.

_NB_: These computations are only valid if factors or interactions have a significant main effect.

In [None]:
# Simple interaction.
summary(emmeans(LMER_fit_VMAX__iteration, list(pairwise ~ preprocessing), adjust = "tukey"))

In [None]:
# Simple interaction.
summary(emmeans(LMER_fit_VMAX__iteration, list(pairwise ~ vectorization), adjust = "tukey"))

In [None]:
# Simple interaction.
summary(emmeans(LMER_fit_VMAX__iteration, list(pairwise ~ sampling), adjust = "tukey"))

In [None]:
# Simple interaction.
summary(emmeans(LMER_fit_VMAX__iteration, list(pairwise ~ clustering), adjust = "tukey"))