# ==== INTERACTIVE CLUSTERING : CONSTRAINTS NUMBER STUDY ====
> ### Stage 2 : Run all experiments until convergence, evaluate efficience and synthesize experiments.

------------------------------
## READ-ME BEFORE RUNNING

### Quick Description

This notebook is **aimed at run all convergence experiments, evaluate interactive clustering efficience over iterations, plot overviews of experiments and synthesize interactive clustering convergence experiments in a CSV file**.
- Environments are represented by subdirectories in the `/experiments` folder. A full path to an experiment environment is `/experiments/[DATASET]/[PREPROCESSING]/[VECTORIZATION]/[SAMPLING]/[CLUSTERING]/[EXPERIMENT]`.
- An experiment run is composed of iterations of _interative clustering_.
- An experiment evaluation look at each _interative clustering_ iteration of the experiment.

Before running, **run the notebook `1_Initialize_convergence_experiments.ipynb` to set up experiments you want**.

Then, **go to the notebook `3_Analyze_main_effects_and_post_hoc.ipynb` to run main effects and and post-hoc analysis on interactive clustering constraints number over experiments**.

### Description each steps

First of all, find all experiment environments.
- A loop look at all subdirectories in the `/experiments` folder that have a `config.json` file.

Then, **apply experiment run** (2.A) for all experiment environments :
- An experiment run is composed of iteration of _interative clustering_.
    - Iterations are represented by a `string` id (the iteration number in four characters).
- The experiment starts with data loading and constraints manager initialization.
- Iterations are made until completness of annotation (cf. constraints manager) or until maximum iteration is reached.
- Each iteration is composed of three major steps:
    - a **constraints sampling step**: Based on previous clustering results, a sampler selects couples of data to annotate. On first iteration, their is no sampling.
    - a **constraints annotation step**: Based on the groundtruth, an automatic annotator simulates the expert constraints annotation on couples of data. These annotations can be "MUST_LINK" or "CANNOT_LINK", and are based on comparison of groundtruth intents.
    - a **constrained clustering step**: Based on constraints annotated, a clustering on data is run.
- Note that constraints additions should correct clustering at each iteration.
- All computations are stored in the following files:
    - samples and annotations in `../experiments/[EXPERIMENT_PATH]/dict_of_constraints_annotations.json`;
    - clustering results in `../experiments/[EXPERIMENT_PATH]/dict_of_clustering_results.json`;
    - time spent in `../experiments/[EXPERIMENT_PATH]/dict_of_computation_times.json`.
- _NB_:
    - The script used to run an experiment is available in the `workerA_run.py` file.
    - For these computations, **multiprocessing is used to parallelize tasks**:
        - Each environment is represented by a task, and tasks are launched as workers on available logical CPUs.
        - The scripts for theses workers are available in the `notebook` directory.
        - **WARNING**: _Number of workers should reprensent the number of logical CPU reserved to avoid slow execution._
    - Each result (annotations, computation time, clustering results) is grouped by iteration and stored in JSON files.

Then, **apply experiment evaluation** (2.B) for all experiment environments:
- Evaluate the following clustering performance metrics for each iteration: `completness`, `homogeneity`, `v-measure`, `adjusted-rand-index`, `adjusted-mutual-information`;
- Find iterations that reach the following clustering performance goal: `v_measure=0.50`, `v_measure=0.60`, `v_measure=0.70`, `v_measure=0.80`, `v_measure=0.90`, `v_measure=0.95`, `v_measure=0.99`, `v_measure=1.00`, `annotation=completed` (others *v_measure* goals can be set if needed). This iterations are stored in `../experiments/[EXPERIMENT_PATH]/dict_of_iterations_to_highlight.json` file;
- Plot clustering performance evolution over iterations in the `../experiments/[EXPERIMENT_PATH]/plot_clustering_performances_evolution.png` image file;
- Plot annotation completeness evolution over iterations in the `../experiments/[EXPERIMENT_PATH]/plot_annotations_completeness_evolution.png` image file;
- Plot compuation times evolution over iterations in the `../experiments/[EXPERIMENT_PATH]/plot_computation_times_evolution.png` image file.
- _NB_:
    - The script used to evaluate an experiment is available in the `workerB_evaluate.py` file.
    - Each result (clustering evaluation) is grouped by iteration and stored in JSON files.

Then, **apply experiment overviews** (2.C) for several sets of experiments :
- Have an overview by computing the mean clustering performance evolution and mean clustering time evolution for a set of experiments;
- Several overviews are computed : `all experiments`, `partial annotation (80% of v-measure)`, `sufficient annotation (100% of v-measure)` and `complete annotation (annotation completeness)`.
- Plot global clustering performances evolution in the `../results/plot_global_performances_evolution.png` image file;
- Plot global computation times evolution in the `../results/plot_global_computation_times_evolution.png` image file;
- _NB_:
    - The script used to do experiments overview is available in the `workerC_overview.py` file.

Then, **apply experiment synthesis** (2.D) for all experiments:
- Create a CSV file to format evaluations, annotations and time evolutions in order to analyze main effects and post-hoc of interactive clustering convergence speed using a `R` script (cf. notebook `3_Analyze_main_effecets_and_post_hoc.ipynb`);
- Evolutions are stored in the `../results/experiment_sysnthesis.csv` file.
- _NB_:
    - The script used to do experiments synthesis is available in the `workerD_synthesis.py` file.

------------------------------
## 1. IMPORT PYTHON DEPENDENCIES

In [None]:
import multiprocessing as mp
import listing_envs
from typing import Dict, List, Union

import tqdm
import workerA_run
import workerB_evaluate
import workerC_overview
import workerD_synthesis

------------------------------
## 2. RUN CONVERGENCE STUDY EXPERIMENTS

Find all experiment environments.

In [None]:
# Get list of experiment environments.
LIST_OF_EXPERIMENT_ENVIRONMENTS: List[
    str
] = listing_envs.get_list_of_experiment_env_paths()
print(
    "There are",
    "`" + str(len(LIST_OF_EXPERIMENT_ENVIRONMENTS)) + "`",
    "created experiment environments in `../experiments`",
)
LIST_OF_EXPERIMENT_ENVIRONMENTS

------------------------------
### 2.A. Run all experiment defined by an environment

Represent each convergence experiment by a task to launch. Tasks define the run parameters.

In [None]:
# List of run tasks to parallelize.
list_of_convergence_tasks: List[Dict[str, Union[str, int, None]]] = [
    {
        "ENV_PATH": env_to_run,  # Environment of experiment.
        "MAX_ITER": None,  # Maximum number of iteration.
    }
    for counter_of_run_task, env_to_run in enumerate(LIST_OF_EXPERIMENT_ENVIRONMENTS)
]
print("There are", "`" + str(len(list_of_convergence_tasks)) + "`", "run tasks to launch.")
##### list_of_run_tasks

Run all defined tasks with `multiprocessing` : Each task is given to a worker.

In [None]:
# Number of worker (logical CPU).
number_of_workers_for_run: int = 8  # mp.cpu_count()  # TODO: set it manually !
print(
    "There are",
    "`" + str(number_of_workers_for_run) + "`",
    "logical CPUs used for evaluation experiments.",
)

In [None]:
# Run tasks in parallel.
if __name__ == "__main__":

    # Define the pool of workers.
    pool_for_run = mp.Pool(number_of_workers_for_run)

    # Map the list of tasks with the pool of workers. Show a progress bar with `tqdm`.
    for _ in tqdm.tqdm(  # noqa: WPS352
        pool_for_run.imap_unordered(workerA_run.experiment_run, list_of_convergence_tasks),
        total=len(list_of_convergence_tasks),
    ):
        pass  # noqa: WPS420

------------------------------
### 2.B. Evaluate all experiments defined by an environment

***WARNING***: _Start by launching the convergence experiments before the experiment evaluations !_

Evaluate each convergence experiment convergence.

In [None]:
# For each environment...
for counter_for_evaluation, exp_to_evaluate in enumerate(
    LIST_OF_EXPERIMENT_ENVIRONMENTS
):

    # Print the current experiment to evaluate.
    print(counter_for_evaluation, ":", exp_to_evaluate)

    # Start the evaluation.
    workerB_evaluate.experiment_evaluate(
        parameters={
            "ENV_PATH": exp_to_evaluate,  # Experiment of experiment.
            "study_progress": "exp: "
            + str(counter_for_evaluation + 1)
            + "/"
            + str(
                len(LIST_OF_EXPERIMENT_ENVIRONMENTS)
            ),  # Study progression.
            "performance_goals_to_compute": [
                "0.05",
                "0.10",
                "0.15",
                "0.20",
                "0.25",
                "0.30",
                "0.35",
                "0.40",
                "0.45",
                "0.50",
                "0.55",
                "0.60",
                "0.65",
                "0.70",
                "0.75",
                "0.80",
                "0.85",
                "0.90",
                "0.95",
                "0.99",
                "1.00",
            ],  # Performance goal for iteration to highlight.
        }
    )

------------------------------
### 2.C. Make experiments overviews in a graph

***WARNING***: _Start by launching the experiment runs and evaluations before the experiments overview !_

Represent each convergence experiment performances overview to make.

Create performance overview graph with several interactive clustering convergence experiments grouped by parameters.

Represent each experiments time overview to make.

Create time overview graph with several overviews interactive clustering convergence speed grouped by parameters.

------------------------------
### 2.D. Synthesize experiments

***WARNING***: _Start by launching the experiment runs and evaluations before the experiments synthesis !_

Synthesize performance, annotation and time in a CSV file.

In [None]:
# Run synthesis computation.
workerD_synthesis.experiments_synthesis(
    list_of_experiment_environments=LIST_OF_EXPERIMENT_ENVIRONMENTS,
)

====================================================================================================

------------------------------
## NOTA BENE

- ***Show the CPU usage***: `htop`
- ***Show disk usage***: `du -sh -- *`

- ***Zip results***: `tar -czf ../experiments.tar.gz ../experiments/`
- ***Unzip results***: `tar -xzf ../experiments.tar.gz -C ../`