# ==== INTERACTIVE CLUSTERING : ANNOTATION ERROR STUDY ====
> ### Stage 2 : Perform constraints annotation (with errors simulation and conflict fix), constrained clustering and clustering evaluation.

------------------------------
## READ-ME BEFORE RUNNING

### Quick Description

This notebook is **aimed at run all experiment environments, estimate impact of annotation errors during interactive clusterings, plot overviews of experiments and synthesize interactive clustering experiments in a CSV file**.
- Environments are represented by subdirectories in the `/experiments` folder. A full path to an experiment environment is `/experiments/[DATASET]/[CLUSTERING]/[CONSTRAINTS_SELECTION]/[ERRORS_SIMULATION]`.
- An experiment run is composed of constraints annotation with errors simulation, then constrained clustering and clustering evaluation.

Before running, **run the notebook `1_Initialize_annotation_errors_experiments.ipynb` to set up experiments you want**.

Then, **go to the notebook `3_Modelize_errors_and_Plot_some_figures.ipynb` to modelize error simulations**.

### Description each steps

First of all, find all experiment environments.
- A loop look at all subdirectories in the `/experiments` folder that have a `config.json` file.

Then, **apply constraints annotation and constrained clustering** (2.A) for all experiment environments :
- Each experiment annotate constraints according to groundtruth, but simulate some errors (cf. error rate). If conflict occurs, we just propagate the infered constraint. Then, a constrained clustering is run.
- All computations are stored in the following files:
    - constraints in `../experiments/[EXPERIMENT_PATH]/list_of_constraints.json`;
    - clustering result `../experiments/[EXPERIMENT_PATH]/dict_of_clustering.json`;
    - clustering performances `../experiments/[EXPERIMENT_PATH]/dict_of_clustering_performances.json`;
- _NB_:
    - The script used to run an experiment is available in the `workerA_run.py` file.
    - For these computations, **multiprocessing is used to parallelize tasks**:
        - Each environment is represented by a task, and tasks are launched as workers on available logical CPUs.
        - The scripts for theses workers are available in the `notebook` directory.
        - **WARNING**: _Number of workers should reprensent the number of logical CPU reserved to avoid slow execution._

Then, **apply experiment synthesis** (2.C) for all experiments:
- Create a CSV file to format evaluations evolutions in order to analyze main effects and post-hoc of interactive clustering convergence speed using a `R` script (cf. notebook `3_Analyze_main_effects_and_post_hoc.ipynb`);
- Evolutions are stored in the `../results/experiment_sysnthesis.csv` file.
- _NB_:
    - The script used to do experiments synthesis is available in the `workerC_synthesis.py` file.

------------------------------
## 1. IMPORT PYTHON DEPENDENCIES

In [None]:
import multiprocessing as mp
import listing_envs
from typing import Dict, List, Union

import tqdm
import workerA_run
import workerC_synthesis

------------------------------
## 2. RUN ANNOTATION ERROR STUDY EXPERIMENTS

Find all experiment environments.

In [None]:
# Get list of experiment environments.
LIST_OF_EXPERIMENT_ENVIRONMENTS: List[
    str
] = listing_envs.get_list_of_errors_simulation_env_paths()
print(
    "There are",
    "`" + str(len(LIST_OF_EXPERIMENT_ENVIRONMENTS)) + "`",
    "created experiment environments in `../experiments`",
)
LIST_OF_EXPERIMENT_ENVIRONMENTS

------------------------------
### 2.A. Simulate all constraints annotation defined by an environment

Represent each annotation simulation by a task to launch. Tasks define the run parameters.

In [None]:
# List of run tasks to parallelize.
list_of_run_tasks: List[Dict[str, Union[str, int, None]]] = [
    {
        "ENV_PATH": env_to_run,  # Environment of experiment.
    }
    for counter_of_run_task, env_to_run in enumerate(LIST_OF_EXPERIMENT_ENVIRONMENTS)
]
print("There are", "`" + str(len(list_of_run_tasks)) + "`", "run tasks to launch.")
##### list_of_run_tasks

Run all defined tasks with `multiprocessing` : Each task is given to a worker.

In [None]:
# Number of worker (logical CPU).
number_of_workers_for_run: int = mp.cpu_count()  # TODO: set it manually !
print(
    "There are",
    "`" + str(number_of_workers_for_run) + "`",
    "logical CPUs used for evaluation experiments.",
)

In [None]:
# Run tasks in parallel.
if __name__ == "__main__":

    # Define the pool of workers.
    pool_for_run = mp.Pool(number_of_workers_for_run)

    # Map the list of tasks with the pool of workers. Show a progress bar with `tqdm`.
    for _ in tqdm.tqdm(  # noqa: WPS352
        pool_for_run.imap_unordered(workerA_run.experiment_run, list_of_run_tasks),
        total=len(list_of_run_tasks),
    ):
        pass  # noqa: WPS420

------------------------------
### 2.C. Synthesize experiments

***WARNING***: _Start by launching the experiment runs before the experiments synthesis !_

Synthesize performance in a CSV file.

In [None]:
# Run synthesis computation.
workerC_synthesis.experiments_synthesis(
    list_of_experiment_environments=LIST_OF_EXPERIMENT_ENVIRONMENTS,
)

====================================================================================================

------------------------------
## NOTA BENE

- ***Show the CPU usage***: `htop`
- ***Show disk usage***: `du -sh -- *`

In [None]:
!du -sh -- ../../*

- ***Zip results***: `tar -czf ../experiments.tar.gz ../experiments/`
- ***Unzip results***: `tar -xzf ../experiments.tar.gz -C ../`