# ==== INTERACTIVE CLUSTERING : ANNOTATION ERROR STUDY ====
> ### Stage 1 : Initialize computation environments for experiments.

------------------------------
## READ-ME BEFORE RUNNING

### Quick Description

This notebook is **aimed at create environments needed to run annotation error study experiments**.
- Environments are represented by subdirectories in the `/experiments` folder. A full path to an experiment environment is `/experiments/[DATASET]/[CLUSTERING]/[CONSTRAINTS_SELECTION]/[ERRORS_SIMULATION]`.
- The path is composed of A. the clustering used (cf. convergence study results to get the "best implementation"), B. the constraints management (randomly selected), and C. errors rate (randomly simulated).

At beginning of the comparative study, **run this notebook to set up experiments you want**.

Then, **go to the notebook `2_Simulate_errors_and_run_clustering.ipynb` to run and evaluate each experiment you have set**.

### Description each steps

- 2.1. **Set up `Dataset` environments**:
    - _Description_: Create a subdirectory, store parameters for the dataset and pre-format dataset for next computations.
    - _Setting_: A dictionary define all possible configurations of datatset environments.
    - _Folder content_:
        - `dict_of_texts.json`: texts from the dataset;
        - `dict_of_true_intents.json`: true intent from the dataset;
        - `config.json`: a json file with all parameters.
    - _Available datasets_:
        - [French trainset for chatbots dealing with usual requests on bank cards v1.0.0](http://doi.org/10.5281/zenodo.4769949)

- 2.2. **Set up `Clustering` environments**:
    - _Description_: Create a subdirectory, store parameters, then preprocess the dataset and vectorize the preprocessed dataset for next computations.
    - _Setting_: A dictionary define all possible configurations of preprocessing + vectorization + clustering environments.
    - _Folder content_:
        - `dict_of_preprocessed_texts.json`: preprocessed texts computed from `dict_of_texts.json`;
        - `config.json`: a json file with all preprocessing parameters.
        - `dict_of_vectors.pkl`: vectors computed from `dict_of_preprocessed_texts.json`;
    - _Available preprocessing settings_:
        - enable preprocessing;
        - apply simple preprocessing (lowercase, punctuation, accent, whitespace);
        - apply filter on dependency parsing;
        - apply lemmatization;
        - and all combination.
    - _Available vectorization settings_:
        - TF-IDF vectorizer;
        - spaCy `fr_core_news_md` language model.
    - _Available clustering settings_:
        - apply constrained kmeans clustering (model COP);
        - apply constrained hierarchical clustering (several linkage);
        - apply constrained spectral clustering (model SPEC).

- 2.3. **Set up `Constraints selection` environments**:
    - _Description_: Create a subdirectory, store parameters with the random seed and select randomly a set of constraints.
    - _Setting_: A dictionary define all possible configurations of constraints selection environments.
    - _Folder content_:
        - `config.json`: a json file with all parameters.
        - `list_of_constraints.json`: all selected constraints.
    - _Available experiment settings_:
        - define the number of constraints.
        - define the type of constraints sampling
        - define the random seed.

- 2.4. **Set up `Errors simulation` environments**:
    - _Description_: Create a subdirectory, store parameters with the random seed and simulate randomly a set of annotation errors.
    - _Setting_: A dictionary define all possible configurations of annotation errors simulation environments.
    - _Folder content_:
        - `config.json`: a json file with all parameters.
        - `list_of_errors.json`: all selected errors.
    - _Available experiment settings_:
        - define the rate of errors.
        - define the random seed.
        - conflicts resolution.

------------------------------
## 1. IMPORT PYTHON DEPENDENCIES

In [None]:
import os
import faker
import listing_envs
from typing import Any, Dict, List, Tuple
from scipy.sparse import csr_matrix
import pandas as pd
import json
import random
import pickle  # noqa: S403
from cognitivefactory.interactive_clustering.utils.preprocessing import (
    preprocess,
)
from cognitivefactory.interactive_clustering.utils.vectorization import (
    vectorize,
)
from cognitivefactory.interactive_clustering.constraints.factory import (
    managing_factory
)
from cognitivefactory.interactive_clustering.sampling.factory import (
    sampling_factory
)

------------------------------
## 2. CREATE COMPUTATION ENVIRONMENTS

------------------------------
### 2.1. Set `Dataset` subdirectories

Define environments with different `datasets`.

In [None]:
LIST_OF_DATASET_TO_IMPORT: List[str] = [
    #'bank_cards_v1'
] + [
    "bank_cards_v2-size_{size_str}-rand_{rand_str}".format(size_str=str(size), rand_str=str(rand))
    for size in range(1000, 5001, 500)
    for rand in [1]
] + [
    "mlsum_fr_train_subset_v1-size_{size_str}-rand_{rand_str}".format(size_str=str(size), rand_str=str(rand))
    for size in range(1000, 5001, 500)
    for rand in [1]
]
print("There is", len(LIST_OF_DATASET_TO_IMPORT), "datasets to import.")

Create `dataset` environments using `ENVIRONMENTS_FOR_DATASETS` configuration dictionary.

In [None]:
### ### ### ### ###
### LOOP FOR ALL ENVIRONMENTS CONFIGURED...
### ### ### ### ###
for ENV_NAME_dataset in LIST_OF_DATASET_TO_IMPORT:
    
    # Get first export file to get configs.
    first_export_file = os.listdir("../previous/"+ENV_NAME_dataset+"/")[0]
    
    # Load export file.
    with open("../previous/"+ENV_NAME_dataset+"/"+first_export_file, "r") as file_d0:
        export_data: Dict[str, Any] = json.load(file_d0)

    ### ### ### ### ###
    ### CREATE AND CONFIGURE ENVIRONMENT.
    ### ### ### ### ###
    
    # get the configuration.
    CONFIG_dataset: Dict[str, Any] = export_data["dataset_config"]

    # Name the configuration.
    CONFIG_dataset["_ENV_NAME"] = ENV_NAME_dataset
    CONFIG_dataset["_ENV_PATH"] = "../experiments/" + ENV_NAME_dataset + "/"

    # Check if the environment already exists.
    if os.path.exists(str(CONFIG_dataset["_ENV_PATH"])):
        continue

    # Create directory for this environment.
    os.mkdir(str(CONFIG_dataset["_ENV_PATH"]))

    # Store configuration file.
    with open(str(CONFIG_dataset["_ENV_PATH"]) + "config.json", "w") as file_d1:
        json.dump(CONFIG_dataset, file_d1)

    ### ### ### ### ###
    ### STORE DATASET.
    ### ### ### ### ###

    # Store `dict_of_texts` and `dict_of_true_intents`.
    with open(str(CONFIG_dataset["_ENV_PATH"]) + "dict_of_texts.json", "w") as file_d2:
        json.dump(export_data["dict_of_texts"], file_d2)

    with open(
        str(CONFIG_dataset["_ENV_PATH"]) + "dict_of_true_intents.json", "w"
    ) as file_d3:
        json.dump(export_data["dict_of_true_intents"], file_d3)

# End
print("\n#####")
print("END - Dataset environments configuration.")

------------------------------
### 2.2. Set `algorithm (Preprocessing + Vectorization + Clustering)` subdirectories

Select `dataset` environments in which create `algorithm` environments

In [None]:
# Get list of dataset environments.
LIST_OF_DATASET_ENVIRONMENTS: List[str] = [
    env
    for env in listing_envs.get_list_of_dataset_env_paths()
    if "bank_cards_v1" not in env
]
print(
    "There are",
    "`" + str(len(LIST_OF_DATASET_ENVIRONMENTS)) + "`",
    "created dataset environments in `../experiments`",
)
LIST_OF_DATASET_ENVIRONMENTS

Define environments with different uses of `algorithm`.

In [None]:
ENVIRONMENTS_FOR_ALGORITHM: Dict[str, Any] = {
    # best implementation to reach 80% of V-measure.
    "simple_tfidf_kmeans_cop": {
        "_TYPE": "algorithm",
        "_DESCRIPTION": "Simple preprocessing + TFIDF vectorization + KMeans clustering, {0} clusters, model 'COP'.",
        "preprocessing": {
            "apply_preprocessing": True,
            "apply_lemmatization": False,
            "apply_parsing_filter": False,
            "spacy_language_model": "fr_core_news_md",
        },
        "vectorization": {
            "vectorizer_type": "tfidf",
            "spacy_language_model": None,
        },
        "clustering": {
            "algorithm": "kmeans",
            "init**kargs": {
                "model": "COP",
                "max_iteration": 150,
                "tolerance": 1e-4,
            },
            "random_seed": 42,
            #"nb_clusters": None,
        },
    },
}

Create `preprocessing + vectorization + clustering` environments using `ENVIRONMENTS_FOR_ALGORITHM` configuration dictionary.

In [None]:
### ### ### ### ###
### LOOP FOR ALL ENVIRONMENTS CONFIGURED...
### ### ### ### ###
for PARENT_ENV_PATH_dataset in LIST_OF_DATASET_ENVIRONMENTS:
    for (
        ENV_NAME_algorithm,
        CONFIG_algorithm,
    ) in ENVIRONMENTS_FOR_ALGORITHM.items():

        ### ### ### ### ###
        ### CREATE AND CONFIGURE ENVIRONMENT.
        ### ### ### ### ###

        # Name the configuration.
        CONFIG_algorithm["_ENV_NAME"] = ENV_NAME_algorithm
        CONFIG_algorithm["_ENV_PATH"] = (
            PARENT_ENV_PATH_dataset + ENV_NAME_algorithm + "/"
        )
        CONFIG_algorithm["_DESCRIPTION"] = CONFIG_algorithm["_DESCRIPTION"].format(nb_clusters)
        
        # Set number of clusters from the dataset configuration.
        with open(PARENT_ENV_PATH_dataset + "config.json", "r") as file_a0:
            nb_clusters = json.load(file_a0)["nb_clusters"]
        CONFIG_algorithm["clustering"]["nb_clusters"] = nb_clusters

        # Check if the environment already exists.
        if os.path.exists(str(CONFIG_algorithm["_ENV_PATH"])):
            continue

        # Create directory for this environment.
        os.mkdir(str(CONFIG_algorithm["_ENV_PATH"]))

        # Store configuration file.
        with open(
            str(CONFIG_algorithm["_ENV_PATH"]) + "config.json", "w"
        ) as file_a1:
            json.dump(CONFIG_algorithm, file_a1)

        ### ### ### ### ###
        ### PREPROCESS DATASET.
        ### ### ### ### ###

        # Load dataset.
        with open(
            str(CONFIG_algorithm["_ENV_PATH"]) + "../dict_of_texts.json", "r"
        ) as file_a2:
            texts: Dict[str, str] = json.load(file_a2)

        dict_of_preprocessed_texts: Dict[str, str] = {}

        # Case with preprocessing.
        if bool(CONFIG_algorithm["preprocessing"]["apply_preprocessing"]):
            dict_of_preprocessed_texts = preprocess(
                dict_of_texts=texts,
                apply_lemmatization=bool(CONFIG_algorithm["preprocessing"]["apply_lemmatization"]),
                apply_parsing_filter=bool(CONFIG_algorithm["preprocessing"]["apply_parsing_filter"]),
                spacy_language_model=str(CONFIG_algorithm["preprocessing"]["spacy_language_model"]),
            )

        # Case without preprocessing.
        else:
            dict_of_preprocessed_texts = texts

        # Store preprocessed texts.
        with open(
            str(CONFIG_algorithm["_ENV_PATH"]) + "dict_of_preprocessed_texts.json",
            "w",
        ) as file_a3:
            json.dump(dict_of_preprocessed_texts, file_a3)
            
        ### ### ### ### ###
        ### VECTORIZE DATASET.
        ### ### ### ### ###

        # Vectorize dataset.
        dict_of_vectors: Dict[str, csr_matrix] = vectorize(
            dict_of_texts=dict_of_preprocessed_texts,
            vectorizer_type=str(CONFIG_algorithm["vectorization"]["vectorizer_type"]),
            spacy_language_model=str(CONFIG_algorithm["vectorization"]["spacy_language_model"]),
        )

        # Store vectors.
        with open(
            str(CONFIG_algorithm["_ENV_PATH"]) + "dict_of_vectors.pkl", "wb"
        ) as file_a4:
            pickle.dump(dict_of_vectors, file_a4)

# End
print("\n#####")
print("END - Algorithm environments configuration.")

------------------------------
### 2.3. Set `Constraints selection` subdirectories

Select `algorithm` environments in which create `constraints selection` environments.

In [None]:
# Get list of algorithm environments.
LIST_OF_ALGORITHM_ENVIRONMENTS: List[str] = [
    env
    for env in listing_envs.get_list_of_algorithm_env_paths()
    if "bank_cards_v1" not in env
]
print(
    "There are",
    "`" + str(len(LIST_OF_ALGORITHM_ENVIRONMENTS)) + "`",
    "created algorithm environments in `../experiments`",
)
LIST_OF_ALGORITHM_ENVIRONMENTS

Define environments with different uses of `constraints selection`.

In [None]:
ENVIRONMENTS_FOR_CONSTRAINTS_SELECTION: Dict[str, Dict[str, int]] = {
    # previous selection from "1_convergence_study" experiements with closest neighbors in different clusters sampling.
    "nb_{nb_str}-closest_{rand_str}".format(nb_str=nb, rand_str=rand): {
            "_TYPE": "constraints_selection",
            "_DESCRIPTION": "Selection with closest neighbors in different clusters of {nb_str} constraints (random seed at {rand_str})".format(nb_str=nb, rand_str=rand),
            "sampling": "closest_in_different_clusters",
            "nb_constraints": nb,
            "random_seed": rand,
    }
    for nb in range(250, 30001, 250)
    for rand in [1, 2,]
}
len(ENVIRONMENTS_FOR_CONSTRAINTS_SELECTION)

Create `constraints selection` environments using `ENVIRONMENTS_FOR_CONSTRAINTS_SELECTION` configuration dictionary.

In [None]:
### ### ### ### ###
### LOOP FOR ALL ENVIRONMENTS CONFIGURED...
### ### ### ### ###
for PARENT_ENV_PATH_algorithm in LIST_OF_ALGORITHM_ENVIRONMENTS:
    for (
        ENV_NAME_constraints_selection,
        CONFIG_constraints_selection,
    ) in ENVIRONMENTS_FOR_CONSTRAINTS_SELECTION.items():

        ### ### ### ### ###
        ### CREATE AND CONFIGURE ENVIRONMENT.
        ### ### ### ### ###
        
        # Get dataset name and number of clusters from the dataset configuration.
        with open(PARENT_ENV_PATH_algorithm + "../config.json", "r") as file_cs0:
            config_dataset = json.load(file_cs0)
            dataset_name: str = config_dataset["_ENV_NAME"]
            nb_clusters: int = config_dataset["nb_clusters"]

        # Name the configuration.
        CONFIG_constraints_selection["_ENV_NAME"] = ENV_NAME_constraints_selection
        CONFIG_constraints_selection["_ENV_PATH"] = (
            PARENT_ENV_PATH_algorithm + ENV_NAME_constraints_selection + "/"
        )
        CONFIG_constraints_selection["previous_sampling_file"] = CONFIG_constraints_selection["previous_sampling_file"].format(
            dataset_str=str(dataset_name),
            nb_cluster_str=str(nb_clusters),
        )
        
        # Add previous file sampling if needed
        if CONFIG_constraints_selection["sampling"] == "closest_in_different_clusters":
            CONFIG_constraints_selection["previous_sampling_file"] = "../previous/{dataset_str}/simple_prep_-_tfidf_-_closest-50_-_kmeans_COP-{nb_cluster_str}c_-_{rand_str}.json".format(
                dataset_str=str(dataset_name),
                nb_cluster_str=str(nb_clusters),
                rand_str=str(CONFIG_constraints_selection["random_seed"]).zfill(4),
            )

        # Check if the environment already exists.
        if os.path.exists(str(CONFIG_constraints_selection["_ENV_PATH"])):
            continue

        # Create directory for this environment.
        os.mkdir(str(CONFIG_constraints_selection["_ENV_PATH"]))

        # Store configuration file.
        with open(
            str(CONFIG_constraints_selection["_ENV_PATH"]) + "config.json", "w"
        ) as file_cs1:
            json.dump(CONFIG_constraints_selection, file_cs1)

        ### ### ### ### ###
        ### SELECT CONSTRAINTS.
        ### ### ### ### ###
        
        list_of_sampling: List[Tuple[str, str]] = []
        
        ###
        ### Case of "closest_in_different_clusters"
        ###
        if CONFIG_constraints_selection["sampling"] == "closest_in_different_clusters":

            # Load previously samplied constraints (from "1_convergence_study" experiments).
            with open(
                str(CONFIG_constraints_selection["previous_sampling_file"]), "r"
            ) as file_cs2:
                dict_of_previous_sampling: Dict[str, List[Tuple[str, str, str]]] = json.load(file_cs2)["dict_of_constraints_annotations"]

            # Get constraints.
            for iteration in dict_of_previous_sampling.keys():
                for sample in dict_of_previous_sampling[iteration]:
                    list_of_sampling.append([sample[0], sample[1]])

            # Limit constraints size.
            list_of_sampling = list_of_sampling[:CONFIG_constraints_selection["nb_constraints"]]

        ###
        ### Case of "random"
        ###
        else:  # if CONFIG_constraints_selection["sampling"] == "random":

            # Load preprocess dataset.
            with open(
                str(CONFIG_constraints_selection["_ENV_PATH"]) + "../dict_of_preprocessed_texts.json", "r"
            ) as file_cs3:
                dict_of_preprocessed_texts: Dict[str, str] = json.load(file_cs3)

            # Randomly select constraints.
            list_of_sampling = sampling_factory(
                algorithm="random",
                random_seed=CONFIG_constraints_selection["random_seed"],
            ).sample(
                constraints_manager=managing_factory(
                    manager="binary",
                    list_of_data_IDs=list(dict_of_preprocessed_texts.keys()),
                ),
                nb_to_select=CONFIG_constraints_selection["nb_constraints"],
            )

        ###
        ### Store constraints selected.
        ###
        with open(
            str(CONFIG_constraints_selection["_ENV_PATH"]) + "list_of_sampling.json",
            "w",
        ) as file_cs4:
            json.dump(list_of_sampling, file_cs4)

# End
print("\n#####")
print("END - Constraints selection environments configuration.")

------------------------------
### 2.4. Set `errors simulation` subdirectories

Select `constraints selections` environments in which create `errors simulation` environments.

In [None]:
# Get list of constraints selection environments.
LIST_OF_CONSTRAINTS_SELECTION_ENVIRONMENTS: List[str] = [
    env
    for env in listing_envs.get_list_of_constraints_selection_env_paths()
    if "bank_cards_v1" not in env
]
print(
    "There are",
    "`" + str(len(LIST_OF_CONSTRAINTS_SELECTION_ENVIRONMENTS)) + "`",
    "created constraints selection environments in `../experiments`",
)
LIST_OF_CONSTRAINTS_SELECTION_ENVIRONMENTS

Define environments with different uses of `errors simulation`.

In [None]:
ENVIRONMENTS_FOR_ERRORS_SIMULATION: Dict[str, Dict[str, int]] = {
    "rate_{rate_str:.2f}-rand_{rand_str}-{fix_str}".format(
        rate_str=rate,
        rand_str=rand,
        fix_str=("with_fix" if fix else "without_fix"),
    ): {
        "_TYPE": "errors_simulation",
        "_DESCRIPTION": "Random simulation of {rate_str:2d}% errors (random seed at {rand_str}) {fix_str}.".format(
            rate_str=int(100*rate),
            rand_str=rand,
            fix_str=("with conflicts fix" if fix else "without conflicts fix")
        ),
        "error_rate": rate,
        "random_seed": rand,
        "with_fix": fix,
    }
    #for rate in [0.0, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50,]
    for rate in [0.0, 0.05, 0.10, 0.15, 0.20, 0.25,]
    for rand in [1, 2,]
    for fix in [True,]
}
ENVIRONMENTS_FOR_ERRORS_SIMULATION

Create `errors simulation` environments using `ENVIRONMENTS_FOR_ERRORS_SIMULATION` configuration dictionary.

In [None]:
### ### ### ### ###
### LOOP FOR ALL ENVIRONMENTS CONFIGURED...
### ### ### ### ###
for PARENT_ENV_PATH_constraints_selection in LIST_OF_CONSTRAINTS_SELECTION_ENVIRONMENTS:
    for (
        ENV_NAME_errors_simulation,
        CONFIG_errors_simulation,
    ) in ENVIRONMENTS_FOR_ERRORS_SIMULATION.items():

        ### ### ### ### ###
        ### CREATE AND CONFIGURE ENVIRONMENT.
        ### ### ### ### ###

        # Name the configuration.
        CONFIG_errors_simulation["_ENV_NAME"] = ENV_NAME_errors_simulation
        CONFIG_errors_simulation["_ENV_PATH"] = (
            PARENT_ENV_PATH_constraints_selection + ENV_NAME_errors_simulation + "/"
        )

        # Check if the environment already exists.
        if os.path.exists(str(CONFIG_errors_simulation["_ENV_PATH"])):
            continue

        # Create directory for this environment.
        os.mkdir(str(CONFIG_errors_simulation["_ENV_PATH"]))

        # Store configuration file.
        with open(
            str(CONFIG_errors_simulation["_ENV_PATH"]) + "config.json", "w"
        ) as file_p1:
            json.dump(CONFIG_errors_simulation, file_p1)

        ### ### ### ### ###
        ### SELECT ERRORS.
        ### ### ### ### ###

        # Load constraints selected.
        with open(
            str(CONFIG_errors_simulation["_ENV_PATH"]) + "../list_of_sampling.json", "r"
        ) as file_p2:
            list_of_sampling: List[Tuple[str, str]] = json.load(file_p2)
                
        # Randomly select errors.
        random.seed(CONFIG_errors_simulation["random_seed"])
        list_of_errors: List[Tuple[str, str]] = random.sample(
            list_of_sampling,
            k=int(len(list_of_sampling)*CONFIG_errors_simulation["error_rate"])
        )

        # Store error selected.
        with open(
            str(CONFIG_errors_simulation["_ENV_PATH"]) + "list_of_errors.json",
            "w",
        ) as file_p3:
            json.dump(list_of_errors, file_p3)

# End
print("\n#####")
print("END - Errors simulation environments configuration.")

------------------------------
### 3. Get all created environments

In [None]:
# Get list of experiment environments.
LIST_OF_EXPERIMENT_ENVIRONMENTS: List[str] = [
    env
    for env in listing_envs.get_list_of_errors_simulation_env_paths()
    if "bank_cards_v1" not in env
]
print(
    "There are",
    "`" + str(len(LIST_OF_EXPERIMENT_ENVIRONMENTS)) + "`",
    "created experiment environments in `../experiments`",
)
#LIST_OF_EXPERIMENT_ENVIRONMENTS