# ==== INTERACTIVE CLUSTERING : ANNOTATION SUBJECTIVITY STUDY ====
> ### Stage 1 : Initialize computation environments for experiments.

------------------------------
## READ-ME BEFORE RUNNING

### Quick Description

This notebook is **aimed at create environments needed to run annotation subjectivity study experiments**.
- Environments are represented by subdirectories in the `/experiments` folder. A full path to an experiment environment is `/experiments/[DATASET_ALGORITHM]/[ERROR]/[CONSTRAINTS_SELECTION]`.
- The path is composed of A. the dataset and the clustering used (cf. convergence study results to get the "best implementation"), B. the errors rate (randomly picked), and C. constraints selection (closest in different clusters, reuse previous selections without error).

At beginning of the comparative study, **run this notebook to set up experiments you want**.

Then, **go to the notebook `2_Simulate_errors_and_run_clustering.ipynb` to run and evaluate each experiment you have set**.

### Description each steps

- 2.1. **Set up `Dataset` environments**:
    - _Description_: Create a subdirectory, store parameters for the dataset and pre-format dataset for next computations.
    - _Setting_: A dictionary define all possible configurations of datatset environments.
        - dataset size;
        - number of clusters;
        - excel columns to use;
        - random seed for faker generation.
    - _Folder content_:
        - `dict_of_texts.json`: texts from the dataset;
        - `dict_of_true_intents.json`: true intent from the dataset;
        - `config.json`: a json file with all parameters.
    - _Available datasets_:
        - [French trainset for chatbots dealing with usual requests on bank cards v2.0.0](http://doi.org/10.5281/zenodo.4769949)
        - [MLSUM: The Multilingual Summarization Corpus](https://arxiv.org/abs/2004.14900v1), subsetted and filtered by SCHILD E. (v1.0.0).
     
- 2.2. **Set up `Algorithm` environments**:
    - _Description_: Create a subdirectory, store parameters for algorithms, then preprocess the dataset and vectorize the preprocessed dataset for next computations.
    - _Setting_: A dictionary define all possible preprocessing + vectorization + clustering environments.
        - _preprocessing_: simple_prep;
        - _vectorization_: tfidf;
        - _clustering_: kmeans_cop.
    - _Folder content_:
        - `dict_of_preprocessed_texts.json`: preprocessed texts computed from `dict_of_texts.json`;
        - `dict_of_vectors.pkl`: vectors computed from `dict_of_preprocessed_texts.json`;
        - `config.json`: a json file with all parameters.

- 2.3. **Set up `Errors` environments**:
    - _Description_: Create a subdirectory, store parameters with the random seed for error simulation.
    - _Setting_: A dictionary define all possible configurations of annotation errors simulation environments.
        - define the rate of errors.
        - define the random seed for simulation.
        - define conflicts resolution method.
    - _Folder content_:
        - `config.json`: a json file with all parameters.

- 2.3. **Set up `Constraints selection` environments**:
    - _Description_: Create a subdirectory, store parameters with the random seed and select constraints from previous experiments.
    - _Setting_: A dictionary define all possible configurations of constraints selection environments.
        - define the max number of constraints ;
        - define the random seed for selection.
        - get previous experiment sampling, then use closest_in_diff sampling.
    - _Folder content_:
        - `config.json`: a json file with all parameters.
        - `previous_sampling.json`: all selected constraints.
        - `dict_of_samplings.json`: all selected constraints.
        - `dict_of_errors.json`: all selected errors.
        - `dict_of_constraints_effective.json`: all effective constraintes (after conflict resolution).
        - `dict_of_clustering_results.json`: all clustering results.
        - `dict_of_clustering_performances.json`: all clustering performances.

------------------------------
## 1. IMPORT PYTHON DEPENDENCIES

In [None]:
import os
import faker
import listing_envs
from typing import Any, Dict, List, Tuple
from scipy.sparse import csr_matrix
import pandas as pd
import json
import pickle  # noqa: S403
from cognitivefactory.interactive_clustering.utils.preprocessing import (
    preprocess,
)
from cognitivefactory.interactive_clustering.utils.vectorization import (
    vectorize,
)

------------------------------
## 2. CREATE COMPUTATION ENVIRONMENTS

------------------------------
### 2.1. Set `Dataset` subdirectories

Define environments with different `datasets`.

In [None]:
LIST_OF_DATASET_TO_IMPORT: List[str] = [
    #'bank_cards_v1'
] + [
    "bank_cards_v2-size_{size_str}-rand_{rand_str}".format(size_str=str(size), rand_str=str(rand))
    for size in range(1000, 5001, 500)
    for rand in [1]
] + [
    "mlsum_fr_train_subset_v1-size_{size_str}-rand_{rand_str}".format(size_str=str(size), rand_str=str(rand))
    for size in range(1000, 5001, 500)
    for rand in [1]
]
print("There is", len(LIST_OF_DATASET_TO_IMPORT), "datasets to import.")

Create `dataset` environments using `ENVIRONMENTS_FOR_DATASETS` configuration dictionary.

In [None]:
### ### ### ### ###
### LOOP FOR ALL ENVIRONMENTS CONFIGURED...
### ### ### ### ###
for ENV_NAME_dataset in LIST_OF_DATASET_TO_IMPORT:
    
    # Get first export file to get configs.
    first_export_file = os.listdir("../previous/"+ENV_NAME_dataset+"/")[0]
    
    # Load export file.
    with open("../previous/"+ENV_NAME_dataset+"/"+first_export_file, "r") as file_d0:
        export_data: Dict[str, Any] = json.load(file_d0)

    ### ### ### ### ###
    ### CREATE AND CONFIGURE ENVIRONMENT.
    ### ### ### ### ###
    
    # get the configuration.
    CONFIG_dataset: Dict[str, Any] = export_data["dataset_config"]

    # Name the configuration.
    CONFIG_dataset["_ENV_NAME"] = ENV_NAME_dataset
    CONFIG_dataset["_ENV_PATH"] = "../experiments/" + ENV_NAME_dataset + "/"

    # Check if the environment already exists.
    if os.path.exists(str(CONFIG_dataset["_ENV_PATH"])):
        continue

    # Create directory for this environment.
    os.mkdir(str(CONFIG_dataset["_ENV_PATH"]))

    # Store configuration file.
    with open(str(CONFIG_dataset["_ENV_PATH"]) + "config.json", "w") as file_d1:
        json.dump(CONFIG_dataset, file_d1)

    ### ### ### ### ###
    ### STORE DATASET.
    ### ### ### ### ###

    # Store `dict_of_texts` and `dict_of_true_intents`.
    with open(str(CONFIG_dataset["_ENV_PATH"]) + "dict_of_texts.json", "w") as file_d2:
        json.dump(export_data["dict_of_texts"], file_d2)

    with open(
        str(CONFIG_dataset["_ENV_PATH"]) + "dict_of_true_intents.json", "w"
    ) as file_d3:
        json.dump(export_data["dict_of_true_intents"], file_d3)

# End
print("\n#####")
print("END - Dataset environments configuration.")

------------------------------
### 2.2. Set `algorithm (Preprocessing + Vectorization + Clustering)` subdirectories

Select `dataset` environments in which create `algorithm` environments

In [None]:
# Get list of dataset environments.
LIST_OF_DATASET_ENVIRONMENTS: List[str] = listing_envs.get_list_of_dataset_env_paths()
print(
    "There are",
    "`" + str(len(LIST_OF_DATASET_ENVIRONMENTS)) + "`",
    "created dataset environments in `../experiments`",
)
LIST_OF_DATASET_ENVIRONMENTS

Define environments with different uses of `algorithm`.

In [None]:
ENVIRONMENTS_FOR_ALGORITHM: Dict[str, Any] = {
    # best implementation to reach 80% of V-measure.
    "simple_tfidf_kmeans_cop": {
        "_TYPE": "algorithm",
        "_DESCRIPTION": "Simple preprocessing + TFIDF vectorization + KMeans clustering.",
        "preprocessing": {
            "_NAME": "simple_prep",
            "apply_preprocessing": True,
            "apply_lemmatization": False,
            "apply_parsing_filter": False,
            "spacy_language_model": "fr_core_news_md",
        },
        "vectorization": {
            "_NAME": "tfidf",
            "vectorizer_type": "tfidf",
            "spacy_language_model": None,
        },
        "clustering": {
             "_TEMPNAME": "kmeans_COP-{0}c",
            "algorithm": "kmeans",
            "init**kargs": {
                "model": "COP",
                "max_iteration": 150,
                "tolerance": 1e-4,
            },
            "random_seed": 42,
            #"nb_clusters": None,  # define by dataset.
        },
        "previous_sampling": {
            "_NAME": "closest-50",
            "algorithm": "closest_in_different_clusters",
            "nb_to_select": 50,
            #"random_seed": None,  # define by constraints selection.
        },
    },
}

Create `preprocessing + vectorization + clustering` environments using `ENVIRONMENTS_FOR_ALGORITHM` configuration dictionary.

In [None]:
### ### ### ### ###
### LOOP FOR ALL ENVIRONMENTS CONFIGURED...
### ### ### ### ###
for PARENT_ENV_PATH_dataset in LIST_OF_DATASET_ENVIRONMENTS:
    for (
        ENV_NAME_algorithm,
        CONFIG_algorithm2,
    ) in ENVIRONMENTS_FOR_ALGORITHM.items():
        
        # Make copy of configuration.
        CONFIG_algorithm = CONFIG_algorithm2.copy()

        ### ### ### ### ###
        ### CREATE AND CONFIGURE ENVIRONMENT.
        ### ### ### ### ###

        # Name the configuration.
        CONFIG_algorithm["_ENV_NAME"] = ENV_NAME_algorithm
        CONFIG_algorithm["_ENV_PATH"] = (
            PARENT_ENV_PATH_dataset + ENV_NAME_algorithm + "/"
        )
        
        # Set number of clusters from the dataset configuration.
        with open(PARENT_ENV_PATH_dataset + "config.json", "r") as file_a0:
            nb_clusters = json.load(file_a0)["nb_clusters"]
        CONFIG_algorithm["clustering"]["nb_clusters"] = nb_clusters
        CONFIG_algorithm["clustering"]["_NAME"] = CONFIG_algorithm["clustering"]["_TEMPNAME"].format(nb_clusters)

        # Check if the environment already exists.
        if os.path.exists(str(CONFIG_algorithm["_ENV_PATH"])):
            continue

        # Create directory for this environment.
        os.mkdir(str(CONFIG_algorithm["_ENV_PATH"]))

        # Store configuration file.
        with open(
            str(CONFIG_algorithm["_ENV_PATH"]) + "config.json", "w"
        ) as file_a1:
            json.dump(CONFIG_algorithm, file_a1)

        ### ### ### ### ###
        ### PREPROCESS DATASET.
        ### ### ### ### ###

        # Load dataset.
        with open(
            str(CONFIG_algorithm["_ENV_PATH"]) + "../dict_of_texts.json", "r"
        ) as file_a2:
            texts: Dict[str, str] = json.load(file_a2)

        dict_of_preprocessed_texts: Dict[str, str] = {}

        # Case with preprocessing.
        if bool(CONFIG_algorithm["preprocessing"]["apply_preprocessing"]):
            dict_of_preprocessed_texts = preprocess(
                dict_of_texts=texts,
                apply_lemmatization=bool(CONFIG_algorithm["preprocessing"]["apply_lemmatization"]),
                apply_parsing_filter=bool(CONFIG_algorithm["preprocessing"]["apply_parsing_filter"]),
                spacy_language_model=str(CONFIG_algorithm["preprocessing"]["spacy_language_model"]),
            )

        # Case without preprocessing.
        else:
            dict_of_preprocessed_texts = texts

        # Store preprocessed texts.
        with open(
            str(CONFIG_algorithm["_ENV_PATH"]) + "dict_of_preprocessed_texts.json",
            "w",
        ) as file_a3:
            json.dump(dict_of_preprocessed_texts, file_a3)
            
        ### ### ### ### ###
        ### VECTORIZE DATASET.
        ### ### ### ### ###

        # Vectorize dataset.
        dict_of_vectors: Dict[str, csr_matrix] = vectorize(
            dict_of_texts=dict_of_preprocessed_texts,
            vectorizer_type=str(CONFIG_algorithm["vectorization"]["vectorizer_type"]),
            spacy_language_model=str(CONFIG_algorithm["vectorization"]["spacy_language_model"]),
        )

        # Store vectors.
        with open(
            str(CONFIG_algorithm["_ENV_PATH"]) + "dict_of_vectors.pkl", "wb"
        ) as file_a4:
            pickle.dump(dict_of_vectors, file_a4)

# End
print("\n#####")
print("END - Algorithm environments configuration.")

------------------------------
### 2.3. Set `Error` subdirectories

Select `algorithm` environments in which create `error` environments.

In [None]:
# Get list of algorithm environments.
LIST_OF_ALGORITHM_ENVIRONMENTS: List[str] = listing_envs.get_list_of_algorithm_env_paths()
print(
    "There are",
    "`" + str(len(LIST_OF_ALGORITHM_ENVIRONMENTS)) + "`",
    "created algorithm environments in `../experiments`",
)
LIST_OF_ALGORITHM_ENVIRONMENTS

Define environments with different uses of `error`.

In [None]:
ENVIRONMENTS_FOR_ERRORS: Dict[str, Dict[str, int]] = {
    "rate_{rate_str:.2f}-rand_{rand_str}-{fix_str}".format(
        rate_str=rate,
        rand_str=rand,
        fix_str=("with_fix" if fix else "without_fix"),
    ): {
        "_TYPE": "errors_simulation",
        "_DESCRIPTION": "Random simulation of {rate_str:2d}% errors (random seed at {rand_str}) {fix_str}.".format(
            rate_str=int(100*rate),
            rand_str=rand,
            fix_str=("with conflicts fix" if fix else "without conflicts fix")
        ),
        "error_rate": rate,
        "random_seed": rand,
        "with_fix": fix,
    }
    for rate in [0.0, 0.05, 0.10, 0.15, 0.20, 0.25,]
    for rand in [1, 2,]
    for fix in [True,]
}
ENVIRONMENTS_FOR_ERRORS

Create `errors` environments using `ENVIRONMENTS_FOR_ERRORS` configuration dictionary.

In [None]:
### ### ### ### ###
### LOOP FOR ALL ENVIRONMENTS CONFIGURED...
### ### ### ### ###
for PARENT_ENV_PATH_algorithm in LIST_OF_ALGORITHM_ENVIRONMENTS:
    for (
        ENV_NAME_errors,
        CONFIG_errors,
    ) in ENVIRONMENTS_FOR_ERRORS.items():

        ### ### ### ### ###
        ### CREATE AND CONFIGURE ENVIRONMENT.
        ### ### ### ### ###

        # Name the configuration.
        CONFIG_errors["_ENV_NAME"] = ENV_NAME_errors
        CONFIG_errors["_ENV_PATH"] = (
            PARENT_ENV_PATH_algorithm + ENV_NAME_errors + "/"
        )

        # Check if the environment already exists.
        if os.path.exists(str(CONFIG_errors["_ENV_PATH"])):
            continue

        # Create directory for this environment.
        os.mkdir(str(CONFIG_errors["_ENV_PATH"]))

        # Store configuration file.
        with open(
            str(CONFIG_errors["_ENV_PATH"]) + "config.json", "w"
        ) as file_e1:
            json.dump(CONFIG_errors, file_e1)

# End
print("\n#####")
print("END - Errors environments configuration.")

------------------------------
### 2.4. Set `Constraints selection` subdirectories

Select `Error` environments in which create `constraints selection` environments.

In [None]:
# Get list of error environments.
LIST_OF_ERROR_ENVIRONMENTS: List[str] = listing_envs.get_list_of_error_env_paths()
print(
    "There are",
    "`" + str(len(LIST_OF_ERROR_ENVIRONMENTS)) + "`",
    "created error environments in `../experiments`",
)
LIST_OF_ERROR_ENVIRONMENTS

In [None]:
ENVIRONMENTS_FOR_CONSTRAINTS_SELECTION: Dict[str, Dict[str, int]] = {
    # previous selection from "1_convergence_study" experiements with closest neighbors in different clusters sampling.
    "closest_250_{rand_str}".format(rand_str=rand): {
        "_TYPE": "constraints_selection",
        "_DESCRIPTION": "Selection from previous sampling then with closest neighbors in different clusters of 250 constraints (random seed at {rand_str})".format(rand_str=rand),
        # "previous_file": None,
        "constraints_step": 250,  # like nb_to_select
        "next_sampling": {
            "algorithm": "custom",
            "init**kargs": {
                "clusters_restriction": "different_clusters",  # like "closest_in_different_clusters".
                "distance_restriction": "closest_neighbors",  # like "closest_in_different_clusters".
                "without_added_constraints": True,
                "without_inferred_constraints": False,  # but without inference "closest_in_different_clusters".
            },
        },
        "random_seed": rand,
    }
    for rand in [1, 2,]
}
len(ENVIRONMENTS_FOR_CONSTRAINTS_SELECTION)

Create `constraints selection` environments using `ENVIRONMENTS_FOR_CONSTRAINTS_SELECTION` configuration dictionary.

In [None]:
### ### ### ### ###
### LOOP FOR ALL ENVIRONMENTS CONFIGURED...
### ### ### ### ###
for PARENT_ENV_PATH_error in LIST_OF_ERROR_ENVIRONMENTS:
    for (
        ENV_NAME_constraints_selection,
        CONFIG_constraints_selection,
    ) in ENVIRONMENTS_FOR_CONSTRAINTS_SELECTION.items():

        ### ### ### ### ###
        ### CREATE AND CONFIGURE ENVIRONMENT.
        ### ### ### ### ###

        # Name the configuration.
        CONFIG_constraints_selection["_ENV_NAME"] = ENV_NAME_constraints_selection
        CONFIG_constraints_selection["_ENV_PATH"] = (
            PARENT_ENV_PATH_error + ENV_NAME_constraints_selection + "/"
        )
        
        # Get dataset name and number of clusters from the dataset configuration.
        with open(PARENT_ENV_PATH_error + "../../config.json", "r") as file_cs0:
            config_dataset = json.load(file_cs0)
            dataset_name: str = config_dataset["_ENV_NAME"]

        # Get algorithms names from the algorithms configuration.
        with open(PARENT_ENV_PATH_error + "../config.json", "r") as file_cs1:
            config_algorithm = json.load(file_cs1)
            preprocessing_name: str = config_algorithm["preprocessing"]["_NAME"]
            vectorization_name: str = config_algorithm["vectorization"]["_NAME"]
            clustering_name: str = config_algorithm["clustering"]["_NAME"]
            sampling_name: str = config_algorithm["previous_sampling"]["_NAME"]
        
        # Add previous file sampling.
        CONFIG_constraints_selection["previous_file"] = "../previous/{dataset_str}/{prep_name_str}_-_{vect_name_str}_-_{samp_name_str}_-_{clust_name_str}_-_{rand_str}.json".format(
            dataset_str=str(dataset_name),
            prep_name_str=str(preprocessing_name),
            vect_name_str=str(vectorization_name),
            samp_name_str=str(sampling_name),
            clust_name_str=str(clustering_name),
            rand_str=str(CONFIG_constraints_selection["random_seed"]).zfill(4),
        )

        # Check if the environment already exists.
        if os.path.exists(str(CONFIG_constraints_selection["_ENV_PATH"])):
            continue

        # Create directory for this environment.
        os.mkdir(str(CONFIG_constraints_selection["_ENV_PATH"]))

        # Store configuration file.
        with open(
            str(CONFIG_constraints_selection["_ENV_PATH"]) + "config.json", "w"
        ) as file_cs2:
            json.dump(CONFIG_constraints_selection, file_cs2)

        ### ### ### ### ###
        ### LOAD PREVIOUS RESULTS.
        ### ### ### ### ###
        
        # Load previously samplied constraints (from "1_convergence_study" experiments).
        list_of_previous_sampling: List[Tuple[str, str]] = []
        with open(
            str(CONFIG_constraints_selection["previous_file"]), "r"
        ) as file_cs3:
            dict_of_previous: Dict[str, Dict[str, Any]] = json.load(file_cs3)

        ### ### ### ### ###
        ### STORE NEEDED FILES.
        ### ### ### ### ###
        
        # Store previous sampling.
        list_of_previous_sampling: List[Tuple[str, str]] = []
        for iteration in dict_of_previous["dict_of_constraints_annotations"].keys():
            for sample in dict_of_previous["dict_of_constraints_annotations"][iteration]:
                list_of_previous_sampling.append([sample[0], sample[1]])
        with open(
            str(CONFIG_constraints_selection["_ENV_PATH"]) + "previous_sampling.json",
            "w",
        ) as file_cs4:
            json.dump(list_of_previous_sampling, file_cs4)

        # Store dictionary of samplings.
        with open(
            str(CONFIG_constraints_selection["_ENV_PATH"])
            + "dict_of_samplings.json",
            "w",
        ) as file_cs5:
            json.dump({"000000": []}, file_cs5)

        # Store dictionary of errors.
        with open(
            str(CONFIG_constraints_selection["_ENV_PATH"])
            + "dict_of_errors.json",
            "w",
        ) as file_cs6:
            json.dump({"000000": []}, file_cs6)

        # Store dictionary of constraints effective.
        with open(
            str(CONFIG_constraints_selection["_ENV_PATH"])
            + "dict_of_constraints_effective.json",
            "w",
        ) as file_cs7:
            json.dump({"000000": []}, file_cs7)

        # Store dictionary of clustering results.
        with open(
            str(CONFIG_constraints_selection["_ENV_PATH"])
            + "dict_of_clustering_results.json",
            "w",
        ) as file_cs8:
            json.dump({"000000": dict_of_previous["dict_of_clustering_results"]["0000"]}, file_cs8)

        # Store dictionary of clustering performances.
        with open(
            str(CONFIG_constraints_selection["_ENV_PATH"])
            + "dict_of_clustering_performances.json",
            "w",
        ) as file_cs9:
            json.dump({"000000": dict_of_previous["dict_of_clustering_performances"]["0000"]}, file_cs9)

# End
print("\n#####")
print("END - Constraints selection environments configuration.")

------------------------------
### 3. Get all created environments

In [None]:
# Get list of experiment environments.
LIST_OF_EXPERIMENT_ENVIRONMENTS: List[str] = listing_envs.get_list_of_constraints_selection_env_paths()
print(
    "There are",
    "`" + str(len(LIST_OF_EXPERIMENT_ENVIRONMENTS)) + "`",
    "created experiment environments in `../experiments`",
)
#LIST_OF_EXPERIMENT_ENVIRONMENTS