# ==== INTERACTIVE CLUSTERING : CONSTRAINTS NUMBER STUDY ====
> ### Stage 1 : Initialize computation environments for experiments.

------------------------------
## READ-ME BEFORE RUNNING

### Quick Description

This notebook is **aimed at create environments needed to run constraints number study experiments**.
- Environments are represented by subdirectories in the `/experiments` folder. A full path to an experiment environment is `/experiments/[DATASET]/[PREPROCESSING]/[VECTORIZATION]/[SAMPLING]/[CLUSTERING]/[EXPERIMENT]`.
- Each subpath corresponds to a part of Interactive Clustering methodology : A. Load dataset ; B. Preprocess dataset ; C. Vectorize dataset ; D. Sample data for constraints annotation ; E. Cluster data with constraints. The last subdirectory define the random seed of the observation.

At beginning of the comparative study, **run this notebook to set up experiments you want**.

Then, **go to the notebook `2_Run_until_convergence_and_evaluate_efficience.ipynb` to run and evaluate each experiment you have set**.

### Description each steps

- 2.1. **Set up `Dataset` environments**:
    - _Description_: Create a subdirectory, store parameters for the dataset and pre-format dataset for next computations.
    - _Setting_: A dictionary define all possible configurations of datatset environments.
    - _Folder content_:
        - `dict_of_texts.json`: texts from the dataset;
        - `dict_of_true_intents.json`: true intent from the dataset;
        - `config.json`: a json file with all parameters.
    - _Available datasets_:
        - [French trainset for chatbots dealing with usual requests on bank cards v2.0.0](http://doi.org/10.5281/zenodo.4769949)
        - [MLSUM: The Multilingual Summarization Corpus](https://arxiv.org/abs/2004.14900v1), subsetted and filtered by SCHILD E. (v1.0.0).
    - _Available dataset settings_:
        - define dataset size;
        - define random seed.

- 2.2. **Set up `Preprocessing` environments**:
    - _Description_: Create a subdirectory, store parameters and preprocess the dataset for next computations.
    - _Setting_: A dictionary define all possible configurations of preprocessing environments.
    - _Folder content_:
        - `dict_of_preprocessed_texts.json`: preprocessed texts computed from `dict_of_texts.json`;
        - `config.json`: a json file with all parameters.
    - _Available preprocessing settings_:
        - apply simple preprocessing (lowercase, punctuation, accent, whitespace).

- 2.3. **Set up `Vectorization` environments**:
    - _Description_: Create a subdirectory, store parameters and vectorize the preprocessed dataset for next computations.
    - _Setting_: A dictionary define all possible configurations of vectorization environments.
    - _Folder content_:
        - `dict_of_vectors.pkl`: vectors computed from `dict_of_preprocessed_texts.json`;
        - `config.json`: a json file with all parameters.
    - _Available vectorization settings_:
        - TF-IDF vectorizer;
        - spaCy `fr_core_news_md` language model.

- 2.4. **Set up `Sampling` environments**:
    - _Description_: Create a subdirectory and store parameters.
    - _Setting_: A dictionary define all possible configurations of sampling environments.
    - _Folder content_:
        - `config.json` : a json file with all parameters.
    - _Available sampling settings_:
        - apply sampling of closest data from two different clusters.

- 2.5. **Set up `Clustering` environments**:
    - _Description_: Create a subdirectory and store parameters.
    - _Setting_: A dictionary define all possible configurations of clustering environments.
    - _Folder content_:
        - `config.json` : a json file with all parameters.
    - _Available clustering settings_:
        - apply constrained kmeans clustering (model COP);
        - define number of clusters.

- 2.6. **Set up `Experiment` environments**:
    - _Description_: Create a subdirectory, store parameters and initialize results files storage.
    - _Setting_: A dictionary define all possible configurations of experiment environments.
    - _Folder content_:
        - `config.json`: a json file with all parameters.
    - _Folder content_:
        - `dict_of_constraints_annotations.json`: all annotations over interactive-clustering iterations;
        - `dict_of_clustering_results.json`: clustering results over interactive-clustering iterations;
        - `dict_of_computation_times.json`: computation times over interactive-clustering iterations;
        - `dict_of_clustering_performances.json`: clustering performances over interactive-clustering iterations.
    - _Available experiment settings_:
        - define the random seed.

------------------------------
## 1. IMPORT PYTHON DEPENDENCIES

In [1]:
import os
import faker
import listing_envs
from typing import Dict, List, Any, Tuple
from scipy.sparse import csr_matrix
import pandas as pd
import json
import pickle  # noqa: S403
from cognitivefactory.interactive_clustering.utils.preprocessing import (
    preprocess,
)
from cognitivefactory.interactive_clustering.utils.vectorization import (
    vectorize,
)

------------------------------
## 2. CREATE COMPUTATION ENVIRONMENTS

------------------------------
### 2.1. Set `Dataset` subdirectories

Define environments with different `datasets`.

In [2]:
ENVIRONMENTS_FOR_DATASETS: Dict[str, Any] = {}
for size in range(1000, 5000+250, 250):
    for rand in [1, 2, 3,]:
        # Case of bank cards management.
        ENVIRONMENTS_FOR_DATASETS["bank_cards_v2-size_{size_str}-rand_{rand_str}".format(size_str=size, rand_str=rand)] = {
            "_TYPE": "dataset",
            "_DESCRIPTION": "This dataset represents examples of common customer requests relating to bank cards management. It can be used as a training set for a small chatbot intended to process these usual requests.",
            "file_name": "French_trainset_for_chatbots_dealing_with_usual_requests_on_bank_cards_v2.0.0.xlsx",
            "sheet_name": "dataset",
            "columns": ["QUESTION", "INTENT"],
            "language": "fr",
            "size": size,
            "random_seed": rand,
            "nb_clusters": 10,
        }
        # Case of MLSUM.
        ENVIRONMENTS_FOR_DATASETS["mlsum_fr_train_subset_v1-size_{size_str}-rand_{rand_str}".format(size_str=size, rand_str=rand)] = {
            "_TYPE": "dataset",
            "_DESCRIPTION": "We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset. For constraints annotation experiment based on data similarity, this dataset have been subsetted (randomly pick 75 articles in the following 14 most used topics: 'economie', 'politique', 'sport', 'planete' (renamed in 'ecologie'), 'sciences', 'police-justice', 'disparitions', 'emploi', 'sante', 'musiques', 'arts', 'educations', 'climat' (renamed in 'meteo'), 'immobilier') and filtered (keep articles that have an obvious topics regarding their titles, without their bodies). Two reviewers have wrking on this task in order to limit the subjectivity of the filtering. This subsetted dataset is used (1) to estimate needed time to annotate titles similarity with constraints (MUST-LINK, CANNOT-LINK) and (2) to test interactive clustering methodology (constraints annotation and constrained clustering).",
            "file_name": "mlsum_fr_train_subset_v1.0.0.schild.xlsx",
            "sheet_name": "dataset",
            "columns": ["title", "topic"],
            "language": "fr",
            "size": size,
            "random_seed": rand,
            "nb_clusters": 14,
        }

Create `dataset` environments using `ENVIRONMENTS_FOR_DATASETS` configuration dictionary.

In [3]:
### ### ### ### ###
### LOOP FOR ALL ENVIRONMENTS CONFIGURED...
### ### ### ### ###
for ENV_NAME_dataset, CONFIG_dataset in ENVIRONMENTS_FOR_DATASETS.items():

    ### ### ### ### ###
    ### CREATE AND CONFIGURE ENVIRONMENT.
    ### ### ### ### ###

    # Name the configuration.
    CONFIG_dataset["_ENV_NAME"] = ENV_NAME_dataset
    CONFIG_dataset["_ENV_PATH"] = "../experiments/" + ENV_NAME_dataset + "/"

    # Check if the environment already exists.
    if os.path.exists(str(CONFIG_dataset["_ENV_PATH"])):
        continue

    # Create directory for this environment.
    os.mkdir(str(CONFIG_dataset["_ENV_PATH"]))

    # Store configuration file.
    with open(str(CONFIG_dataset["_ENV_PATH"]) + "config.json", "w") as file_d1:
        json.dump(CONFIG_dataset, file_d1, indent=1)

    ### ### ### ### ###
    ### LOAD DATASET.
    ### ### ### ### ###

    # Load dataset.
    df_dataset: pd.DataFrame = pd.read_excel(
        io="../../datasets/" + CONFIG_dataset["file_name"],
        sheet_name=CONFIG_dataset["sheet_name"],
        engine="openpyxl",
    )
    
    # Drop duplicates.
    df_dataset.drop_duplicates(inplace=True)

    ### ### ### ### ###
    ### LOAD DATASET.
    ### ### ### ### ###

    # Load base for `dict_of_texts` and `dict_of_true_intents`.
    # > Force `str` type to avoid typing errors.

    base_dict_of_texts: Dict[str, str] = {
        str(data_id): str(value[
            CONFIG_dataset["columns"][0]
        ])
        for data_id, value in df_dataset.to_dict("index").items()
    }

    base_dict_of_true_intents: Dict[str, str] = {
        str(data_id): str(value[
            CONFIG_dataset["columns"][1]
        ])
        for data_id, value in df_dataset.to_dict("index").items()
    }

    # Fake dataset if needed (i.e. artificially add data by generating random spelling errors).
    faker_results: Tuple[Dict[str, str], Dict[str, str]] = faker.fake_dataset(
        dict_of_texts=base_dict_of_texts,
        dict_of_true_intents=base_dict_of_true_intents,
        size=CONFIG_dataset["size"],
        random_seed=CONFIG_dataset["random_seed"],
    )
    dict_of_texts: Dict[str, str] = faker_results[0]
    dict_of_true_intents: Dict[str, str] = faker_results[1]

    # Store `dict_of_texts` and `dict_of_true_intents`.
    with open(str(CONFIG_dataset["_ENV_PATH"]) + "dict_of_texts.json", "w") as file_d2:
        json.dump(dict_of_texts, file_d2, indent=1)

    with open(
        str(CONFIG_dataset["_ENV_PATH"]) + "dict_of_true_intents.json", "w"
    ) as file_d3:
        json.dump(dict_of_true_intents, file_d3, indent=1)

# End
print("\n#####")
print("END - Dataset environments configuration.")


#####
END - Dataset environments configuration.


------------------------------
### 2.2. Set `Preprocessing` subdirectories

Select `dataset` environments in which create `preprocessing` environments

In [4]:
# Get list of dataset environments.
LIST_OF_DATASET_ENVIRONMENTS: List[str] = listing_envs.get_list_of_dataset_env_paths()
print(
    "There are",
    "`" + str(len(LIST_OF_DATASET_ENVIRONMENTS)) + "`",
    "created dataset environments in `../experiments`",
)
LIST_OF_DATASET_ENVIRONMENTS

There are `102` created dataset environments in `../experiments`


['../experiments/bank_cards_v2-size_1000-rand_1/',
 '../experiments/bank_cards_v2-size_1000-rand_2/',
 '../experiments/bank_cards_v2-size_1000-rand_3/',
 '../experiments/bank_cards_v2-size_1250-rand_1/',
 '../experiments/bank_cards_v2-size_1250-rand_2/',
 '../experiments/bank_cards_v2-size_1250-rand_3/',
 '../experiments/bank_cards_v2-size_1500-rand_1/',
 '../experiments/bank_cards_v2-size_1500-rand_2/',
 '../experiments/bank_cards_v2-size_1500-rand_3/',
 '../experiments/bank_cards_v2-size_1750-rand_1/',
 '../experiments/bank_cards_v2-size_1750-rand_2/',
 '../experiments/bank_cards_v2-size_1750-rand_3/',
 '../experiments/bank_cards_v2-size_2000-rand_1/',
 '../experiments/bank_cards_v2-size_2000-rand_2/',
 '../experiments/bank_cards_v2-size_2000-rand_3/',
 '../experiments/bank_cards_v2-size_2250-rand_1/',
 '../experiments/bank_cards_v2-size_2250-rand_2/',
 '../experiments/bank_cards_v2-size_2250-rand_3/',
 '../experiments/bank_cards_v2-size_2500-rand_1/',
 '../experiments/bank_cards_v2-

Define environments with different uses of `preprocessing`.

In [5]:
ENVIRONMENTS_FOR_PREPROCESSING: Dict[str, Any] = {
    # Case of simple preprocessing (lowercase, accents, punctuation, whitspace).
    "simple_prep": {
        "_TYPE": "preprocessing",
        "_DESCRIPTION": "Simple preprocessing (lowercase, accents, punctuation, whitspace)",
        "apply_preprocessing": True,
        "apply_lemmatization": False,
        "apply_parsing_filter": False,
        "spacy_language_model": "fr_core_news_md",
    },
}

Create `preprocessing` environments using `ENVIRONMENTS_FOR_PREPROCESSING` configuration dictionary.

In [6]:
### ### ### ### ###
### LOOP FOR ALL ENVIRONMENTS CONFIGURED...
### ### ### ### ###
for PARENT_ENV_PATH_dataset in LIST_OF_DATASET_ENVIRONMENTS:
    for (
        ENV_NAME_preprocessing,
        CONFIG_preprocessing,
    ) in ENVIRONMENTS_FOR_PREPROCESSING.items():

        ### ### ### ### ###
        ### CREATE AND CONFIGURE ENVIRONMENT.
        ### ### ### ### ###

        # Name the configuration.
        CONFIG_preprocessing["_ENV_NAME"] = ENV_NAME_preprocessing
        CONFIG_preprocessing["_ENV_PATH"] = (
            PARENT_ENV_PATH_dataset + ENV_NAME_preprocessing + "/"
        )

        # Check if the environment already exists.
        if os.path.exists(str(CONFIG_preprocessing["_ENV_PATH"])):
            continue

        # Create directory for this environment.
        os.mkdir(str(CONFIG_preprocessing["_ENV_PATH"]))

        # Store configuration file.
        with open(
            str(CONFIG_preprocessing["_ENV_PATH"]) + "config.json", "w"
        ) as file_p1:
            json.dump(CONFIG_preprocessing, file_p1, indent=1)

        ### ### ### ### ###
        ### PREPROCESS DATASET.
        ### ### ### ### ###

        # Load preprocess dataset.
        with open(
            str(CONFIG_preprocessing["_ENV_PATH"]) + "../dict_of_texts.json", "r"
        ) as file_p2:
            texts: Dict[str, str] = json.load(file_p2)

        dict_of_preprocessed_texts: Dict[str, str] = {}

        # Case with preprocessing.
        if bool(CONFIG_preprocessing["apply_preprocessing"]):
            dict_of_preprocessed_texts = preprocess(
                dict_of_texts=texts,
                apply_lemmatization=bool(CONFIG_preprocessing["apply_lemmatization"]),
                apply_parsing_filter=bool(CONFIG_preprocessing["apply_parsing_filter"]),
                spacy_language_model=str(CONFIG_preprocessing["spacy_language_model"]),
            )

        # Case without preprocessing.
        else:
            dict_of_preprocessed_texts = texts

        # Store preprocessed texts.
        with open(
            str(CONFIG_preprocessing["_ENV_PATH"]) + "dict_of_preprocessed_texts.json",
            "w",
        ) as file_p3:
            json.dump(dict_of_preprocessed_texts, file_p3, indent=1)

# End
print("\n#####")
print("END - Preprocessing environments configuration.")


#####
END - Preprocessing environments configuration.


------------------------------
### 2.3. Set `Vectorization` subdirectories

Select `preprocessing` environments in which create `vectorization` environments.

In [7]:
# Get list of preprocessing environments.
LIST_OF_PREPROCESSING_ENVIRONMENTS: List[
    str
] = listing_envs.get_list_of_preprocessing_env_paths()
print(
    "There are",
    "`" + str(len(LIST_OF_PREPROCESSING_ENVIRONMENTS)) + "`",
    "created preprocessing environments in `../experiments`",
)
LIST_OF_PREPROCESSING_ENVIRONMENTS

There are `102` created preprocessing environments in `../experiments`


['../experiments/bank_cards_v2-size_1000-rand_1/simple_prep/',
 '../experiments/bank_cards_v2-size_1000-rand_2/simple_prep/',
 '../experiments/bank_cards_v2-size_1000-rand_3/simple_prep/',
 '../experiments/bank_cards_v2-size_1250-rand_1/simple_prep/',
 '../experiments/bank_cards_v2-size_1250-rand_2/simple_prep/',
 '../experiments/bank_cards_v2-size_1250-rand_3/simple_prep/',
 '../experiments/bank_cards_v2-size_1500-rand_1/simple_prep/',
 '../experiments/bank_cards_v2-size_1500-rand_2/simple_prep/',
 '../experiments/bank_cards_v2-size_1500-rand_3/simple_prep/',
 '../experiments/bank_cards_v2-size_1750-rand_1/simple_prep/',
 '../experiments/bank_cards_v2-size_1750-rand_2/simple_prep/',
 '../experiments/bank_cards_v2-size_1750-rand_3/simple_prep/',
 '../experiments/bank_cards_v2-size_2000-rand_1/simple_prep/',
 '../experiments/bank_cards_v2-size_2000-rand_2/simple_prep/',
 '../experiments/bank_cards_v2-size_2000-rand_3/simple_prep/',
 '../experiments/bank_cards_v2-size_2250-rand_1/simple_

Define environments with different uses of `vectorization`.

In [8]:
ENVIRONMENTS_FOR_VECTORIZATION: Dict[str, Any] = {
    # Case of TFIDF vectorization.
    "tfidf": {
        "_TYPE": "vectorization",
        "_DESCRIPTION": "TFIDF vectorization.",
        "vectorizer_type": "tfidf",
        "spacy_language_model": None,
    },
}

Create `vectorization` environments using `ENVIRONMENTS_FOR_VECTORIZATION` configuration dictionary.

In [9]:
### ### ### ### ###
### LOOP FOR ALL ENVIRONMENTS CONFIGURED...
### ### ### ### ###
for PARENT_ENV_PATH_preprocessing in LIST_OF_PREPROCESSING_ENVIRONMENTS:
    for (
        ENV_NAME_vectorization,
        CONFIG_vectorization,
    ) in ENVIRONMENTS_FOR_VECTORIZATION.items():

        ### ### ### ### ###
        ### CREATE AND CONFIGURE ENVIRONMENT.
        ### ### ### ### ###

        # Name the configuration.
        CONFIG_vectorization["_ENV_NAME"] = ENV_NAME_vectorization
        CONFIG_vectorization["_ENV_PATH"] = (
            PARENT_ENV_PATH_preprocessing + ENV_NAME_vectorization + "/"
        )

        # Check if the environment already exists.
        if os.path.exists(str(CONFIG_vectorization["_ENV_PATH"])):
            continue

        # Create directory for this environment.
        os.mkdir(str(CONFIG_vectorization["_ENV_PATH"]))

        # Store configuration file.
        with open(
            str(CONFIG_vectorization["_ENV_PATH"]) + "config.json", "w"
        ) as file_v1:
            json.dump(CONFIG_vectorization, file_v1, indent=1)

        ### ### ### ### ###
        ### VECTORIZE DATASET.
        ### ### ### ### ###

        # Load preprocess dataset.
        with open(
            str(CONFIG_vectorization["_ENV_PATH"])
            + "../dict_of_preprocessed_texts.json",
            "r",
        ) as file_v2:
            preprocessed_texts: Dict[str, str] = json.load(file_v2)

        # Vectorize dataset.
        dict_of_vectors: Dict[str, csr_matrix] = vectorize(
            dict_of_texts=preprocessed_texts,
            vectorizer_type=str(CONFIG_vectorization["vectorizer_type"]),
            spacy_language_model=str(CONFIG_vectorization["spacy_language_model"]),
        )

        # Store vectors.
        with open(
            str(CONFIG_vectorization["_ENV_PATH"]) + "dict_of_vectors.pkl", "wb"
        ) as file_v3:
            pickle.dump(dict_of_vectors, file_v3)

# End
print("\n#####")
print("END - Vectorization environments configuration.")


#####
END - Vectorization environments configuration.


------------------------------
### 2.4. Set `Sampling` subdirectories

Select `vectorization` environments in which create `sampling` environments.

In [10]:
# Get list of vectorization environments.
LIST_OF_VECTORIZATION_ENVIRONMENTS: List[
    str
] = listing_envs.get_list_of_vectorization_env_paths()
print(
    "There are",
    "`" + str(len(LIST_OF_VECTORIZATION_ENVIRONMENTS)) + "`",
    "created vectorization environments in `../experiments`",
)
LIST_OF_VECTORIZATION_ENVIRONMENTS

There are `102` created vectorization environments in `../experiments`


['../experiments/bank_cards_v2-size_1000-rand_1/simple_prep/tfidf/',
 '../experiments/bank_cards_v2-size_1000-rand_2/simple_prep/tfidf/',
 '../experiments/bank_cards_v2-size_1000-rand_3/simple_prep/tfidf/',
 '../experiments/bank_cards_v2-size_1250-rand_1/simple_prep/tfidf/',
 '../experiments/bank_cards_v2-size_1250-rand_2/simple_prep/tfidf/',
 '../experiments/bank_cards_v2-size_1250-rand_3/simple_prep/tfidf/',
 '../experiments/bank_cards_v2-size_1500-rand_1/simple_prep/tfidf/',
 '../experiments/bank_cards_v2-size_1500-rand_2/simple_prep/tfidf/',
 '../experiments/bank_cards_v2-size_1500-rand_3/simple_prep/tfidf/',
 '../experiments/bank_cards_v2-size_1750-rand_1/simple_prep/tfidf/',
 '../experiments/bank_cards_v2-size_1750-rand_2/simple_prep/tfidf/',
 '../experiments/bank_cards_v2-size_1750-rand_3/simple_prep/tfidf/',
 '../experiments/bank_cards_v2-size_2000-rand_1/simple_prep/tfidf/',
 '../experiments/bank_cards_v2-size_2000-rand_2/simple_prep/tfidf/',
 '../experiments/bank_cards_v2-siz

Define environments with different uses of `sampling`.

In [11]:
ENVIRONMENTS_FOR_SAMPLING: Dict[str, Any] = {
    # Case of Closest in different clusters sampling, 50 selected combination per iteration.
    "closest-50": {
        "_TYPE": "sampling",
        "_DESCRIPTION": "Closest in different clusters sampling, 50 selected combination per iteration.",
        "algorithm": "closest_in_different_clusters",
        "nb_to_select": 50,
    },
}

Create `sampling` environments using `ENVIRONMENTS_FOR_SAMPLING` configuration dictionary.

In [12]:
### ### ### ### ###
### LOOP FOR ALL ENVIRONMENTS CONFIGURED...
### ### ### ### ###
for PARENT_ENV_PATH_vectorization in LIST_OF_VECTORIZATION_ENVIRONMENTS:
    for ENV_NAME_sampling, CONFIG_sampling in ENVIRONMENTS_FOR_SAMPLING.items():

        ### ### ### ### ###
        ### CREATE AND CONFIGURE ENVIRONMENT.
        ### ### ### ### ###

        # Name the configuration.
        CONFIG_sampling["_ENV_NAME"] = ENV_NAME_sampling
        CONFIG_sampling["_ENV_PATH"] = (
            PARENT_ENV_PATH_vectorization + ENV_NAME_sampling + "/"
        )

        # Check if the environment already exists.
        if os.path.exists(str(CONFIG_sampling["_ENV_PATH"])):
            continue

        # Create directory for this environment.
        os.mkdir(str(CONFIG_sampling["_ENV_PATH"]))

        # Store configuration file.
        with open(str(CONFIG_sampling["_ENV_PATH"]) + "config.json", "w") as file_s1:
            json.dump(CONFIG_sampling, file_s1, indent=1)

# End
print("\n#####")
print("END - Sampling environments configuration.")


#####
END - Sampling environments configuration.


------------------------------
### 2.5. Set `Clustering` subdirectories

Select `sampling` environments in which create `clustering` environments.

In [13]:
# Get list of sampling environments.
LIST_OF_SAMPLING_ENVIRONMENTS: List[str] = listing_envs.get_list_of_sampling_env_paths()
print(
    "There are",
    "`" + str(len(LIST_OF_SAMPLING_ENVIRONMENTS)) + "`",
    "created sampling environments in `../experiments`",
)
LIST_OF_SAMPLING_ENVIRONMENTS

There are `102` created sampling environments in `../experiments`


['../experiments/bank_cards_v2-size_1000-rand_1/simple_prep/tfidf/closest-50/',
 '../experiments/bank_cards_v2-size_1000-rand_2/simple_prep/tfidf/closest-50/',
 '../experiments/bank_cards_v2-size_1000-rand_3/simple_prep/tfidf/closest-50/',
 '../experiments/bank_cards_v2-size_1250-rand_1/simple_prep/tfidf/closest-50/',
 '../experiments/bank_cards_v2-size_1250-rand_2/simple_prep/tfidf/closest-50/',
 '../experiments/bank_cards_v2-size_1250-rand_3/simple_prep/tfidf/closest-50/',
 '../experiments/bank_cards_v2-size_1500-rand_1/simple_prep/tfidf/closest-50/',
 '../experiments/bank_cards_v2-size_1500-rand_2/simple_prep/tfidf/closest-50/',
 '../experiments/bank_cards_v2-size_1500-rand_3/simple_prep/tfidf/closest-50/',
 '../experiments/bank_cards_v2-size_1750-rand_1/simple_prep/tfidf/closest-50/',
 '../experiments/bank_cards_v2-size_1750-rand_2/simple_prep/tfidf/closest-50/',
 '../experiments/bank_cards_v2-size_1750-rand_3/simple_prep/tfidf/closest-50/',
 '../experiments/bank_cards_v2-size_2000

Define environments with different uses of `clustering`.

In [14]:
ENVIRONMENTS_FOR_CLUSTERING: Dict[str, Any] = {
    # Case of KMeans clustering, {0} clusters, model 'COP'.
    "kmeans_COP-{0}c": {
        "_TYPE": "clustering",
        "_DESCRIPTION": "KMeans clustering, 10 clusters, model 'COP'.",
        "algorithm": "kmeans",
        "init**kargs": {
            "model": "COP",
            "max_iteration": 150,
            "tolerance": 1e-4,
        },
        #"nb_clusters": None,  # Will be set in the loop by loading the dataset configuration.
    },
}

Create `clustering` environments using `ENVIRONMENTS_FOR_CLUSTERING` configuration dictionary.

In [17]:
### ### ### ### ###
### LOOP FOR ALL ENVIRONMENTS CONFIGURED...
### ### ### ### ###
for PARENT_ENV_PATH_sampling in LIST_OF_SAMPLING_ENVIRONMENTS:
    for ENV_NAME_clustering, CONFIG_clustering in ENVIRONMENTS_FOR_CLUSTERING.items():

        ### ### ### ### ###
        ### CREATE AND CONFIGURE ENVIRONMENT.
        ### ### ### ### ###
        
        # Get number of clusters from the dataset configuration.
        with open(PARENT_ENV_PATH_sampling + "../../../config.json", "r") as file_c1:
            nb_clusters = json.load(file_c1)["nb_clusters"]
        

        # Name the configuration.
        CONFIG_clustering["_ENV_NAME"] = ENV_NAME_clustering.format(nb_clusters)
        CONFIG_clustering["_ENV_PATH"] = (
            PARENT_ENV_PATH_sampling + ENV_NAME_clustering.format(nb_clusters) + "/"
        )
        CONFIG_clustering["nb_clusters"] = nb_clusters

        # Check if the environment already exists.
        if os.path.exists(str(CONFIG_clustering["_ENV_PATH"])):
            continue

        # Create directory for this environment.
        os.mkdir(str(CONFIG_clustering["_ENV_PATH"]))

        # Store configuration file.
        with open(str(CONFIG_clustering["_ENV_PATH"]) + "config.json", "w") as file_c2:
            json.dump(CONFIG_clustering, file_c2, indent=1)

# End
print("\n#####")
print("END - Clustering environments configuration.")


#####
END - Clustering environments configuration.


------------------------------
### 2.6. Set `Experiment ID` subdirectories

Select `clustering` environments in which create `experiment` environments.

In [None]:
# Get list of clustering environments.
LIST_OF_CLUSTERING_ENVIRONMENTS: List[
    str
] = listing_envs.get_list_of_clustering_env_paths()
print(
    "There are",
    "`" + str(len(LIST_OF_CLUSTERING_ENVIRONMENTS)) + "`",
    "created clustering environments in `../experiments`",
)
LIST_OF_CLUSTERING_ENVIRONMENTS

Define environments with different uses of `experiment`.

In [None]:
MANAGER_TYPE: str = "binary"
LIST_OF_EXPERIMENT_IDS: List[int] = [
    1,
    2,
    3,
    4,
    5,
]

Create `experiment` environments until there are `NUMBER_OF_ENVIRONMENTS_FOR_EXPERIMENT_TO_HAVE` experiment environments.

In [None]:
### ### ### ### ###
### LOOP FOR ALL ENVIRONMENTS CONFIGURED...
### ### ### ### ###
for PARENT_ENV_PATH_clustering in LIST_OF_CLUSTERING_ENVIRONMENTS:
    for EXPERIMENT_ID in LIST_OF_EXPERIMENT_IDS:

        ### ### ### ### ###
        ### CREATE AND CONFIGURE ENVIRONMENT.
        ### ### ### ### ###

        # Name the configuration.
        CONFIG_experiment = {
            "_ENV_NAME": str(EXPERIMENT_ID).zfill(4),
            "_ENV_PATH": PARENT_ENV_PATH_clustering + str(EXPERIMENT_ID).zfill(4) + "/",
            "EXPERIMENT_ID": EXPERIMENT_ID,
            "random_seed": EXPERIMENT_ID,
            "manager_type": MANAGER_TYPE,
        }

        # Check if the environment already exists.
        if os.path.exists(str(CONFIG_experiment["_ENV_PATH"])):
            continue

        # Create directory for this environment.
        os.mkdir(str(CONFIG_experiment["_ENV_PATH"]))

        # Store configuration file.
        with open(str(CONFIG_experiment["_ENV_PATH"]) + "config.json", "w") as file_e1:
            json.dump(CONFIG_experiment, file_e1, indent=1)

        ### ### ### ### ###
        ### INITIALIZE SOME INFORMATION.
        ### ### ### ### ###

        # Store dictionary of clustering results.
        with open(
            str(CONFIG_experiment["_ENV_PATH"]) + "dict_of_clustering_results.json", "w"
        ) as file_e2:
            json.dump({}, file_e2, indent=1)

        # Store dictionary of clustering performances.
        with open(
            str(CONFIG_experiment["_ENV_PATH"])
            + "dict_of_clustering_performances.json",
            "w",
        ) as file_e3:
            json.dump({}, file_e3, indent=1)

        # Store dictionary of computation time.
        with open(
            str(CONFIG_experiment["_ENV_PATH"]) + "dict_of_computation_times.json", "w"
        ) as file_e4:
            json.dump({}, file_e4, indent=1)

        # Store dictionary of annotation history.
        with open(
            str(CONFIG_experiment["_ENV_PATH"])
            + "dict_of_constraints_annotations.json",
            "w",
        ) as file_e5:
            json.dump({}, file_e5, indent=1)

# End
print("\n#####")
print("END - Experiment environments configuration.")

------------------------------
### 3. Get all created environments

In [None]:
# Get list of experiment environments.
LIST_OF_EXPERIMENT_ENVIRONMENTS: List[
    str
] = listing_envs.get_list_of_experiment_env_paths()
print(
    "There are",
    "`" + str(len(LIST_OF_EXPERIMENT_ENVIRONMENTS)) + "`",
    "created experiment environments in `../experiments`",
)
LIST_OF_EXPERIMENT_ENVIRONMENTS