# ==== INTERACTIVE CLUSTERING : BUSINESS CONSISTENCY STUDY ====
> ### Stage 1 : Initialize computation environments for experiments.

------------------------------
## READ-ME BEFORE RUNNING

### Quick Description

This notebook is **aimed at create environments needed to run business consistency study experiments**.
- Environments are represented by subdirectories in the `/experiments` folder. A full path to an experiment environment is `/experiments/[DATASET]/[CLUSTERING_EVOLUTION]`.
- The path is composed of A. the dataset used and B. the clustering evolution (cf. convergence study results).

At beginning of the comparative study, **run this notebook to set up experiments you want**.

Then, **go to the notebook `2_Compute_terms_analysis_with_FMC.ipynb` to run and evaluate each experiment you have set**.

### Description each steps

- 2.1. **Set up `Dataset` environments**:
    - _Description_: Create a subdirectory, store parameters for the dataset and pre-format dataset for next computations.
    - _Setting_: A dictionary define all possible configurations of datatset environments.
    - _Folder content_:
        - `dict_of_texts.json`: texts from the dataset;
        - `dict_of_true_intents.json`: true intent from the dataset;
        - `fmc_analysis.json`: results of Features Maximization Metric computation;
        - `config.json`: a json file with all parameters.
    - _Available datasets_:
        - [French trainset for chatbots dealing with usual requests on bank cards v1.0.0](http://doi.org/10.5281/zenodo.4769949)

- 2.2. **Set up `Clustering` environments**:
    - _Description_: Create a subdirectory, store parameters, then ... # TODO
    - _Setting_: A dictionary define all possible configurations of preprocessing + vectorization + clustering environments.
    - _Folder content_:
        - `dict_of_clustering_results.json`: clustering results over interactive-clustering iterations;
        - `fmc_analysis.json`: results of Features Maximization Metric computation for each iteration;
        - `config.json`: a json file with all parameters.

------------------------------
## 1. IMPORT PYTHON DEPENDENCIES

In [None]:
! pip install ../../.temp/cognitivefactory-features-maximization-metric-0.0.0.tar.gz

In [1]:
import os
import listing_envs
from typing import Any, Dict, List, Tuple
from scipy.sparse import csr_matrix
import pandas as pd
import json
import pickle  # noqa: S403
from cognitivefactory.interactive_clustering.utils.preprocessing import (
    preprocess,
)
from cognitivefactory.interactive_clustering.utils.vectorization import (
    vectorize,
)

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from cognitivefactory.features_maximization_metric.fmc import FeaturesMaximizationMetric
import pandas as pd
import numpy as np

------------------------------
## 2. CREATE COMPUTATION ENVIRONMENTS

------------------------------
### 2.1. Set `Dataset` subdirectories

Define environments with different `datasets`.

In [3]:
ENVIRONMENTS_FOR_DATASETS: Dict[str, Any] = {
    # Case of bank cards management.
    "bank_cards_v1": {
        "_TYPE": "dataset",
        "_DESCRIPTION": "This dataset represents examples of common customer requests relating to bank cards management. It can be used as a training set for a small chatbot intended to process these usual requests.",
        "file_name": "French_trainset_for_chatbots_dealing_with_usual_requests_on_bank_cards_v1.0.0.xlsx",
        "sheet_name": "dataset",
        "language": "fr",
    }
}

Create `dataset` environments using `ENVIRONMENTS_FOR_DATASETS` configuration dictionary.

In [4]:
### ### ### ### ###
### LOOP FOR ALL ENVIRONMENTS CONFIGURED...
### ### ### ### ###
for ENV_NAME_dataset, CONFIG_dataset in ENVIRONMENTS_FOR_DATASETS.items():

    ### ### ### ### ###
    ### CREATE AND CONFIGURE ENVIRONMENT.
    ### ### ### ### ###

    # Name the configuration.
    CONFIG_dataset["_ENV_NAME"] = ENV_NAME_dataset
    CONFIG_dataset["_ENV_PATH"] = "../experiments/" + ENV_NAME_dataset + "/"

    # Check if the environment already exists.
    if not os.path.exists(str(CONFIG_dataset["_ENV_PATH"])):

        # Create directory for this environment.
        os.mkdir(str(CONFIG_dataset["_ENV_PATH"]))

    # Store configuration file.
    with open(str(CONFIG_dataset["_ENV_PATH"]) + "config.json", "w") as file_d1:
        json.dump(CONFIG_dataset, file_d1, indent=1)

        
    ### ### ### ### ###
    ### FORMAT DATA.
    ### ### ### ### ###

    # Load dataset.
    df_dataset: pd.DataFrame = pd.read_excel(
        io="../../datasets/" + CONFIG_dataset["file_name"],
        sheet_name=CONFIG_dataset["sheet_name"],
        engine="openpyxl",
    )

    # Define `dict_of_texts`.
    dict_of_texts: Dict[str, str] = {
        str(data_id): str(value["QUESTION"])
        for data_id, value in df_dataset.to_dict("index").items()
    }
    with open(str(CONFIG_dataset["_ENV_PATH"]) + "dict_of_texts.json", "w") as file_d2:
        json.dump(dict_of_texts, file_d2, indent=1)

    # Define and store `dict_of_true_intents`.
    dict_of_true_intents: Dict[str, str] = {
        str(data_id): str(value["INTENT"])
        for data_id, value in df_dataset.to_dict("index").items()
    }
    with open(
        str(CONFIG_dataset["_ENV_PATH"]) + "dict_of_true_intents.json", "w"
    ) as file_d3:
        json.dump(dict_of_true_intents, file_d3, indent=1)

    # Define and store `dict_of_hard_preprocessed_texts`.
    dict_of_hard_preprocessed_texts: Dict[str, str] = preprocess(
        dict_of_texts=dict_of_texts,
        apply_stopwords_deletion=True,
        apply_lemmatization=True,
        apply_parsing_filter=False,
        spacy_language_model="fr_core_news_md",
    )
    with open(
        str(CONFIG_dataset["_ENV_PATH"]) + "dict_of_hard_preprocessed_texts.json", "w",
    ) as file_d4:
        json.dump(dict_of_hard_preprocessed_texts, file_d4, indent=1)
        
    # Define and store `list_of_possible_vectors_features`.
    vectorizer = TfidfVectorizer(
        analyzer="word",
        ngram_range=(1, 3),
        min_df=2,
        sublinear_tf=True,
    )
    matrix_of_vectors: csr_matrix = vectorizer.fit_transform(
        [
            str(dict_of_hard_preprocessed_texts[data_ID])
            for data_ID in dict_of_hard_preprocessed_texts.keys()
        ]
    )
    with open(
        str(CONFIG_dataset["_ENV_PATH"]) + "matrix_of_vectors.pkl", "wb"
    ) as file_d5:
        pickle.dump(matrix_of_vectors, file_d5)

    # Define and store `list_of_possible_vectors_features`.
    list_of_possible_vectors_features: List[str] = list(vectorizer.get_feature_names_out())
    with open(
        str(CONFIG_dataset["_ENV_PATH"]) + "list_of_possible_vectors_features.json", "w"
    ) as file_d6:
        json.dump(list_of_possible_vectors_features, file_d6, indent=1)

        
    ### ### ### ### ###
    ### COMPUTE FMC
    ### ### ### ### ###
    
    # Computation.
    fmc_computer_dataset: FeaturesMaximizationMetric = FeaturesMaximizationMetric(
        data_vectors=matrix_of_vectors,
        data_classes=list(dict_of_true_intents.values()),
        list_of_possible_features=list_of_possible_vectors_features,
        amplification_factor=1,
    )

    # Storage.
    with pd.ExcelWriter(
        str(CONFIG_dataset["_ENV_PATH"]) + "fmc_analysis.xlsx", engine="xlsxwriter",
    ) as writer_d:

        for cluster in fmc_computer_dataset.list_of_possible_classes:
            df_cluster_analysis = pd.DataFrame()
            df_cluster_analysis["feature"] = fmc_computer_dataset.get_most_active_features_by_a_classe(classe=cluster)
            df_cluster_analysis["fmeasure"] = df_cluster_analysis.apply(
                lambda row: fmc_computer_dataset.features_fmeasure[row["feature"]][cluster],
                axis=1
            )
            df_cluster_analysis["contrast"] = df_cluster_analysis.apply(
                lambda row: fmc_computer_dataset.features_contrast[row["feature"]][cluster],
                axis=1
            )
            df_cluster_analysis["activation"] = df_cluster_analysis.apply(
                lambda row: len(fmc_computer_dataset.get_most_activated_classes_by_a_feature(feature=row["feature"])),
                axis=1
            )
            df_cluster_analysis.to_excel(writer_d, sheet_name=cluster)

# End
print("\n#####")
print("END - Dataset environments configuration.")


#####
END - Dataset environments configuration.


------------------------------
### 2.2. Set `Clustering_evolution` subdirectories

Select `dataset` environments in which create `clustering_evolution` environments

In [5]:
# Get list of dataset environments.
LIST_OF_DATASET_ENVIRONMENTS: List[str] = listing_envs.get_list_of_dataset_env_paths()
print(
    "There are",
    "`" + str(len(LIST_OF_DATASET_ENVIRONMENTS)) + "`",
    "created dataset environments in `../experiments`",
)
LIST_OF_DATASET_ENVIRONMENTS = LIST_OF_DATASET_ENVIRONMENTS[:1]
LIST_OF_DATASET_ENVIRONMENTS

There are `2` created dataset environments in `../experiments`


['../experiments/bank_cards_v1/']

Define environments with different `clustering_evolution`.

In [6]:
ENVIRONMENTS_FOR_CLUSTERING: Dict[str, Any] = {
    # previous clustering results from "1_convergence_study" experiements.
    "{prep_str}_-_{vect_str}_-_{sampl_str}_-_{clust_str}_-_{rand_str}".format(
        prep_str=prep,
        vect_str=vect,
        sampl_str=sampl,
        clust_str=clust,
        rand_str=str(rand).zfill(4),
    ): {
        "_TYPE": "clustering_evolution",
        "_DESCRIPTION": "Clustering evolution (preprocessing: '{prep_str}', vectorization: '{vect_str}', sampling: '{sampl_str}', clustering: '{clust_str}', rand: '{rand_str}')".format(
            prep_str=prep,
            vect_str=vect,
            sampl_str=sampl,
            clust_str=clust,
            rand_str=str(rand).zfill(4),
        ),
        "previous_clustering_file": "../previous_clustering/bank_cards_v1_-_{prep_str}_-_{vect_str}_-_{sampl_str}_-_{clust_str}_-_{rand_str}.json".format(
            prep_str=prep,
            vect_str=vect,
            sampl_str=sampl,
            clust_str=clust,
            rand_str=str(rand).zfill(4),
        ),
    }
    for prep in ["simple_prep",]
    for vect in ["tfidf",]
    for sampl in ["closest-50",]
    for clust in ["kmeans_COP-10c",]
    for rand in [1, 2, 3, 4, 5,]
}

Create `clustering_evolution` environments using `ENVIRONMENTS_FOR_CLUSTERING` configuration dictionary.

In [75]:
### ### ### ### ###
### LOOP FOR ALL ENVIRONMENTS CONFIGURED...
### ### ### ### ###
for PARENT_ENV_PATH_dataset in LIST_OF_DATASET_ENVIRONMENTS:
    for (
        ENV_NAME_clustering_evolution,
        CONFIG_clustering_evolution,
    ) in ENVIRONMENTS_FOR_CLUSTERING.items():

        ### ### ### ### ###
        ### CREATE AND CONFIGURE ENVIRONMENT.
        ### ### ### ### ###

        # Name the configuration.
        CONFIG_clustering_evolution["_ENV_NAME"] = ENV_NAME_clustering_evolution
        CONFIG_clustering_evolution["_ENV_PATH"] = (
            PARENT_ENV_PATH_dataset + ENV_NAME_clustering_evolution + "/"
        )

        # Check if the environment already exists.
        if not os.path.exists(str(CONFIG_clustering_evolution["_ENV_PATH"])):

            # Create directory for this environment.
            os.mkdir(str(CONFIG_clustering_evolution["_ENV_PATH"]))

        # Store configuration file.
        with open(str(CONFIG_clustering_evolution["_ENV_PATH"]) + "config.json", "w") as file_c1:
            json.dump(CONFIG_clustering_evolution, file_c1, indent=1)


        ### ### ### ### ###
        ### LOAD DATA AND CLUSTERING.
        ### ### ### ### ###

        # Load `dict_of_clustering_evolution`.
        with open(CONFIG_clustering_evolution["previous_clustering_file"], "r") as file_c2:
            dict_of_clustering_evolution = json.load(file_c2)
        with open(str(CONFIG_clustering_evolution["_ENV_PATH"] + "") + "dict_of_clustering_results.json", "w") as file_c3:
            json.dump(dict_of_clustering_evolution, file_c3, indent=1)
            
        # Load `matrix_of_vectors`.
        with open(
            str(CONFIG_clustering_evolution["_ENV_PATH"]) + "../matrix_of_vectors.pkl", "rb"
        ) as file_c4:
            matrix_of_vectors: csr_matrix = pickle.load(file_c4)

        # Load `list_of_possible_vectors_features`.
        with open(
            str(CONFIG_clustering_evolution["_ENV_PATH"]) + "../list_of_possible_vectors_features.json", "r"
        ) as file_c5:
            list_of_possible_vectors_features: List[str] = json.load(file_c5)


        ### ### ### ### ###
        ### COMPUTE FMC
        ### ### ### ### ###
        
        for iteration in ["0005", "0010", "0015"]:  # dict_of_clustering_evolution.keys():
            
            # Computation.
            fmc_computer_clustering: FeaturesMaximizationMetric = FeaturesMaximizationMetric(
                data_vectors=matrix_of_vectors,
                data_classes=list(dict_of_clustering_evolution[iteration].values()),
                list_of_possible_features=list_of_possible_vectors_features,
                amplification_factor=1,
            )

            # Storage.
            with pd.ExcelWriter(
                str(CONFIG_clustering_evolution["_ENV_PATH"]) + "fmc_analysis-{iter_str}.xlsx".format(iter_str=iteration),
                engine="xlsxwriter",
            ) as writer_c :

                for cluster in fmc_computer_clustering.list_of_possible_classes:
                    df_cluster_analysis = pd.DataFrame()
                    df_cluster_analysis["feature"] = fmc_computer_clustering.get_most_active_features_by_a_classe(classe=cluster)
                    df_cluster_analysis["fmeasure"] = df_cluster_analysis.apply(
                        lambda row: fmc_computer_clustering.features_fmeasure[row["feature"]][cluster],
                        axis=1
                    )
                    df_cluster_analysis["contrast"] = df_cluster_analysis.apply(
                        lambda row: fmc_computer_clustering.features_contrast[row["feature"]][cluster],
                        axis=1
                    )
                    df_cluster_analysis["activation"] = df_cluster_analysis.apply(
                        lambda row: len(fmc_computer_clustering.get_most_activated_classes_by_a_feature(feature=row["feature"])),
                        axis=1
                    )
                    df_cluster_analysis.to_excel(writer_c, sheet_name=str(cluster))

# End
print("\n#####")
print("END - Clustering evolution environments configuration.")


#####
END - Clustering evolution environments configuration.


-----
## _DRAFT METRICS

In [8]:
fmc_computer_dataset.list_of_possible_classes

['alerte_perte_vol_carte',
 'carte_avalee',
 'commande_carte',
 'consultation_solde',
 'couverture_assurrance',
 'deblocage_carte',
 'gestion_carte_virtuelle',
 'gestion_decouvert',
 'gestion_plafond',
 'gestion_sans_contact']

In [9]:
fmc_computer_clustering.list_of_possible_classes

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [10]:
pd.DataFrame(fmc_computer_clustering.features_fmeasure)
# vouloir commander = 0.006 ; 0.009 ; 8*0

Unnamed: 0,2000,2000 euro,2000 euro paiement,achat,achat internet,achat ligne,achat numero,achat numero carte,actif,actif carte,...,voler carte,voler carte bleu,vouloir,vouloir activer,vouloir commander,vouloir connaitre,vouloir connaitre solde,vouloir signaler,vouloir signaler perte,voyage
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.025896,0.014118,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.012986,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.028604,0.010011,0.006095,0.013927,0.009855,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.014979,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.009632,0.0,0.0,0.006771,0.0,0.0,0.0,0.038395
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.010487,0.0,0.0,0.015933,0.016783,0.0,0.0,0.0
5,0.00832,0.00832,0.00832,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.004238,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.030708,0.0,0.017087,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01346,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.151407,0.151407,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.043422,0.018692,0.007994,0.0,0.0,0.0,0.0,0.012638,0.012638,0.0


In [11]:
fmc_computer_clustering.features_marginal_averages
# vouloir commander = 0.0015 = 0.015/10 ou 0.0015/2 ?

{'2000': 0.0008319924269304347,
 '2000 euro': 0.0008319924269304347,
 '2000 euro paiement': 0.0008319924269304347,
 'achat': 0.004358288659895404,
 'achat internet': 0.0010010791777013653,
 'achat ligne': 0.0006095090577567146,
 'achat numero': 0.0013926783378198225,
 'achat numero carte': 0.0009854570335344538,
 'actif': 0.015140715924073733,
 'actif carte': 0.015140715924073733,
 'activer': 0.011896472526005192,
 'activer carte': 0.009464524171991114,
 'activer contact': 0.002671288846051326,
 'activer mode': 0.0012060070084016412,
 'activer numero': 0.0007884124287486173,
 'activer numero carte': 0.0007884124287486173,
 'activer option': 0.0017972404938328731,
 'activer possibilite': 0.0016767910519507843,
 'actuel': 0.0016170959763984777,
 'aimer': 0.009906583550421674,
 'aimer utiliser': 0.0010834313402584587,
 'aller': 0.001964196445651519,
 'annuler': 0.004014185439240913,
 'application': 0.006445879960502233,
 'argent': 0.007446590478933377,
 'argent compte': 0.0026372129759418

In [12]:
pd.DataFrame(fmc_computer_clustering.features_activation)

Unnamed: 0,2000,2000 euro,2000 euro paiement,achat,achat internet,achat ligne,achat numero,achat numero carte,actif,actif carte,...,voler carte,voler carte bleu,vouloir,vouloir activer,vouloir commander,vouloir connaitre,vouloir connaitre solde,vouloir signaler,vouloir signaler perte,voyage
0,False,False,False,False,False,False,False,False,False,False,...,False,False,True,True,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
2,False,False,False,True,True,True,True,True,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,True
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,True,False,False,False
5,True,True,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,True,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,True,True,...,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,...,True,True,False,False,False,False,False,True,True,False


In [13]:
pd.DataFrame(fmc_computer_clustering.features_contrast)

Unnamed: 0,2000,2000 euro,2000 euro paiement,achat,achat internet,achat ligne,achat numero,achat numero carte,actif,actif carte,...,voler carte,voler carte bleu,vouloir,vouloir activer,vouloir commander,vouloir connaitre,vouloir connaitre solde,vouloir signaler,vouloir signaler perte,voyage
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.24401,10.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.125254,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,6.563127,10.0,10.0,10.0,10.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,3.436873,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.834656,0.0,0.0,2.98237,0.0,0.0,0.0,10.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.908745,0.0,0.0,7.01763,10.0,0.0,0.0,0.0
5,10.0,10.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.367256,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.660958,0.0,10.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.166386,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10.0,10.0,0.692735,0.0,0.0,0.0,0.0,10.0,10.0,0.0


In [14]:
from typing import Dict, Tuple
def compare_fmc_modelization(
    fmc_computed: FeaturesMaximizationMetric,
    fmc_reference: FeaturesMaximizationMetric,
) -> Tuple[
    float,
    float,
    float,
    Dict[str, Dict[str, float]],
    Dict[str, Dict[str, float]],
    Dict[str, Dict[str, float]],
]:
    """
    Gives a similarity score in agreement with a reference FMC modelization.
    Data classes can be different, but vector features must be similar.

    Args:
        fmc_computed (FeaturesMaximizationMetric): Computed Features Maximization modelization.
        fmc_reference (FeaturesMaximizationMetric): Reference Features Maximization modelization.

    Raises:
        ValueError: if `list_of_possible_features` are different.

    Returns:
        Tuple[
            float,
            float,
            float,
            Dict[str, Dict[str, float]],
            Dict[str, Dict[str, float]],
            Dict[str, Dict[str, float]]
        ]: Computation of features activation equivalence and modelization similarity.
    """
    
    ###
    ### Features activation Equivalence computations.
    ### (probability of activation of source on target)
    ###
    
    def _compute_activation_probability(
        fmc_source: FeaturesMaximizationMetric,
        classe_source: str,
        fmc_target: FeaturesMaximizationMetric,
        classe_target: str,
    ) -> float:
        
        numerator: float = 0.0
        denominator: float = 0.0

        for feature_target in fmc_target.list_of_possible_features:

            #### if (
            ####     bool(fmc_target.features_activation[feature_target][classe_target])
            ####     and len(fmc_target.get_most_activated_classes_by_a_feature(feature=feature_target))==1
            #### ):
            if fmc_target.get_most_activated_classes_by_a_feature(feature=feature_target) == [classe_target]:
            # if bool(fmc_target.features_activation[feature_target][classe_target]):
                denominator += fmc_target.features_fmeasure[feature_target][classe_target]

                #### if (
                ####     bool(fmc_target.features_activation[feature_target][classe_target])
                ####     and len(fmc_target.get_most_activated_classes_by_a_feature(feature=feature_target))==1
                ####     and bool(fmc_source.features_activation[feature_target][classe_source])
                ####     and len(fmc_source.get_most_activated_classes_by_a_feature(feature=feature_target))==1
                #### ):
                if fmc_source.get_most_activated_classes_by_a_feature(feature=feature_target) == [classe_source]:
                # if bool(fmc_source.features_activation[feature_target][classe_source]):
                    numerator += fmc_target.features_fmeasure[feature_target][classe_target]

        return (
            0.0
            if denominator == 0
            else numerator / denominator
        )
    
    activation_probability_of_computed_on_reference: Dict[str, Dict[str, float]] = {
        classe_computed: {
            classe_reference: _compute_activation_probability(
                fmc_source=fmc_computed,
                classe_source=classe_computed,
                fmc_target=fmc_reference,
                classe_target=classe_reference,
            )
            for classe_reference in fmc_reference.list_of_possible_classes
        }
        for classe_computed in fmc_computed.list_of_possible_classes
    }
        
    activation_probability_of_reference_on_computed: Dict[str, Dict[str, float]] = {
        classe_computed: {
            classe_reference: _compute_activation_probability(
                fmc_source=fmc_reference,
                classe_source=classe_reference,
                fmc_target=fmc_computed,
                classe_target=classe_computed,
            )
            for classe_reference in fmc_reference.list_of_possible_classes
        }
        for classe_computed in fmc_computed.list_of_possible_classes
    }
        
    activation_probability_reciprocity: Dict[str, Dict[str, float]] = {
        classe_computed: {
            classe_reference: (
                0.0
                if (
                    activation_probability_of_computed_on_reference[classe_computed][classe_reference]
                    + activation_probability_of_reference_on_computed[classe_computed][classe_reference]
                ) == 0
                else (
                    2 * (
                        activation_probability_of_computed_on_reference[classe_computed][classe_reference]
                        * activation_probability_of_reference_on_computed[classe_computed][classe_reference]
                    ) / (
                        activation_probability_of_computed_on_reference[classe_computed][classe_reference]
                        + activation_probability_of_reference_on_computed[classe_computed][classe_reference]
                    )
                )
            )
            for classe_reference in fmc_reference.list_of_possible_classes
        }
        for classe_computed in fmc_computed.list_of_possible_classes
    }
    
    ###
    ### Modelization similarity computations.
    ### (average of probability of activation of source on target)
    ###
            
            
    similarity_of_computed_on_reference: float =sum(
        (
            sum(
            activation_probability_of_computed_on_reference[classe_computed][classe_reference]
            for classe_computed in fmc_computed.list_of_possible_classes
            )
        )
        for classe_reference in fmc_reference.list_of_possible_classes
    ) / len(
        fmc_reference.list_of_possible_classes
    )
        
    similarity_of_reference_on_computed: float = sum(
        sum(
            activation_probability_of_reference_on_computed[classe_computed][classe_reference]
            for classe_reference in fmc_reference.list_of_possible_classes
        ) / len([
            classe_reference
            for classe_reference in fmc_reference.list_of_possible_classes
            if activation_probability_of_reference_on_computed[classe_computed][classe_reference] != 0
        ])
        for classe_computed in fmc_computed.list_of_possible_classes
    ) / len(
        fmc_computed.list_of_possible_classes
    )

    similarity_reciprocity: float = (
        0.0
        if (
            similarity_of_computed_on_reference
            + similarity_of_reference_on_computed
        ) == 0
        else (
            2 * (
                similarity_of_computed_on_reference
                * similarity_of_reference_on_computed
            ) / (
                similarity_of_computed_on_reference
                + similarity_of_reference_on_computed
            )
        )
    )
        
    return(
        similarity_of_computed_on_reference,
        similarity_of_reference_on_computed,
        similarity_reciprocity,
        activation_probability_of_computed_on_reference,
        activation_probability_of_reference_on_computed,
        activation_probability_reciprocity,
    )

In [15]:
s1, s2, s3, p1, p2, p3 = compare_fmc_modelization(
    fmc_computed = fmc_computer_clustering,  # fmc_computer_clustering
    fmc_reference = fmc_computer_dataset,  # fmc_computer_dataset
)

In [16]:
print(s1)
pd.DataFrame(p1)

0.9108603867897326


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
alerte_perte_vol_carte,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.910916
carte_avalee,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.878821,0.0,0.0
commande_carte,0.0,0.0,0.0,0.0,0.0,0.0,0.973889,0.0,0.026111,0.0
consultation_solde,0.0,0.052644,0.0,0.0,0.858315,0.0,0.0,0.0,0.0,0.0
couverture_assurrance,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
deblocage_carte,0.0,0.0,0.586416,0.017947,0.0,0.0,0.04092,0.0,0.0,0.0
gestion_carte_virtuelle,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
gestion_decouvert,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
gestion_plafond,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
gestion_sans_contact,0.719304,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043321,0.0


In [17]:
print(s1)
pd.DataFrame(p1)

0.9108603867897326


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
alerte_perte_vol_carte,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.910916
carte_avalee,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.878821,0.0,0.0
commande_carte,0.0,0.0,0.0,0.0,0.0,0.0,0.973889,0.0,0.026111,0.0
consultation_solde,0.0,0.052644,0.0,0.0,0.858315,0.0,0.0,0.0,0.0,0.0
couverture_assurrance,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
deblocage_carte,0.0,0.0,0.586416,0.017947,0.0,0.0,0.04092,0.0,0.0,0.0
gestion_carte_virtuelle,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
gestion_decouvert,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
gestion_plafond,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
gestion_sans_contact,0.719304,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043321,0.0


In [18]:
pd.DataFrame(p1).sum(axis=1)
# Is a reference fully included in computation ?

alerte_perte_vol_carte     0.910916
carte_avalee               0.878821
commande_carte             1.000000
consultation_solde         0.910959
couverture_assurrance      1.000000
deblocage_carte            0.645283
gestion_carte_virtuelle    1.000000
gestion_decouvert          1.000000
gestion_plafond            1.000000
gestion_sans_contact       0.762625
dtype: float64

In [19]:
print(s2)
pd.DataFrame(p2)

0.7229152853599775


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
alerte_perte_vol_carte,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
carte_avalee,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
commande_carte,0.0,0.0,0.0,0.0,0.0,0.0,0.963112,0.0,0.201839,0.0
consultation_solde,0.0,0.039476,0.0,0.0,0.807826,0.0,0.0,0.0,0.0,0.0
couverture_assurrance,0.0,0.0,0.0,0.982447,0.0,0.0,0.0,0.0,0.0,0.0
deblocage_carte,0.0,0.0,0.26682,0.017553,0.0,0.0,0.036888,0.0,0.0,0.0
gestion_carte_virtuelle,0.0,0.0,0.73318,0.0,0.0,0.0,0.0,0.0,0.0,0.0
gestion_decouvert,0.0,0.907001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
gestion_plafond,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
gestion_sans_contact,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.694339,0.0


In [20]:
pd.DataFrame(p2).sum(axis=0)
# Is a computation fully included in a reference ?

0    1.000000
1    0.946477
2    1.000000
3    1.000000
4    0.807826
5    1.000000
6    1.000000
7    1.000000
8    0.896178
9    1.000000
dtype: float64

In [21]:
print(s3)
pd.DataFrame(p3)

0.8060774899075128


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
alerte_perte_vol_carte,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.953381
carte_avalee,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.935503,0.0,0.0
commande_carte,0.0,0.0,0.0,0.0,0.0,0.0,0.96847,0.0,0.04624,0.0
consultation_solde,0.0,0.045119,0.0,0.0,0.832305,0.0,0.0,0.0,0.0,0.0
couverture_assurrance,0.0,0.0,0.0,0.991146,0.0,0.0,0.0,0.0,0.0,0.0
deblocage_carte,0.0,0.0,0.366762,0.017748,0.0,0.0,0.0388,0.0,0.0,0.0
gestion_carte_virtuelle,0.0,0.0,0.846052,0.0,0.0,0.0,0.0,0.0,0.0,0.0
gestion_decouvert,0.0,0.951233,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
gestion_plafond,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
gestion_sans_contact,0.836739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.081554,0.0


In [22]:
# Computation.
fmc_computer_1: FeaturesMaximizationMetric = FeaturesMaximizationMetric(
    data_vectors=csr_matrix(
        [
            [9, 5, 5],
            [9, 10, 5],
            [9, 20, 6],
            [5, 15, 5],
            [6, 25, 6],
            [5, 25, 5],
        ]
    ),
    data_classes=[
        "Man",
        "Man",
        "Man",
        "Woman",
        "Woman",
        "Woman",
    ],
    list_of_possible_features=[
        "Shoes size",
        "Hair size",
        "Nose size",
    ],
    amplification_factor=1,
)
pd.DataFrame(fmc_computer_1.features_activation)

Unnamed: 0,Shoes size,Hair size,Nose size
Man,True,False,False
Woman,False,True,False


In [23]:
# Computation.
fmc_computer_2: FeaturesMaximizationMetric = FeaturesMaximizationMetric(
    data_vectors=csr_matrix(
        [
            [9, 5, 5],
            [9, 10, 5],
            [9, 20, 6],
            [5, 15, 5],
            [6, 25, 6],
            [5, 25, 5],
            [5, 15, 9],
            [6, 10, 8],
            [9, 20, 8],
        ]
    ),
    data_classes=[
        "Man",
        "Man",
        "Man",
        "Woman",
        "Woman",
        "Woman",
        "??",
        "??",
        "??",
    ],
    list_of_possible_features=[
        "Shoes size",
        "Hair size",
        "Nose size",
    ],
    amplification_factor=1,
)
pd.DataFrame(fmc_computer_2.features_activation)

Unnamed: 0,Shoes size,Hair size,Nose size
??,False,False,True
Man,True,False,False
Woman,False,True,False


==================================

In [24]:
from typing import Dict, Tuple
def compare_fmc_modelization_v2(
    fmc_computed: FeaturesMaximizationMetric,
    fmc_reference: FeaturesMaximizationMetric,
) -> Tuple[
    float,
    float,
]:
    """
    (1) Présence = (
        Sum of FMC metric for features actives in ref. and comp.
    )/(
        Sum of FMC metric for features actives in ref.
    )
    (2) Concentration

    Args:
        fmc_computed (FeaturesMaximizationMetric): Computed Features Maximization modelization.
        fmc_reference (FeaturesMaximizationMetric): Reference Features Maximization modelization.

    Raises:
        ValueError: if `list_of_possible_features` are different.

    Returns:
        Tuple[
            float,
            float
        ]: # TODO
    """
    
    ###
    ### PRESENCE OF MODEL REFERENCE IN MODEL COMPUTED
    ###
    
    num: float = 0.0
    den: float = 0.0
        
    for classe_ref in fmc_reference.list_of_possible_classes:
            
        for feature_ref in fmc_reference.list_of_possible_features:
            
            # update DENOMINATOR
            if fmc_reference.get_most_activated_classes_by_a_feature(feature=feature_ref) == [classe_ref]:
                den += fmc_reference.features_fmeasure[feature_ref][classe_ref]
        
            # update NUMERATOR
            for classe_comp in fmc_computed.list_of_possible_classes:
            
                if (
                    fmc_reference.get_most_activated_classes_by_a_feature(feature=feature_ref) == [classe_ref]
                    and fmc_computed.get_most_activated_classes_by_a_feature(feature=feature_ref) == [classe_comp]
                ):
                    num += fmc_reference.features_fmeasure[feature_ref][classe_ref]
                    
    presence = (
        0.0
        if den == 0
        else num / den
    )
    
    ###
    ### CONCENTRATION OF MODEL REFERENCE IN MODEL COMPUTED
    ###
    
    
    
    return presence, 0.0

In [25]:
compare_fmc_modelization_v2(
    fmc_computed = fmc_computer_clustering,  # fmc_computer_clustering
    fmc_reference = fmc_computer_dataset,  # fmc_computer_dataset
)

(0.9116191353805566, 0.0)

In [26]:
compare_fmc_modelization_v2(
    fmc_computed = fmc_computer_dataset,  # fmc_computer_clustering
    fmc_reference = fmc_computer_dataset,  # fmc_computer_dataset
)

(1.0, 0.0)

=================================================

In [27]:
from sklearn.metrics.cluster import entropy, contingency_matrix, mutual_info_score, homogeneity_completeness_v_measure

In [28]:
import numpy as np
import math
from typing import Dict, List, Optional, Tuple

$
H(C)
= - \sum _{C_i \subset C} \frac{|C_i|}{N} \log \frac{|C_i|}{N}
= - \sum _{C_i \subset C} \frac{|C_i|}{N} \left( \log |C_i| - \log N \right)
$

where :
- $C=\{C_i\}$ = the labeling formated as a list of sets of data with the same label.
- $N$ = nb of data.

In [29]:
def compute_entropy(
    list_of_labels: List[str],
    rounded: Optional[int] = None,
) -> float:
    """
    Calculate the Entropy for a labeling.
    The Entropy is defined as :
    $
        H(C)
        = - \sum _{C_i \subset C} \frac{|C_i|}{N} \log \frac{|C_i|}{N}
        = - \sum _{C_i \subset C} \frac{|C_i|}{N} \left( \log |C_i| - \log N \right)
    $
    where $N$ the number of data
    and $C=\{C_i\}$ is the labeling formated as a list of sets of data with the same label.
    
    Args:
        list_of_labels (List[str]): The list of labels, i.e. `list_of_labels[d]` is the label of data `d`.
        rounded (Optional[int]): The option to round the result to counter log approximation. Defaults to `None`.
    
    Returns:
        float: The Entropy of the labeling.
    """
    
    # Count number of data.
    nb_data: int = len(list_of_labels)
    # Case of no data: defaults to 1.0.
    if nb_data == 0:
        return 1.0
    
    # Count occurrence of each label.
    labels_occurrences: Dict[str, int] = {}
    for label in list_of_labels:
        if label not in labels_occurrences.keys():
            labels_occurrences[label] = 0
        labels_occurrences[label] += 1
    # Case of single cluster: no entropy.
    if len(labels_occurrences.keys()) <= 1:
        return 0.0

    # Compute entropy.
    #### $
    ####     H(C)
    ####     = - \sum _{C_i \subset C} \frac{|C_i|}{N} \log \frac{|C_i|}{N}
    ####     = - \sum _{C_i \subset C} \frac{|C_i|}{N} \left( \log |C_i| - \log N \right)
    #### $
    entropy: float = -sum(
        label_occurrence
        / nb_data
        * (
            np.log(label_occurrence)
            - np.log(nb_data)
        )
        for label_occurrence in labels_occurrences.values()
        # if label_occurrence > 0  # NB: Always greater than 0 by design.
    )

    # Round the results.
    if rounded is not None:
        entropy = round(entropy, rounded)
        
    # Return entropy.
    return entropy

In [30]:
L = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [31]:
H_custom = compute_entropy(list_of_labels=L)
H_custom

2.3025850929940455

In [32]:
H_sklearn = entropy(labels=L)
H_sklearn

2.302585092994046

In [33]:
H_custom == H_sklearn

False

$
MI(C,K)
= \sum _{C_i \subset C} \sum _{K_j \subset K} \frac{|C_i \cap K_j|}{N} \log \frac{ N |C_i \cap K_j|}{|C_i||K_j|}
= \sum _{C_i \subset C} \sum _{K_j \subset K} \frac{|C_i \cap K_j|}{N} \left( \log |C_i \cap K_j| - \log |C_i| - \log|K_j| + \log N \right)
$

where :
- $C=\{C_i\}$ = the reference labeling formated as a list of sets of data with the same label.
- $K=\{K_j\}$ = the clustering labeling formated as a list of sets of data with the same label.
- $N$ = nb of data.

In [34]:
def compute_mutual_info_score(
    list_of_classes_labels: List[str],
    list_of_clusters_labels: List[str],
    rounded: Optional[int] = None,
):
    """
    Calculate the Mutual Information between two labelings.
    The Mutual Information is defined as :
    $
        MI(C,K)
        = \sum _{C_i \subset C} \sum _{K_j \subset K} \frac{|C_i \cap K_j|}{N} \log \frac{ N |C_i \cap K_j|}{|C_i||K_j|}
        = \sum _{C_i \subset C} \sum _{K_j \subset K} \frac{|C_i \cap K_j|}{N} \left( \log |C_i \cap K_j| - \log |C_i| - \log|K_j| + \log N \right)
    $
    where $N$ the number of data,
    $C=\{C_i\}$ is the classes labeling formated as a list of sets of data with the same classe,
    and $K=\{K_i\}$ is the clusters labeling formated as a list of sets of data with the same cluster.
    
    Warning:
        Should round the result due to log approximation.
    
    Args:
        list_of_labels (List[str]): The list of labels, i.e. `list_of_labels[d]` is the label of data `d`.
        rounded (Optional[int]): The option to round the result to counter log approximation. Defaults to `None`.
        
    Raises:
        ValueError: Lists must have the same size.

    Returns:
        float: The Mutual Information between two labelings.
    """
    
    # Count number of data.
    if len(list_of_classes_labels) != len(list_of_clusters_labels):
        raise ValueError("Lists must have the same size.")
    nb_data: int = len(list_of_classes_labels)
    
    # Count occurrence of classes, occurrence of clusters, and co-occurrence of both.
    occurrences_classes: Dict[str, int] = {
        classe: 0
        for classe in sorted(set(list_of_classes_labels))
    }
    occurrences_clusters: Dict[str, int] = {
        cluster: 0
        for cluster in sorted(set(list_of_clusters_labels))
    }
    cooccurrences: Dict[str, Dict[str, int]] = {
        classe: {
            cluster: 0
            for cluster in sorted(set(list_of_clusters_labels))
        }
        for classe in sorted(set(list_of_classes_labels))
    }
    for x in range(nb_data):
        classe: str = list_of_classes_labels[x]
        cluster: str = list_of_clusters_labels[x]
        occurrences_classes[classe] += 1
        occurrences_clusters[cluster] += 1
        cooccurrences[classe][cluster] += 1
    
    # Compute
    #### $
    ####     MI(C,K)
    ####     = \sum _{C_i \subset C} \sum _{K_j \subset K} \frac{|C_i \cap K_j|}{N} \log \frac{ N |C_i \cap K_j|}{|C_i||K_j|}
    ####     = \sum _{C_i \subset C} \sum _{K_j \subset K} \frac{|C_i \cap K_j|}{N} \left( \log N |C_i \cap K_j| - \log |C_i||K_j| \right)
    #### $
    MI: float = sum(
        (
            cooccurrences[classe][cluster] / nb_data
            * (
                np.log(cooccurrences[classe][cluster])
                + np.log(nb_data)
                - np.log(occurrences_classes[classe])
                - np.log(occurrences_clusters[cluster])
            )
        )
        for classe in occurrences_classes.keys()
        for cluster in occurrences_clusters.keys()
        if cooccurrences[classe][cluster] != 0
    )

    # Round the results.
    if rounded is not None:
        MI = round(MI, rounded)

    return MI

In [46]:
list_of_classes_labels = [0, 0, 1, 1, -1, -2, -3, -4, -5, -6, -7, -8, -9]
list_of_clusters_labels = ["0", "0", "1", "1", "-1", "-2", "-3", "-4", "-5", "-6", "-7", "-8", "-9"]

In [47]:
MI_custom = compute_mutual_info_score(
    list_of_classes_labels=list_of_classes_labels,
    list_of_clusters_labels=list_of_clusters_labels,
)
MI_custom

2.351673301904631

In [48]:
MI_sklearn = mutual_info_score(
    labels_true=list_of_classes_labels,
    labels_pred=list_of_clusters_labels,
)
MI_sklearn

2.351673301904631

In [49]:
MI_custom-MI_sklearn

0.0

In [50]:
def compute_homogeneity_compleness_vmeasure(
    list_of_classes_labels: List[str],
    list_of_clusters_labels: List[str],
    rounded: Optional[int] = None,
) -> Tuple[
    float,
    float,
    float
]:
    """
    Calculate the Homogeneity, Completness and VMeasure between two labelings.
    The Homogeneity is defined as :
    $
        MI(C,K)
        = \sum _{C_i \subset C} \sum _{K_j \subset K} \frac{|C_i \cap K_j|}{N} \log \frac{ N |C_i \cap K_j|}{|C_i||K_j|}
        = \sum _{C_i \subset C} \sum _{K_j \subset K} \frac{|C_i \cap K_j|}{N} \left( \log |C_i \cap K_j| - \log |C_i| - \log|K_j| + \log N \right)
    $
    where $N$ the number of data,
    $C=\{C_i\}$ is the classes labeling formated as a list of sets of data with the same classe,
    and $K=\{K_i\}$ is the clusters labeling formated as a list of sets of data with the same cluster.
    
    Warning:
        Should round the result due to log approximation.
    
    Args:
        list_of_labels (List[str]): The list of labels, i.e. `list_of_labels[d]` is the label of data `d`.
        rounded (Optional[int]): The option to round the result to counter log approximation. Defaults to `None`.
        
    Raises:
        ValueError: Lists must have the same size.

    Returns:
        float: The Mutual Information between two labelings.
    """
    
    # Get Entropy and Mutual Information.
    entropy_classes: float = compute_entropy(
        list_of_labels=list_of_classes_labels,
        rounded=rounded,
    )
    entropy_clusters: float = compute_entropy(
        list_of_labels=list_of_clusters_labels,
        rounded=rounded,
    )
    mutual_information: float = compute_mutual_info_score(
        list_of_classes_labels=list_of_classes_labels,
        list_of_clusters_labels=list_of_clusters_labels,
        rounded=rounded,
    )
        
    # Compute Homogeneity.
    homogeneity: float = (
        1.0
        if entropy_classes == 0
        else mutual_information / entropy_classes
    )
    # Compute Completness.
    completness: float = (
        1.0
        if entropy_clusters == 0
        else mutual_information / entropy_clusters
        
    )
    # Compute VMeasure.
    vmeasure: float = (
        0.0
        if homogeneity + completness == 0
        else 2 * (
            homogeneity * completness
        ) / (
            homogeneity + completness
        )
    )
    
    # Round the results.
    if rounded is not None:
        homogeneity = round(homogeneity, rounded)
        completness = round(completness, rounded)
        vmeasure = round(vmeasure, rounded)
        
    return homogeneity, completness, vmeasure

In [58]:
list_of_classes_labels = [0, 0, 1, 1, -1, -2, -3, -4, -5, -6, -7, -8, -9]
list_of_clusters_labels = ["0", "0", "1", "1", "-2", "-2", "-3", "-4", "-5", "-6", "-7", "-8", "-9"]

In [59]:
val_custom = compute_homogeneity_compleness_vmeasure(
    list_of_classes_labels=list_of_classes_labels,
    list_of_clusters_labels=list_of_clusters_labels,
    rounded=10,
)
val_custom

(0.9546544039, 1.0, 0.9768012207)

In [53]:
val_sklearn = homogeneity_completeness_v_measure(
    labels_true=list_of_classes_labels,
    labels_pred=list_of_clusters_labels,
)
val_sklearn

(1.0, 1.0, 1.0)

In [43]:
val_custom == val_sklearn

True

=================================================

In [72]:
def from_fmc_to_clustering(
    fmc
):
    
    res = []
    outliers_index = 0
    for feature in fmc.list_of_possible_features:
        
        most_activated_classes = fmc.get_most_activated_classes_by_a_feature(
            feature=feature
        )
        
        if len(most_activated_classes) == 1:
            res.append(most_activated_classes[0])
        else:
            outliers_index += 1
            res.append(-1)
    return res

In [73]:
f1 = from_fmc_to_clustering(fmc_computer_dataset)
f2 = from_fmc_to_clustering(fmc_computer_clustering)
len(f1), len(f2)

(505, 505)

In [69]:
homogeneity_completeness_v_measure(
    labels_true=from_fmc_to_clustering(fmc_computer_dataset),
    labels_pred=from_fmc_to_clustering(fmc_computer_clustering),
)

(1.0, 1.0, 1.0)

$
C = \{
    F_{c}
    | c \in classes
\}
$

$
F_{c} = \{
    f_c | f_c \in features, selected(f_c), active (f_c, c), exclusif(f_c, c)
\}
$

$
P_{C}(c) =
$

- $
    \frac{
        |F_{c}|
    }{
        \sum_{x \in C}  |F_{x}|
    }
$

- $ \frac{
        \sum_{f_c \in F_{c}} 1 + FM(f_c)(c)
    }{
        \sum_{x \in C}  \sum_{f \in F_{x}} 1 + FM(f)(x)
    }
$

$
P_{C,K}(c,k) =
$

- $
    \frac{
        |F_{c} \cap F_{k}|
    }{
        |\cup _{x \in C \cup K} F_{x} |
    }
$
- $ \frac{
        \sum_{f_{ck} \in F_{c}, f_{ck} \in F_{k}} 1 + FM(f_{ck})(c) + FM(f_{ck})(l)
    }{
        \sum_{x \subset C}  \sum_{f \in F_{x}} ...
    }
$

$
H(C) = \sum _{C_i \subset C} P_{C}(C_i) log P_{C}(C_i)
$

$
MI(C,K) = \sum _{C_i \subset C} \sum _{K_j \subset K} P_{C,K}(C_i, K_j) log \frac{ P_{C,K}(C_i, K_j) }{ P_{C}(C_i) P_{K}(K_j) }
$