In [1]:
#!/bin/bash -e

# Use of ANNIF library

Ce notebook contient toutes les étapes de l'utilisation de la librairie annif (doc d'installation de toutes les librairies à prévoir).
- Formatage des données pour utilisation dans ANNIF
- Entrainement d'un modèle 
- Utilisation de pipelines pour tester plusieurs modèles
- Recherche des meilleurs paramètres  

## Setup 

### Packages

In [2]:
# Import librairies
import os
import re
import csv
import pandas as pd

### Graphical parameters

### Paths

In [3]:
# Set path
abes_path = "/home/aurelie/ABES/labo-indexation-ai/"
os.chdir(abes_path)

In [4]:
# Create folders if needed
list_folder = [
    "ANNIF", 
    "ANNIF/data", "ANNIF/reports",
    "ANNIF/data/train", "ANNIF/data/test", "ANNIF/data/valid"]

for folder in list_folder:
    if not os.path.exists(folder):
        os.makedirs(folder)
    else:
        print(f"Folder {folder} already exists")



Folder ANNIF already exists
Folder ANNIF/data already exists
Folder ANNIF/reports already exists
Folder ANNIF/data/train already exists
Folder ANNIF/data/test already exists
Folder ANNIF/data/valid already exists


In [5]:
# Set current directory
annif_path = os.getcwd() + "/ANNIF"
os.chdir(annif_path)

In [6]:
# Set paths
data_path = "./../data"
fig_path = "./../figs"
annif_data_path = annif_path + "/data"
annif_report_path = annif_path + "/reports"


annif_train_folder_path = os.path.join(annif_data_path, "train/")
annif_test_folder_path = os.path.join(annif_data_path, "test/")
annif_valid_folder_path = os.path.join(annif_data_path, "valid/")

## Prepare data

In [7]:
! annif list-vocabs

/bin/bash: /home/aurelie/anaconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)


Vocabulary ID       Languages                 Size  Loaded
----------------------------------------------------------
rameau              fr                      103628  True  
rameau-chains       fr                      151048  True  


In [8]:
# Select vocabulary to use
vocab = "rameau-chains"

In [9]:
if vocab == "rameau":
    vocab_filename = os.path.join(annif_data_path,'subjects.csv')
    train_tsv_path = os.path.join(annif_data_path, "rameau-chains-train.tsv")
    test_tsv_path = os.path.join(annif_data_path, "rameau-chains-test.tsv")
    valid_tsv_path = os.path.join(annif_data_path, "rameau-chains-valid.tsv")

elif vocab == "rameau-chains":
    vocab_filename = os.path.join(annif_data_path,'chain-subjects.csv')
    train_tsv_path = os.path.join(annif_data_path, "rameau-train.tsv")
    test_tsv_path = os.path.join(annif_data_path, "rameau-test.tsv")
    valid_tsv_path = os.path.join(annif_data_path, "rameau-valid.tsv")


else:
    print(f"uri_type should be 'chains' or 'concepts'. You asked for {vocab}")

# Load vocabulary
! annif load-vocab {vocab} {vocab_filename} --force

# Set texts folders
annif_train_folder_path = os.path.join(annif_data_path, "train/")
annif_test_folder_path = os.path.join(annif_data_path, "test/")
annif_valid_folder_path = os.path.join(annif_data_path, "valid/")


/bin/bash: /home/aurelie/anaconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)


Loading vocabulary from CSV file /home/aurelie/ABES/labo-indexation-ai/ANNIF/data/chain-subjects.csv...
creating subject index
saving vocabulary into SKOS file data/vocabs/rameau-chains/subjects.ttl


In [10]:
# Check import with bash:
! head {vocab_filename}

/bin/bash: /home/aurelie/anaconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)
label_fr,uri
Kirp?n,https://www.idref.fr/157992527
Militaires artistes,https://www.idref.fr/110140494
Militaires romains,https://www.idref.fr/028492161
Militaires prussiens,https://www.idref.fr/028521757
Sa-skya-pa,https://www.idref.fr/029895561
Militaires réunionnais,https://www.idref.fr/031875459
Construction à l'épreuve de la sécheresse,https://www.idref.fr/032370083
Missionnaires suisses,https://www.idref.fr/032878117
Militaires ivoiriens,https://www.idref.fr/034423982


In [11]:
# Check import with bash:
! tail {vocab_filename}

/bin/bash: /home/aurelie/anaconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Coutumes alimentaires -- Aspect social,https://www.idref.fr/027233464#--https://www.idref.fr/027790088
Sociologie -- Recherche,https://www.idref.fr/049647490#--https://www.idref.fr/027315754
Isolation thermique -- Aspect environnemental,https://www.idref.fr/027235394#--https://www.idref.fr/027587886
"Nombre, Idée de -- Chez l'enfant",https://www.idref.fr/027241181#--https://www.idref.fr/16719402X
Risques professionnels -- Prévention,https://www.idref.fr/165201010#--https://www.idref.fr/02724525X
Population -- Statistiques -- Vingtième siècle,https://www.idref.fr/027546071#--http://www.idref.fr/028627008#--https://www.idref.fr/027338983
Esclavage -- Dans l'art,https://www.idref.fr/027224295#--https://www.idref.fr/027966216
Physique -- Dans l'art,https://www.idref.fr/027247015#--https://www.idref.fr/027966216
Violences sexuelles -- Prévention,https://www.idref.fr/027343758#--htt

In [12]:
# Check number of concepts in the vocabulary file:
! wc -l < {vocab_filename}

/bin/bash: /home/aurelie/anaconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)
151049


### Datasets for training and evaluation

In [13]:
# Check import with bash: 
! head -n 5 {train_tsv_path}

/bin/bash: /home/aurelie/anaconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)
La culture pour vivre Mort de la culture populaire en France. Mutation des institutions culturelles grâce à une technique de mise en relation des oeuvres et d'un public, et qui tend à créer un comportement culturel adapté aux caractéristiques de l'époque	https://www.idref.fr/027348237 https://www.idref.fr/027224929 https://www.idref.fr/027416593
La nuit, le jour : essai psychanalytique sur le fonctionnement mental Discontinuité, latence, rétablissement d’une continuité organisent la vie psychique. Réparatrice est dite la nuit... Les auteurs ont voulu montrer la complexité sous-jacente à cette qualité dès lors que Freud met au jour dans l’étude du rêve, au-delà d’une certaine réalisation de désir inconscient lié à l’histoire individuelle d’un sujet donné, l’existence de « veinures » qui résultent de la préhistoire de tous les humains et qui, imprimant la matière où s’inscrit le

## Test ANNIF

### List all available projects 

In [14]:
# list of projects
project_list = ! annif list-projects
project_list

['/bin/bash: /home/aurelie/anaconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)',
 'Project ID               Project Name                                 Language  Trained',
 '---------------------------------------------------------------------------------------',
 'rameau-tfidf-snowball-fr RAMEAU - TF-IDF(Snowball)                    fr        True   ',
 'rameau-fasttext-snowball-frFastText French RAMEAU                       fr        False  ',
 'rameau-yake-snowball-fr  Yake French RAMEAU                           fr        True   ',
 'rameau-mllm-snowball-fr  RAMEAU MLLM project                          fr        True   ',
 'rameau-omikuji-snowball-frOmikuji Parabel French                       fr        True   ',
 'rameau-tfidf-fr          TF-IDF French RAMEAU with spacy lemma        fr        False  ',
 'rameau-fasttext-fr       FastText French RAMEAU                       fr        True   ',
 'rameau-yake-fr           Yake French RAMEAU         

### Select project to test

In [15]:
# Select project and parameters
project = "rameau-chains-ensemble-allButFastext-fr"
njobs = 0
input_file = train_tsv_path
max_nb_concepts = 10
threshold = 0.2 
trials = 100
metric_file_path = os.path.join(annif_report_path, str(project + '.json'))
result_file_path = os.path.join(annif_report_path, str(project + '.csv'))

test_file = train_tsv_path

In [16]:
metric_file_path

'/home/aurelie/ABES/labo-indexation-ai/ANNIF/reports/rameau-chains-ensemble-allButFastext-fr.json'

### Train model

In [19]:
# Train project
! annif train {project} --jobs {njobs} {input_file}

/bin/bash: /home/aurelie/anaconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)


2023-07-13 08:21:04.744963: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Backend nn_ensemble: creating NN ensemble model
2023-07-13 08:21:06.380933: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Backend nn_ensemble: Initializing source projects: rameau-chains-tfidf-snowball-fr, rameau-chains-mllm-snowball-fr, rameau-chains-omikuji-snowball-fr
2023-07-13T08:21:11.033Z [36mINFO[0m [omikuji::model] Loading model from data/projects/rameau-chains-omikuji-snowbal

### Evaluate model

In [20]:
# Evaluate project
! annif eval --limit {max_nb_concepts} --threshold {threshold} --metrics-file {metric_file_path} --results-file {result_file_path} --jobs {njobs} {project} {test_file}

/bin/bash: /home/aurelie/anaconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Writing per subject evaluation results to /home/aurelie/ABES/labo-indexation-ai/ANNIF/reports/rameau-chains-ensemble-allButFastext-fr.csv
2023-07-13 15:05:10.429138: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-13T15:05:16.224Z [36mINFO[0m [omikuji::model] Loading model from data/projects/rameau-chains-omikuji-snowball-fr/omikuji-model...
2023-07-13T15:05:16.224Z [36mINFO[0m [omikuji::model] Loading model settings from data/projects/rameau-chains-omikuji-snowball-fr/omikuji-model/settings.json...
2023-07-13T15:05:16.224Z [36mINFO[0m [omikuji::model] Loaded model settings Settings { n_features: 20759

## Hyperoptimization

In [17]:
metric_file_path = os.path.join(annif_report_path, str(project + '_opt.json'))
result_file_path = os.path.join(annif_report_path, str(project + '_opt.csv'))

! annif hyperopt {project} --trials {trials} --results-file {result_file_path}   {test_file}

/bin/bash: /home/aurelie/anaconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Looking for optimal hyperparameters using 100 trials
[32m[I 2023-07-12 09:59:12,973][0m A new study created in memory with name: no-name-ff3cbc6c-06b3-4e59-b7fb-aef9607be1b8[0m
[32m[I 2023-07-12 10:26:02,466][0m Trial 0 finished with value: 0.3213123679161072 and parameters: {'min_samples_leaf': 22, 'max_leaf_nodes': 1066, 'max_samples': 0.9236792232828465}. Best is trial 0 with value: 0.3213123679161072.[0m
[32m[I 2023-07-12 10:59:53,713][0m Trial 1 finished with value: 0.32043683528900146 and parameters: {'min_samples_leaf': 18, 'max_leaf_nodes': 878, 'max_samples': 0.606656009987659}. Best is trial 0 with value: 0.3213123679161072.[0m
[32m[I 2023-07-12 11:31:48,931][0m Trial 2 finished with value: 0.3230161666870117 and parameters: {'min_samples_leaf': 22, 'max_leaf_nodes': 1494, 'max_samples': 0.74663042365701}. Best is trial 2 with value: 0.3230161666870117.[0

In [19]:
result_file_path

'/home/aurelie/ABES/labo-indexation-ai/ANNIF/reports/rameau-chains-simple-ensemble-allButFastext-fr.csv'

In [42]:
# Select best model
result_file_path
opt = pd.read_csv("reports/rameau-chains-simple-ensemble-allButFastext-fr_opt.csv", sep='\t')
opt

Unnamed: 0,trial,value,rameau-chains-tfidf-snowball-fr,rameau-chains-mllm-snowball-fr,rameau-chains-omikuji-snowball-fr
0,1,0.876617,0.379310,0.414124,0.206566
1,3,0.919342,0.310702,0.345550,0.343748
2,2,0.862996,0.828231,0.057027,0.114742
3,0,0.904235,0.381039,0.343160,0.275802
4,5,0.930734,0.385528,0.226061,0.388410
...,...,...,...,...,...
95,95,0.977741,0.006305,0.059121,0.934573
96,96,0.977773,0.003472,0.056872,0.939656
97,98,0.978182,0.004521,0.022829,0.972649
98,97,0.978190,0.000146,0.016319,0.983535


In [43]:
import numpy as np
best_model = opt.iloc[np.argmax(opt["value"])]
best_model

trial                                26.000000
value                                 0.978343
rameau-chains-tfidf-snowball-fr       0.007775
rameau-chains-mllm-snowball-fr        0.004958
rameau-chains-omikuji-snowball-fr     0.987267
Name: 26, dtype: float64

In [None]:
# Retrain model
! annif train {project} --jobs {njobs} {input_file}

In [None]:
# Evaluate project
! annif eval --limit {max_nb_concepts} --threshold {threshold} --metrics-file {metric_file_path} --results-file {result_file_path} --jobs {njobs} {project} {test_file}

In [None]:
annif_eval.loc["rameau-pav-MLLM-fr.json"]

### Prediction on all notices from test folder

In [57]:
# Prediction sur toutes les notices du dossier "test"
suffix = str('_' + project + '.csv')
! annif index -s {suffix} {project} {annif_test_folder_path}

/bin/bash: /home/aurelie/anaconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)


2023-07-12T20:16:41.004Z [36mINFO[0m [omikuji::model] Loading model from data/projects/rameau-chains-omikuji-snowball-fr/omikuji-model...
2023-07-12T20:16:41.004Z [36mINFO[0m [omikuji::model] Loading model settings from data/projects/rameau-chains-omikuji-snowball-fr/omikuji-model/settings.json...
2023-07-12T20:16:41.004Z [36mINFO[0m [omikuji::model] Loaded model settings Settings { n_features: 207599, classifier_loss_type: Hinge }...
2023-07-12T20:16:41.004Z [36mINFO[0m [omikuji::model] Loading tree from data/projects/rameau-chains-omikuji-snowball-fr/omikuji-model/tree0.cbor...
2023-07-12T20:16:42.063Z [36mINFO[0m [omikuji::model] Loading tree from data/projects/rameau-chains-omikuji-snowball-fr/omikuji-model/tree1.cbor...
2023-07-12T20:16:43.116Z [36mINFO[0m [omikuji::model] Loading tree from data/projects/rameau-chains-omikuji-snowball-fr/omikuji-model/tree2.cbor...
2023-07-12T20:16:44.180Z [36mINFO[0m [omikuji::model] Loaded model with 3 trees; it took 3.18s


## Format predictions for future use

In [58]:
csv_files = [f for f in os.listdir(annif_test_folder_path) if f.endswith(suffix)]
print(f"There are {len(csv_files)} files to compile")

There are 29227 files to compile


In [59]:
# Build dataframe
predictions = pd.DataFrame(columns=["PPN", "predictions", "scores"])
for i, file in enumerate(csv_files):
    ppn = file.split('_')[0]
    pred = pd.read_csv(os.path.join(annif_test_folder_path, file), sep='\t', header=None, names=["URI", "pred_concept", "score"])
    predictions.loc[i,"PPN"] = ppn
    predictions.loc[i,"predictions"] = pred["pred_concept"].to_list()
    predictions.loc[i,"scores"] = pred["score"].to_list()

In [60]:
# Show predictions
predictions.head()

Unnamed: 0,PPN,predictions,scores
0,250492091,"[Notariat, Manuels d'enseignement supérieur, C...","[0.1410744786262512, 0.012239116244018, 0.0024..."
1,186369999,"[Manuels d'enseignement supérieur, Problèmes e...","[0.2717428505420685, 0.1196022480726242, 0.082..."
2,158168623,"[Didactique, Universités, Séminaires (groupes ...","[0.6160987019538879, 0.493828684091568, 0.1499..."
3,194354326,"[Enseignement supérieur, Étude et enseignement...","[0.9926307201385498, 0.5386803150177002, 0.098..."
4,261584022,"[Histoire, Aspect social, Culture populaire, O...","[0.9645671844482422, 0.592507004737854, 0.0823..."


In [62]:
# Save dataframe
predictions.to_csv(os.path.join(annif_report_path, str(project + "_predictions.csv")))

## Merge predictions with existing predictions (including reindexation and indexing if available)

In [None]:
os.getcwd()

In [66]:
# Set files
input_file = "./../data/reindexation_final_with_concepts_juin2023.csv"
output_file = "./../data/reindexation_final_with_concepts_juin2023_withANNIF.csv"


In [65]:
os.getcwd()

'/home/aurelie/ABES/labo-indexation-ai/ANNIF'

In [67]:
# merge predictions with reindexation file
indexation_file = pd.read_csv(output_file, index_col=0)
print(indexation_file.shape)
indexation_file.head(3)

(100, 34)


Unnamed: 0,PPN,DESCR,RAMEAU_CHECKED,rameau_concepts,rameau_chaines_index,INDEX_AFE,INDEX_JMF,INDEX_LJZ,INDEX_LPL,INDEX_MCR,...,rameau_index_chain_AFE,rameau_index_chain_MCR,rameau_index_chain_JMF,rameau_index_chain_LPL,rameau_index_chain_LJZ,rameau_index_chain_MPD,predictions_rameau-ensemble-mllmSpacy-allButFastext-fr,scores_rameau-ensemble-mllmSpacy-allButFastext-fr,predictions_rameau-omikuji-snowball-fr,scores_rameau-omikuji-snowball-fr
0,000308838,Les sommets de l'État : essai sur l'élite du p...,Bureaucratie;Classes dirigeantes;Classes dirig...,"['Bureaucratie', 'Classes dirigeantes', 'Class...","['Bureaucratie', 'Classes dirigeantes', 'Class...",Classes dirigeantes -- France -- Histoire;;Pou...,Classes dirigeantes -- Relations avec l'État -...,Classes dirigeantes -- France;;Hauts-fonctionn...,Hauts-fonctionnairesss -- France;;Classes diri...,Pouvoir (sciences sociales) -- Classes dirigea...,...,"['Classes dirigeantes -- France -- Histoire', ...",['Pouvoir (sciences sociales) -- Classes dirig...,"[""Classes dirigeantes -- Relations avec l'État...","['Hauts-fonctionnairesss -- France', 'Classes ...","['Classes dirigeantes -- France', 'Hauts-fonct...","['Classes dirigeantes -- France -- Histoire', ...","['Histoire', 'Élite (sciences sociales)', 'Éta...","[0.5610998868942261, 0.4483318328857422, 0.435...","['Élite (sciences sociales)', 'État', 'Histoir...","[0.46941938996315, 0.3625344932079315, 0.24151..."
1,00094758X,Le dollar La quatrième de couverture indique :...,Dollar américain;Finances internationales;Poli...,"['Dollar américain', 'Finances internationales...","['Dollar américain', 'Finances internationales...","Dollar américain;;Eurodollar, Marché de l';;Po...",Dollar américain ;;Politique économique -- Éta...,"Dollar américain;;Eurodollar, Marché de l';;Fi...",Dollar américain -- Influence -- 20e siècle;;F...,Dollar américain -- Mondialisation;;Dollar amé...,...,"['Dollar américain', ""Eurodollar, Marché de l'...","['Dollar américain -- Mondialisation', 'Dollar...","['Dollar américain ', 'Politique économique --...",['Dollar américain -- Influence -- 20e siècle'...,"['Dollar américain', ""Eurodollar, Marché de l'...","['Dollar américain', ""Eurodollar, Marché de l'...","['Système monétaire international', 'Monnaie',...","[0.2470552176237106, 0.1972707509994506, 0.164...","['Système monétaire international', 'Monnaie',...","[0.2199680209159851, 0.0743607208132743, 0.063..."
2,003632806,Les intellectuels sous la Ve République : 1958...,Intellectuels;Intellectuels français,"['Intellectuels', 'Intellectuels français']","['Intellectuels', 'Intellectuels français']",Intellectuels -- France -- 1958-.... (5e Répub...,Intellectuels français -- Sociologie ;;Intelle...,Intellectuels -- France;;Vie intellectuelle --...,Intellectuels -- France -- 1958 (5e République...,Intellectuels -- France -- 1945,...,['Intellectuels -- France -- 1958-.... (5e Rép...,['Intellectuels -- France -- 1945'],"['Intellectuels français -- Sociologie ', 'Int...",['Intellectuels -- France -- 1958 (5e Républiq...,"['Intellectuels -- France', 'Vie intellectuell...",['Intellectuels -- France -- 1958-.... (5e Rép...,"['Intellectuels', 'Sociologie', 'Politique et ...","[0.2859456539154053, 0.1352124512195587, 0.077...","['Intellectuels', 'Sociologie', 'Vie intellect...","[0.1557696014642715, 0.0378550477325916, 0.020..."


In [68]:
# Merge predictions
output = indexation_file.merge(predictions, on="PPN", how="inner")
output.rename(columns={"predictions" : str("predictions_" + project), "scores":str("scores_" + project)}, inplace=True)

In [69]:
output.head(3)

Unnamed: 0,PPN,DESCR,RAMEAU_CHECKED,rameau_concepts,rameau_chaines_index,INDEX_AFE,INDEX_JMF,INDEX_LJZ,INDEX_LPL,INDEX_MCR,...,rameau_index_chain_JMF,rameau_index_chain_LPL,rameau_index_chain_LJZ,rameau_index_chain_MPD,predictions_rameau-ensemble-mllmSpacy-allButFastext-fr,scores_rameau-ensemble-mllmSpacy-allButFastext-fr,predictions_rameau-omikuji-snowball-fr,scores_rameau-omikuji-snowball-fr,predictions_rameau-chains-pav-allButFastext-fr,scores_rameau-chains-pav-allButFastext-fr
0,000308838,Les sommets de l'État : essai sur l'élite du p...,Bureaucratie;Classes dirigeantes;Classes dirig...,"['Bureaucratie', 'Classes dirigeantes', 'Class...","['Bureaucratie', 'Classes dirigeantes', 'Class...",Classes dirigeantes -- France -- Histoire;;Pou...,Classes dirigeantes -- Relations avec l'État -...,Classes dirigeantes -- France;;Hauts-fonctionn...,Hauts-fonctionnairesss -- France;;Classes diri...,Pouvoir (sciences sociales) -- Classes dirigea...,...,"[""Classes dirigeantes -- Relations avec l'État...","['Hauts-fonctionnairesss -- France', 'Classes ...","['Classes dirigeantes -- France', 'Hauts-fonct...","['Classes dirigeantes -- France -- Histoire', ...","['Histoire', 'Élite (sciences sociales)', 'Éta...","[0.5610998868942261, 0.4483318328857422, 0.435...","['Élite (sciences sociales)', 'État', 'Histoir...","[0.46941938996315, 0.3625344932079315, 0.24151...","[Classes dirigeantes, État, Élite (sciences so...","[0.9877382516860962, 0.9687923789024352, 0.888..."
1,00094758X,Le dollar La quatrième de couverture indique :...,Dollar américain;Finances internationales;Poli...,"['Dollar américain', 'Finances internationales...","['Dollar américain', 'Finances internationales...","Dollar américain;;Eurodollar, Marché de l';;Po...",Dollar américain ;;Politique économique -- Éta...,"Dollar américain;;Eurodollar, Marché de l';;Fi...",Dollar américain -- Influence -- 20e siècle;;F...,Dollar américain -- Mondialisation;;Dollar amé...,...,"['Dollar américain ', 'Politique économique --...",['Dollar américain -- Influence -- 20e siècle'...,"['Dollar américain', ""Eurodollar, Marché de l'...","['Dollar américain', ""Eurodollar, Marché de l'...","['Système monétaire international', 'Monnaie',...","[0.2470552176237106, 0.1972707509994506, 0.164...","['Système monétaire international', 'Monnaie',...","[0.2199680209159851, 0.0743607208132743, 0.063...","[Système monétaire international, Finances int...","[0.98750501871109, 0.1764083951711654, 0.04999..."
2,003632806,Les intellectuels sous la Ve République : 1958...,Intellectuels;Intellectuels français,"['Intellectuels', 'Intellectuels français']","['Intellectuels', 'Intellectuels français']",Intellectuels -- France -- 1958-.... (5e Répub...,Intellectuels français -- Sociologie ;;Intelle...,Intellectuels -- France;;Vie intellectuelle --...,Intellectuels -- France -- 1958 (5e République...,Intellectuels -- France -- 1945,...,"['Intellectuels français -- Sociologie ', 'Int...",['Intellectuels -- France -- 1958 (5e Républiq...,"['Intellectuels -- France', 'Vie intellectuell...",['Intellectuels -- France -- 1958-.... (5e Rép...,"['Intellectuels', 'Sociologie', 'Politique et ...","[0.2859456539154053, 0.1352124512195587, 0.077...","['Intellectuels', 'Sociologie', 'Vie intellect...","[0.1557696014642715, 0.0378550477325916, 0.020...","[Sociologie, Opinion publique, Aspect social, ...","[0.0289787556976079, 0.0040915142744779, 0.003..."


In [70]:
# Save output
output.to_csv(output_file)

## Multilabel classification - Metrics

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from utils_metrics import *

In [None]:
# Variables to use
## Chains
field = [
    ("ANNIF_mllm", "rameau_concepts", "predictions_opt_rameau-mllm-fr"),
    ("ANNIF_tfidf",  "rameau_concepts", "predictions_rameau-tfidf-snowball-fr")
]
results = dict()


In [None]:
def flatten(list):
    flat_list = [item for sublist in list for item in sublist]
    return flat_list

In [None]:
# Binarization
for var in field: 
    print("Working on ", var[0])
    mlb = MultiLabelBinarizer(sparse_output=False)
    mlb.fit(flatten(output["rameau_concepts"]))
    sudoc = mlb.transform(output[var[1]])
    embed = mlb.transform(output[var[2]])
    results[str("ANNIF" + var[0])] = label_metrics_report("Embeddings", sudoc, embed, zero_division=0)

In [None]:
mlb.inverse_transform(embed)

### Dataframe of results

In [None]:
result_df = pd.DataFrame(results).T
result_df

### Plot

In [None]:
# Plot results
metrics_radar_plot(
    result_df,
    remove_identity=False,
    title="Quantitative comparisons",
    savefig="metrics_ANNIF-sudoc.html",)