# Tutorial
This notebook aims at illustrating the different building blocks of the pipeline (see `narval/pipeline.py`).  
It can be run locally with the lightweight `google/flan-t5-base` model, though with a very low performance of indicator extraction (but this is sufficient to understand how Narval works internally).  
For better performance, use the heavy `meta-llama/Meta-Llama-3-8B-Instruct` model and run it on a computer having a GPU $\gtrsim$ 16Go.  
For extracting indicator values from several PDFs in a "production" mode, use the notebook `run_full_pipeline.py`.

### Import modules

In [1]:
import pandas as pd
import sys
sys.path.append("../")    # Add the path to the root directory (where we can find the folder .git)

%load_ext autoreload
%autoreload 2 

from narval.pdfreader import PDFReader
from narval.pagefinder import PageFinder
from narval.qamodel import QAModel
from narval.pipeline import Pipeline, merge_question_answer_dicts
from narval.answermanager import AnswerManager
from narval.metrics import MetricsCalculator
from narval.utils import get_data_dir, FileSystem

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
pd.options.display.max_columns = None
pd.options.display.max_colwidth = None

### Instantiate the File System

Narval can be run locally or on a cloud using a S3 bucket for storage. The next cell allows to detect the file system automatically as well as the path to the Narval directory where the folder `data` is located.

In [3]:
fs = FileSystem()
data_dir = get_data_dir()

In [4]:
print(data_dir)

gefleury/narval


### Choose a question file
For each indicator value to be found in the PDF, Narval will ask a few questions to a Small Language Model (SLM). The proper choice of those questions is detrimental for performance. Note that each question will be formatted in the prompt given to the SLM as `Question : {question} en {year} ?`

In [5]:
# Choose the question file in `data/input`
question_file = "question_keyword_v7.csv"

# Show its content
question_df = fs.read_csv_to_df(data_dir+"/data/input/"+question_file, index_col=0)
question_df.sort_values(by="indic").head(6)

Unnamed: 0,question,mot,indic,competence
38,Quel le nombre d'habitants desservis par le réseau d'assainissement collectif (D201.0),habitant,D201.0,assainissement collectif
37,Quelle est la valeur de l'indicateur D201.0,D201.0,D201.0,assainissement collectif
13,Quel est le nombre d'autorisations de déversement d'effluents d'établissements industriels (D202.0),autorisation,D202.0,assainissement collectif
12,Quelle est la valeur de l'indicateur D202.0,D202.0,D202.0,assainissement collectif
17,"Quelle est la quantité de boues évacuées (D203.0), et non pas la quantité de boues produites,",boue,D203.0,assainissement collectif
16,Quelle est la valeur de l'indicateur D203.0,D203.0,D203.0,assainissement collectif


Note that each question is associated to a keyword (column `mot`). This keyword is used to identify the pages in the PDF that are relevant to answer the corresponding question.

### Choose the indicator file
The indicator file contains mandatory informations for each indicator : indicator codes, units, critical and warning min/max boundaries, prompt instructions. Units and specific prompt instructions will be used to format the prompts given to the SLM. 

In [6]:
# Choose the indicator file in "data/input"
indicator_file = "indicateurs_v6.csv"

# Show its content
indicator_df = fs.read_csv_to_df(
    data_dir+"/data/input/"+indicator_file, 
    usecols=['code_ip', 'name_competence', 
             'min_warning_ip', 'max_warning_ip', 'min_critic_ip', 'max_critic_ip', 
             'libelle_grand_public', 'unit_tag', 'prompt_instruction']
)
indicator_df[indicator_df["name_competence"]=="assainissement collectif"].head()

Unnamed: 0,code_ip,name_competence,min_warning_ip,max_warning_ip,min_critic_ip,max_critic_ip,libelle_grand_public,unit_tag,prompt_instruction
0,P205.3,assainissement collectif,0.0,100.0,0.0,100.0,Conformité de la performance des ouvrages d’épuration du service aux prescriptions nationales issues de la directive ERU,en %,"\n- Ne confonds pas avec d'autres taux de conformité dans l'extrait. En particulier, l'indicateur recherché n'est pas P204.3 et n'est pas P254.3. Si tu ne trouves pas l'indicateur recherché, réponds '{no_answer_tag}'."
1,P203.3,assainissement collectif,0.0,100.0,0.0,100.0,Conformité de la collecte des effluents aux prescriptions définies aux prescriptions nationales issues de la directive ERU,en %,"\n- Ne confonds pas avec d'autres taux de conformité dans l'extrait. Si tu ne trouves pas l'indicateur recherché, réponds '{no_answer_tag}'."
2,P204.3,assainissement collectif,0.0,100.0,0.0,100.0,Conformité des équipements d’épuration aux prescriptions nationales issues de la directive ERU,en %,"\n- Ne confonds pas avec d'autres taux de conformité dans l'extrait. En particulier, l'indicateur recherché n'est pas P205.3 et n'est pas P254.3. Si tu ne trouves pas l'indicateur recherché, réponds '{no_answer_tag}'."
3,P253.2,assainissement collectif,0.0,3.0,0.0,100.0,Renouvellement des réseaux de collecte des eaux usées,en %,"\n- Ne confonds pas avec d'autres valeurs {unit_tag} dans l'extrait. Si tu ne trouves pas l'indicateur recherché, réponds '{no_answer_tag}'."
5,P202.2A,assainissement collectif,0.0,100.0,0.0,100.0,Connaissance et gestion patrimoniale des réseaux de collecte des eaux usées (jusqu'à 2012),sans unité,


### Instantiate the pipeline

Choose the subfolder where results will be saved 

In [7]:
benchmark_version = "tutorial"   # Results will be saved in f"data/output/{benchmark_version}"

... and create the pipeline (Check the ``pipeline.py`` module for other input parameters; here, the default ones are used.).

In [8]:
pipeline = Pipeline(
    question_file=question_file,
    indicator_file=indicator_file,  
    benchmark_version=benchmark_version
)

### Choose and read a PDF file

In [9]:
# Choose a PDF in the folder "/data/input/pdfs/"
pdf_file = "RPQS_Rully_AC_2021.pdf"  #"RPQS_Junas_AC_2021.pdf"
# Extract text + tables 
pdf_pages, pdf_tables, toc_indices, is_rad = (
    pipeline.extract_text_and_tables_from_pdf(pdf_file)
)

Show the extracted text for all pages

In [10]:
for i, page_text in enumerate(pdf_pages):
    print(f"### Text from page {i} ###\n{page_text}\n")

### Text from page 0 ###
Assainissement Exercice
2021
Rully
Rapport annuel sur le Prixet la 
Qualité duService public


### Text from page 1 ###
      R apport annuel sur le Prix et la Qualité du Service public
Rapport relatif au prix et à la qualité du service public de l’assainissement collectif pour l'exercice 2021 présenté conformément à l’article L.2224 5 du 
code général des collectivités territoriales et au décret n°2007-675 du 2 mai 2007.Vérifié par : Arnaud DEBOSQUE
Approuvé par : Florence SYOENEdité le : lundi 7 novembre 2022
Etabli par : Quentin SENEZN° de dossier : 64029
ADTO -SAO
SPL au capital de 3 306 750 €
36 avenue Salvador Allende
Bâtiment A «Hervé CARLIER»
60000 BEAUVAIS
Tél: 03 44 15 37 37 Fax: 03 44 15 37 30
accueil@adto -sao.fr
Page n° 2 sur 50  

### Text from page 2 ###
      R apport annuel sur le Prix et la Qualité du Service public
Synthèse de l'Exercice 2021
Rapport sur le Prix et la Qualité du Service 
public de l'assainissement collectif
Rully
0,0 tMS/an d

Show the raw extracted text for a given page

In [11]:
num_page = 6  # the first PDF page has num_page=0
pdf_pages[num_page]

"      R apport annuel sur le Prix et la Qualité du Service public\n-   6 postes de refoulement\n-   7,629km de réseaux\n-   52 branchements\nLes compétences liées au service sont  la collecte, le transfert et le traitement des eaux usées : \n- La collecte consiste à reprendre l’ensemble des eaux usées domestiques ou non au droit de chaque habitation \ndans le réseau d’assainissement.I) CARACTERISATION DU SERVICE\nA) Présentation du territoire desservi\nLa commune de Rully gère le service de l'assainissement collectif au niveau communal. La collectivité dispose des \nouvrages suivants : \n-   2 stations d'épuration\n- la compétence liée au transfert consiste à assurer le transport des eaux usées depuis le réseau de collecte vers \nl’usine de traitement : il peut s’agir de canalisations de refoulement ou de canalisations intercommunales par \nexemple.\n- la compétence liée au traitement consiste à améliorer la qualité des effluents à l’aide d’ouvrages adaptés avant \nrejet en milieu sup

Show the raw extracted tables for a given page

In [12]:
num_page = 41  # the first PDF page has num_page=0
page_tables = pdf_tables[num_page]

for df in page_tables:
    display(df)

Unnamed: 0,Col,Indicateur,2020,2021
0,Indice de connaissance et de gestion patrimoniale des réseaux de collecte des eaux usées,P202.2B,77 / 120,77 / 120
1,Prix TTC du service au m³ pour 120 m³,D204.0,"4,02 €/m³","4,11 €/m³"
2,Montant des abandons de créances ou des versements à un fond de solidarité,D207.0,"0,00 €","0,00 €"
3,Taux de débordement des effluents dans les locaux des usagers,P251.1,"0,00%","0,00%"
4,Nombre de points noirs du réseau,P252.2,000,000
5,Taux moyen de renouvellement des réseaux,P253.2,"0,00%","0,00%"
6,Indice de connaissance des rejets au milieu naturel par les réseaux de collecte des eaux usées,P255.3,90 / 120,90 / 120
7,Durée d'extinction de la dette de la collectivité (en année),P256.2,485,2225
8,Taux d'impayés sur les factures d'eau de l'année précédente,P257.0,"1,20%","0,47%"
9,Taux de réclamation,P258.1,"0,00%","0,00%"


Find the pages containing the table of content

In [13]:
print(f"The Table of Content is located on pages {toc_indices} (first page is on page 0).")

The Table of Content is located on pages [4, 5] (first page is on page 0).


### Find the relevant pages in the PDF for each question

In [14]:
segmentation_df = pipeline.get_segmentation_df(
    pdf_pages=pdf_pages, 
    pdf_tables=pdf_tables, 
    competence="assainissement collectif", 
    toc_indices=toc_indices   # The TOC pages are excluded.
)

# Show the first rows of this dataframe
segmentation_df.sort_values(by="indicator").head(6)

Unnamed: 0,indicator,question,keyword_regex,relevant_pages,table_relevant_pages
0,D201.0,Quel le nombre d'habitants desservis par le réseau d'assainissement collectif (D201.0),\bhabitants?\b,"[2, 3, 7, 10, 16, 25, 35, 42]",
1,D201.0,Quelle est la valeur de l'indicateur D201.0,\bD201.0s?\b,"[3, 7, 42]","[[42, 0], [42, 1]]"
2,D202.0,Quel est le nombre d'autorisations de déversement d'effluents d'établissements industriels (D202.0),\bautorisations?\b,"[11, 17, 42]",
3,D202.0,Quelle est la valeur de l'indicateur D202.0,\bD202.0s?\b,"[3, 11, 17]","[[42, 0], [42, 1]]"
4,D203.0,"Quelle est la quantité de boues évacuées (D203.0), et non pas la quantité de boues produites,",\bboues?\b,"[2, 15, 21, 34, 42]",
5,D203.0,Quelle est la valeur de l'indicateur D203.0,\bD203.0s?\b,"[15, 21, 42]","[[42, 0], [42, 1]]"


- The column `relevant_pages` gives for each question the list of pages that contain the corresponding keyword-based regex.  
- The column `table_relevant_pages` gives for each indicator the list of tables `t_list` that contain the indicator code : Each element `[i,j]` of `t_list` contains the indices to access the table `pdf_tables[i][j]`.

### Extract indicators from tables (no AI)

Give the exercice year of the RPQS/RAD as input parameter

In [15]:
year = "2021"

Extract indicators from tables:  
- only tables identified with the column `table_relevant_pages` of `segmentation_df` are considered
- indicators are only extracted from "summary" tables having a colum with indicator codes and a column with the exercice ``year`` in its name (but not `year-1`)

In [16]:
indicator_value_dict = pipeline.extract_indicators_from_tables(
    pdf_tables, segmentation_df, year
)

Show the "raw" extracted indicators. They will be cleaned later.

In [17]:
pd.DataFrame(indicator_value_dict)

Unnamed: 0,indicator_code_list,answer_list_from_tables
0,D201.0,"[NC, 756]"
1,D202.0,"[0, 0]"
2,D203.0,"[0,00 tMS, 0,00 tMS]"
3,D204.0,"[4,11 €/m³]"
4,P201.1,"[100,00%]"
5,P202.2B,[77 / 120]
6,P203.3,[]
7,P204.3,[]
8,P205.3,[]
9,P206.3,[]


- There might be at this stage several values for a given indicator if there are several tables containing the indicator codes. Selection is done afterwards.   
- In general, some indicators (if not all) cannot be extracted from summary tables. For them, we ask the SLM 

### Load the Small Language Model

- For fast execution but poor performance, choose the model `google/flan-t5-base`.  
- For good performance but slow execution, choose the model `meta-llama/Meta-Llama-3-8B-Instruct`:
  - This requires a GPU $\gtrsim$ 16Go
  - This requires to create first a Hugging Face token `HF_TOKEN` on your HuggingFace profile and to save it as an environment variable.  



In [18]:
# This cell needs to be run only once and only for `meta-llama/Meta-Llama-3-8B-Instruct`
# There is no need to run this cell if you have already logged in to HuggingFace Hub previously
# This cell must be run if the next cell generates an `AttributeError` inviting you to log in to the HuggingFace Hub 

'''
import os
from huggingface_hub import login

hf_token = os.environ["HF_TOKEN"]
login(token = hf_token)
'''

'\nimport os\nfrom huggingface_hub import login\n\nhf_token = os.environ["HF_TOKEN"]\nlogin(token = hf_token)\n'

In [None]:
# Choose the Question Answering model
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"   #"google/flan-t5-base"
# Load the model (this takes 5-10 minutes for Meta-Llama-3-8B-Instruct)
pipeline.model_name = model_name
pipeline.load_qa_model()

Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.02it/s]
Some parameters are on the meta device because they were offloaded to the cpu.
Device set to use cuda:0


Device =  cuda:0


### Ask questions

Choose the prompt version and give additional informations about the PDF that will be used in the prompts (depending on the chosen prompt version).  
To see what the prompts look like, see `narval/prompt.py` or use the notebook `test/test_prompt.ipynb`.

In [20]:
# Choose the prompt version
prompt_version = ("Llama_prompt_system_v7", "Llama_prompt_user_v7")
#prompt_version = "T5_prompt_v1"  # for "google/flan-t5-base"
pipeline.prompt_version = prompt_version

# Give additional informations about the PDF that may be used in the prompts 
# (or not, depending on the chosen `prompt_version`)
collectivity = "Rully"
year = "2021"
competence = "assainissement collectif"

Select a subset of the question list for faster execution (only for this tutorial).  
Note : for indicators that have already been extracted from tables, the SLM will not be asked any questions.

In [21]:
trunc_segmentation_df = segmentation_df  #.loc[0:14]
trunc_segmentation_df

Unnamed: 0,indicator,question,keyword_regex,relevant_pages,table_relevant_pages
0,D201.0,Quel le nombre d'habitants desservis par le réseau d'assainissement collectif (D201.0),\bhabitants?\b,"[2, 3, 7, 10, 16, 25, 35, 42]",
1,D201.0,Quelle est la valeur de l'indicateur D201.0,\bD201.0s?\b,"[3, 7, 42]","[[42, 0], [42, 1]]"
2,D202.0,Quel est le nombre d'autorisations de déversement d'effluents d'établissements industriels (D202.0),\bautorisations?\b,"[11, 17, 42]",
3,D202.0,Quelle est la valeur de l'indicateur D202.0,\bD202.0s?\b,"[3, 11, 17]","[[42, 0], [42, 1]]"
4,D203.0,"Quelle est la quantité de boues évacuées (D203.0), et non pas la quantité de boues produites,",\bboues?\b,"[2, 15, 21, 34, 42]",
5,D203.0,Quelle est la valeur de l'indicateur D203.0,\bD203.0s?\b,"[15, 21, 42]","[[42, 0], [42, 1]]"
6,D204.0,Quel est le prix au m3 du service d'assainissement de l'eau pour une consommation de 120m3 (D204.0),\b120\s*m\s*3s?\b,"[27, 41]",
7,D204.0,Quelle est la valeur de l'indicateur D204.0,\bD204.0s?\b,"[27, 41]","[[41, 0]]"
8,P201.1,Quel est le taux de desserte du réseau d'assainissement (P201.1),\bdessertes?\b,"[24, 41]",
9,P201.1,Quelle est la valeur de l'indicateur P201.1,\bP201.1s?\b,"[24, 41]","[[41, 0]]"


Now ask questions. For that purpose, we use the method `ask_questions` of the class `Pipeline` without decomposing it, to avoid copy-pasting a few dozens of code lines. It works as follows : for each question in `segmentation_df` and each corresponding page in the PDF that has been identified as relevant for that question, we ask the SLM to answer that question. Hence for each question in `segmentation_df`, we get a list of SLM answers. Note that the prompt given to the SLM is made of the page content + specific instructions (that depend on the indicator to be found) + question.

In [22]:
# Identify indicators that have already been extracted from summary tables
known_indicator_list = pipeline.get_known_indicator_list(indicator_value_dict)
default_question_answer_dict = pipeline.get_default_question_answer_dict(
    segmentation_df, known_indicator_list
)

# Ask questions to the SLM for other indicators
llm_question_answer_dict = pipeline.ask_questions(
    pdf_pages=pdf_pages, 
    segmentation_df=trunc_segmentation_df, 
    known_indicator_list=known_indicator_list,  # From summary tables
    competence=competence, 
    year=year, 
    collectivity=collectivity, 
    max_new_tokens=10
)

# Format the output question_answer_dict
question_answer_dict = merge_question_answer_dicts(
    llm_question_answer_dict, default_question_answer_dict
)

 67%|██████▋   | 6/9 [00:42<00:20,  6.95s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 9/9 [01:12<00:00,  8.03s/it]


### Get the detailed answer dataframe

Here we could you directly use the method `pipeline.clean_answers_from_dict()` but for pedagogical reasons, we decompose it into 2 steps and explicitly call the ``AnswerManager`` class

In [23]:
# Initialize the Answer Manager
answer_manager = AnswerManager(data_dir+"/data/input/"+indicator_file)
# Build the detailed answer dataframe 
# by adding a column "answer_list" to the "segmentation_df" using "question_answer_dict"
answer_manager.build_detailed_answer_df(
    segmentation_df, question_answer_dict, indicator_value_dict
)
# Show the detailed answer dataframe
detailed_answer_df = answer_manager.detailed_answer_df
detailed_answer_df

Unnamed: 0,indicator,question,keyword_regex,relevant_pages,table_relevant_pages,answer_list_from_language_model,answer_list_from_tables,answer_list
0,D201.0,Quel le nombre d'habitants desservis par le réseau d'assainissement collectif (D201.0),\bhabitants?\b,"[2, 3, 7, 10, 16, 25, 35, 42]",,,,[]
1,D201.0,Quelle est la valeur de l'indicateur D201.0,\bD201.0s?\b,"[3, 7, 42]","[[42, 0], [42, 1]]",,"[NC, 756]","[(NC, source: table extraction), (756, source: table extraction)]"
2,D202.0,Quel est le nombre d'autorisations de déversement d'effluents d'établissements industriels (D202.0),\bautorisations?\b,"[11, 17, 42]",,,,[]
3,D202.0,Quelle est la valeur de l'indicateur D202.0,\bD202.0s?\b,"[3, 11, 17]","[[42, 0], [42, 1]]",,"[0, 0]","[(0, source: table extraction), (0, source: table extraction)]"
4,D203.0,"Quelle est la quantité de boues évacuées (D203.0), et non pas la quantité de boues produites,",\bboues?\b,"[2, 15, 21, 34, 42]",,,,[]
5,D203.0,Quelle est la valeur de l'indicateur D203.0,\bD203.0s?\b,"[15, 21, 42]","[[42, 0], [42, 1]]",,"[0,00 tMS, 0,00 tMS]","[(0,00 tMS, source: table extraction), (0,00 tMS, source: table extraction)]"
6,D204.0,Quel est le prix au m3 du service d'assainissement de l'eau pour une consommation de 120m3 (D204.0),\b120\s*m\s*3s?\b,"[27, 41]",,,,[]
7,D204.0,Quelle est la valeur de l'indicateur D204.0,\bD204.0s?\b,"[27, 41]","[[41, 0]]",,"[4,11 €/m³]","[(4,11 €/m³, source: table extraction)]"
8,P201.1,Quel est le taux de desserte du réseau d'assainissement (P201.1),\bdessertes?\b,"[24, 41]",,,,[]
9,P201.1,Quelle est la valeur de l'indicateur P201.1,\bP201.1s?\b,"[24, 41]","[[41, 0]]",,"[100,00%]","[(100,00%, source: table extraction)]"


For each indicator, values extracted from tables (if they exist) are written on rows for which the `keyword_regex` coincide with the indicator code. The $p$-th element in `answer_list_from_tables` corresponds to the value extracted from the table `pdf_tables[i][j]` where `[i, j]` is the element `table_relevant_pages[p]`.  
For each question, `answer_list_from_language_model` is `None` if the indicator value has been extracted from tables. Otherwise the $p$-th element in the `answer_list_from_language_model` corresponds to the SLM answer when asked on the PDF page with page number `relevant_pages[p]`. Answers are really bad with the model "google/flan-t5-base" ...   


### Clean answers and select only one answer per indicator

In [24]:
# Apply the cleaning pipeline
# If textpages is given, hallucinations are removed. If textpages is None, they are not
answer_manager.apply_full_cleaning_pipeline(textpages=pdf_pages, forbidden_number_list=[float(year)])
# Get the answer dataframe after cleaning
answer_df = answer_manager.answer_df
answer_df

Unnamed: 0,indicator,final_answer,final_answer_source,filtered_answer_list,clean_answer_list,concat_answer_list
0,D201.0,756.0,table extraction,"[(756.0, source: table extraction)]","[(756.0, source: table extraction)]","[(NC, source: table extraction), (756, source: table extraction)]"
1,D202.0,0.0,table extraction,"[(0.0, source: table extraction), (0.0, source: table extraction)]","[(0.0, source: table extraction), (0.0, source: table extraction)]","[(0, source: table extraction), (0, source: table extraction)]"
2,D203.0,0.0,table extraction,"[(0.0, source: table extraction), (0.0, source: table extraction)]","[(0.0, source: table extraction), (0.0, source: table extraction)]","[(0,00 tMS, source: table extraction), (0,00 tMS, source: table extraction)]"
3,D204.0,4.11,table extraction,"[(4.11, source: table extraction)]","[(4.11, source: table extraction)]","[(4,11 €/m³, source: table extraction)]"
4,P201.1,100.0,table extraction,"[(100.0, source: table extraction)]","[(100.0, source: table extraction)]","[(100,00%, source: table extraction)]"
5,P202.2B,77.0,table extraction,"[(77.0, source: table extraction)]","[(77.0, source: table extraction)]","[(77 / 120, source: table extraction)]"
6,P203.3,je ne trouve pas,language model,"[(je ne trouve pas, source: language model)]","[(je ne trouve pas, source: language model)]","[(Je ne trouve pas., source: language model)]"
7,P204.3,je ne trouve pas,language model,"[(je ne trouve pas, source: language model)]","[(je ne trouve pas, source: language model)]","[(Je ne trouve pas., source: language model)]"
8,P205.3,je ne trouve pas,language model,"[(je ne trouve pas, source: language model)]","[(je ne trouve pas, source: language model)]","[(Je ne trouve pas., source: language model)]"
9,P206.3,je ne trouve pas,language model,"[(je ne trouve pas, source: language model)]","[(je ne trouve pas, source: language model)]","[(Je ne trouve pas., source: language model), (Je ne trouve pas., source: language model), (Je ne trouve pas., source: language model), (Je ne trouve pas., source: language model), (Je ne trouve pas., source: language model), (Je ne trouve pas., source: language model)]"


- The column `concat_answer_list` is built from the column `answer_list` of `detailed_answer_df` by concatenating all answers for a given indicator
- The column `clean_answer_list` is obtained from the column `concat_answer_list` by regex cleaning and elimination of hallucination  
- The column `filtered_answer_list` is obtained from the column `clean_answer_list` by removing answers that are out of the critic min/max boundaries of each indicator (see `indicator_df` above)
- The column `final_answer` is obtained from the column `filtered_answer_list` by selecting one answer : selection is made by keeping only the most frequent value. If two (or more) are values appear the same number of times in `filtered_answer_list`, values that are out of the warning min/max boundaries of each indicator are excluded (except if there are all outside). If there are still several remaining values, the final answer is chosen randomly.

### Save answers

Answers are saved in `data/output/tutorial/answers`

In [25]:
pipeline.save_answers(answer_df, detailed_answer_df, pdf_file, competence, year)
# Here pdf_file, competence, year are merged to the answer dataframes 

### Compare with the true indicator values 

In [26]:
# Instantiate the Metrics Calculator
metrics_calc = MetricsCalculator()
# Write an answer file containing the model answers together with the true values 
answer_file = pdf_file.split(".")[0] + "_answers.csv"
metrics_calc.write_answers_vs_true_file(answer_file, benchmark_version)

A new answer file containing also the true indicator values (taken from `data/input/sispea_vs_pdf_indic_values`) is saved to `data/output/tutorial/answers`.  


In [27]:
answer_vs_true_file = pdf_file.split(".")[0] + "_answers_vs_true.csv"
answer_vs_true_df = fs.read_csv_to_df(data_dir+"/data/output/"+benchmark_version+"/answers/"+answer_vs_true_file,
                                      usecols=["indicator", "true_pdf_value", "final_answer"])
answer_vs_true_df

Unnamed: 0,indicator,true_pdf_value,final_answer
0,D201.0,756.0,756.0
1,D202.0,0.0,0.0
2,D203.0,0.0,0.0
3,D204.0,4.11,4.11
4,P201.1,100.0,100.0
5,P202.2B,77.0,77.0
6,P203.3,,je ne trouve pas
7,P204.3,,je ne trouve pas
8,P205.3,,je ne trouve pas
9,P206.3,,je ne trouve pas
