# Notebook to run the partial pipeline extracting indicators from tables

In this notebook, we run the partial pipeline that extracts indicators only from *summary tables* ie from tables having a column with indicator codes and a column with the exercice year in its name (but not year-1). Indicators that are not given in summary tables are extracted as ``je ne trouve pas``.


In [1]:
import sys

sys.path.append("../")    # Add the path to the root directory (where we can find the folder .git)

%load_ext autoreload
%autoreload 2 

from codecarbon import EmissionsTracker
from time import time
from narval.pipeline import Pipeline
from narval.utils import FileSystem, get_data_dir

  machar = _get_machar(dtype)
  from .autonotebook import tqdm as notebook_tqdm


### Define the input parameters

In [2]:
# Name of the subfolder where answers will be saved (in `data/output/benchmark_table_*/answers/`)
benchmark_version = "benchmark_table_123" 
# Name of the indicator file in `data/input`
indicator_file = "indicateurs_v6.csv"
# Name of the question file in `data/input`
question_file = "question_keyword_v7.csv"
# Table extraction parameter
table_extraction_method="PDFPlumber"
# Name of the file in `data/input` containing the list of PDFs to be read 
rpqs_eval_list_file = "rpqs_eval_list_1+2.csv"

In [3]:
# Instantiate the File System (local file system or S3 bucket)
fs = FileSystem()
# Get the directory containing the folder `data`
data_dir = get_data_dir()
# Import the dataframe containing the list of PDFs to be read and questioned
eval_df = fs.read_csv_to_df(data_dir + "/data/input/" + rpqs_eval_list_file, sep=";", 
                            usecols=["pdf_name", "collectivity", "year", "competence"])
# Show the first rows of this dataframe
eval_df.head()

Unnamed: 0,pdf_name,collectivity,year,competence
0,RPQS_Ahun_cp23150_rpqsid_674494_AC_2021,Ahun,2021,assainissement collectif
1,RPQS_Amagne_cp08300_rpqsid_651153_AC_2022,Amagne,2022,assainissement collectif
2,RPQS_Artaix_cp71110_rpqsid_303861_AC_2019,Artaix,2019,assainissement collectif
3,RAD_Cabasse_AC_2022,Cabasse,2022,assainissement collectif
4,RPQS_Cartelegue_cp33390_rpqsid_787673_AC_2023,Cartelègue,2023,assainissement collectif


### Instantiate the pipeline

In [4]:
pipeline = Pipeline(
    question_file=question_file,
    indicator_file=indicator_file,
    table_extraction_method=table_extraction_method,
    benchmark_version=benchmark_version,
)


### Instantiate the CodeCarbon tracker

In [5]:
tracker = EmissionsTracker(
    save_to_file = False,      
    log_level="error"
    )

### Run the pipeline

In [6]:
# Start the CodeCarbon tracker
tracker.start()

t0 = time()
try:
    for _, row in eval_df.iterrows():
        print("\n"+"#"*20)
        pdf_file = row['pdf_name'] + ".pdf"
        year = row['year']
        competence = row['competence']

        pipeline.run_table_extraction_step(
            pdf_file=pdf_file,
            competence=competence,
            year=year
        )  
finally:
    emissions = tracker.stop()

t1 = time()
print("\n"+"#"*20)
print(f"Computation time = {round(t1-t0, 1)} s")
print(f"Carbon footprint : {round(emissions * 1_000, 1)} gCO2eq")



####################
Extract tables from pdf RPQS_Ahun_cp23150_rpqsid_674494_AC_2021.pdf ...
Done
Get the segmentation dataframe  ...
Done
Extract indicator values from summary tables ...
Done
Cleaning answers ...
Done
Saving answers for RPQS_Ahun_cp23150_rpqsid_674494_AC_2021.pdf ...
Done

####################
Extract tables from pdf RPQS_Amagne_cp08300_rpqsid_651153_AC_2022.pdf ...
Done
Get the segmentation dataframe  ...
Done
Extract indicator values from summary tables ...
Done
Cleaning answers ...
Done
Saving answers for RPQS_Amagne_cp08300_rpqsid_651153_AC_2022.pdf ...
Done

####################
Extract tables from pdf RPQS_Artaix_cp71110_rpqsid_303861_AC_2019.pdf ...
Done
Get the segmentation dataframe  ...
Done
Extract indicator values from summary tables ...
Done
Cleaning answers ...
Done
Saving answers for RPQS_Artaix_cp71110_rpqsid_303861_AC_2019.pdf ...
Done

####################
Extract tables from pdf RAD_Cabasse_AC_2022.pdf ...
Done
Get the segmentation dataframe  ...
D



Done
Saving answers for RPQS_Trie-Chateau_cp60590_rpqsid_560253_AC_2021.pdf ...
Done

####################
Extract tables from pdf RPQS_Corbel_cp73160_rpqsid_728273_AC_2022.pdf ...
Done
Get the segmentation dataframe  ...
Done
Extract indicator values from summary tables ...
Done
Cleaning answers ...
Done
Saving answers for RPQS_Corbel_cp73160_rpqsid_728273_AC_2022.pdf ...
Done

####################
Extract tables from pdf RPQS_Cussy-les-Forges_cp89420_rpqsid_727813_AC_2021.pdf ...
Done
Get the segmentation dataframe  ...
Done
Extract indicator values from summary tables ...
Done
Cleaning answers ...
Done
Saving answers for RPQS_Cussy-les-Forges_cp89420_rpqsid_727813_AC_2021.pdf ...
Done

####################
Extract tables from pdf RPQS_Estezargues_cp30390_rpqsid_613034_AC_2021.pdf ...
Done
Get the segmentation dataframe  ...
Done
Extract indicator values from summary tables ...
Done
Cleaning answers ...
Done
Saving answers for RPQS_Estezargues_cp30390_rpqsid_613034_AC_2021.pdf ...
Do



Done
Cleaning answers ...
Done
Saving answers for RPQS_Reugny_cp37380_rpqsid_747173_AC_2022.pdf ...
Done

####################
Extract tables from pdf RPQS_Reyssouze_cp01190_rpqsid_732773_AC_2022.pdf ...
Done
Get the segmentation dataframe  ...
Done
Extract indicator values from summary tables ...
Done
Cleaning answers ...
Done
Saving answers for RPQS_Reyssouze_cp01190_rpqsid_732773_AC_2022.pdf ...
Done

####################
Extract tables from pdf RPQS_Saint-Laurent-les-Tours_cp46400_rpqsid_679393_AC_2021.pdf ...
Done
Get the segmentation dataframe  ...
Done
Extract indicator values from summary tables ...
Done
Cleaning answers ...
Done
Saving answers for RPQS_Saint-Laurent-les-Tours_cp46400_rpqsid_679393_AC_2021.pdf ...
Done

####################
Extract tables from pdf RPQS_Saint-Mihiel_cp55300_rpqsid_446333_AC_2020.pdf ...
Done
Get the segmentation dataframe  ...
Done
Extract indicator values from summary tables ...
Done
Cleaning answers ...
Done
Saving answers for RPQS_Saint-Mihie

In [7]:
from IPython.display import JSON
JSON(tracker.final_emissions_data.toJSON())



<IPython.core.display.JSON object>