# Table Curation

In this notebook, we will curate the table data for model training. Specifically, we will add labels to the data extracted by the [previous notebook](pdf_table_extraction.ipynb), using the manually determined labels available in the annotation excel sheets. We will focus on the tables extracted from `sustainability-report-2019.pdf` and will be using annotations provided in the `20201030 1Qbit aggregated_annotations_needs_correction.xlsx` sheet.

In [1]:
import os
import glob
import pathlib
import logging
import pandas as pd
from dotenv import load_dotenv

import config
from src.data.s3_communication import S3Communication
from src.components.preprocessing import TableCurator

logger = logging.getLogger()

In [2]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [3]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("S3_ACCESS_KEY"),
    aws_secret_access_key=os.getenv("S3_SECRET_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

In [4]:
# download annotation files
s3c.download_files_in_prefix_to_dir(
    config.BASE_ANNOTATION_S3_PREFIX,
    config.BASE_ANNOTATION_FOLDER,
)

In [5]:
# initialize table curator
tb_cur = TableCurator(
    neg_pos_ratio=1, 
    create_neg_samples=True,
    columns_to_read=[
        'company', 'source_file', 'source_page', 'kpi_id', 'year', 'answer', 'data_type'
    ],
    company_to_exclude=[]
)

In [6]:
# excel sheets containing manually labelled data
annotation_excels = glob.glob('{}/[!~$]*[.xlsx]'.format(config.BASE_ANNOTATION_FOLDER))
annotation_excels

['/home/kachau/Documents/aicoe-osc-demo/data/annotations/20201030 1Qbit aggregated_annotations_needs_correction.xlsx']

In the following cell, we will run the curator but suppress the warnings. Most of these warnings are warnings such as "LSE_WG_2016.pdf was not extracted" which are thrown because the corresponding pdfs do not exist in `sustainability-report-2019.json`. In other words these pdfs do not exist in the initial pdfs folder.

In [7]:
# store current level, set new level to error
current_loglevel = logger.level
logger.setLevel(logging.ERROR)

# run curator
tb_cur.run(config.BASE_EXTRACTION_FOLDER, annotation_excels, config.BASE_CURATION_FOLDER)

# restore level to original level
logger.setLevel(current_loglevel)

Next, we will see that the curation folder will have a csv which shows the curated data.

In [8]:
# view curated data
df = pd.read_csv(config.BASE_CURATION_FOLDER / "esg_TABLE_dataset.csv", index_col=0)
df

Unnamed: 0,Company,Year,Question,Answer,Table_filename,Label
0,Equinor,2019,What is the total volume of hydrocarbons produ...,,sustainability-report-2019_page22_1.csv,0
1,Equinor,2019,What is the total volume of hydrocarbons produ...,1055 mmboe,sustainability-report-2019_page16_1.csv,1
2,Equinor,2018,What is the total volume of hydrocarbons produ...,,sustainability-report-2019_page30_1.csv,0
3,Equinor,2018,What is the total volume of hydrocarbons produ...,1077 mmboe,sustainability-report-2019_page16_1.csv,1
4,Equinor,2017,What is the total volume of hydrocarbons produ...,,sustainability-report-2019_page7_1.csv,0
5,Equinor,2017,What is the total volume of hydrocarbons produ...,1099 mmboe,sustainability-report-2019_page16_1.csv,1
6,Equinor,2016,What is the total volume of hydrocarbons produ...,,sustainability-report-2019_page28_1.csv,0
7,Equinor,2016,What is the total volume of hydrocarbons produ...,1030 mmboe,sustainability-report-2019_page16_1.csv,1
8,Equinor,2015,What is the total volume of hydrocarbons produ...,,sustainability-report-2019_page4_1.csv,0
9,Equinor,2015,What is the total volume of hydrocarbons produ...,1073 mmboe,sustainability-report-2019_page16_1.csv,1


In [10]:
# upload the curation file to s3
ret = s3c.upload_file_to_s3(
    config.BASE_CURATION_FOLDER / "esg_TABLE_dataset.csv",
    config.BASE_CURATION_S3_PREFIX,
    "esg_TABLE_dataset.csv",
)
ret['ResponseMetadata']['HTTPStatusCode']

200

# Conclusion

Through this notebook, we have curated the raw table data that was extracted from the pdfs. This data is now ready for training or fine-tuning our models.