# Text Curation
As the second step of the pipeline, the notebook aims to take the extracted text from PDFs and annotated files to create the curated training set for the language model. The extracted text for this notebook is in the `ROOT/data/extraction` directory and the output csv dataset will be stored in `ROOT/data/curation` directory. The Curator class finds positive labels from annoated files and creates negative examples from the extracted text. The output dataset from this notebook will be used for the next step of the pipeline, i.e., training the model. 

In [1]:
# Author: ALLIANZ NLP esg data pipeline
import os
import pathlib
from dotenv import load_dotenv


import config
from src.components.preprocessing import Curator
from src.data.s3_communication import S3Communication

In [2]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [3]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("S3_ACCESS_KEY"),
    aws_secret_access_key=os.getenv("S3_SECRET_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

In [4]:
# download the files created by the extraction phase
s3c.download_files_in_prefix_to_dir(
    config.BASE_EXTRACTION_S3_PREFIX,
    config.BASE_EXTRACTION_FOLDER,
)

In [5]:
# download the annoatation files
s3c.download_files_in_prefix_to_dir(
    config.BASE_ANNOTATION_S3_PREFIX,
    config.BASE_ANNOTATION_FOLDER,
)

In [2]:
# XLS = os.path.join(config.BASE_ANNOTATION_FOLDER, "ESG")
# EXT_FOLDER = config.BASE_EXTRACTION_FOLDER
# CUR_FOLDER = config.BASE_CURATION_FOLDER

### Call text Curator

In [6]:
SEED = 42
TextCurator_kwargs = {
    "retrieve_paragraph": False,
    "neg_pos_ratio": 1,
    "columns_to_read": [
        "company",
        "source_file",
        "source_page",
        "kpi_id",
        "year",
        "answer",
        "data_type",
        "relevant_paragraphs",
    ],
    "company_to_exclude": [],
    "create_neg_samples": True,
    "seed": SEED,
}

In [7]:
cur = Curator([("TextCurator", TextCurator_kwargs)])
cur.run(config.BASE_EXTRACTION_FOLDER, config.BASE_ANNOTATION_FOLDER, config.BASE_CURATION_FOLDER)

Could not process row number 15 in 20201030 1Qbit aggregated_annotations_needs_correction.xlsx
Could not process row number 16 in 20201030 1Qbit aggregated_annotations_needs_correction.xlsx
Could not process row number 17 in 20201030 1Qbit aggregated_annotations_needs_correction.xlsx
Could not process row number 18 in 20201030 1Qbit aggregated_annotations_needs_correction.xlsx
Could not process row number 37 in 20201030 1Qbit aggregated_annotations_needs_correction.xlsx
Could not process row number 165 in 20201030 1Qbit aggregated_annotations_needs_correction.xlsx
Could not process row number 166 in 20201030 1Qbit aggregated_annotations_needs_correction.xlsx
Could not process row number 167 in 20201030 1Qbit aggregated_annotations_needs_correction.xlsx
Could not process row number 179 in 20201030 1Qbit aggregated_annotations_needs_correction.xlsx
Could not process row number 180 in 20201030 1Qbit aggregated_annotations_needs_correction.xlsx
Could not process row number 309 in 20201030 

In [10]:
# upload the curation file to s3
ret = s3c.upload_file_to_s3(
    config.BASE_CURATION_FOLDER / "esg_TEXT_dataset.csv",
    config.BASE_CURATION_S3_PREFIX,
    "esg_TEXT_dataset.csv",
)
ret['ResponseMetadata']['HTTPStatusCode']

200

### Conclusion
We called the Curator class to combine extracted text and annotated files and store the ouput in the `ROOT/data/curation` folder.