# Data Preparation

The following code will curate the data and output them as SQuAD-like format.

In [1]:
import os
import pdb
import pathlib
from dotenv import load_dotenv

import config
from src.data.s3_communication import S3Communication, S3FileType
from src.components.preprocessing.kpi_inference_curator import TextKPIInferenceCurator

In [2]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [3]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

## Format Extracted Text in SQUAD formmat

In [4]:
kpi_df = s3c.download_df_from_s3(
    "aicoe-osc-demo/kpi_mapping",
    "kpi_mapping.csv",
    filetype=S3FileType.CSV,
    header=0,
)
kpi_df.head()

Unnamed: 0,kpi_id,question,sectors,add_year,kpi_category,Unnamed: 5,Unnamed: 6
0,0.0,What is the company name?,"OG, CM, CU",False,TEXT,,
1,1.0,In which year was the annual report or the sus...,"OG, CM, CU",False,TEXT,,
2,2.0,What is the total volume of proven and probabl...,OG,True,"TEXT, TABLE",,
3,2.1,What is the volume of estimated proven hydroca...,OG,True,"TEXT, TABLE",,
4,2.2,What is the volume of estimated probable hydro...,OG,True,"TEXT, TABLE",,


In [5]:
config.TextKPIInferenceCurator_kwargs

{'annotation_folder': PosixPath('/opt/app-root/src/aicoe-osc-demo/data/annotations'),
 'agg_annotation': PosixPath('/opt/app-root/src/aicoe-osc-demo/data/20201030 1Qbit aggregated_annotations_needs_correction.xlsx'),
 'extracted_text_json_folder': PosixPath('/opt/app-root/src/aicoe-osc-demo/data/extraction'),
 'output_squad_folder': PosixPath('/opt/app-root/src/aicoe-osc-demo/data/squad'),
 'relevant_text_path': PosixPath('/opt/app-root/src/aicoe-osc-demo/data/infer_relevance/*.csv')}

In [6]:
tkpi = TextKPIInferenceCurator(
    **config.TextKPIInferenceCurator_kwargs,
    kpi_df=kpi_df,
    columns_to_read=config.TRAIN_KPI_INFERENCE_COLUMNS_TO_READ,
)
train_squad, val_squad = tkpi.curate(**config.CurateConfig().__dict__)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_single.loc[:,'source_page'] = df_single['source_page'].apply(lambda x: x[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_single.loc[:,'relevant_paragraphs'] = df['relevant_paragraphs'].apply(lambda x: x[0])


Now, we have the data in SQuAD format, which is ready for training.

In [7]:
# see that the data has been curated and placed in the output folder
output_dir = str(config.TextKPIInferenceCurator_kwargs['output_squad_folder'])
!ls $output_dir

kpi_train.json	      kpi_val_split.json
kpi_train_split.json  reference_kpi_01-06-2022.csv


# KPI extraction

This following shows how to train a kpi extraction model and make inference using that. <br> <br>
KPI extraction model is a Question-Answering (QA) system. A public dataset for the Question-Answering task is called SQuAD. This notebook assumes that the ESG data is already curated in a SQuAD-like format. 

Our pipeline includes components that are provided by the FARM library. FARM is a framework which facilitates transfer learning tasks for BERT based models. Documentation for FARM is available here: https://farm.deepset.ai.

In [8]:
## General imports
import pprint
import os

### Step 1: Set the configurable parameters. 
Before start training, parameters for each component of the training pipeline must be set. For this we create `config` objects which hold these parameters. Default values have already been set but they can be easily changed.

In [9]:
from src.models.qa_farm_trainer import QAFARMTrainer
from config_qa_farm_train import (
    QAFileConfig,
    QATokenizerConfig,
    QAProcessorConfig,
    QAModelConfig,
    QATrainingConfig,
    QAMLFlowConfig,
)

06/01/2022 22:50:13 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [10]:
file_config = QAFileConfig("demo_train_kpi_infer") #Settings data files and checkpoints parameters
processor_config = QAProcessorConfig("demo_train_kpi_infer") #Settings for the processor component
tokenizer_config = QATokenizerConfig("demo_train_kpi_infer") #Settings for the tokenizer
model_config = QAModelConfig("demo_train_kpi_infer") #Settings for the model
train_config = QATrainingConfig("demo_train_kpi_infer") #Settings for training
mlflow_config = QAMLFlowConfig("demo_train_kpi_infer") #Settings for training

Parameters can be changed as follows:

In [11]:
file_config.experiment_name = "demo_train_kpi_infer"

However, we advise that you manually update the parameters in the corresponding config file:

`model_pipeline/model_pipeline/config_qa_farm_trainer.py`

We should change the training data file to the one we just created.

In [12]:
curated_data_path = os.path.join(output_dir, 'esg_kpi_train_15-10-2020.json')
file_config.update_paths(output_dir, curated_data_path)
file_config.perform_splitting = True

We can check the value for some parameters:

In [13]:
print(f"Experiment_name: \n {file_config.experiment_name} \n")
print(f"Data directory: \n {file_config.data_dir} \n")
print(f"Curated dataset path: \n {file_config.curated_data} \n")
print(f"Split train/validation ratio: \n{file_config.dev_split} \n")
print(f"Training dataset path: \n {file_config.train_filename} \n")
print(f"Validation dataset path: \n {file_config.dev_filename} \n")
print(f"Directory where trained model is saved: \n {file_config.saved_models_dir} \n")

Experiment_name: 
 demo_train_kpi_infer 

Data directory: 
 /opt/app-root/src/aicoe-osc-demo/data 

Curated dataset path: 
 /opt/app-root/src/aicoe-osc-demo/data/squad/kpi_train.json 

Split train/validation ratio: 
0.2 

Training dataset path: 
 /opt/app-root/src/aicoe-osc-demo/data/squad/kpi_train_split.json 

Validation dataset path: 
 /opt/app-root/src/aicoe-osc-demo/data/squad/kpi_val_split.json 

Directory where trained model is saved: 
 /opt/app-root/src/aicoe-osc-demo/models/KPI_EXTRACTION 



In [14]:
print(f"Max number of tokens per example: {processor_config.max_seq_len} \n")

Max number of tokens per example: 384 



In [15]:
print(f"Use GPU: {train_config.use_cuda} \n")

Use GPU: True 



In [16]:
print(f"Learning_rate: {train_config.learning_rate} \n")
print(f"Number of epochs for fine tuning: {train_config.n_epochs} \n")
print(f"Batch size: {train_config.batch_size} \n")
print(f"Perform Cross validation: {train_config.run_cv} \n")

Learning_rate: 2e-05 

Number of epochs for fine tuning: 1 

Batch size: 4 

Perform Cross validation: False 



# Training the Model

The training pipeline recieves the curated ESG dataset for KPI extraction. The necassary components includes the Tokenizer and Processor are loaded. These components will create features from the input text. Next, the model will be defined, the model is a bert-based model with extra dense layers for the question answering task. The weights of the model are initialized from the pretrained model on SQuAD dataset. In the Training phase the model will be fine-tuned on the ESG curated data.


#### Fine-tune on curated ESG data

Once all the parameters are set, a `QAFARMTrainer` object can be instantiated by passing all the configuration objects

In [17]:
farm_trainer = QAFARMTrainer(
    file_config=file_config,
    tokenizer_config=tokenizer_config,
    model_config=model_config,
    processor_config=processor_config,
    training_config=train_config,
    mlflow_config=mlflow_config,
)

Call the method `run()` to start training

_Note_ For the first time, loading the model will take a little longer, for download the checkpoints. The model will be cached after that. 

In [None]:
import pdb
pdb.set_trace()
# b /opt/app-root/lib64/python3.8/site-packages/src/models/farm_trainer.py:427
farm_trainer.run(metric = "f1")

--Return--
None
> [0;32m/tmp/ipykernel_9549/822982082.py[0m(2)[0;36m<cell line: 2>[0;34m()[0m
[0;32m      1 [0;31m[0;32mimport[0m [0mpdb[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m----> 2 [0;31m[0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      3 [0;31m[0;31m# b /opt/app-root/lib64/python3.8/site-packages/src/models/farm_trainer.py:427[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m      4 [0;31m[0mfarm_trainer[0m[0;34m.[0m[0mrun[0m[0;34m([0m[0mmetric[0m [0;34m=[0m [0;34m"f1"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m


At the end of the training process, the model and the processor vocabulary are saved into the directory `file_config.saved_models_dir`

In [None]:
type(os.path.join(config.ROOT, "models", "KPI_EXTRACTION"))

In [None]:
!ls -al $file_config.saved_models_dir

You can find the developement dataset at `file_config.dev_filename`. This dataset has not been seen by the model.

In [None]:
file_config.dev_filename

## Inference

We can use the saved model and test it on some real examples.<br><br>
First let's load the model:

In [None]:
from farm.infer import QAInferencer
model = QAInferencer.load(file_config.saved_models_dir, batch_size=40, gpu=True)

Now, let's make prediction on a pair of paragraph and question.

In [None]:
context = """the paris agreement on climate change drafted in 2015 aims to reduce worldwide emissions of greenhouse 
gases to a level intended to limit a rise in global temperatures to below 2 degrees or, better still,
to below 1.5 degrees. verbund’s target of reducing greenhouse gas emissions by 90% measured beginning from 
the basis year 2011 5 million tonnes co2e until 2021 includes scope 1, scope 2 market- based and parts of scope 3 emissions 
for energy and air travel. the science based targets initiative validated this goal as science-based in october 2016, 
i.e. it meets global standards. according to current planning, the target can be achieved. 
however, if the grid operator requires higher generation volumes 
"""
question = "What is the target year for climate commitment?"
    

In [None]:
QA_input = [
        {
            "qas": [question],
            "context":  context
        }]

result = model.inference_from_dicts(dicts=QA_input)[0]
pprint.pprint(result)

What does the prediction result show? 

In [None]:
# This is the best answer. Generally it can be span-based or it can be no-answer, which ever is higher
# Here the top answer is the span '2021'
result['predictions'][0]['answers'][0]

In [None]:
# Non-answerable score: The model is pretty confident that the answer to the question can be in the context.
result['predictions'][0]['answers'][1]

You can also make the prediction on a squad-format file as below:

In [None]:
from farm.data_handler.utils import write_squad_predictions

results = model.inference_from_file(file=file_config.dev_filename, return_json=False)
result_squad = [x.to_squad_eval() for x in results]

write_squad_predictions(
    predictions=result_squad,
    predictions_filename=file_config.dev_filename,
    out_filename=os.path.join(file_config.data_dir, "predictions.json")
)

The result is written in the `out_filename`

_Tip_: Many of the result objects belong to farm classes. If you want to see the attributes of its class in jupyter notebook, type the "objectname." and then press tap. For example, try it in the below cell by pressing tap and see the attribiues of the class. This is an usefull jupyter notebook trick. 

In [None]:
# Put the cursor after dot and press Tab.
result[0].