# Data Preparation

The following code will curate the data and output them as SQuAD-like format.

In [1]:
import os
import pdb
import pathlib
from dotenv import load_dotenv

import config
from src.data.s3_communication import S3Communication, S3FileType
from src.components.preprocessing.kpi_inference_curator import TextKPIInferenceCurator

In [2]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [3]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

## Format Extracted Text in SQUAD formmat

In [4]:
kpi_df = s3c.download_df_from_s3(
    "aicoe-osc-demo/kpi_mapping",
    "kpi_mapping.csv",
    filetype=S3FileType.CSV,
    header=0,
)
kpi_df.head()

Unnamed: 0,kpi_id,question,sectors,add_year,kpi_category,Unnamed: 5,Unnamed: 6
0,0.0,What is the company name?,"OG, CM, CU",False,TEXT,,
1,1.0,In which year was the annual report or the sus...,"OG, CM, CU",False,TEXT,,
2,2.0,What is the total volume of proven and probabl...,OG,True,"TEXT, TABLE",,
3,2.1,What is the volume of estimated proven hydroca...,OG,True,"TEXT, TABLE",,
4,2.2,What is the volume of estimated probable hydro...,OG,True,"TEXT, TABLE",,


In [5]:
config.TextKPIInferenceCurator_kwargs

{'annotation_folder': PosixPath('/opt/app-root/src/aicoe-osc-demo/data/annotations'),
 'agg_annotation': PosixPath('/opt/app-root/src/aicoe-osc-demo/data/20201030 1Qbit aggregated_annotations_needs_correction.xlsx'),
 'extracted_text_json_folder': PosixPath('/opt/app-root/src/aicoe-osc-demo/data/extraction'),
 'output_squad_folder': PosixPath('/opt/app-root/src/aicoe-osc-demo/data/squad'),
 'relevant_text_path': PosixPath('/opt/app-root/src/aicoe-osc-demo/data/infer_relevance/*.csv')}

In [6]:
tkpi = TextKPIInferenceCurator(
    **config.TextKPIInferenceCurator_kwargs,
    kpi_df=kpi_df,
    columns_to_read=config.TRAIN_KPI_INFERENCE_COLUMNS_TO_READ,
)
train_squad, val_squad = tkpi.curate(**config.CurateConfig().__dict__)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_single.loc[:,'source_page'] = df_single['source_page'].apply(lambda x: x[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_single.loc[:,'relevant_paragraphs'] = df['relevant_paragraphs'].apply(lambda x: x[0])


Now, we have the data in SQuAD format, which is ready for training.

In [7]:
# see that the data has been curated and placed in the output folder
output_dir = str(config.TextKPIInferenceCurator_kwargs['output_squad_folder'])
!ls $output_dir

kpi_train.json	      reference_kpi_01-06-2022.csv
kpi_train_split.json  reference_kpi_02-06-2022.csv
kpi_val_split.json


# KPI extraction

This following shows how to train a kpi extraction model and make inference using that. <br> <br>
KPI extraction model is a Question-Answering (QA) system. A public dataset for the Question-Answering task is called SQuAD. This notebook assumes that the ESG data is already curated in a SQuAD-like format. 

Our pipeline includes components that are provided by the FARM library. FARM is a framework which facilitates transfer learning tasks for BERT based models. Documentation for FARM is available here: https://farm.deepset.ai.

In [8]:
## General imports
import pprint
import os

### Step 1: Set the configurable parameters. 
Before start training, parameters for each component of the training pipeline must be set. For this we create `config` objects which hold these parameters. Default values have already been set but they can be easily changed.

In [9]:
from src.models.qa_farm_trainer import QAFARMTrainer
from config_qa_farm_train import (
    QAFileConfig,
    QATokenizerConfig,
    QAProcessorConfig,
    QAModelConfig,
    QATrainingConfig,
    QAMLFlowConfig,
)

06/02/2022 14:53:14 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [10]:
file_config = QAFileConfig("demo_train_kpi_infer") #Settings data files and checkpoints parameters
processor_config = QAProcessorConfig("demo_train_kpi_infer") #Settings for the processor component
tokenizer_config = QATokenizerConfig("demo_train_kpi_infer") #Settings for the tokenizer
model_config = QAModelConfig("demo_train_kpi_infer") #Settings for the model
train_config = QATrainingConfig("demo_train_kpi_infer") #Settings for training
mlflow_config = QAMLFlowConfig("demo_train_kpi_infer") #Settings for training

Parameters can be changed as follows:

In [11]:
file_config.experiment_name = "demo_train_kpi_infer"

However, we advise that you manually update the parameters in the corresponding config file:

`model_pipeline/model_pipeline/config_qa_farm_trainer.py`

We should change the training data file to the one we just created.

In [12]:
curated_data_path = os.path.join(output_dir, 'esg_kpi_train_15-10-2020.json')
file_config.update_paths(output_dir, curated_data_path)
file_config.perform_splitting = True

We can check the value for some parameters:

In [13]:
print(f"Experiment_name: \n {file_config.experiment_name} \n")
print(f"Data directory: \n {file_config.data_dir} \n")
print(f"Curated dataset path: \n {file_config.curated_data} \n")
print(f"Split train/validation ratio: \n{file_config.dev_split} \n")
print(f"Training dataset path: \n {file_config.train_filename} \n")
print(f"Validation dataset path: \n {file_config.dev_filename} \n")
print(f"Directory where trained model is saved: \n {file_config.saved_models_dir} \n")

Experiment_name: 
 demo_train_kpi_infer 

Data directory: 
 /opt/app-root/src/aicoe-osc-demo/data 

Curated dataset path: 
 /opt/app-root/src/aicoe-osc-demo/data/squad/kpi_train.json 

Split train/validation ratio: 
0.2 

Training dataset path: 
 /opt/app-root/src/aicoe-osc-demo/data/squad/kpi_train_split.json 

Validation dataset path: 
 /opt/app-root/src/aicoe-osc-demo/data/squad/kpi_val_split.json 

Directory where trained model is saved: 
 /opt/app-root/src/aicoe-osc-demo/models/KPI_EXTRACTION 



In [14]:
print(f"Max number of tokens per example: {processor_config.max_seq_len} \n")

Max number of tokens per example: 384 



In [15]:
print(f"Use GPU: {train_config.use_cuda} \n")

Use GPU: True 



In [16]:
print(f"Learning_rate: {train_config.learning_rate} \n")
print(f"Number of epochs for fine tuning: {train_config.n_epochs} \n")
print(f"Batch size: {train_config.batch_size} \n")
print(f"Perform Cross validation: {train_config.run_cv} \n")

Learning_rate: 2e-05 

Number of epochs for fine tuning: 1 

Batch size: 4 

Perform Cross validation: False 



# Training the Model

The training pipeline recieves the curated ESG dataset for KPI extraction. The necassary components includes the Tokenizer and Processor are loaded. These components will create features from the input text. Next, the model will be defined, the model is a bert-based model with extra dense layers for the question answering task. The weights of the model are initialized from the pretrained model on SQuAD dataset. In the Training phase the model will be fine-tuned on the ESG curated data.


#### Fine-tune on curated ESG data

Once all the parameters are set, a `QAFARMTrainer` object can be instantiated by passing all the configuration objects

In [17]:
tokenizer_config.root = str(tokenizer_config.root)

In [18]:
farm_trainer = QAFARMTrainer(
    file_config=file_config,
    tokenizer_config=tokenizer_config,
    model_config=model_config,
    processor_config=processor_config,
    training_config=train_config,
    mlflow_config=mlflow_config,
)

Call the method `run()` to start training

_Note_ For the first time, loading the model will take a little longer, for download the checkpoints. The model will be cached after that. 

In [20]:
# import pdb
# pdb.set_trace()
# b /opt/app-root/lib64/python3.8/site-packages/src/models/farm_trainer.py:427
farm_trainer.run(metric="f1")

06/02/2022 14:56:25 - INFO - src.models.qa_farm_trainer -   Loading the /opt/app-root/src/aicoe-osc-demo/data/squad/kpi_train.json data and splitting to train and val...
06/02/2022 14:56:25 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: True
06/02/2022 14:56:25 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'RobertaTokenizer'
06/02/2022 14:56:26 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
06/02/2022 14:56:26 - INFO - farm.data_handler.data_silo -   Loading train set from: /opt/app-root/src/aicoe-osc-demo/data/squad/kpi_train_split.json 
06/02/2022 14:56:26 - INFO - farm.data_handler.data_silo -   Got ya 7 parallel workers to convert 833 dictionaries to pytorch datasets (chunksize = 24)...
06/02/2022 14:56:26 - INFO - farm.data_h

0.7343754487988401

At the end of the training process, the model and the processor vocabulary are saved into the directory `file_config.saved_models_dir`

In [22]:
!ls -al $file_config.saved_models_dir

total 1389660
drwxr-sr-x. 2 1000630000 1000630000       4096 Jun  2 14:56 .
drwxrwsr-x. 7 1000630000 1000630000       4096 Jun  1 22:09 ..
-rw-r--r--. 1 1000630000 1000630000 1421605239 Jun  2 14:59 language_model.bin
-rw-r--r--. 1 1000630000 1000630000        572 Jun  2 14:59 language_model_config.json
-rw-r--r--. 1 1000630000 1000630000     456318 Jun  2 14:59 merges.txt
-rw-r--r--. 1 1000630000 1000630000       9473 Jun  2 14:59 prediction_head_0.bin
-rw-r--r--. 1 1000630000 1000630000        405 Jun  2 14:59 prediction_head_0_config.json
-rw-r--r--. 1 1000630000 1000630000        825 Jun  2 14:59 processor_config.json
-rw-r--r--. 1 1000630000 1000630000        150 Jun  2 14:59 special_tokens_map.json
-rw-r--r--. 1 1000630000 1000630000        621 Jun  2 14:59 tokenizer_config.json
-rw-r--r--. 1 1000630000 1000630000     898822 Jun  2 14:59 vocab.json


You can find the developement dataset at `file_config.dev_filename`. This dataset has not been seen by the model.

In [23]:
file_config.dev_filename

'/opt/app-root/src/aicoe-osc-demo/data/squad/kpi_val_split.json'

## Inference

We can use the saved model and test it on some real examples.<br><br>
First let's load the model:

In [24]:
from farm.infer import QAInferencer
model = QAInferencer.load(file_config.saved_models_dir, batch_size=40, gpu=True)

06/02/2022 15:41:44 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
06/02/2022 15:41:52 - INFO - farm.modeling.adaptive_model -   Found files for loading 1 prediction heads
06/02/2022 15:41:52 - INFO - farm.modeling.prediction_head -   Prediction head initialized with size [1024, 2]
06/02/2022 15:41:52 - INFO - farm.modeling.prediction_head -   Loading prediction head from /opt/app-root/src/aicoe-osc-demo/models/KPI_EXTRACTION/prediction_head_0.bin
06/02/2022 15:41:53 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'RobertaTokenizer'
06/02/2022 15:41:53 - INFO - farm.data_handler.processor -   Initialized processor without tasks. Supply `metric` and `label_list` to the constructor for using the default task or add a custom task later via processor.add_task()
06/02/2022 15:41:53 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
06/02/2

Now, let's make prediction on a pair of paragraph and question.

In [39]:
context = """southern company is transitioning our energy generation fleet for a low-carbon future. in doing so, we have established an intermediate goal of a 50 percent reduction in carbon emissions from 2007 levels by 2030 and a long-term goal of low- to no-carbon operations by 2050."""
question = "What is the target year for climate commitment?"
question = "What is the target carbon reduction in percentage in year 2018?"

In [40]:
QA_input = [
        {
            "qas": [question],
            "context":  context
        }]

result = model.inference_from_dicts(dicts=QA_input)[0]
pprint.pprint(result)

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 32.25 Batches/s]

{'predictions': [{'answers': [{'answer': '50 percent',
                               'context': 'e have established an intermediate '
                                          'goal of a 50 percent reduction in '
                                          'carbon emissions from 2007 leve',
                               'document_id': '0-0',
                               'offset_answer_end': 156,
                               'offset_answer_start': 146,
                               'offset_context_end': 201,
                               'offset_context_start': 101,
                               'probability': None,
                               'score': 8.2923583984375},
                              {'answer': 'no_answer',
                               'context': '',
                               'document_id': '0-0',
                               'offset_answer_end': 0,
                               'offset_answer_start': 0,
                               'offset_context_




In [25]:
context = """the paris agreement on climate change drafted in 2015 aims to reduce worldwide emissions of greenhouse 
gases to a level intended to limit a rise in global temperatures to below 2 degrees or, better still,
to below 1.5 degrees. verbund’s target of reducing greenhouse gas emissions by 90% measured beginning from 
the basis year 2011 5 million tonnes co2e until 2021 includes scope 1, scope 2 market- based and parts of scope 3 emissions 
for energy and air travel. the science based targets initiative validated this goal as science-based in october 2016, 
i.e. it meets global standards. according to current planning, the target can be achieved. 
however, if the grid operator requires higher generation volumes 
"""
question = "What is the target year for climate commitment?"

In [26]:
QA_input = [
        {
            "qas": [question],
            "context":  context
        }]

result = model.inference_from_dicts(dicts=QA_input)[0]
pprint.pprint(result)

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 24.41 Batches/s]

{'predictions': [{'answers': [{'answer': '2021',
                               'context': 'the basis year 2011 5 million '
                                          'tonnes co2e until 2021 includes '
                                          'scope 1, scope 2 market- based and '
                                          'par',
                               'document_id': '0-0',
                               'offset_answer_end': 366,
                               'offset_answer_start': 362,
                               'offset_context_end': 414,
                               'offset_context_start': 314,
                               'probability': None,
                               'score': 10.710538864135742},
                              {'answer': 'no_answer',
                               'context': '',
                               'document_id': '0-0',
                               'offset_answer_end': 0,
                               'offset_answer_start': 0,
     




What does the prediction result show? 

In [27]:
# This is the best answer. Generally it can be span-based or it can be no-answer, which ever is higher
# Here the top answer is the span '2021'
result['predictions'][0]['answers'][0]

{'score': 10.710538864135742,
 'probability': None,
 'answer': '2021',
 'offset_answer_start': 362,
 'offset_answer_end': 366,
 'context': 'the basis year 2011 5 million tonnes co2e until 2021 includes scope 1, scope 2 market- based and par',
 'offset_context_start': 314,
 'offset_context_end': 414,
 'document_id': '0-0'}

In [28]:
# Non-answerable score: The model is pretty confident that the answer to the question can be in the context.
result['predictions'][0]['answers'][1]

{'score': 2.6435301303863525,
 'probability': None,
 'answer': 'no_answer',
 'offset_answer_start': 0,
 'offset_answer_end': 0,
 'context': '',
 'offset_context_start': 0,
 'offset_context_end': 0,
 'document_id': '0-0'}

You can also make the prediction on a squad-format file as below:

In [29]:
from farm.data_handler.utils import write_squad_predictions

results = model.inference_from_file(file=file_config.dev_filename, return_json=False)
result_squad = [x.to_squad_eval() for x in results]

write_squad_predictions(
    predictions=result_squad,
    predictions_filename=file_config.dev_filename,
    out_filename=os.path.join(file_config.data_dir, "predictions.json")
)

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.74 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.42 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.77 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.89 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.70 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.44 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.83 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.56 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.05 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.17 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.06 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.70 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00

The result is written in the `out_filename`

_Tip_: Many of the result objects belong to farm classes. If you want to see the attributes of its class in jupyter notebook, type the "objectname." and then press tap. For example, try it in the below cell by pressing tap and see the attribiues of the class. This is an usefull jupyter notebook trick. 

In [30]:
# Put the cursor after dot and press Tab.
result[0]

KeyError: 0