## Train Relevance Model

Now that the dataset has been [extracted](pdf_text_extraction.ipynb) and [curated](pdf_text_curation.ipynb), we will train the relevance classifier model in this notebook. The model trained is comprised of a transformer model (e.g., BERT) that can be loaded pre-trained on the NQ dataset into the pipeline and then be fine-tuned on the curated data for our specific relevance detection task.

Our pipeline includes components that are provided by the FARM library. FARM is a framework which facilitates transfer learning tasks for BERT based models. Documentation for FARM is available here: https://farm.deepset.ai.

In [1]:
import os
import config
import zipfile
import pathlib

import pandas as pd

from dotenv import load_dotenv

from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score

from src.models import FARMTrainer
from src.data.s3_communication import S3Communication, S3FileType

from config_farm_train import (
    FileConfig,
    ModelConfig,
    MLFlowConfig,
    TrainingConfig,
    TokenizerConfig,
    ProcessorConfig,
)
from farm.infer import Inferencer

05/18/2022 20:28:38 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [2]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [3]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

#### Set parameters

Before starting training, parameters for each component of the training pipeline must be set. For this we create `config` objects which hold these parameters. Default values have already been set but they can be easily changed. To do so, you can manually update the parameters in the corresponding config file:

`aicoe-osc-demo/notebooks/demo2/config_farm_train.py`

In [4]:
# Settings data files and checkpoints parameters
file_config = FileConfig("relevance_fine_tune_demo")

# Settings for the processor component
processor_config = ProcessorConfig("relevance_fine_tune_demo")

# Settings for the tokenizer
tokenizer_config = TokenizerConfig("relevance_fine_tune_demo")
# NOTE: specifically for tokenizer, we need to ensure root dir is a string
tokenizer_config.root = str(tokenizer_config.root)

# Settings for the model
model_config = ModelConfig("relevance_fine_tune_demo")

# Settings for training
train_config = TrainingConfig("relevance_fine_tune_demo")

# Settings for training
mlflow_config = MLFlowConfig("relevance_fine_tune_demo")

We can check the value for some parameters:

In [5]:
print(f"Experiment_name: \n {file_config.experiment_name} \n")
print(f"Data directory: \n {file_config.data_dir} \n")
print(f"Curated dataset path: \n {file_config.curated_data} \n")
print(f"Split train/validation ratio: \n{file_config.dev_split} \n")
print(f"Training dataset path: \n {file_config.train_filename} \n")
print(f"Validation dataset path: \n {file_config.dev_filename} \n")
print(f"Directory where trained model is saved: \n {file_config.saved_models_dir} \n")

Experiment_name: 
 relevance_fine_tune_demo 

Data directory: 
 /opt/app-root/src/aicoe-osc-demo/data 

Curated dataset path: 
 /opt/app-root/src/aicoe-osc-demo/data/curation/esg_TEXT_dataset.csv 

Split train/validation ratio: 
0.2 

Training dataset path: 
 /opt/app-root/src/aicoe-osc-demo/data/processed/kpi_train_split.csv 

Validation dataset path: 
 /opt/app-root/src/aicoe-osc-demo/data/processed/kpi_val_split.csv 

Directory where trained model is saved: 
 /opt/app-root/src/aicoe-osc-demo/models/relevance_fine_tune_demo/RELEVANCE 



In [6]:
print(f"Max number of tokens per example: {processor_config.max_seq_len} \n")

Max number of tokens per example: 512 



In [7]:
print(f"Use GPU: {train_config.use_cuda} \n")

Use GPU: True 



In [8]:
print(f"Learning_rate: {train_config.learning_rate} \n")
print(f"Number of epochs for fine tuning: {train_config.n_epochs} \n")
print(f"Batch size: {train_config.batch_size} \n")
print(f"Perform Cross validation: {train_config.run_cv} \n")

Learning_rate: 1e-05 

Number of epochs for fine tuning: 10 

Batch size: 16 

Perform Cross validation: False 



## Load Pretrained Model

We already have a trained relevance classifier on Google's large NQ dataset. We download it and then save it in the following directory: `file_config.saved_models_dir / "relevance_roberta"`

In [9]:
# download the pretrained model
model_root = pathlib.Path(model_config.load_dir).parent
model_rel_zip = pathlib.Path(model_root, "relevance_roberta.zip")

s3c.download_file_from_s3(
    model_rel_zip, config.CHECKPOINT_S3_PREFIX, "relevance_roberta.zip"
)

with zipfile.ZipFile(pathlib.Path(model_root, "relevance_roberta.zip"), "r") as z:
    z.extractall(model_root)

In [10]:
file_config.data_type = "Text"
print(f"Data type: \n {file_config.data_type} \n")

Data type: 
 Text 



We need to load this model in our pipeline to fine-tune a relevance classifier on our specific ESG curated dataset. For this we have to set the parameter `model_config.load_dir` to be the directory where we saved our first checkpoint. We can check that this is set:

In [11]:
print(f"NQ checkpoint directory: {model_config.load_dir}")

NQ checkpoint directory: /opt/app-root/src/aicoe-osc-demo/models/pretrained/relevance_roberta


## Fine-tune on curated ESG data

Once all the parameters are set, a `FARMTrainer` object can be instantiated by passing all the configuration objects

In [12]:
# init farm trainer
farm_trainer = FARMTrainer(
    file_config=file_config,
    tokenizer_config=tokenizer_config,
    processor_config=processor_config,
    model_config=model_config,
    training_config=train_config,
    mlflow_config=mlflow_config,
)

Call the method `run()` to start training

In [13]:
farm_trainer.run()

05/18/2022 20:28:51 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: True
05/18/2022 20:28:51 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'RobertaTokenizer'
05/18/2022 20:28:52 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
05/18/2022 20:28:52 - INFO - farm.data_handler.data_silo -   Loading train set from: /opt/app-root/src/aicoe-osc-demo/data/processed/kpi_train_split.csv 
05/18/2022 20:28:52 - INFO - farm.data_handler.data_silo -   Got ya 7 parallel workers to convert 3028 dictionaries to pytorch datasets (chunksize = 87)...
05/18/2022 20:28:52 - INFO - farm.data_handler.data_silo -    0    0    0    0    0    0    0 
05/18/2022 20:28:52 - INFO - farm.data_handler.data_silo -   /w\  /w\  /|\  /|\  /w\  /|\  /|\
05/18/2022 20:28

0.9722955145118733

At the end of the training process, the model and the processor vocabulary are saved into the directory `file_config.saved_models_dir`

In [14]:
!ls -al $file_config.saved_models_dir

total 488348
drwxr-sr-x. 2 1000630000 1000630000      4096 May 18 20:52 .
drwxr-sr-x. 3 1000630000 1000630000      4096 May 18 20:52 ..
-rw-r--r--. 1 1000630000 1000630000 498669047 May 18 20:52 language_model.bin
-rw-r--r--. 1 1000630000 1000630000       562 May 18 20:52 language_model_config.json
-rw-r--r--. 1 1000630000 1000630000    456318 May 18 20:52 merges.txt
-rw-r--r--. 1 1000630000 1000630000      7489 May 18 20:52 prediction_head_0.bin
-rw-r--r--. 1 1000630000 1000630000       321 May 18 20:52 prediction_head_0_config.json
-rw-r--r--. 1 1000630000 1000630000       755 May 18 20:52 processor_config.json
-rw-r--r--. 1 1000630000 1000630000       772 May 18 20:52 special_tokens_map.json
-rw-r--r--. 1 1000630000 1000630000       237 May 18 20:52 tokenizer_config.json
-rw-r--r--. 1 1000630000 1000630000    898822 May 18 20:52 vocab.json


## Cross-validation

To better estimate the performance of the model on new data, it is recommended to perform k-folds cross validation (CV). CV works as follows:

- Split the entire data randomly into k folds (usually 5 to 10)
- Fit the model using the K — 1 folds and validate the model using the remaining Kth fold and save the scores
- Repeat until every K-fold serve as the test set and average the saved scores

`FARMTrainer` includes this features. To perform 3-fold CV proceed as follows:

In [15]:
train_config.run_cv = True
train_config.xval_folds = 3
train_config.n_epochs = 3
train_config.batch_size = 8

In [16]:
farm_trainer = FARMTrainer(
    file_config=file_config,
    tokenizer_config=tokenizer_config,
    model_config=model_config,
    processor_config=processor_config,
    training_config=train_config,
    mlflow_config=mlflow_config,
)

In [17]:
farm_trainer.run()

05/18/2022 20:52:22 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: True
05/18/2022 20:52:22 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'RobertaTokenizer'
05/18/2022 20:52:23 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
05/18/2022 20:52:23 - INFO - farm.data_handler.data_silo -   Loading train set from: /opt/app-root/src/aicoe-osc-demo/data/processed/kpi_train_split.csv 
05/18/2022 20:52:23 - INFO - farm.data_handler.data_silo -   Got ya 7 parallel workers to convert 3028 dictionaries to pytorch datasets (chunksize = 87)...
05/18/2022 20:52:23 - INFO - farm.data_handler.data_silo -    0    0    0    0    0    0    0 
05/18/2022 20:52:23 - INFO - farm.data_handler.data_silo -   /w\  /w\  /|\  /|\  /w\  /|\  /|\
05/18/2022 20:52

**NOTE:** CV mode does not save a checkpoint, it is only used for validation

## Model Performance Metrics

In this section, we will quantify the performance of the fine tuned model on our dataset. Specifically, we will calculate the precision, recall, and f1-score. 

In [18]:
# load test set
test_data = pd.read_csv(file_config.dev_filename, index_col=0)
test_data.head()

Unnamed: 0,label,text,text_b
3180,0,What is the total volume of natural gas produc...,"ROMGAZ proposes to be an active, profitable an..."
2367,1,What is the total amount of energy indirect gr...,Scope 2 emissions which arise due to purchased...
2647,0,What is the total volume of hydrocarbons produ...,"Accordingly, the Internal Audit Department per..."
3107,0,What is the total volume of natural gas liquid...,Transparency rating of Russian oil and gas com...
3181,0,What is the total volume of natural gas produc...,Re-development of the Hod field has passed dec...


In [19]:
# get predictions from current model
model = Inferencer.load(file_config.saved_models_dir)

result = model.inference_from_file(file_config.dev_filename)
results = [d for r in result for d in r["predictions"]]
preds = [int(r["label"]) for r in results]

test_data["pred"] = preds

05/18/2022 21:10:36 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
05/18/2022 21:10:39 - INFO - farm.modeling.adaptive_model -   Found files for loading 1 prediction heads
05/18/2022 21:10:39 - INFO - farm.modeling.prediction_head -   Prediction head initialized with size [768, 2]
05/18/2022 21:10:39 - INFO - farm.modeling.prediction_head -   Loading prediction head from /opt/app-root/src/aicoe-osc-demo/models/relevance_fine_tune_demo/RELEVANCE/prediction_head_0.bin
05/18/2022 21:10:39 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'RobertaTokenizer'
05/18/2022 21:10:39 - INFO - farm.data_handler.processor -   Initialized processor without tasks. Supply `metric` and `label_list` to the constructor for using the default task or add a custom task later via processor.add_task()
05/18/2022 21:10:39 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision train

In [20]:
# evalute performance
groups = test_data.groupby("text")
scores = {}
for group, data in groups:
    pred = data.pred
    true = data.label
    scores[group] = {}
    scores[group]["accuracy"] = accuracy_score(true, pred)
    scores[group]["f1_score"] = f1_score(true, pred)
    scores[group]["recall_score"] = recall_score(true, pred)
    scores[group]["precision_score"] = precision_score(true, pred)
    scores[group]["support"] = len(pred)

  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [21]:
# kpi wise performance metrics
scores_df = pd.DataFrame(scores)
scores_df.head()

Unnamed: 0,In which year was the annual report or the sustainability report published?,What is the annual total production from coal?,What is the annual total production from lignite (brown coal)?,What is the base year for carbon reduction commitment?,What is the climate commitment scenario considered?,What is the company name?,What is the target carbon reduction in percentage?,What is the target year for climate commitment?,What is the total amount of direct greenhouse gases emissions referred to as scope 1 emissions?,What is the total amount of energy indirect greenhouse gases emissions referred to as scope 2 emissions?,...,What is the total amount of upstream energy indirect greenhouse gases emissions referred to as scope 3 emissions?,What is the total installed capacity from coal?,What is the total installed capacity from lignite (brown coal)?,What is the total volume of crude oil liquid production?,What is the total volume of hydrocarbons production?,What is the total volume of natural gas liquid production?,What is the total volume of natural gas production?,What is the total volume of proven and probable hydrocarbons reserves?,What is the volume of estimated probable hydrocarbons reserves?,What is the volume of estimated proven hydrocarbons reserves?
accuracy,0.96,1.0,1.0,0.948718,1.0,0.911765,0.980769,0.987179,1.0,1.0,...,1.0,1.0,1.0,1.0,0.97,1.0,1.0,0.980392,1.0,1.0
f1_score,0.944444,0.0,1.0,0.9,1.0,0.861538,0.96,0.981132,1.0,1.0,...,1.0,1.0,0.0,1.0,0.955224,1.0,1.0,0.96,1.0,1.0
recall_score,0.918919,0.0,1.0,1.0,1.0,0.823529,0.923077,0.962963,1.0,1.0,...,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.923077,1.0,1.0
precision_score,0.971429,0.0,1.0,0.818182,1.0,0.903226,1.0,1.0,1.0,1.0,...,1.0,1.0,0.0,1.0,0.914286,1.0,1.0,1.0,1.0,1.0
support,100.0,4.0,1.0,39.0,36.0,102.0,52.0,78.0,40.0,26.0,...,19.0,4.0,1.0,8.0,100.0,13.0,16.0,51.0,1.0,62.0


In [22]:
# save results locally
scores_df.to_csv(file_config.model_performance_metrics_file)

## Save model to s3

Great, we have a fine tuned model at this point. We will now save this model as well as its performance metrics to s3.

In [23]:
# upload performance files to s3
s3c.upload_df_to_s3(
    scores_df,
    s3_prefix=f"{config.CHECKPOINT_S3_PREFIX}/relevance_fine_tune_demo",
    s3_key="relevance_fine_tune_demo_scores,csv",
    filetype=S3FileType.CSV,
)

{'ResponseMetadata': {'RequestId': 'EQ033H8X94TD0GGG',
  'HostId': 'eFQBzbT8wEngR6ixTSMmbb2+6oBr1bEZ+Su6HX7ZsxeDT/37drujx5Nn40EHcboZIFSX7+f+mkc=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'eFQBzbT8wEngR6ixTSMmbb2+6oBr1bEZ+Su6HX7ZsxeDT/37drujx5Nn40EHcboZIFSX7+f+mkc=',
   'x-amz-request-id': 'EQ033H8X94TD0GGG',
   'date': 'Wed, 18 May 2022 21:13:18 GMT',
   'etag': '"4bb779b3cd4cb02c8a79e43ca93ea287"',
   'server': 'AmazonS3',
   'content-length': '0'},
  'RetryAttempts': 0},
 'ETag': '"4bb779b3cd4cb02c8a79e43ca93ea287"'}

In [24]:
# upload model to s3
s3c.upload_files_in_dir_to_prefix(
    source_dir=file_config.saved_models_dir,
    s3_prefix=f"{config.CHECKPOINT_S3_PREFIX}/relevance_fine_tune_demo/RELEVANCE",
)