##**BERT Q/A for climate-related financial disclosures**

---



<a href="https://colab.research.google.com/github/dafrie/fin-disclosures-nlp/blob/master/notebooks/BERT_Q_A_for_climate_related_financial_disclosures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
### **Overview**
**BERT**, or Bidirectional Embedding Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. The academic paper can be found here: https://arxiv.org/abs/1810.04805.

**SQuAD** Stanford Question Answering Dataset is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

This colab file shows how to fine-tune BERT on SQuAD dataset, and then how to perform the prediction on the domain of climate-related financial disclosures.


SyntaxError: invalid syntax (<ipython-input-2-dd7705fc9e00>, line 2)

This guide has been modified. Make sure to go through the **Initial Setup** on the first run. On each training/inference run, run the parts in **Initialization**

# Initial Setup

### **Download the BERT PRETRAINED MODEL**


BERT Pretrained Model List :


*   [BERT-Large, Uncased (Whole Word Masking)](https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip) : 24-layer, 1024-hidden, 16-heads, 340M parameters
*   [BERT-Large, Cased (Whole Word Masking)](https://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.zip) : 24-layer, 1024-hidden, 16-heads, 340M parameters
*   [BERT-Base, Uncased](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip) : 12-layer, 768-hidden, 12-heads, 110M parameters
*   [BERT-Large, Uncased](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip) : 24-layer, 1024-hidden, 16-heads, 340M parameters
*   [BERT-Base, Cased](https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip): 12-layer, 768-hidden, 12-heads , 110M parameters
*   [BERT-Large, Cased](https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip) : 24-layer, 1024-hidden, 16-heads, 340M parameters
*   [BERT-Base, Multilingual Cased (New, recommended)](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip) : 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
*   [BERT-Base, Multilingual Uncased (Orig, not recommended) (Not recommended, use Multilingual Cased instead)](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip) : 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
*   [BERT-Base, Chinese](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip) : Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

BERT has release **BERT-Base** and **BERT-Large** models. Uncased means that the text has been lowercased before WordPiece tokenization, e.g., John Smith becomes john smith, whereas Cased means that the true case and accent markers are preserved. 

**When using a cased model, make sure to pass --do_lower=False at the time of training.** 

You can download any model of your choice. We have used **BERT-Large-Uncased Model.**


In [None]:
!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip
# Unzip the pretrained model
!unzip uncased_L-24_H-1024_A-16.zip

--2020-07-14 12:37:01--  https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.129.128, 74.125.124.128, 172.217.212.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.129.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1247797031 (1.2G) [application/zip]
Saving to: ‘uncased_L-24_H-1024_A-16.zip’


2020-07-14 12:37:06 (226 MB/s) - ‘uncased_L-24_H-1024_A-16.zip’ saved [1247797031/1247797031]



### **Move Pretrained Model to GCS Bucket** 


> Need to move Pre-trained Model at GCS (Google Cloud Storage) bucket, as Local File System is not Supported on TPU. If you don't move your pretrained model to TPU you may face the error. 



> The **gsutil** **mv** command allows you to move data between your local file system and the cloud, move data within the cloud, and move data between cloud storage providers.




In [None]:
!gsutil mv /content/bert/uncased_L-24_H-1024_A-16 $BUCKET_NAME

Copying file:///content/bert/uncased_L-24_H-1024_A-16/vocab.txt [Content-Type=text/plain]...
Removing file:///content/bert/uncased_L-24_H-1024_A-16/vocab.txt...
Copying file:///content/bert/uncased_L-24_H-1024_A-16/bert_model.ckpt.index [Content-Type=application/octet-stream]...
Removing file:///content/bert/uncased_L-24_H-1024_A-16/bert_model.ckpt.index...
Copying file:///content/bert/uncased_L-24_H-1024_A-16/bert_model.ckpt.meta [Content-Type=application/octet-stream]...
Removing file:///content/bert/uncased_L-24_H-1024_A-16/bert_model.ckpt.meta...
Copying file:///content/bert/uncased_L-24_H-1024_A-16/bert_config.json [Content-Type=application/json]...
Removing file:///content/bert/uncased_L-24_H-1024_A-16/bert_config.json...

==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying f

# Initialization

### Libraries + Auth

In [None]:
%tensorflow_version 1.x
import tensorflow as tf
from google.colab import auth

auth.authenticate_user()

TensorFlow 1.x selected.


### **Create output directory** 


> Need to create a output directory at GCS (Google Cloud Storage) bucket, where you will get your fine_tuned model after training completion. For that you need to provide your BUCKET name and OUPUT DIRECTORY name.

> Also need to move Pre-trained Model at GCS (Google Cloud Storage) bucket, as Local File System is not Supported on TPU. If you don't move your pretrained model to TPU you may face an error. 




In [None]:
BUCKET = 'fin-disclosures-nlp' #@param {type:"string"}
assert BUCKET, '*** Must specify an existing GCS bucket name ***'
output_dir_name = 'bert_squad' #@param {type:"string"}
BUCKET_NAME = 'gs://{}'.format(BUCKET)
OUTPUT_DIR = 'gs://{}/{}'.format(BUCKET,output_dir_name)
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))

***** Model output directory: gs://fin-disclosures-nlp/bert_squad *****


### **Clone the BERT github repository**


> First Step is to Clone the BERT github repository, below is the way by which you can clone the repo from github.



In [None]:
!git clone https://github.com/google-research/bert.git

Cloning into 'bert'...
remote: Enumerating objects: 340, done.[K
remote: Total 340 (delta 0), reused 0 (delta 0), pack-reused 340[K
Receiving objects: 100% (340/340), 317.20 KiB | 594.00 KiB/s, done.
Resolving deltas: 100% (185/185), done.


In [None]:
cd bert

/content/bert


In [None]:
!ls

CONTRIBUTING.md		    predicting_movie_reviews_with_bert_on_tf_hub.ipynb
create_pretraining_data.py  README.md
extract_features.py	    requirements.txt
__init__.py		    run_classifier.py
LICENSE			    run_classifier_with_tfhub.py
modeling.py		    run_pretraining.py
modeling_test.py	    run_squad.py
multilingual.md		    sample_text.txt
optimization.py		    tokenization.py
optimization_test.py	    tokenization_test.py


# **Fine-Tuning**

### **Change Runtime to TPU**
Make sure to have a TPU enabled from here on (and disabled otherwise, to save on quota), so the training can be sped up! Note that changing this will probably reset the notebook, so the initialization has to be redone.
> On the main menu, click on **Runtime** and select **Change runtime type**. Set "**TPU**" as the hardware accelerator.


### **Download the SQUAD 2.0 Dataset**

In [None]:
# Download the SQUAD train and dev dataset
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

--2020-07-16 09:25:54--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.109.153, 185.199.108.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘train-v2.0.json.1’


2020-07-16 09:25:55 (47.9 MB/s) - ‘train-v2.0.json.1’ saved [42123633/42123633]

--2020-07-16 09:25:57--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json.1’


2020-07-16 09:25:57 (17.6 MB/s) - ‘dev-v2.0.json.1’ saved [4370528/4370528]



### **Set up your TPU environment**
*   Verify that you are connected to a TPU device
*   You will get know your TPU Address that is used at time of fine-tuning
*   Perform Google Authentication to access your bucket
*   Upload your credentials to TPU to access your GCS bucket

In [None]:
import datetime
import json
import os
import pprint
import random
import string
import sys

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is => ', TPU_ADDRESS)

with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

TensorFlow 1.x selected.
1.15.2


AssertionError: ignored

## Training

> Below is the command to run the training. To run the training on TPU you need to make sure about below Hyperparameter, that is tpu must be true and provide the tpu_address that we have find out above.

1.   --use_tpu=True
2.   --tpu_name=YOUR_TPU_ADDRESS





In [None]:
!python run_squad.py \
  --vocab_file=$BUCKET_NAME/uncased_L-24_H-1024_A-16/vocab.txt \
  --bert_config_file=$BUCKET_NAME/uncased_L-24_H-1024_A-16/bert_config.json \
  --init_checkpoint=$BUCKET_NAME/uncased_L-24_H-1024_A-16/bert_model.ckpt \
  --do_train=True \
  --train_file=train-v2.0.json \
  --do_predict=True \
  --predict_file=dev-v2.0.json \
  --train_batch_size=24 \
  --learning_rate=3e-5 \
  --num_train_epochs=2.0 \
  --use_tpu=True \
  --tpu_name=grpc://10.44.58.186:8470 \
  --max_seq_length=384 \
  --doc_stride=128 \
  --version_2_with_negative=True \
  --output_dir=$OUTPUT_DIR

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
I0714 13:44:02.434313 140396581336960 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0714 13:44:02.451616 140396581336960 tpu_estimator.py:600] Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.
I0714 13:44:02.451881 140396581336960 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0714 13:44:02.468445 140396581336960 tpu_estimator.py:600] Enqueue next (1) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (1) batch(es) of data from outfeed.
I0714 13:44:02.468699 140396581336960 tpu_estimator.py:604] Dequeue next (1) batch(es) of data from outfeed.
INFO:tensorflow:Enqueue next (1) batch(es) of data to infeed.
I0714 13:44:02.486172 140396581336960 tpu_estimator.py:600] Enqueue next (1) bat

# Inference

Make sure to reset/disconnect the notebook and disable GPU/TPU to reduce quote consumptions and run from here if you want to do inference.

In [None]:
%tensorflow_version 1.x
import tensorflow as tf

from google.colab import auth
auth.authenticate_user()

BUCKET = 'fin-disclosures-nlp' #@param {type:"string"}
assert BUCKET, '*** Must specify an existing GCS bucket name ***'
output_dir_name = 'bert_squad' #@param {type:"string"}
BUCKET_NAME = 'gs://{}'.format(BUCKET)
OUTPUT_DIR = 'gs://{}/{}'.format(BUCKET,output_dir_name)
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))


TensorFlow 1.x selected.
***** Model output directory: gs://fin-disclosures-nlp/bert_squad *****


### **Create Testing File**


> We are creating input_file.json as a blank json file and then writing the data in SQUAD format in the file.


*   **touch** is used to create a file
*   **%%writefile** is used to write a file in the colab



> You can pass your own questions and context in the below file.


In [None]:
!touch input_file.json

In [None]:
%%writefile input_file.json
{
    "version": "v2.0",
    "data": [
        {
            "title": "your_title",
            "paragraphs": [
                {
                    "qas": [
                        {
                            "question": "What guidelines is the company supporting?",
                            "id": "q1",
                            "is_impossible": ""
                        },
                        {
                            "question": "Are the TCFD recommendations supported?",
                            "id": "q2",
                            "is_impossible": ""
                        }
                    ],
                    "context": "Climate change has been identified as one of the greatest risks to the future of Nestlé. The Group adopted the Taskforce for Climate-related Financial Disclosures (TCFD) recommendations and began implementation in 2019."                
                 },
                 {
                    "qas": [
                        {
                            "question": "Is there a net zero emissions target?",
                            "id": "q3",
                            "is_impossible": ""
                        }
                    ],
                    "context": "The impacts of climate change are already apparent. It is a global issue that will affect everyone. We are innovating to reduce our environmental footprint, in line with our commitment to achieve net zero carbon emissions by 2050. This supports the ambitious 1.5° C target outlined in the Intergovernmental Panel on Climate Change’s latest report. To thrive, businesses must be resilient to the risks of climate change. We conducted a high-level assessment of physical and transitional risks for several of our key commodity supply chains using a number of climate scenarios."                
                 },
                 {
                     "qas": [
                             {
                                 "question": "What is the reduction committment?",
                                  "id": "q4",
                              "is_impossible": ""
                             }
                     ],
                  "context": "By 2050, we have made the commitment to bring CO2 emissions to half of 2005 levels."
                 },
                 {
                     "qas": [
                             {
                                 "question": "What is the board overseeing or monitoring?",
                                  "id": "1-AC1",
                              "is_impossible": ""
                             }, {
                                 "question": "How many times does the board meet?",
                                 "id": "1-AC1.1",
                                 "is_impossible": ""
                             }
                     ],
                  "context": "The highest governing body at Allianz when it comes to sustainability-related issues is the Group ESG Board (ESG = Environment, Social, and Governance). Established in 2012, it is composed of three Allianz SE board members and meets quarterly. The Group ESG Board is responsible for the whole Corporate Responsibility agenda, including climate-related topics, the integration of ESG into our business lines and into the core processes related to insurance and investment, and the Allianz Group’s corporate citizenship activities."
                 },
                                  {
                     "qas": [
                             {
                                 "question": "What is the board overseeing or monitoring?",
                                  "id": "2-AC1",
                              "is_impossible": ""
                             }, {
                                 "question": "How many times does the board meet?",
                                 "id": "2-AC1.1",
                                 "is_impossible": ""
                             }
                     ],
                  "context": "Our Nomination and Sustainability Committee, chaired by our Lead Independent Director, evaluates Board composition, structure and succession planning. It assesses candidates for nomination to the Board in the coming years. Importantly, this Committee reviews all aspects of our environmental and social sustainability including our responses to climate change."
                 }

            ]
        }
    ]
}

Overwriting input_file.json


### **Prediction**


> Below is the command to perform your own custom prediction, that is you can change the input_file.json by providing your paragraph and questions after then execute the below command.



In [None]:
!python run_squad.py \
  --vocab_file=$BUCKET_NAME/uncased_L-24_H-1024_A-16/vocab.txt \
  --bert_config_file=$BUCKET_NAME/uncased_L-24_H-1024_A-16/bert_config.json \
  --init_checkpoint=$OUTPUT_DIR/model.ckpt-10859 \
  --do_train=False \
  --max_query_length=30  \
  --do_predict=True \
  --predict_file=input_file.json \
  --predict_batch_size=8 \
  --n_best_size=3 \
  --max_seq_length=384 \
  --doc_stride=128 \
  --output_dir=output/




W0716 21:54:31.947401 139802782291840 module_wrapper.py:139] From run_squad.py:1127: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W0716 21:54:31.947654 139802782291840 module_wrapper.py:139] From run_squad.py:1127: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W0716 21:54:31.947840 139802782291840 module_wrapper.py:139] From /content/bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W0716 21:54:33.088611 139802782291840 module_wrapper.py:139] From run_squad.py:1133: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related op

In [None]:
import json

with open('./output/nbest_predictions.json') as f:
  results = json.load(f)
  print(json.dumps(results, indent=2))

{
  "q1": [
    {
      "text": "Taskforce for Climate-related Financial Disclosures",
      "probability": 0.7833332860307294,
      "start_logit": 4.242916584014893,
      "end_logit": 5.492935657501221
    },
    {
      "text": "the Taskforce for Climate-related Financial Disclosures",
      "probability": 0.14217628643768268,
      "start_logit": 2.536426067352295,
      "end_logit": 5.492935657501221
    },
    {
      "text": "Taskforce for Climate-related Financial Disclosures (TCFD) recommendations",
      "probability": 0.074490427531588,
      "start_logit": 4.242916584014893,
      "end_logit": 3.140048027038574
    }
  ],
  "q2": [
    {
      "text": "The Group adopted the Taskforce for Climate-related Financial Disclosures (TCFD) recommendations and began implementation in 2019.",
      "probability": 0.5201596732063127,
      "start_logit": 0.5909603238105774,
      "end_logit": 1.184169054031372
    },
    {
      "text": "began implementation in 2019.",
      "probabi