<a href="https://colab.research.google.com/github/dafrie/fin-disclosures-nlp/blob/master/notebooks/ColabLabelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CRO Labelling using Colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dafrie/fin-disclosures-nlp/blob/master/notebooks/ColabLabelling.ipynb)

This is a Jupyter/Colab notebook for the labelling of climate-related risks and opportunities, storing the data in a Google Storage Bucket.


## Codebook

This codebook serves as guidance for the labelling process of climate-related risk and opportunities (CRO) in financial disclosures.
The overarching goal is to label risk disclosures that are found in the various sections of annual reports according to the Task Force on Climate-related Financial Disclosures (TCFD)'s categorization of CRO's.

Since the overall dataset of annual reports is too large to label fully, a random sample is drawn for this labelling process.
To ensure labelling consensus and allow for reliability checks, at least some of the individual reports in the random sample should get labelled by at least two researchers. Thus, from the initial random sample, a subset of reports is selected randomly to be labelled twice.



### Main report labelling process
1. The next PDF report is automatically loaded, parsed and preprocessed
2. Based on an initial keyword list [initial keyword list](../data/keyword_vocabulary.txt), relevant pages searched and displayed
3. The original PDF report is also opened automatically for cross-checking/validation purposes
    * Quickly scan the PDF report to validate if Step 1) and 2) processed the report properly, e.g. the parsing was successful and the    keyword seach did not miss important parts of the report
4. By going through the filtered pages, the content of each page then is displayed and the rows that contained a keyword are highlighted. The researcher however should also check the previous and next page, whether relevant information on CRO is found.
5. Each paragraph (separated by a single empty line) that contains relevant information then should be read and categorized according to the detailed definitions below 

The main two variables of interest are **CRO_TYPE** and **CRO_SUB_TYPE**. Both of them follow the TCFD categorization in two levels.
The paragraph in question then first should be read in light of the first level/**CRO_TYPE**, with an emphasis that the passage is having a _forward-looking_ perspective and is not talking about a possible impact from e.g. past regulations. If the passage is not forward looking, or too generic to be categorized in the first level, the paragraph should still be flagged (see below) but should receive a _Empty/Unknown_ value.


* Transition Risk (**TR**): Risks from the _transition_ to a lower-carbon economy
    * Policy and Legal (**POLICY**): Carbon tax, emission reporting policy changes, regulation on products and services, litigation risk
    * Technology (**TECH**): Obsolence of existing products & services, unsuccessful investements in new technologies, costs to transition in to lower emission technology
    * Market (**MARKET**): Changing customer behavior, uncertainty in market prices, increase of cost in raw material & natural resources
    * Reputation (**REPUT**): Changing customer preferences, stigmatization of sector, hiring risk, increased stakeholder pressure
* Physical Risk (**PR**): Risks from the _physical_ impacts of climate-change
    * Acute (**ACUTE**): Increase serverity of extreme weather events, cyclones, floods, heat waves
    * Chronic (**CHRON**): Changes in precipation patterns (droughts), rising mean temperatures, rising sea levels
* Opportunity (**OP**): Opportunities that result from _efforts_ in mitigation/adaption of climate change
    * Resource Efficiency (**EFFI**): More efficient modes of transport, production processes, recycling, efficient buildings, reduction in waser usage
    * Energy Source(**ENERGY**): Use of lower emission sources of energy, use of supportive policy incentives, new technologies, decentralization
    * Products and Services(**PRODUCTS**): Development or expansion of low emission goods and services, development of products that benefit from changes induced by climate change (in market preferences, technological changes)
    * Markets(**MARKETS**): New markets, incentives, 
    * Resilience(**RESILIENCE**): Diversification, decentralization

### TCFD-Examples
![Climate Risk Examples](https://raw.githubusercontent.com/dafrie/fin-disclosures-nlp/master/data/labeling/TCFD_CR_Examples.png "Climate-related Risk Examples")
![Climate Risk Examples](https://raw.githubusercontent.com/dafrie/fin-disclosures-nlp/master/data/labeling/TCFD_OP_Examples.png "Climate-related Opportunities Examples")

### TODO

- "ID" Instruction for spanning labels
- "Indirect" label for weak/unclear cases --> Differentiation between first order, second order

## Setup / Initialization

In [4]:
# Get the project code
!git clone https://github.com/dafrie/fin-disclosures-nlp.git

# Install the extra spacy model
!python -m spacy download en_core_web_md
# NOTE: Need to restart the runtime afterwards!
exit()

fatal: destination path 'fin-disclosures-nlp' already exists and is not an empty directory.
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [1]:
# Install gcsfuse to mount the bucket
%%bash
echo "deb http://packages.cloud.google.com/apt gcsfuse-`lsb_release -c -s` main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get -y -q update
sudo apt-get -y -q install gcsfuse

deb http://packages.cloud.google.com/apt gcsfuse-bionic main
OK
Get:1 http://packages.cloud.google.com/apt gcsfuse-bionic InRelease [3,724 B]
Hit:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Ign:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:8 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:10 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:11 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:12 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:13 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Get:14 http://archive.ubuntu.com

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   653  100   653    0     0  17648      0 --:--:-- --:--:-- --:--:-- 17648


In [2]:
# Bucket initialization
BUCKET = 'fin-disclosures-nlp' #@param {type:"string"}
assert BUCKET, '*** Must specify an existing GCS bucket name ***'

from google.colab import auth
auth.authenticate_user()

!mkdir -p /content/bucket
!gcsfuse --implicit-dirs --limit-bytes-per-sec -1 --limit-ops-per-sec -1 fin-disclosures-nlp /content/bucket

Using mount point: /content/bucket
Opening GCS connection...
Opening bucket...
Mounting file system...
daemonize.Run: readFromProcess: sub-process: mountWithArgs: mountWithConn: Mount: mount: running fusermount: exit status 1

stderr:
fusermount: mountpoint is not empty
fusermount: if you are sure this is safe, use the 'nonempty' mount option



In [3]:
# Config
FIRM_METADATA = "/content/bucket/data/Firm_Metadata.csv" #@param {type:"string"}
DATA_INPUT_PATH = "/content/bucket/data/annual_reports/" #@param {type:"string"}
MASTER_DATA_PATH = "/content/bucket/data/annual_reports/Firm_AnnualReport_TS.csv" #@param {type:"string"}
LABEL_OUTPUT_FN = 'Firm_AnnualReport_Labels_TS.pkl' #@param {type:"string"}

%cd fin-disclosures-nlp/
!git pull

/content/fin-disclosures-nlp
Already up to date.


## Labelling

In [4]:
import importlib
import pandas as pd

import data
importlib.reload(data)
from data.custom_widgets import ReportsLabeler
importlib.reload(data.custom_widgets)

labeler = ReportsLabeler(files_input_dir=DATA_INPUT_PATH, master_input_path=MASTER_DATA_PATH, label_output_fn=LABEL_OUTPUT_FN)

Output()

Output()

HBox(children=(Button(description='Previous relevant page', style=ButtonStyle()), BoundedIntText(value=1, desc…

Output()