<a href="https://colab.research.google.com/github/dafrie/fin-disclosures-nlp/blob/master/notebooks/ColabLabelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CRO Labelling using Colab 

This is a Jupyter/Colab notebook for the labelling of climate-related risks and opportunities, storing the data in a Google Storage Bucket.


## Codebook

This codebook serves as guidance for the labelling process of climate-related risk and opportunities (CRO) in financial disclosures.
The overarching goal is to label risk disclosures that are found in the various sections of annual reports according to the Task Force on Climate-related Financial Disclosures (TCFD)'s categorization of CRO's.

Since the overall dataset of annual reports is too large to label fully, a random sample is drawn for this labelling process.
To ensure labelling consensus and allow for reliability checks, at least some of the individual reports in the random sample should get labelled by at least two researchers. Thus, from the initial random sample, a subset of reports is selected randomly to be labelled twice.



### Main report labelling process
1. The next unlabelled PDF report is automatically loaded, parsed and preprocessed
2. Based on an [initial keyword list](../data/keyword_vocabulary.txt), relevant pages searched (unigrams & and bigrams with lemmatization and stopwords removed) and displayed

3. Only running it locally: The original PDF report is also opened automatically for cross-checking/validation purposes
    * Quickly scan the PDF report to validate if Step 1) and 2) processed the report properly, e.g. the parsing was successful and the    keyword seach did not miss important parts of the report
4. By going through the filtered pages, the paragraphs of each page then is displayed and the paragraphs that contained a keyword are highlighted. Since often relevant information on CRO can be found in immediate proximity of the keyword hits, the previous and next page is added to the navigation.
5. Each paragraph on the pages then should be read and categorized according to the detailed definitions below.

#### CRO Labelling
The main two variables of interest are **CRO_TYPE** and **CRO_SUB_TYPE**. Both of them follow the TCFD categorization in two levels.
The paragraph in question then first should be read in light of the first level/**CRO_TYPE**, with an emphasis that the passage is having a _forward-looking_ perspective and is not talking about a possible impact from e.g. past regulations. If the passage is not forward looking, or too generic to be categorized in the first level, the paragraph should still be flagged (i.e. added to the list) but should receive _Empty/Unknown_ values for the two variables.

* Transition Risk (**TR**): Risks from the _transition_ to a lower-carbon economy
    
    Note: Difference between first order/second order
    * Policy and Legal (**POLICY**): Carbon tax, emission reporting policy changes, regulation on products and services, litigation risk
    * Technology (**TECH**): Obsolence of existing products & services, unsuccessful investements in new technologies, costs to transition in to lower emission technology
    * Market (**MARKET**): Changing customer behavior, uncertainty in market prices, increase of cost in raw material & natural resources
    * Reputation (**REPUT**): Changing customer preferences, stigmatization of sector, hiring risk, increased stakeholder pressure
    
* Physical Risk (**PR**): Risks from the _physical_ impacts of climate-change

    Note: Include also "indirect" physical risks, i.e. when climate change is not mentioned directly but reference to one of 
    * Acute (**ACUTE**): Increase serverity of extreme weather events, cyclones, floods, heat waves
    * Chronic (**CHRON**): Changes in precipation patterns (droughts), rising mean temperatures, rising sea levels
    
* Opportunity (**OP**): Opportunities that result from _efforts_ in mitigation/adaption of climate change
    * Resource Efficiency (**EFFI**): More efficient modes of transport, production processes, recycling, efficient buildings, reduction in waser usage
    * Energy Source(**ENERGY**): Use of lower emission sources of energy, use of supportive policy incentives, new technologies, decentralization
    * Products and Services(**PRODUCTS**): Development or expansion of low emission goods and services, development of products that benefit from changes induced by climate change (in market preferences, technological changes)
    * Markets(**MARKETS**): New markets, incentives, 
    * Resilience(**RESILIENCE**): Diversification, decentralization

The two excerpts from TCFD below should give some inspiration. The categories of the second level, **CRO_SUB_TYPE**, are often overlapping, thus see the edge cases on how to handle. 
##### Edge cases

- **Problem:** A paragraph contains multiple CRO on either level.
    
    **Solution:** Add multiple labels (by pressing on the paragraph button) and for each, label accordingly and if possible add an comment containing a *keyword:\<triggering_keyword>* tag that contains the (subjective) trigger keyword for each CRO disclosure.
- **Problem:** A CRO disclosure spans multiple paragraphs.
    
    **Solution:** Add a *cro_id:\<id>* tag by adding a descriptive and unique (for the same report) *\<id>*, i.e. *carbon_tax_eu* in the comment field for each relevant paragraph
- **Problem:** A CRO disclosure spans paragraphs on multiple pages.
    
    **Solution:** Same as above, link the paragraphs with an ID tag.
- **Problem:** Unclear categorization as multiple sub categories are possible.
    
    **Solution:** The "strongest" contender should be chosen as a label, the remainders should be added in the comment field as *sub_types:\<CRO_SUB_TYPE2,CRO_SUB_TYPE3,...>*
- **Problem:** The CRO is very unspecific/indirect/far streched:
    
    **Solution:** Tag the indicator with a *indirect* tag
    
Any other issues that may come up during the labelling could be written in the comment field and if possible, added via an structured *tag* approach so automated handling later down the road is possible...

### TCFD-Examples
![Climate Risk Examples](https://raw.githubusercontent.com/dafrie/fin-disclosures-nlp/master/data/labeling/TCFD_CR_Examples.png "Climate-related Risk Examples")
![Climate Risk Examples](https://raw.githubusercontent.com/dafrie/fin-disclosures-nlp/master/data/labeling/TCFD_OP_Examples.png "Climate-related Opportunities Examples")

## Setup / Initialization

In [None]:
# Get the project code
!git clone https://github.com/dafrie/fin-disclosures-nlp.git

# Install the extra spacy model
!python -m spacy download en_core_web_md
# NOTE: Need to restart the runtime afterwards!
exit()

In [None]:
# Install gcsfuse to mount the bucket
%%bash
echo "deb http://packages.cloud.google.com/apt gcsfuse-`lsb_release -c -s` main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get -y -q update
sudo apt-get -y -q install gcsfuse

In [None]:
# Bucket initialization
BUCKET = 'fin-disclosures-nlp' #@param {type:"string"}
assert BUCKET, '*** Must specify an existing GCS bucket name ***'

from google.colab import auth
auth.authenticate_user()

!mkdir -p /content/bucket
!gcsfuse --implicit-dirs --limit-bytes-per-sec -1 --limit-ops-per-sec -1 fin-disclosures-nlp /content/bucket

In [None]:
# Config
FIRM_METADATA = "/content/bucket/data/Firm_Metadata.csv" #@param {type:"string"}
DATA_INPUT_PATH = "/content/bucket/data/annual_reports/" #@param {type:"string"}
MASTER_DATA_PATH = "/content/bucket/data/annual_reports/Firm_AnnualReport_TS.csv" #@param {type:"string"}
LABEL_OUTPUT_FN = 'Firm_AnnualReport_Labels_TS.pkl' #@param {type:"string"}

%cd fin-disclosures-nlp/
!git pull

## Labelling

In [None]:
import importlib
import pandas as pd

import data
importlib.reload(data)
from data.custom_widgets import ReportsLabeler
importlib.reload(data.custom_widgets)

labeler = ReportsLabeler(files_input_dir=DATA_INPUT_PATH, master_input_path=MASTER_DATA_PATH, label_output_fn=LABEL_OUTPUT_FN)