# Zero-shot classifiers as a noisy label source

At sentence level within parsed policy documents.


In [1]:
# notebook-specific dependencies

!python -m pip install torch transformers 



In [2]:
from pathlib import Path

import yaml
import pandas as pd
from transformers import pipeline

from utils import Schema, ZeroShotClassifier, load_policy_dataset

In [3]:
def flatten_list_of_lists(l: list) -> list:
    """
    [[1, 2], [3]] -> [1, 2, 3]
    [[1, 2], 3] -> [1, 2, 3]
    """

    res = []

    for item in l:
        if isinstance(item, list):
            res = res + item
        else:
            res.append(item)

    return res


In [4]:
df = load_policy_dataset()

df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1666918 entries, 0 to 1666917
Data columns (total 4 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   policy_id    1666918 non-null  int64 
 1   policy_name  1666918 non-null  object
 2   page_id      1666918 non-null  int64 
 3   text         1666918 non-null  object
dtypes: int64(2), object(2)
memory usage: 50.9+ MB


## 1. Get class labels, download model

### 1a) Get class labels

In [6]:
SCHEMA_FOLDER = Path("../../schema/")

instruments = Schema.from_yaml_path(SCHEMA_FOLDER/"instruments.yml")
sectors = Schema.from_yaml_path(SCHEMA_FOLDER/"sectors.yml")

### 1b) Get models

This method is deprecated as wrapped in `utils.ZeroShotClassifier` (see 2)

#### typeform, distilbert MNLI

In [7]:
# typeform, distilbert MNLI: https://huggingface.co/typeform/distilbert-base-uncased-mnli
clf_bert = pipeline("zero-shot-classification", model="typeform/distilbert-base-uncased-mnli")

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


In [8]:
text = """Support around £3.6 billion of investment to upgrade around a million homes through the Energy Company Obligation (ECO), and extend support for home energy efficiency improvements until 2028 at the current level of ECO funding"""

clf_bert(text, instruments.all_keywords, threshold=0.7)

{'sequence': 'Support around £3.6 billion of investment to upgrade around a million homes through the Energy Company Obligation (ECO), and extend support for home energy efficiency improvements until 2028 at the current level of ECO funding',
 'labels': ['investment',
  'obligations',
  'funding',
  'finance',
  'interest',
  'commons',
  'resilience funding',
  'financing structure',
  'grants',
  'structures',
  'stakeholder engagement',
  'capacity-building body',
  'spatial zoning',
  'infrastructure',
  'multi-level governance',
  'organisation',
  'zoning',
  'resilient infrastructure policy',
  'equity in adaptation',
  'energy efficiency rating',
  'reporting',
  'assessment',
  'credits',
  'strategy',
  'bodies',
  'information',
  'green finance',
  'strategic body',
  'adaptation plan',
  'responsibility',
  'fiscal incentives',
  'regulatory body',
  'strategic plan',
  'advice',
  'financial flows',
  'duty to set target',
  'target',
  'mandates',
  'national strategy',


## 2. Run zero-shot classifiers on a few examples

Here we use `utils.ZeroShotClassifier` as a wrapper around the huggingface model. 

In [9]:
n_examples = 30
examples = df.sample(n_examples, random_state=99)['text'].tolist()

clf_instruments = ZeroShotClassifier(instruments)
clf_sectors = ZeroShotClassifier(sectors)

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


In [11]:
clf_instruments.predict("test text", 0.8)

[('assessment', 'Monitoring and evaluation', 0.9998759031295776),
 ('evaluation', 'Monitoring and evaluation', 0.9994389414787292),
 ('information', 'Labelling and information', 0.9839628338813782),
 ('interest', 'Finance', 0.8211902379989624)]

In [12]:
THRESHOLD = 0.8

for _str in examples:
    print(_str)
    print()
    print("INSTRUMENT PREDICTIONS")
    print("\n".join([f"- {pred}" for pred in clf_instruments.predict(_str, THRESHOLD)]))
    print()
    print("SECTOR PREDICTIONS")
    print("\n".join([f"- {pred}" for pred in clf_sectors.predict(_str, THRESHOLD)]))
    print("----------------------")

Sources of Energy Supply At present, Bangladesh has energy supply from both renewable and nonrenewable sources, 38 percent of which comes from biomass (Figure 3.1).

INSTRUMENT PREDICTIONS


SECTOR PREDICTIONS
- ('renewable energy', 'Energy production', 0.9283546805381775)
- ('biomass', 'Agriculture (general)', 0.915523886680603)
- ('energy availability', 'Adaptation', 0.9047046303749084)
- ('energy', 'Energy (general)', 0.8982304334640503)
- ('energy distribution', 'Energy production', 0.8556111454963684)
----------------------
To put that in real-world context, roughly 35 jobs are created for each million board feet of wood processed.

INSTRUMENT PREDICTIONS


SECTOR PREDICTIONS

----------------------
Research on the likelihood of disasters and the assessment of the likely social, economic and environmental impacts will be conducted regularly as an integral aspect of disaster preparedness and management.

INSTRUMENT PREDICTIONS
- ('scientific inquiries', 'Knowledge generation', 0.99