# Climate justice experiments
The goal of this notebook is to find out (in a very rough first experiment kind of way) how well different model set-ups are able to pick up on fairly subtle concepts such as different justice theories. 

## Loading data
This is staken more or less directly from the open data repository. For these experiments, we're only really interested in a small subset, so loading the whole dataset, only to then subset it to a few dozen is sort of overkill. So at the end of this, I'll write the subset to a little csv to use instead. Keeping the code to do so here anyway so it's easy to replicate if you don't have that csv. 

If you do have a justice_subset.csv already, just set reload_data to False


In [1]:
reload_data = False
create_new_sample_csv = False

In [2]:
import duckdb
from huggingface_hub import snapshot_download
import tqdm as notebook_tqdm
import pandas as pd

#Using the public version of the repo for now
REPO_NAME = "ClimatePolicyRadar/all-document-text-data"
REPO_URL = f"https://huggingface.co/datasets/{REPO_NAME}"
DATA_CACHE_DIR = "../../cache"

REVISION = "main"  # Use this to set a commit hash. Recommended!

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
if reload_data:
    snapshot_download(
        repo_id=REPO_NAME,
        repo_type="dataset",
        local_dir=DATA_CACHE_DIR,
        revision=REVISION,
        allow_patterns=["*.parquet"],
    )

In [4]:
def create_db(): 
    db = duckdb.connect('data.db')  # Create a persistent database

    # Authenticate (only needed if loading a private dataset)
    # You'll need to log in using `huggingface-cli login` in your terminal first
    #db.execute("CREATE SECRET hf_token (TYPE HUGGINGFACE, PROVIDER credential_chain);")

    # Drop the existing table if it exists (necessary if you want to update the fields)
    db.execute("DROP TABLE IF EXISTS open_data")

    # Check if table exists
    table_exists = db.execute("SELECT COUNT(*) FROM information_schema.tables WHERE table_name = 'open_data'").fetchone()[0] > 0

    if not table_exists:
        # Create a persistent table with only the columns we need
        db.execute("""
            CREATE TABLE open_data AS 
            SELECT 
                "document_metadata.geographies",
                "document_metadata.corpus_type_name",
                "document_metadata.publication_ts",
                "document_metadata.import_id",
                "document_metadata.translated",
                "document_metadata.source_url",
                "document_metadata.document_title",	
                "text_block.text",
                "text_block.language",
                "text_block.type",
                "text_block.index",
                "text_block.page_number"
            FROM read_parquet('{}/*.parquet')
        """.format(DATA_CACHE_DIR))

        # Create indexes for common query patterns
        db.execute("CREATE INDEX idx_language ON open_data(\"text_block.language\")")
        db.execute("CREATE INDEX idx_corpus_type ON open_data(\"document_metadata.corpus_type_name\")")
        db.execute("CREATE INDEX idx_publication_ts ON open_data(\"document_metadata.publication_ts\")")
        db.execute("CREATE INDEX idx_text_type ON open_data(\"text_block.type\")")

    return db

In [5]:
#Create a subset of fairly recent documents based on title keywords, including a few hand-picked documents
if reload_data:
    db = create_db()
    title_df = db.sql(
        """
    SELECT *
    FROM open_data
    WHERE (
        LOWER("document_metadata.document_title") LIKE '%justice%'
        OR LOWER("document_metadata.document_title") LIKE '%just transition%'
        OR LOWER("document_metadata.document_title") LIKE '%human rights%'
        OR LOWER("document_metadata.document_title") LIKE '%ethical%'
        OR LOWER("document_metadata.document_title") LIKE '%fwg inputs to the technical assessment%'
        OR LOWER("document_metadata.document_title") LIKE '%climate equity%'
        OR LOWER("document_metadata.document_title") LIKE '%a fair climate%'
        OR LOWER("document_metadata.document_title") LIKE '%reducing inequalities%'
        OR LOWER("document_metadata.document_title") LIKE '%iisd%'
        OR LOWER("document_metadata.document_title") LIKE '%climate analytics%'
        OR LOWER("document_metadata.document_title") LIKE '%inflation reduction act%'
    )
    AND "document_metadata.publication_ts" >= '2018-01-01'
    AND "document_metadata.import_id" IS NOT NULL
    AND "document_metadata.source_url" IS NOT NULL
        """
        ).to_df()


    title_df.head()

In [6]:
if create_new_sample_csv:
    title_df.to_csv("justice_subset.csv", index=False, encoding='utf-8')
    df = title_df.copy()
elif reload_data == False:
    df = pd.read_csv("justice_subset.csv", encoding='utf-8')

#Let's shorten the column names
df.columns = [col.split('.')[-1] for col in df.columns]

print(f"Total nr of paras: {df.shape[0]}")
print(f"Total nr of documents: {len(df['document_title'].unique())}")
df.head()

Total nr of paras: 41944
Total nr of documents: 41


Unnamed: 0,geographies,corpus_type_name,publication_ts,import_id,translated,source_url,document_title,text,language,type,index,page_number
0,['ZAF'],Laws and Policies,2020-09-09T00:00:00Z,CCLW.document.i00000012.n0000,False,https://pccommissionflo.imgix.net/uploads/imag...,Just Transition Framework,Call,en,Text,0,0.0
1,['ZAF'],Laws and Policies,2020-09-09T00:00:00Z,CCLW.document.i00000012.n0000,False,https://pccommissionflo.imgix.net/uploads/imag...,Just Transition Framework,PRESIDENTIAL CLIMATE COMMISSION TOWARDS A JUST...,en,Text,1,0.0
2,['ZAF'],Laws and Policies,2020-09-09T00:00:00Z,CCLW.document.i00000012.n0000,False,https://pccommissionflo.imgix.net/uploads/imag...,Just Transition Framework,ERO,en,Text,2,0.0
3,['ZAF'],Laws and Policies,2020-09-09T00:00:00Z,CCLW.document.i00000012.n0000,False,https://pccommissionflo.imgix.net/uploads/imag...,Just Transition Framework,BOMBELA,en,Text,3,0.0
4,['ZAF'],Laws and Policies,2020-09-09T00:00:00Z,CCLW.document.i00000012.n0000,False,https://pccommissionflo.imgix.net/uploads/imag...,Just Transition Framework,June 2022,en,Text,4,0.0


 ## Connecting to the LLMs and concept store
The LLM classifier we import from src is built using pydantic. That gives us a lot of flexibility and [known models](https://ai.pydantic.dev/api/models/base/#pydantic_ai.models.KnownModelName) to choose from, as long as they are covered by our API key

In [7]:
import boto3 #AWS
import os

#Remember to login with aws sso login --profile labs
session = boto3.Session(profile_name="labs", region_name="eu-west-1")
ssm = session.client("ssm")
api_key = ssm.get_parameter(Name="OPENAI_API_KEY", WithDecryption=True)["Parameter"]["Value"]
os.environ["OPENAI_API_KEY"] = api_key
print("Connected with Open AI API key")

TokenRetrievalError: Error when retrieving token from sso: Token has expired and refresh failed

In [None]:
from src.wikibase import WikibaseSession
from src.concept import Concept
from src.classifier.llm import LLMClassifier
import nest_asyncio #Since we're working on notebookes, we need to handle the async bits
nest_asyncio.apply()

wikibase = WikibaseSession()
flood = wikibase.get_concept('Q382') #Flood
print(flood)

classifier = LLMClassifier(flood, model_name='gpt-4-turbo')
sentence = 'Our basement is underwater.' 

classifier.predict(sentence)

flood (Q382)


[Span(text='Our basement is underwater.', start_index=16, end_index=26, concept_id='Q382', labellers=['LLMClassifier(flood, model_name="gpt-4-turbo", id=r94z3p2j)'], timestamps=[datetime.datetime(2025, 7, 17, 14, 55, 58, 462296)], id='gdqdtvdz', labelled_text='underwater')]

That works! Let's explore the basic functionality
-  By default, the LLM classifier takes all the info it can find from the concept and puts this in the prompt. But we can also create a custom concept. 
- Plus, we can run in batches also. Sentences without a predicted span return an empty list. Still a bit unclear to me if batch prediction with nest_asyncio might lead to problems, but it seems OK so far.
- Let's also explore the default prompt and play around with that a tiny bit. 



In [None]:
#custom test concept
new_concept = Concept(
    preferred_label = 'Harrison',
    alternative_labels = ['Data scientist'],
    description = 'Really good at AWS',
    definition = 'The data scientist who is very good at AWS'
)

classifier = LLMClassifier(new_concept, model_name='gpt-4-turbo') 

#batch predict
sentences = ['Here we have a man who is really good at AWS',
             "His name is Harrison. He is a seasoned data scientist.",
             "Kalyan is a  nice guy too"]
[print(f"\n{c}") for c in classifier.predict_batch(sentences)]


[Span(text='Here we have a man who is really good at AWS', start_index=15, end_index=44, concept_id=None, labellers=['LLMClassifier(Harrison, model_name="gpt-4-turbo", id=34nzdgz3)'], timestamps=[datetime.datetime(2025, 7, 17, 16, 59, 42, 808248)], id='aptrmezm', labelled_text='man who is really good at AWS')]

[Span(text='His name is Harrison. He is a seasoned data scientist.', start_index=12, end_index=20, concept_id=None, labellers=['LLMClassifier(Harrison, model_name="gpt-4-turbo", id=34nzdgz3)'], timestamps=[datetime.datetime(2025, 7, 17, 16, 59, 42, 808409)], id='f7wnybwm', labelled_text='Harrison'), Span(text='His name is Harrison. He is a seasoned data scientist.', start_index=39, end_index=53, concept_id=None, labellers=['LLMClassifier(Harrison, model_name="gpt-4-turbo", id=34nzdgz3)'], timestamps=[datetime.datetime(2025, 7, 17, 16, 59, 42, 808426)], id='bhe4d45t', labelled_text='data scientist')]

[]


[None, None, None]

In [None]:
#Keep in mind the difference between the template prompt and the fully filled out system prompt
print("PROMPT TEMPLATE:")
print(classifier.system_prompt_template)
print("\n\nSYSTEM PROMPT:")
print(classifier.system_prompt)

PROMPT TEMPLATE:

You are a specialist climate policy analyst, tasked with identifying mentions of 
concepts in climate policy documents. You will mark up references to concepts with 
XML tags.

First, carefully review the following description of the concept:

<concept_description>
{concept_description}
</concept_description>

Instructions:

1. Read through each passage carefully, thinking about the concept.
2. Identify any mentions of the concept, including direct references and related terms.
3. Surround each identified mention with <concept> tags.
4. If a passage contains multiple instances, each one should be tagged separately.
5. If a passage does not contain any instances, it should be reproduced exactly as given, without any additional tags.
6. If an entire passage refers to the concept without specific mentions, the entire passage should be wrapped in a <concept> tag.
7. The input text must be reproduced exactly, down to the last character, only adding concept tags.



SYSTEM 

## Let's try some justice concepts

In [None]:
#First, let's figure out how long this really takes
import time
df.dropna(subset = 'text', inplace = True)
small_df = df.sample(n=100, random_state=42)

In [None]:
wikibase = WikibaseSession()
distributive = wikibase.get_concept('Q911') #Distributive justice

classifier = LLMClassifier(distributive, model_name='gpt-3.5-turbo')

print("Starting predicition")
t0 = time.time()
prediction_spans = classifier.predict_batch(small_df['text'].astype(str))
tseconds = time.time()-t0
print(f"Finished in {tseconds:.2f} seconds")
print(f"That's {tseconds/len(small_df):.3f} s per passage\n\nPositive examples:")

for prediction in prediction_spans:
    if len(prediction)>0:
        for p in prediction:
            print(p.text)
            print(f"=> {p.labelled_text}")
            print()



Starting predicition


UnexpectedModelBehavior: Exceeded maximum retries (1) for result validation

From this, I draw the following conclusions:
- not bad at all! 
- it seems likely we'll want to do some filtering before feeding it to the LLM (but we wanted to do that anyway to limit monetary and enviornmental costs)
- it's quick enough that this filtering can be relatively coarse though
- my Spanish isn't great but it seems to do OK there too. Still, as we translate to English anyway, probably better to limit to that here.

Further, it would be good to tweak the prompt template somewhat:
- the spans are often on the short side. Ideally, it should label modifiers on the main word too. 
- it's probably too permissive. It's broadly good that it is picking up on justice "vibes", but it is not currently distinguishing between the specific justice type. Unsure if this entirely on the template or on the concept description, but we should encourage it to be stricter. 
- as a more specific case, I think it needs to ignore negative examples (i.e. injustices). Somewhat debatable & this definitely seems like something we'll want to decide on a concept-by-concept basis, so let's add that to the definition instead. 

In [None]:
#Let's compare a few different prompts and also try other (simpler?) models

setups = {
    "current": LLMClassifier(distributive, 
                             model_name='gpt-4-turbo',
                            system_prompt_template=
                             classifier.system_prompt_template  # Current template
    ),


    "strict":LLMClassifier(distributive,
                           model_name='gpt-4-turbo',
                           system_prompt_template="""
    You are a specialist climate policy analyst, tasked with identifying mentions of 
    concepts in climate policy documents. You will mark up references to concepts with 
    XML tags.  

    First, carefully review the following description of the concept. These tags will be used to investigate fine-grained distinctions so it is essential that you base your judgement on the nuances of the description.

    <concept_description>
    {concept_description}
    </concept_description>

    Instructions:
    1. Be very strict in identifying concept mentions - read each passage carefully and only tag if you are highly confident that the passage is relevant to an expert, accounting for variations in language.
    2. Identify any mentions of the concept, and include surrounding context in the tags also if these are relevant to the concept. Again, be strict but allow for variations in language. 
    3. Surround each identified mention with <concept> tags.
    4. If a passage contains multiple instances, each one should be tagged separately.
    5. If a passage does not contain any instances, it should be reproduced exactly as given, without any additional tags.
    6. If an entire passage refers to the concept without specific mentions, the entire passage should be wrapped in a <concept> tag.
    7. Do not tag negative examples or injustices unless the need for justice is made explicit or is heavily implied
    8. The input text must be reproduced exactly, down to the last character, only adding concept tags.
    """),

    "lenient": LLMClassifier(distributive,
                           model_name='gpt-4-turbo',
                           system_prompt_template = """
    You are a specialist climate policy analyst, tasked with identifying mentions of 
    concepts in climate policy documents. You will mark up references to concepts with 
    XML tags.

    First, carefully review the following description of the concept:

    <concept_description>
    {concept_description}
    </concept_description>

    Instructions:
    1. Be inclusive in identifying concept mentions - tag if there's a reasonable connection,  allowing for differences in language and understandings of the concept
    2. Include surrounding context in tags to capture full meaning
    3. Surround each identified mention with <concept> tags.
    4. If a passage contains multiple instances, each one should be tagged separately.
    5. If a passage does not contain any instances, it should be reproduced exactly as given, without any additional tags.
    6. If an entire passage refers to the concept without specific mentions, the entire passage should be wrapped in a <concept> tag
    7. Also tag negative instances of the concept -- i.e. tag injustices and situations where justice is called for. 
    8. The input text must be reproduced exactly, down to the last character, only adding concept tags. 
    """
    ), 

    
    "rules": LLMClassifier(distributive,
                           model_name='gpt-4-turbo',
                           system_prompt_template = """
    You are a specialist climate policy analyst, tasked with identifying mentions of 
    concepts in climate policy documents. You will mark up references to concepts with 
    XML tags.

    First, carefully review the following description of the concept:

    <concept_description>
    {concept_description}
    </concept_description>

    Instructions:
    1. First, think about the core of the description you have just read. Distill this into key characteristics, thinking carefully on when a policy expert would include and exclude.
    2. Keep these inclusion and exclusion criteria in mind and assign your tags, being strict to only include mentions that are relevant to the core idea of the concept.
    3. Surround each identified mention with <concept> tags.
    4. If a passage contains multiple instances, each one should be tagged separately.
    5. If a passage does not contain any instances, it should be reproduced exactly as given, without any additional tags.
    6. If an entire passage refers to the concept without specific mentions, the entire passage should be wrapped in a <concept> tag
    7. If the context is important to understand the meaning of the passage, be inclusive and include this context within the tags.
    7. The input text must be reproduced exactly, down to the last character, only adding concept tags. 
    """
    ),


    # Different models
    #'gemini_pro': LLMClassifier(distributive, model_name='gemini-2.5-pro-exp-03-25'),
    #'gemini_flash': LLMClassifier(distributive, model_name = 'gemini-1.5-flash-002'),
    #'gemini_thinking': LLMClassifier(distributive, model_name = "gemini-2.0-flash-thinking-exp-01-21")
}

In [None]:
#Let's write some helper functions to make my life easier 
#and then use that to set up some experiments
#First, let's create a function to get tagged passages in a readable format
def get_tagged_passages(classifier, texts):
    results = []
    # Use predict_batch instead of individual predict calls
    all_spans = classifier.predict_batch(texts)
    
    # Match spans with their original texts
    for text, spans in notebook_tqdm.tqdm(zip(texts, all_spans), 
                                        total=len(texts), 
                                        desc="Processing tagged passages"):
        if spans:  # Only include texts that got tagged
            results.append({
                'text': text,
                'tagged_spans': [span.labelled_text for span in spans]
            })
    return results

def compare_setups(setups, sample_texts):
    """
    setups: dict of {setup_name: classifier}
    sample_texts: list of texts to test
    """
    results = {}
    
    for setup_name, classifier in notebook_tqdm.tqdm(setups.items(), 
                                                   desc="Comparing setups"):
        tagged = get_tagged_passages(classifier, sample_texts)
        results[setup_name] = {
            'n_tagged': len(tagged),
            'tagged_passages': tagged
        }
    
    return results
def create_comparison_table(results):
    """
    Create a comparison table that includes both quantitative and qualitative data.
    """
    # Initialize dictionary to store the data
    comparison_data = {
        'text': sample_texts,  # Original texts
    }
    
    # Add columns for each setup
    for setup_name, result in results.items():
        # Create a dictionary mapping texts to their tagged spans
        text_to_spans = {p['text']: p['tagged_spans'] for p in result['tagged_passages']}
        
        # Add both the binary tag and the actual spans
        comparison_data[f'{setup_name}_tagged'] = [1 if text in text_to_spans else 0 for text in sample_texts]
        comparison_data[f'{setup_name}_spans'] = [text_to_spans.get(text, []) for text in sample_texts]
    
    # Create DataFrame only once at the end
    return pd.DataFrame(comparison_data)

In [None]:
#Run the comparison
sample_texts = df[df['language'] == 'en'].sample(n=400, random_state=420)['text'].tolist()
sample_texts = [str(t) for t in sample_texts]
results = compare_setups(setups, sample_texts)

Comparing setups:   0%|          | 0/4 [00:00<?, ?it/s]

Processing tagged passages: 100%|██████████| 400/400 [00:00<00:00, 603279.97it/s]
Processing tagged passages: 100%|██████████| 400/400 [00:00<00:00, 1093690.74it/s]
Processing tagged passages: 100%|██████████| 400/400 [00:00<00:00, 821204.89it/s]
Processing tagged passages: 100%|██████████| 400/400 [00:00<00:00, 584164.90it/s]
Comparing setups: 100%|██████████| 4/4 [00:56<00:00, 14.07s/it]


In [None]:
# Create and display the comparison table
comparison_table = create_comparison_table(results)

# Show summary statistics
print("Total passages tagged by each setup:")
print(comparison_table[[col for col in comparison_table.columns if col.endswith('_tagged')]].sum())

# Show passages where setups disagree
print("\nPassages where setups disagree:")
disagreements = comparison_table[comparison_table[[col for col in comparison_table.columns if col.endswith('_tagged')]].nunique(axis=1) > 1]

# For each disagreeing passage, show the different spans identified
for _, row in disagreements.iterrows():
    print("\nOriginal text:", row['text'])
    for setup in setups.keys():
        if row[f'{setup}_tagged']:
            print(f"{setup} identified:", row[f'{setup}_spans'])
        else:
            print(f"{setup} identified: [NO TAGS]")

Total passages tagged by each setup:
current_tagged    44
strict_tagged     34
lenient_tagged    46
rules_tagged      47
dtype: int64

Passages where setups disagree:

Original text: · In 2016, JMD collaborated with Bureau Sustainability Program Managers and facilities personnel to evaluate all DOJ-owned facilities for vulnerabilities to coastal and inland flooding, extreme heat, drought, and wildfire using the DOJ Facility Climate Adaptation Checklist.
current identified: ['vulnerabilities to coastal and inland flooding, extreme heat, drought, and wildfire']
strict identified: [NO TAGS]
lenient identified: ['vulnerabilities to coastal and inland flooding, extreme heat, drought, and wildfire']
rules identified: [NO TAGS]

Original text: Within renewables, employment opportunities are diverse: the number of jobs per MW created in the solar energy sector is significant, although in Europe it is mainly linked to installation; wind power generates fewer jobs per MW, but leads to a greater 

In [None]:
comparison_table.to_csv("350513_comparison_table_very_coarse.csv", index=False)

