## Extracting Climate Finance Tools

The goal of this project is to identify climate finance laws and tools from around the world. We now have a grouping of laws themselves, news articles, press releases, and company plans that may mention these types of initatives. The next step is to comb through them to find what we need. 

### Loading the data

In [9]:
import os
import pandas as pd
import PyPDF2
import fitz  # PyMuPDF - better for detecting formatting like strikethrough
from tqdm.notebook import tqdm
from googletrans import Translator
from nltk import sent_tokenize
import re

In [10]:
brasil_folder = 'brasil'
china_folder = 'china' 
eu_folder = 'eu'

In [11]:
def load_files(folder, exclude_strikethrough=True): 
    """
    Load files from a folder, with option to exclude strikethrough text from PDFs.
    
    Args:
        folder: Path to folder containing files
        exclude_strikethrough: If True, excludes crossed-out text from PDFs (default: True)
    """
    df = []
    files = os.listdir(folder)
    for file in tqdm(files, desc=f"Loading files from {folder}"): 
        file_path = os.path.join(folder, file)
        ext = os.path.splitext(file)[-1].lower()
        try:
            if ext == '.pdf':
                text = ""
                if exclude_strikethrough:
                    # Use PyMuPDF to extract text while excluding strikethrough
                    doc = fitz.open(file_path)
                    for page in doc:
                        blocks = page.get_text("dict")["blocks"]
                        for block in blocks:
                            if "lines" in block:
                                for line in block["lines"]:
                                    for span in line["spans"]:
                                        # Check if text has strikethrough flag
                                        # Font flags: bit 4 (value 16) indicates strikethrough
                                        flags = span.get("flags", 0)
                                        is_strikethrough = bool(flags & (1 << 4))
                                        
                                        if not is_strikethrough:
                                            text += span["text"]
                                    text += "\n"
                    doc.close()
                else:
                    # Fallback to PyPDF2
                    with open(file_path, 'rb') as f:
                        reader = PyPDF2.PdfReader(f)
                        for page in reader.pages:
                            try:
                                text += page.extract_text() or ""
                            except Exception:
                                continue
            else:
                with open(file_path, 'r', encoding='utf-8', errors='ignore') as f: 
                    text = f.read()
            df.append({'file': file, 'text': text})
        except Exception as e:
            df.append({'file': file, 'text': '', 'error': str(e)})
    return pd.DataFrame(df)

In [4]:
b_df = load_files(brasil_folder)

Loading files from brasil:   0%|          | 0/73 [00:00<?, ?it/s]

In [5]:
#check the 44th item in brasil 
# Look at the text associated with file '44.pdf'
item_44 = b_df[b_df['file'] == '44.pdf']
if not item_44.empty:
    print(item_44.iloc[0]['text'])
else:
    print("44.pdf not found in loaded DataFrame.")






Revogado pelo Decreto nº 11.772, de 2023
Texto para impressão
Estabelece as Diretrizes Nacionais sobre Empresas e
Direitos Humanos.
, no exercício do cargo de Presidente da República, no uso das
atribuições que lhe confere o art. 84, , incisos IV e VI, alínea “a”, da Constituição,
:
CAPÍTULO I
DISPOSIÇÕES PRELIMINARES
Art. 1º Este Decreto estabelece as Diretrizes Nacionais sobre Empresas e Direitos Humanos, para médias e
grandes empresas, incluídas as empresas multinacionais com atividades no País.
§ 1º Nos termos do disposto na Lei Complementar nº 123, de 14 de dezembro de 2006, as microempresas e as
empresas de pequeno porte poderão, na medida de suas capacidades, cumprir as Diretrizes de que trata este Decreto,
observado o disposto no art. 179 da Constituição.
§ 2º As Diretrizes serão implementadas voluntariamente pelas empresas.
§ 3º Ato do Ministro de Estado dos Direitos Humanos instituirá o Selo “Empresa e Direitos Humanos”, destinado às
empresas que voluntariamente implement

In [6]:
text_44 = item_44.iloc[0]['text']

In [None]:
tokenized_44 = sent_tokenize(text_44)

In [8]:
tokenized_44

['\n\n\n\nRevogado pelo Decreto nº 11.772, de 2023\nTexto para impressão\nEstabelece as Diretrizes Nacionais sobre Empresas e\nDireitos Humanos.',
 ', no exercício do cargo de Presidente da República, no uso das\natribuições que lhe confere o art.',
 '84, , incisos IV e VI, alínea “a”, da Constituição,\n:\nCAPÍTULO I\nDISPOSIÇÕES PRELIMINARES\nArt.',
 '1º Este Decreto estabelece as Diretrizes Nacionais sobre Empresas e Direitos Humanos, para médias e\ngrandes empresas, incluídas as empresas multinacionais com atividades no País.',
 '§ 1º Nos termos do disposto na Lei Complementar nº 123, de 14 de dezembro de 2006, as microempresas e as\nempresas de pequeno porte poderão, na medida de suas capacidades, cumprir as Diretrizes de que trata este Decreto,\nobservado o disposto no art.',
 '179 da Constituição.',
 '§ 2º As Diretrizes serão implementadas voluntariamente pelas empresas.',
 '§ 3º Ato do Ministro de Estado dos Direitos Humanos instituirá o Selo “Empresa e Direitos Humanos”, destin

In [10]:
from deep_translator import GoogleTranslator, single_detection
tokenized_44 = sent_tokenize(text_44)


In [None]:
translated_sentences = [
    GoogleTranslator(source='pt', target='en').translate(tt)
    for tt in tokenized_44
]

# Pin it back together as one text
translated_text_44 = ' '.join(translated_sentences)

KeyboardInterrupt: 

In [11]:
from pprint import pprint
pprint(translated_text_44)

NameError: name 'translated_text_44' is not defined

In [12]:
def chinese_sent_tokenize(text):
    """
    Tokenize Chinese text into sentences using Chinese punctuation marks.
    Chinese uses different punctuation: 。(period), ！(exclamation), ？(question mark)
    Also handles some English-style punctuation mixed in Chinese text.
    """
    # Replace Chinese punctuation with placeholder + newline for splitting
    text = re.sub(r'([。！？!?；])', r'\1\n', text)
    
    # Split by newlines and filter out empty strings
    sentences = [s.strip() for s in text.split('\n') if s.strip()]
    
    # Filter out very short "sentences" that are likely headers or page numbers
    sentences = [s for s in sentences if len(s) > 3]
    
    return sentences

def smart_sent_tokenize(text, language='auto'):
    """
    Tokenize text into sentences based on language.
    For Chinese (zh), uses Chinese punctuation.
    For other languages, uses NLTK's sent_tokenize.
    """
    if language == 'zh' or language == 'zh-CN' or language == 'zh-TW':
        return chinese_sent_tokenize(text)
    else:
        # Use NLTK for English and other Latin-script languages
        return sent_tokenize(text)


In [13]:
from deep_translator import single_detection

# Detect language of text
lang = single_detection(text_44, api_key='70013dac7b8673080e5edc9bae911672')
print(lang)

NameError: name 'text_44' is not defined

In [27]:
def map_language_code(lang):
    """
    Map language codes from detection API to GoogleTranslator format.
    Detection API returns 'zh', but GoogleTranslator needs 'zh-CN' or 'zh-TW'.
    """
    language_map = {
        'zh': 'zh-CN',  # Default Chinese to Simplified
        'zh-cn': 'zh-CN',
        'zh-tw': 'zh-TW',
    }
    return language_map.get(lang.lower(), lang)

def translate_and_process(text, partial_translation=None): 
    """
    Translate text sentence by sentence, resuming from partial translation if provided.
    Returns tuple: (translated_text, is_complete)
    """
    # Handle empty or invalid text
    if not text or not isinstance(text, str) or len(text.strip()) == 0:
        print('  ⚠️  Empty or invalid text, skipping')
        return None, True  # Mark as complete so we don't retry empty documents
    
    # Detect language FIRST using a small sample (before tokenization)
    small_sample = text[:1000] if len(text) > 1000 else text
    detected_lang = single_detection(small_sample, api_key='70013dac7b8673080e5edc9bae911672')
    print('language detected: ', detected_lang)
    
    # Map the language code to GoogleTranslator format
    lang = map_language_code(detected_lang)
    if lang != detected_lang:
        print(f'  🔄 Mapped language code: {detected_lang} -> {lang}')
    
    if lang == 'en':
        return text, True  # Already in English, return original
    
    # Use language-aware tokenization (handles Chinese properly)
    # Use the original detected_lang for tokenization logic
    tokenized_text = smart_sent_tokenize(text, detected_lang)
    print(f'  📝 Text split into {len(tokenized_text)} sentences')
    
    # Figure out where to resume from
    translated_sentences = []
    start_idx = 0
    
    if partial_translation:
        # Count how many sentences were already translated by tokenizing the partial translation
        # The partial_translation is in ENGLISH, so tokenize it with 'en', not the source language
        partial_tokenized = smart_sent_tokenize(partial_translation, 'en')
        start_idx = len(partial_tokenized)
        
        # Keep the already-translated text as-is
        translated_sentences = partial_tokenized.copy()
        
        print(f'  📍 Resuming from sentence {start_idx + 1}/{len(tokenized_text)} ({start_idx} already translated)')
    
    # Translate remaining sentences one by one
    try:
        for i in range(start_idx, len(tokenized_text)):
            sentence = tokenized_text[i].strip()
            
            # Skip empty sentences
            if not sentence:
                continue
                
            print(f"    ➡️  Translating sentence {i+1}/{len(tokenized_text)} ...", end=" ", flush=True)
            
            try:
                translated = GoogleTranslator(source=lang, target='en').translate(sentence)
                translated_sentences.append(translated)
                print("done.")
            except Exception as sent_error:
                error_msg = str(sent_error)
                
                # ALWAYS print the actual error so we can debug
                print(f"ERROR: {error_msg}")
                
                # Check if it's a network/connection error
                if any(keyword in error_msg for keyword in ['HTTPS', 'Connection', 'HTTPError', 'timeout', 'Network']):
                    # Network error - stop here and mark as incomplete so we retry later
                    print(f"connection error!")
                    raise sent_error  # Raise to trigger partial save and retry
                else:
                    # Translation error (special characters, can't translate, etc.)
                    # Keep the original Portuguese text for this sentence
                    print(f"translation error - keeping original")
                    translated_sentences.append(sentence)
            
        # All sentences processed - filter out any None values before joining
        filtered_sentences = [s for s in translated_sentences if s is not None]
        final_text = ' '.join(filtered_sentences)
        return final_text, True
        
    except Exception as e:
        # Network error or unexpected issue - save partial progress
        error_msg = str(e)
        if any(keyword in error_msg for keyword in ['HTTPS', 'Connection', 'HTTPError', 'timeout', 'Network']):
            print(f'\n  🌐 Network issue at sentence {i + 1}/{len(tokenized_text)} - saving partial progress')
        else:
            print(f'\n  ⚠️  Unexpected error at sentence {i + 1}/{len(tokenized_text)}: {str(e)[:100]}')
        
        # Filter out None values before joining
        filtered_sentences = [s for s in translated_sentences if s is not None]
        partial_text = ' '.join(filtered_sentences) if filtered_sentences else None
        return partial_text, False

In [28]:
import time
from requests.exceptions import RequestException, ConnectionError, HTTPError

def translate_with_retry(text, partial_translation=None, max_retries=5, delay=5):
    """Translate with automatic retry on network errors, supporting partial progress"""
    for attempt in range(max_retries):
        try:
            translated_text, is_complete = translate_and_process(text, partial_translation)
            return translated_text, is_complete
        except (RequestException, ConnectionError, HTTPError, Exception) as e:
            print(f"  ⚠️  Error (attempt {attempt + 1}/{max_retries}): {str(e)[:100]}")
            if attempt < max_retries - 1:
                print(f"  ⏳ Waiting {delay} seconds before retry...")
                time.sleep(delay)
            else:
                print(f"  ❌ Failed after {max_retries} attempts")
                return None, False

def is_translation_complete(original_text, translated_text):
    """Check if translation is complete by comparing sentence counts"""
    if not original_text or not isinstance(original_text, str):
        return True  # Can't translate invalid text
    if not translated_text or not isinstance(translated_text, str):
        return False  # No translation exists
    
    # Detect language of original text for proper tokenization
    small_sample = original_text[:1000] if len(original_text) > 1000 else original_text
    lang = single_detection(small_sample, api_key='70013dac7b8673080e5edc9bae911672')
    
    # Tokenize both texts using the detected language
    original_sentences = smart_sent_tokenize(original_text, lang)
    translated_sentences = smart_sent_tokenize(translated_text, 'en')  # Translations are in English
    
    # Translation is complete if at least 90% of sentences are translated
    # (allowing some wiggle room for sentence splitting differences)
    completion_ratio = len(translated_sentences) / len(original_sentences)
    return completion_ratio >= 0.9

def process_dataframe_with_checkpoints(df, save_path='b_df_checkpoint.csv'):
    """
    Process dataframe with automatic saving, supporting partial sentence-level progress.
    Uses df for text content, checkpoint only for translation status.
    """
    
    # Keep the original df with fresh text data
    working_df = df.copy()
    
    # Load existing translations from checkpoint (if available)
    try:
        checkpoint_df = pd.read_csv(save_path, escapechar='\\')
        print(f"📂 Loaded checkpoint from {save_path}")
        
        # Merge only the 'translated' column from checkpoint
        if 'translated' in checkpoint_df.columns:
            # Create a mapping of file -> translated text
            translation_map = dict(zip(checkpoint_df['file'], checkpoint_df['translated']))
            working_df['translated'] = working_df['file'].map(translation_map)
        else:
            working_df['translated'] = None
            
    except (FileNotFoundError, pd.errors.EmptyDataError):
        print("🆕 Starting fresh (no checkpoint found or checkpoint corrupted)")
        working_df['translated'] = None

    for idx, row in working_df.iterrows():
        text = row['text']
        existing_translation = row.get('translated') if pd.notna(row.get('translated')) else None
        
        # Check if translation is complete
        if existing_translation and is_translation_complete(text, existing_translation):
            print(f"\n⏭️  Skipping file: {row['file']} (already translated)")
            continue

        print(f"\n🔄 Processing file: {row['file']}")
        
        # Skip .DS_Store and other non-PDF files
        if not row['file'].endswith('.pdf'):
            print(f"  ⏭️  Skipping non-PDF file")
            working_df.at[idx, 'translated'] = None
            continue

        # Try to translate (or continue from partial)
        translated, is_complete = translate_with_retry(text, existing_translation)

        if translated:
            # Save whatever progress we have (complete or partial)
            working_df.at[idx, 'translated'] = translated
            working_df.to_csv(save_path, index=False, escapechar='\\', doublequote=False)

            if is_complete:
                print(f"  ✅ Complete translation saved!")
            else:
                print(f"  💾 Partial translation saved (will continue next time)")
        else:
            # No progress made at all
            print(f"  ❌ No translation progress - will retry next time")
            if existing_translation is None:
                working_df.at[idx, 'translated'] = None
                working_df.to_csv(save_path, index=False, escapechar='\\', doublequote=False)

        # Small delay between requests to avoid rate limiting
        time.sleep(1)

    print("\n🎉 All done!")
    return working_df

In [36]:
# Run it
b_df = process_dataframe_with_checkpoints(b_df, save_path='b_df_checkpoint.csv')

📂 Loaded checkpoint from b_df_checkpoint.csv

⏭️  Skipping file: 49.pdf (already translated)

⏭️  Skipping file: 61.pdf (already translated)

⏭️  Skipping file: 75.pdf (already translated)

⏭️  Skipping file: 74.pdf (already translated)

⏭️  Skipping file: 60.pdf (already translated)

⏭️  Skipping file: 48.pdf (already translated)

⏭️  Skipping file: 62.pdf (already translated)

⏭️  Skipping file: 63.pdf (already translated)

⏭️  Skipping file: 73.pdf (already translated)

⏭️  Skipping file: 67.pdf (already translated)

⏭️  Skipping file: 9.pdf (already translated)

🔄 Processing file: .DS_Store
  ⏭️  Skipping non-PDF file

⏭️  Skipping file: 8.pdf (already translated)

⏭️  Skipping file: 66.pdf (already translated)

⏭️  Skipping file: 72.pdf (already translated)

⏭️  Skipping file: 64.pdf (already translated)

⏭️  Skipping file: 70.pdf (already translated)

⏭️  Skipping file: 58.pdf (already translated)

⏭️  Skipping file: 59.pdf (already translated)

⏭️  Skipping file: 71.pdf (already

In [37]:
b_df

Unnamed: 0,file,text,translated
0,49.pdf,\nPublicado em: 14/06/2019 | Edição: 114 | Seç...,Published on: 06/14/2019 | Edition: 114 | Sect...
1,61.pdf,\nPublicado em: 05/06/2020 | Edição: 107-A | S...,Published on: 06/05/2020 | Edition: 107-A | Se...
2,75.pdf,\nPresidencia da Republica \nCasa Civil \nSub...,Presidency of the Republic \nCivil House \nDep...
3,74.pdf,archive.today\nwebpage capture\n https://www.j...,archive.today\nweb page capture\n https://www....
4,60.pdf,"\n\n\n\nRevogado pelo Decreto nº 11.964, de 20...","Revoked by Decree No. 11,964, of 2024\nText fo..."
...,...,...,...
68,3.pdf,"\n \nPublicada no DOU nº 1, de 2 de janeiro de...","Published in DOU nº 1, of January 2, 2007, Sec..."
69,51.pdf,\n\n\n\n \nAprova o Programa Nacional de Direi...,Approves the National Human Rights Program -\n...
70,2.pdf,\n \n \n(D.O.U. de 01/11/00) \n \n \nO D...,(D.O.U. of 01/11/00) \n \n \nTHE GENERAL DI...
71,50.pdf,"\nResolução nº 3.896, de 17 de agosto de 2010...","Resolution No. 3,896, of August 17, 2010 \n1 \..."


_______

### Mandarin translation

In [16]:
c_df = load_files(china_folder)

Loading files from china:   0%|          | 0/45 [00:00<?, ?it/s]

MuPDF error: format error: No default Layer config



In [17]:
# Test: Verify Chinese tokenization is working properly
sample_chinese = c_df[c_df['file'] == '13.pdf']
if not sample_chinese.empty:
    text = sample_chinese.iloc[0]['text']
    print(f"Text length: {len(text)} characters")
    print(f"First 300 chars:\n{text[:300]}")
    
    # Test old way (NLTK - doesn't work for Chinese)
    print(f"\n❌ Old way (NLTK sent_tokenize): {len(sent_tokenize(text))} sentences")
    
    # Test new way (language-aware)
    chinese_sentences = smart_sent_tokenize(text, 'zh')
    print(f"✅ New way (Chinese-aware tokenization): {len(chinese_sentences)} sentences")
    print(f"\nFirst 3 sentences:")
    for i, sent in enumerate(chinese_sentences[:3]):
        print(f"  {i+1}. {sent[:80]}...")
else:
    print("File not found")


Text length: 27910 characters
First 300 chars:
全国重要生态系统保护和修复重大工程
总体规划（2021—2035 年）
二〇二〇年五月
I
目
录
前
言.......................................................................................... 1
第一章
生态保护和修复面临的形势......................................3
一、我国生态保护和修复工作成效
（一）森林资源总量持续快速增长.................................... 3
（二）草原生态系统恶化趋势得到遏制...........

❌ Old way (NLTK sent_tokenize): 1 sentences
✅ New way (Chinese-aware tokenization): 1208 sentences

First 3 sentences:
  1. 全国重要生态系统保护和修复重大工程...
  2. 总体规划（2021—2035 年）...
  3. 二〇二〇年五月...


In [39]:
c_df = process_dataframe_with_checkpoints(c_df, save_path='c_df_checkpoint.csv')

📂 Loaded checkpoint from c_df_checkpoint.csv

⏭️  Skipping file: 49.pdf (already translated)

⏭️  Skipping file: 48.pdf (already translated)

⏭️  Skipping file: 9.pdf (already translated)

🔄 Processing file: .DS_Store
  ⏭️  Skipping non-PDF file

⏭️  Skipping file: 8.pdf (already translated)

⏭️  Skipping file: 29.pdf (already translated)

🔄 Processing file: 14.pdf
language detected:  en
  ✅ Complete translation saved!

⏭️  Skipping file: 28.pdf (already translated)

⏭️  Skipping file: 10.pdf (already translated)

⏭️  Skipping file: 38.pdf (already translated)

⏭️  Skipping file: 39.pdf (already translated)

⏭️  Skipping file: 11.pdf (already translated)

🔄 Processing file: ~$21.pdf.docx
  ⏭️  Skipping non-PDF file

⏭️  Skipping file: 13.pdf (already translated)

⏭️  Skipping file: 12.pdf (already translated)

⏭️  Skipping file: 23.pdf (already translated)

⏭️  Skipping file: 34.pdf (already translated)

⏭️  Skipping file: 20.pdf (already translated)

⏭️  Skipping file: 21.pdf (already

_____

In [36]:
eu_df = load_files(eu_folder)

Loading files from eu:   0%|          | 0/100 [00:00<?, ?it/s]

In [38]:
eu_df = process_dataframe_with_checkpoints(eu_df, save_path='eu_df_checkpoint.csv')

🆕 Starting fresh (no checkpoint found or checkpoint corrupted)

🔄 Processing file: 49.pdf
language detected:  en
  ✅ Complete translation saved!

🔄 Processing file: 61.pdf
language detected:  en
  ✅ Complete translation saved!

🔄 Processing file: 75.pdf
language detected:  en
  ✅ Complete translation saved!

🔄 Processing file: 74.pdf
language detected:  en
  ✅ Complete translation saved!

🔄 Processing file: 60.pdf
language detected:  en
  ✅ Complete translation saved!

🔄 Processing file: 48.pdf
language detected:  en
  ✅ Complete translation saved!

🔄 Processing file: 89.pdf
language detected:  en
  ✅ Complete translation saved!

🔄 Processing file: 76.pdf
language detected:  en
  ✅ Complete translation saved!

🔄 Processing file: 62.pdf
language detected:  en
  ✅ Complete translation saved!

🔄 Processing file: 63.pdf
language detected:  en
  ✅ Complete translation saved!

🔄 Processing file: 77.pdf
language detected:  en
  ✅ Complete translation saved!

🔄 Processing file: 88.pdf
language

_____

In [167]:
c_df

Unnamed: 0,file,text,translated
0,49.pdf,附件： \n \n \n国家应对气候变化规划（ 年）\n\n\n\n\n\n\n\n\n\n...,National climate change plan (year) September ...
1,48.pdf,on the Prevention and Gntrol of Air Pollution ...,on the Prevention and Gntrol of Air Pollution ...
2,9.pdf,POLICY UPDATE\n\n COMMUNICATIONS@THEICCT...,POLICY UPDATE\n\n COMMUNICATIONS@THEICCT...
3,.DS_Store,,
4,8.pdf,"22/02/2022, 19:36\n中华人民共和国湿地保护法_中华人民共和国生态环境部\n...","22/02/2022, 19:36 Wetland Protection Law of th..."
5,29.pdf,\n1 \n附件 \n \n财政支持做好碳达峰碳中和工作的意见 \n \n为深入贯彻落实党...,Opinions on providing financial support to ach...
6,14.pdf,\n\n\n\n\n\n\n\n \n \nDelivered at the Fifth S...,\n\n\n\n\n\n\n\n \n \nDelivered at the Fifth S...
7,28.pdf,1\n附件\n全国碳排放权交易市场建设方案（发电行业）\n建立碳排放权交易市场，是利用市场机...,National Carbon Emissions Trading Market Const...
8,10.pdf,CHINA: FUELS: DIESEL AND GASOLINE\n CHINA: FUE...,CHINA: FUELS: DIESEL AND GASOLINE\n CHINA: FUE...
9,38.pdf,\n— 3 — \n \n \n\n\n\n\n\nTo peak carbon diox...,\n— 3 — \n \n \n\n\n\n\n\nTo peak carbon diox...


In [168]:
c_df.iloc[0].text

'附件： \n \n \n国家应对气候变化规划（ 年）\n\n\n\n\n\n\n\n\n\n\n\n\n\n二〇一四年九月 \n\n\n\n\n\n前  言 \n气候变化关系全人类的生存和发展。我国人口众多，人均资源\n禀赋较差，气候条件复杂，生态环境脆弱，是易受气候变化不利影\n响的国家。气候变化关系我国经济社会发展全局，对维护我国经济\n安全、能源安全、生态安全、粮食安全以及人民生命财产安全至关\n重要。积极应对气候变化，加快推进绿色低碳发展，是实现可持续\n发展、推进生态文明建设的内在要求，是加快转变经济发展方式、\n调整经济结构、推进新的产业革命的重大机遇，也是我国作为负责\n任大国的国际义务。 \n根据全面建成小康社会目标任务，国家发展和改革委员会会同\n有关部门，组织编制了《国家应对气候变化规划（2014-2020 年）》，\n提出了我国应对气候变化工作的指导思想、目标要求、政策导向、\n重点任务及保障措施，将减缓和适应气候变化要求融入经济社会发\n展各方面和全过程，加快构建中国特色的绿色低碳发展模式。\n目  录 \n \n第一章现状与展望 \n第一节 全球气候变化趋势及对我国影响.........................................1 \n第二节 应对气候变化工作现状.........................................................1 \n第三节 应对气候变化面临的形势.....................................................2 \n第四节 积极应对气候变化的战略要求.............................................3 \n第二章指导思想和主要目标 \n第一节 指导思想和基本原则.............................................................4 \n第二节 主要目标.................................................................................5 \n第三章控制温室气体排放 \n第一节 调整产业结构..........................

In [47]:
import pandas as pd

# Load the checkpoint files
b_df = pd.read_csv('b_df_checkpoint.csv', escapechar='\\')  # Brazilian docs
c_df = pd.read_csv('c_df_checkpoint.csv', escapechar='\\')  # Chinese docs
eu_df = pd.read_csv('eu_df_checkpoint.csv', escapechar='\\') # EU docs (already English)

# Each dataframe has columns: 'file', 'text' (original), 'translated' (English)

In [48]:
# Access translated text for a specific file
translated_text = b_df[b_df['file'] == '44.pdf']['translated'].iloc[0]

# Get all translated documents
all_translations = b_df['translated'].tolist()

# Create a combined dataframe with all documents
combined_df = pd.concat([
    b_df.assign(source='Brazil'),
    c_df.assign(source='China'),
    eu_df.assign(source='EU')
], ignore_index=True)

# Work with just the English versions
combined_df['english_text'] = combined_df['translated'].fillna(combined_df['text'])

In [49]:
# Create final datasets
data_dir = 'data/'

for df, name in [(b_df, 'brazil'), (c_df, 'china'), (eu_df, 'eu')]:
    output = df[['file', 'translated']].copy()
    output.columns = ['file', 'text']
    output.to_csv(f'{data_dir}{name}_df.csv', index=False)

In [50]:
combined_df

Unnamed: 0,file,text,translated,source,english_text
0,49.pdf,\nPublicado em: 14/06/2019 | Edição: 114 | Seç...,Published on: 06/14/2019 | Edition: 114 | Sect...,Brazil,Published on: 06/14/2019 | Edition: 114 | Sect...
1,61.pdf,\nPublicado em: 05/06/2020 | Edição: 107-A | S...,Published on: 06/05/2020 | Edition: 107-A | Se...,Brazil,Published on: 06/05/2020 | Edition: 107-A | Se...
2,75.pdf,\nPresidencia da Republica \nCasa Civil \nSub...,Presidency of the Republic \nCivil House \nDep...,Brazil,Presidency of the Republic \nCivil House \nDep...
3,74.pdf,archive.today\nwebpage capture\n https://www.j...,archive.today\nweb page capture\n https://www....,Brazil,archive.today\nweb page capture\n https://www....
4,60.pdf,"\n\n\n\nRevogado pelo Decreto nº 11.964, de 20...","Revoked by Decree No. 11,964, of 2024\nText fo...",Brazil,"Revoked by Decree No. 11,964, of 2024\nText fo..."
...,...,...,...,...,...
213,78.pdf,DECISIONS \n\n\n\n(notified under document C(2...,DECISIONS \n\n\n\n(notified under document C(2...,EU,DECISIONS \n\n\n\n(notified under document C(2...
214,2.pdf,ELI: http://data.europa.eu/eli/reg/2024/795/oj...,ELI: http://data.europa.eu/eli/reg/2024/795/oj...,EU,ELI: http://data.europa.eu/eli/reg/2024/795/oj...
215,50.pdf,28.8.2014 \nEN\nOfficial Journal of the Euro...,28.8.2014 \nEN\nOfficial Journal of the Euro...,EU,28.8.2014 \nEN\nOfficial Journal of the Euro...
216,44.pdf,I\n(Legislative acts)\nDIRECTIVES\n\n\n\n\nTHE...,I\n(Legislative acts)\nDIRECTIVES\n\n\n\n\nTHE...,EU,I\n(Legislative acts)\nDIRECTIVES\n\n\n\n\nTHE...


_____

Now I have combined_df and I can embed and chunk. Then I can use an LLM to look through the chunks and extract any laws, policies, or directives that are relevant in the given country.

In [51]:
from transformers import AutoTokenizer

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [52]:
tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v1")

In [53]:
def tokenize_and_chunk(row, tokenizer, max_tokens=512, text_col='text'):
    sentences = re.split(r'(?<=[.!?]) +', row[text_col])
    chunks = []
    current_chunk = []
    current_tokens = 0

    for sentence in sentences:
        token_count = len(tokenizer.tokenize(sentence))

        if current_tokens + token_count <= max_tokens:
            current_chunk.append(sentence)
            current_tokens += token_count
        else:
            if current_chunk:
                chunk_row = row.to_dict()
                chunk_row[text_col] = ' '.join(current_chunk)
                chunks.append(chunk_row)
            current_chunk = [sentence]
            current_tokens = token_count

    if current_chunk:
        chunk_row = row.to_dict()
        chunk_row[text_col] = ' '.join(current_chunk)
        chunks.append(chunk_row)

    return chunks

In [57]:
print("Chunking documents...")

chunked_rows = []
for _, row in combined_df.iterrows():
    # Only chunk over the 'translated' column, but preserve 'file' and 'country' info
    translated_text = row.get('translated', None)
    if not isinstance(translated_text, str) or not translated_text.strip():
        continue  # skip rows where translated text is not a valid string
    chunk_input = row.copy()
    chunk_input['text'] = translated_text
    try:
        chunks = tokenize_and_chunk(chunk_input, tokenizer, text_col='text')
    except TypeError:
        continue  # skip this row if tokenize_and_chunk fails
    for chunk in chunks:
        # store chunk and remember source
        chunk_record = {
            'file': row.get('file', None),
            'country': row.get('source', None),
            'chunk_text': chunk['text']
        }
        chunked_rows.append(chunk_record)

chunked_df = pd.DataFrame(chunked_rows)
chunked_df = chunked_df.drop_duplicates(subset=['chunk_text']).reset_index(drop=True)

Chunking documents...


In [58]:
chunked_df

Unnamed: 0,file,country,chunk_text
0,49.pdf,Brazil,Published on: 06/14/2019 | Edition: 114 | Sect...
1,49.pdf,Brazil,"729, of May 11, 2018,\nsent by the ANP to the ..."
2,49.pdf,Brazil,"§ 2 Annually, the ANP will publish, on its web..."
3,49.pdf,Brazil,"9º The fuel distributor must maintain, for a p..."
4,49.pdf,Brazil,The definitive individual annual goals for 201...
...,...,...,...
9297,44.pdf,EU,131).’\n(5) Annex IIIb to Directive 1999/62/EC...
9298,44.pdf,EU,If there is \nscientific evidence for a higher...
9299,44.pdf,EU,The parts of the network subject to congestion...
9300,44.pdf,EU,The resulting \ncharging structure shall be tr...


### Calling an LLM

In [64]:
from dotenv import load_dotenv
import openai
# Load environment variables from .env file
load_dotenv()

# Get OpenRouter API key from environment variables
OPENROUTER_API_KEY = os.getenv('OPENROUTER_API_KEY')
if not OPENROUTER_API_KEY:
    raise ValueError("Please set OPENROUTER_API_KEY in your .env file or environment variables")

In [65]:
client = openai.OpenAI(
    api_key=OPENROUTER_API_KEY,
    base_url="https://openrouter.ai/api/v1"
)

In [77]:
system_prompt = '''You are an expert in climate finance law and policy. Your task is to identify climate finance policy instruments.

# CORE DEFINITION

A climate finance policy instrument must meet ALL four criteria:

1. TYPE: Public policies (laws, regulations, guidelines, programs, investment decisions) that govern or structure financial flows at scale

2. SOURCE: Originated by public authorities (national/subnational legislators, governments, regulatory agencies, enforcement agencies, state-owned entities)

3. TARGET: Directly influences financial flows AND the behavior of:
   - Financial institutions (banks, insurers, asset managers, pension funds, etc.)
   - Financial market participants (investors, lenders, underwriters)
   - Individuals making finance-related decisions (financial advisors, fund managers)

4. PURPOSE: Explicit climate objective to materially impact:
   - Climate mitigation (reducing GHG emissions, enhancing carbon sinks)
   - Climate adaptation (reducing vulnerability, increasing resilience to climate impacts)
   - Loss and damage compensation (addressing climate-related losses)

# KEY PRINCIPLE: FINANCIAL SECTOR FOCUS

The policy must regulate HOW MONEY FLOWS in the financial system, not just regulate emissions or environmental practices directly.

Examples of FINANCIAL targets:
- Green bond standards
- Climate risk disclosure requirements for financial institutions
- ESG investment mandates for pension funds
- Green lending quotas for banks
- Climate stress testing requirements for insurers

Examples of NON-FINANCIAL targets (EXCLUDE these):
- Emissions standards for factories
- Renewable energy mandates for utilities
- Biofuel requirements for fuel producers
- Building energy efficiency codes
- Vehicle emissions regulations

# INCLUSION RULES

INCLUDE policies that:
- Are not climate-exclusive BUT have explicit purpose to materially and systematically affect climate-related financial flows
- Example: Environmental disclosure requirements for listed firms that result in climate information being provided to investors

# EXCLUSION RULES

EXCLUDE policies that:

1. Lack explicit climate/environmental purpose:
   - General consumer protection or deceptive marketing rules (even if used in climate litigation)
   - Standard financial advisor training/suitability requirements (unless they explicitly include climate considerations)

2. Target non-financial sectors:
   - Production mandates (biofuels, renewable energy quotas for producers)
   - Operational emissions standards
   - Product bans or restrictions
   - Direct subsidies to clean tech manufacturers

3. Focus primarily on non-climate environmental goals:
   - Biodiversity reserves
   - Toxic substance prohibitions
   - Air quality standards (non-GHG pollutants)
   - (Exclude even if they have indirect climate co-benefits)

# DECISION FRAMEWORK

Ask yourself:
1. Does this policy regulate the FINANCIAL SECTOR or financial decision-making?
2. Does it aim to change WHERE and HOW money is allocated?
3. Does it have an EXPLICIT climate purpose (stated in the policy)?

If NO to any question → EXCLUDE
If YES to all questions → INCLUDE
'''

In [84]:
def get_user_prompt(chunk):
    return f"""
You are an expert in climate finance law and policy. Your goal is to identify climate finance policy instruments. Only returned NAMED policy instruments. 

For example, if a sentence said, "The government has implemented a policy to encourage green investments", you should return nothing. If the previous sentence said, "The government has implemented the Green Investment Act", then you would return "Green Investment Act".

Even if the policy instrument is named, but there is not enough context to identify it as related to climate finance, you should return nothing.

We define climate finance policy instruments as: 

{system_prompt}

Here is the text to analyze:
{chunk}

Please identify the climate finance policy instruments in the text. Respond only with a JSON object in one of the following two formats:

If there are no climate finance policy instruments in the text, respond with:
{{
  "has_policy_instruments": false,
  "policy_instruments": []
}}

If there are climate finance policy instruments in the text, respond with:
{{
  "has_policy_instruments": true,
  "policy_instruments": ["name of instrument 1", "name of instrument 2", ...]
}}

Do not provide any reasoning or any other information. Only respond with the JSON object as specified above.
"""

In [80]:
chunk_1 = get_user_prompt(chunked_df['chunk_text'].iloc[0])

In [81]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": chunk_1}
    ]
)

In [82]:
response_slim = response.choices[0].message.content
response_slim

'{\n  "has_policy_instruments": false,\n  "policy_instruments": []\n}'

In [76]:
for i in range(10):
    print(f"Chunk {i+1}:\n{chunked_df['chunk_text'].iloc[i]}\n")


Chunk 1:
Published on: 06/14/2019 | Edition: 114 | Section: 1 | Page: 44

 

Provides for the individualization of annual compulsory targets
to reduce greenhouse gas emissions
for the sale of fuels, within the scope of the Policy
National Biofuels Agency (RenovaBio). THE BOARD OF THE NATIONAL PETROLEUM, NATURAL GAS AND BIOFUELS AGENCY -
ANP, in the exercise of the powers conferred by art. 6 of the Internal Regulations and by art. 7th of Annex I of
Decree No. 2,455, of January 14, 1998, in view of the provisions of Law No. 9,478, of August 6,
1997, considering what is contained in Process nº 48610.003318/2018 and the deliberations taken in the 980th
Board Meeting, held on June 11, 2019, resolves:
CHAPTER I
ANNUAL TARGETS FOR REDUCING GREENHOUSE GAS EMISSIONS
Art. 1st The criteria for the individualization of annual compulsory targets are established
to reduce greenhouse gas emissions for the sale of fuels
applicable to all fuel distributors, as referred to in art. 7th of Law No. 13,576,

In [93]:
def call_llm(line):
    this_prompt = get_user_prompt(line)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": this_prompt}])
    return response

In [97]:
results = []

In [99]:
# Make sure results_df exists before the loop (it is created after this, so initialize it with prior results if file exists)
import os

if os.path.exists("responses_df.csv"):
    results_df = pd.read_csv("responses_df.csv")
else:
    results_df = pd.DataFrame(columns=['file', 'country', 'chunk_text', 'response'])

total_chunks = len(chunked_df)
for i in range(total_chunks):
    this_chunk = chunked_df['chunk_text'].iloc[i]
    this_file = chunked_df['file'].iloc[i]
    this_country = chunked_df['country'].iloc[i]

    # Only proceed if this chunk/file/country is not already in results_df
    mask = (
        (results_df['file'] == this_file) &
        (results_df['country'] == this_country) &
        (results_df['chunk_text'] == this_chunk)
    )
    if mask.any():
        continue  # Skip if found

    print(f"Processing chunk {i+1} of {total_chunks} (remaining: {total_chunks - i - 1})")
    this_response = call_llm(this_chunk)
    this_response_slim = this_response.choices[0].message.content
    print(this_response_slim)
    results.append({
        'file': this_file,
        'country': this_country,
        'chunk_text': this_chunk,
        'response': this_response_slim
    })
    pd.DataFrame(results).to_csv("responses_df.csv", index=False)
    results_df = pd.DataFrame(results)

Processing chunk 1 of 9302 (remaining: 9301)
{
  "has_policy_instruments": false,
  "policy_instruments": []
}
Processing chunk 2 of 9302 (remaining: 9300)
{
  "has_policy_instruments": false,
  "policy_instruments": []
}
Processing chunk 3 of 9302 (remaining: 9299)
{
  "has_policy_instruments": false,
  "policy_instruments": []
}
Processing chunk 4 of 9302 (remaining: 9298)
{
  "has_policy_instruments": false,
  "policy_instruments": []
}
Processing chunk 5 of 9302 (remaining: 9297)
{
  "has_policy_instruments": false,
  "policy_instruments": []
}
Processing chunk 6 of 9302 (remaining: 9296)
{
  "has_policy_instruments": false,
  "policy_instruments": []
}
Processing chunk 7 of 9302 (remaining: 9295)
{
  "has_policy_instruments": false,
  "policy_instruments": []
}
Processing chunk 8 of 9302 (remaining: 9294)
{
  "has_policy_instruments": true,
  "policy_instruments": ["Decree No. 8,874", "Law No. 12,431", "Law No. 13,334"]
}
Processing chunk 33 of 9302 (remaining: 9269)
{
  "has_poli

In [100]:
results_df = pd.DataFrame(results)
results_df

Unnamed: 0,file,country,chunk_text,response
0,61.pdf,Brazil,.................................................,"{\n ""has_policy_instruments"": false,\n ""poli..."
1,61.pdf,Brazil,§ 6 For the purposes of the provisions of item...,"{\n ""has_policy_instruments"": false,\n ""poli..."
2,61.pdf,Brazil,"10,387, OF JUNE 5, 2020 - DOU - National Press...","{\n ""has_policy_instruments"": false,\n ""poli..."
3,75.pdf,Brazil,Presidency of the Republic \nCivil House \nDep...,"{\n ""has_policy_instruments"": true,\n ""polic..."
4,75.pdf,Brazil,32 The application of FNMC resources may be al...,"{\n ""has_policy_instruments"": true,\n ""polic..."
...,...,...,...,...
9297,44.pdf,EU,131).’\n(5) Annex IIIb to Directive 1999/62/EC...,"{\n ""has_policy_instruments"": false,\n ""poli..."
9298,44.pdf,EU,If there is \nscientific evidence for a higher...,"{\n ""has_policy_instruments"": false,\n ""poli..."
9299,44.pdf,EU,The parts of the network subject to congestion...,"{\n ""has_policy_instruments"": false,\n ""poli..."
9300,44.pdf,EU,The resulting \ncharging structure shall be tr...,"{\n ""has_policy_instruments"": false,\n ""poli..."


In [101]:
results_df.iloc[0].response

'{\n  "has_policy_instruments": false,\n  "policy_instruments": []\n}'

In [102]:
# extract true/false (has_policy_instruments) from response for each row

import json

def extract_true_false(response_str):
    try:
        response_json = json.loads(response_str)
        return response_json.get("has_policy_instruments")
    except Exception:
        return None

# Add a new column to results_df with the extracted true/false values
results_df['has_policy_instruments'] = results_df['response'].apply(extract_true_false)
results_df.head()


Unnamed: 0,file,country,chunk_text,response,has_policy_instruments
0,61.pdf,Brazil,.................................................,"{\n ""has_policy_instruments"": false,\n ""poli...",False
1,61.pdf,Brazil,§ 6 For the purposes of the provisions of item...,"{\n ""has_policy_instruments"": false,\n ""poli...",False
2,61.pdf,Brazil,"10,387, OF JUNE 5, 2020 - DOU - National Press...","{\n ""has_policy_instruments"": false,\n ""poli...",False
3,75.pdf,Brazil,Presidency of the Republic \nCivil House \nDep...,"{\n ""has_policy_instruments"": true,\n ""polic...",True
4,75.pdf,Brazil,32 The application of FNMC resources may be al...,"{\n ""has_policy_instruments"": true,\n ""polic...",True


In [103]:
true_df = results_df[results_df['has_policy_instruments'] == True]

In [104]:
true_df.iloc[0].response

'{\n  "has_policy_instruments": true,\n  "policy_instruments": ["National Fund on Climate Change - FNMC"]\n}'

In [120]:
# Extract a comma-separated string of policy instruments from the 'response' column in true_df, removing brackets

def extract_policy_instruments_str(response_str):
    try:
        response_json = json.loads(response_str)
        instr_list = response_json.get("policy_instruments", [])
        # Join list into comma-separated string, or return empty string if not a list
        if isinstance(instr_list, list):
            return ', '.join(str(instr) for instr in instr_list)
        else:
            return str(instr_list)
    except Exception:
        return None

# Add a new column 'policy_instruments' (as string) to true_df
true_df['policy_instruments'] = true_df['response'].apply(extract_policy_instruments_str)

# Display the first few rows with policy instrument strings
true_df


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  true_df['policy_instruments'] = true_df['response'].apply(extract_policy_instruments_str)


Unnamed: 0,file,country,chunk_text,response,has_policy_instruments,policy_instruments
3,75.pdf,Brazil,Presidency of the Republic \nCivil House \nDep...,"{\n ""has_policy_instruments"": true,\n ""polic...",True,National Fund on Climate Change - FNMC
4,75.pdf,Brazil,32 The application of FNMC resources may be al...,"{\n ""has_policy_instruments"": true,\n ""polic...",True,FNMC
5,75.pdf,Brazil,ß 12 0 annual application plan must include: \...,"{\n ""has_policy_instruments"": true,\n ""polic...",True,"Law No. 12,114, of 2009"
6,75.pdf,Brazil,"12,114, of 2009; \n \n \n II - approve th...","{\n ""has_policy_instruments"": true,\n ""polic...",True,FNMC
11,74.pdf,Brazil,2nd The National Crop-Livestock-Forest Integra...,"{\n ""has_policy_instruments"": true,\n ""polic...",True,National Crop-Livestock-Forest Integration Policy
...,...,...,...,...,...,...
9221,2.pdf,EU,Amounts that would be due to be paid by the Co...,"{\n ""has_policy_instruments"": true,\n ""polic...",True,Regulation (EU) 2024/795
9222,2.pdf,EU,Where the amount of \na contribution agreement...,"{\n ""has_policy_instruments"": true,\n ""polic...",True,Regulation (EU) 2021/1060
9225,2.pdf,EU,"However, in order to effectively \nsupport non...","{\n ""has_policy_instruments"": true,\n ""polic...",True,"Regulation (EU) 2024/795, Regulation (EU) 2021..."
9226,2.pdf,EU,\neuropa.eu/eli/reg/2024/795/oj).’.\nArticle 1...,"{\n ""has_policy_instruments"": true,\n ""polic...",True,"Regulation (EU) 2021/241, Regulation (EU) 2021..."


In [128]:
true_df.policy_instruments.iloc[15]

'PROCEL, PBE'

In [145]:
import re
import pandas as pd

def split_policy_instruments(instr_str):
    # Split only on commas followed by a space and then a capital letter
    # This identifies the start of a new policy name
    # Example: "National Plan of Energy 2030, National Energy Matrix 2030"
    # Will split into: ["National Plan of Energy 2030", "National Energy Matrix 2030"]
    # But won't split: "Regulation (EU) 2021, 1060" (no capital after comma+space)
    if not isinstance(instr_str, str) or not instr_str.strip():
        return []
    pattern = r',\s+(?=[A-Z])'
    items = [item.strip() for item in re.split(pattern, instr_str) if item.strip()]
    return items

# Build a list of dicts: each with file, country, policy_instrument
rows = []
for idx, row in true_df.iterrows():
    file = row['file']
    country = row['country']
    policy_str = row['policy_instruments']
    policy_list = split_policy_instruments(policy_str)
    for pol in policy_list:
        if pol.strip():  # skip empty
            rows.append({
                "file": file,
                "country": country,
                "policy_instrument": pol.strip()
            })

policy_instruments_df = pd.DataFrame(rows)
policy_instruments_df  # this df has columns: file, country, policy_instrument


Unnamed: 0,file,country,policy_instrument
0,75.pdf,Brazil,National Fund on Climate Change - FNMC
1,75.pdf,Brazil,FNMC
2,75.pdf,Brazil,"Law No. 12,114, of 2009"
3,75.pdf,Brazil,FNMC
4,74.pdf,Brazil,National Crop-Livestock-Forest Integration Policy
...,...,...,...
1877,2.pdf,EU,Regulation (EU) 2021/241
1878,2.pdf,EU,Regulation (EU) 2021/523
1879,2.pdf,EU,Regulation (EU) 2024/795
1880,44.pdf,EU,Directive 2003/87/EC


In [146]:
b_df_policies = policy_instruments_df[policy_instruments_df['country'] == 'Brazil']
eu_df_policies = policy_instruments_df[policy_instruments_df['country'] == 'EU']
c_df_policies = policy_instruments_df[policy_instruments_df['country'] == 'China']

In [147]:
b_df_policies

Unnamed: 0,file,country,policy_instrument
0,75.pdf,Brazil,National Fund on Climate Change - FNMC
1,75.pdf,Brazil,FNMC
2,75.pdf,Brazil,"Law No. 12,114, of 2009"
3,75.pdf,Brazil,FNMC
4,74.pdf,Brazil,National Crop-Livestock-Forest Integration Policy
...,...,...,...
698,45.pdf,Brazil,National ABC Plan
699,45.pdf,Brazil,Climate Fund
700,45.pdf,Brazil,Amazon Fund
701,50.pdf,Brazil,ABC Program


In [148]:
c_df_policies

Unnamed: 0,file,country,policy_instrument
703,49.pdf,China,Carbon Trading System
704,49.pdf,China,Carbon Emission Certification System
705,49.pdf,China,Fiscal
706,49.pdf,China,Taxation and Pricing Policies
707,49.pdf,China,Investment and Financing Policies
...,...,...,...
817,40.pdf,China,Climate investment and financing pilot
818,46.pdf,China,Measures for the Management of Central Budgeta...
819,45.pdf,China,"green finance, green development policy system"
820,2.pdf,China,"financial service support, credit management a..."


In [149]:
eu_df_policies

Unnamed: 0,file,country,policy_instrument
822,49.pdf,EU,European Climate Law
823,49.pdf,EU,Carbon Farming Initiative
824,49.pdf,EU,Common Agricultural Policy
825,49.pdf,EU,InvestEU Programme
826,49.pdf,EU,EU taxonomy
...,...,...,...
1877,2.pdf,EU,Regulation (EU) 2021/241
1878,2.pdf,EU,Regulation (EU) 2021/523
1879,2.pdf,EU,Regulation (EU) 2024/795
1880,44.pdf,EU,Directive 2003/87/EC


In [150]:
import difflib

def mark_first_instance(df, policy_col='policy_instrument', threshold=0.8):
    """
    Adds a column 'first_instance' to the DataFrame, which is 1 for the first (canonical) occurrence 
    of a fuzzy-unique policy, and 0 for subsequent fuzzy duplicates.
    Args:
        df: DataFrame with a column of policy strings (policy_col)
        policy_col: string, name of policy column
        threshold: float (0-1), similarity ratio above which two entries are considered the same
    Returns:
        DataFrame with additional 'first_instance' column (1/0)
    """
    # Keep track of seen policies (canonical representatives)
    unique_policies = []
    first_instance_list = []
    for pol in df[policy_col]:
        pol_clean = pol.strip().lower()
        found = False
        for i, existing in enumerate(unique_policies):
            existing_clean = existing.strip().lower()
            # quick exact containment or equality
            if pol_clean == existing_clean or pol_clean in existing_clean or existing_clean in pol_clean:
                found = True
                # Prefer the longer string as more descriptive
                if len(pol) > len(existing):
                    unique_policies[i] = pol
                break
            # fuzzy
            ratio = difflib.SequenceMatcher(None, pol_clean, existing_clean).ratio()
            if ratio >= threshold:
                found = True
                # Prefer the longer string as more descriptive
                if len(pol) > len(existing):
                    unique_policies[i] = pol
                break
        if not found:
            unique_policies.append(pol)
            first_instance_list.append(1)  # First unique occurrence
        else:
            first_instance_list.append(0)  # Duplicate (fuzzy)
    df = df.copy()
    df['first_instance'] = first_instance_list
    return df


In [151]:
b_df_policies = mark_first_instance(b_df_policies)
b_df_policies.head()

Unnamed: 0,file,country,policy_instrument,first_instance
0,75.pdf,Brazil,National Fund on Climate Change - FNMC,1
1,75.pdf,Brazil,FNMC,0
2,75.pdf,Brazil,"Law No. 12,114, of 2009",1
3,75.pdf,Brazil,FNMC,0
4,74.pdf,Brazil,National Crop-Livestock-Forest Integration Policy,1


In [165]:
# The previous code attempts to use .iloc with a string, which is incorrect.
# To select all rows for file '44.pdf', use boolean indexing:
b_df_policies[b_df_policies['file'] == '44.pdf']

Unnamed: 0,file,country,policy_instrument,first_instance


In [155]:
b_df_policies.first_instance.value_counts()

first_instance
0    449
1    254
Name: count, dtype: int64

This means there are 254 policy instruments in Brazil. Let's see what they are

In [157]:
#print a list of the 254 policy instruments identified in Brazil, where the first_instance is 1
b_policy_list = b_df_policies[b_df_policies['first_instance'] == 1]['policy_instrument'].tolist()
len(b_policy_list)
b_policy_list


['National Fund on Climate Change - FNMC',
 'Law No. 12,114, of 2009',
 'National Crop-Livestock-Forest Integration Policy',
 'Decree No. 8,874',
 'Law No. 12,431',
 'Law No. 13,334',
 'PNRS instruments',
 'Decarbonization Credit',
 'National Plan of Energy 2030 - PNE 2030',
 'National Energy Matrix 2030 - MEN 2030',
 'PROINFA',
 'Brazilian Labeling Law',
 'Energy Efficiency Law',
 'National Biodiesel Program',
 'National Alcohol Program – PROALCOOL',
 'Law No. 11,241, of September 19, 2002',
 'PROCEL',
 'PBE',
 'National Rationalization Program - CONPET',
 'Law of Energy Efficiency (nº 10,295/01)',
 'Law nº 9,991/00',
 'CONPET Seal',
 'National Energy Efficiency Policy',
 'CIMGC Resolution No. 9 of 03/20/2009',
 'Program of Activities (PoA)',
 'Clean Development Mechanism (CDM)',
 'Action Plan for Prevention and Control of Deforestation and Fires in the Cerrado (PPCerrado)',
 'Programa Agricultura de Baixo Carbono',
 'Estratégia Nacional para REDD+',
 'Política Nacional sobre Mudança 

In [158]:
def get_policy_list(df, country):
    df = mark_first_instance(df)
    policy_list = df[df['first_instance'] == 1]['policy_instrument'].tolist()
    return policy_list

b_policy_list = get_policy_list(b_df_policies, 'Brazil')
c_policy_list = get_policy_list(c_df_policies, 'China')
eu_policy_list = get_policy_list(eu_df_policies, 'EU')

In [161]:
c_policy_list

['Carbon Trading System',
 'Carbon Emission Certification System',
 'Fiscal',
 'Taxation and Pricing Policies',
 'Investment and Financing Policies',
 "China's National Climate Change Plan",
 'Twelfth Five-Year Plan for Controlling Greenhouse Gas Emissions',
 'National Climate Adaptation Plan',
 'Climate Change Strategy',
 'low-carbon provincial pilot projects, low-carbon development plans',
 'Greenhouse Gas Voluntary Emissions Reduction Trading Management Office Law, national carbon emissions trading market, carbon emissions trading pilot',
 'Low Carbon Product Certification System',
 'Mandatory Government Green and Low-Carbon Procurement Policies',
 'carbon emission access thresholds for key industries, green credit institutions system, carbon finance development model, policy investment and financing institutions for low-carbon development',
 'national carbon emissions trading market, carbon emissions trading management measures, carbon emissions trading certification system',
 'Low

In [163]:
eu_policy_list

['European Climate Law',
 'Carbon Farming Initiative',
 'Common Agricultural Policy',
 'InvestEU Programme',
 'EU taxonomy',
 'Regulation (EU) 2018/1999 on the Governance of the Energy Union and Climate Action',
 'Directive 2003/87/EC',
 'Regulation (EU) 2018/842',
 'Regulation (EU) 2021/1119',
 'Just Transition Mechanism',
 'Just Transition Fund',
 'Council Regulation (EU',
 'Euratom) 2020/2093',
 'Euratom) 2018/1046',
 'European Investment Bank Facility',
 'Territorial Just Transition Plan',
 'InvestEU Advisory Hub',
 'Union support under the Facility',
 'Private Finance for Energy Efficiency (PF4EE)',
 'Natural Capital Financing Facility (NCFF)',
 'Regulation (EC) No 680/2007',
 'Commission Decision of 25.2.2010 on European Union participation in the 2020 European Fund for Energy',
 'Climate Change and Infrastructure',
 'European Energy Efficiency Fund (EEEF): Regulation (EU) No 1233/2010',
 'Union Emissions Trading System',
 'EU Strategy on adaptation to climate change',
 'European