# Preprocessing text
Once text has been parsed from PDF, we (may) want to clean the text to improve indexing and recall. The two major tasks are
- replace abbreviations
- resolve coreferences (note, the pace of NLP progress bites us here. The major coreference resolution libraries are incompatible with my development environment, and the incompatibility is unresolvable. I created a separate project `world-bank-kg-coref` to explore options but ultimately I'd strongly prefer a single environment for this project).
- summarize
- extract keywords


We're using the MinerU output from 01_parse-pdf.ipynb.

In [9]:
import re
import json
from pprint import pprint
from dotenv import load_dotenv
load_dotenv(dotenv_path="../secrets/.env")

from llama_index.readers.file.markdown import MarkdownReader
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.core import Document, StorageContext, load_index_from_storage
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

## Load vector database
Before proceeding with this notebook, run the `scripts/build-vector-store.py` script to create a "test" collection in a locally persisted Chroma database.

In [13]:
chroma_client = chromadb.PersistentClient(path="../chroma_db")
collection = chroma_client.get_or_create_collection("test")

vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(
    persist_dir="../storage",
    vector_store=vector_store
)

index = load_index_from_storage(storage_context)

Loading llama_index.core.storage.kvstore.simple_kvstore from ../storage/docstore.json.
Loading llama_index.core.storage.kvstore.simple_kvstore from ../storage/index_store.json.


## Abbreviations
Abbreviations are often captured in a section at the beginning of the document called Abbreviations or Acronyms. We can query the vector database to find the likely nodes that contain acronyms.

In [14]:
query_str = "acronyms abbreviations list of acronyms glossary"
retriever = index.as_retriever(similarity_top_k=5)

nodes = retriever.retrieve(query_str)

for node in nodes:
    print(f"[Score: {node.score:.2f}] {node.node.text[:300]}...\n")

[Score: 0.61] Annex 14: Country at a Glance...

[Score: 0.60] P RIC E\$ and GOVERN M ENT FINANCE...

[Score: 0.59] This table provides data on various economic indicators and statistics over different years, including domestic prices, consumer prices, GDP deflator, government finance, trade, balance of payments, external debt, and resource flows.,
with the following table title:
Economic Indicators and Statistic...

[Score: 0.59] The table provides economic and social indicators for Mexico, focusing on aspects like poverty, development, GDP, economic ratios, and long-term trends. It includes data on population, life expectancy, GNI per capita, labor force, education, economic growth, structure of the economy, and more.,
with...

[Score: 0.59] Document of The World Bank

FOR OFFICIAL USE ONLY

PROJECT APPRAISAL DOCUMENT

ON A

PROPOSED PURCHASE OF EMISSIONS REDUCTIONS

BY THE SPANISH CARBON FUND AND THE BIO CARBON FUND

IN THE AMOUNT OF US\$ 17,473,211

FROM THE

COMISION FEDERAL DE ELE

In [15]:
query_engine = index.as_query_engine(similarity_top_k=5)

# 2. Ask explicitly for abbreviation/glossary sections
response = query_engine.query(
    "Find sections of the document that define acronyms or abbreviations. "
    "These sections may be called 'Abbreviations', 'Acronyms', or 'List of Acronyms'."
)

# 3. Print results + source nodes
print("ANSWER:\n", response.response)

print("\n--- MATCHED SECTIONS ---")
for node in response.source_nodes:
    print(f"[Score: {node.score:.2f}] {node.node.text[:300]}...\n")

ANSWER:
 The document contains a table that defines acronyms or abbreviations.

--- MATCHED SECTIONS ---
[Score: 0.60] Annex 14: Country at a Glance...

[Score: 0.58] P RIC E\$ and GOVERN M ENT FINANCE...

[Score: 0.58] Table detailing the Environmental Impacts of a Construction Project including effects on air quality, noise, soil, water, vegetation, and fauna.,
with the following table title:
General Description of Environmental Impacts,
with the following columns:
- PHASE: None
- AIR QUALITY: None
- SOIL -GROUND...

[Score: 0.58] Arrangements for results monitoring...

[Score: 0.58] Table showing the names, titles, and units of various professionals in different roles.,
with the following columns:
- Name: None
- Title: None
- Unit: None

<table><tr><td>Name</td><td>Title</td><td>Unit</td></tr><tr><td>Demetrios Papathanasiou</td><td>Energy Economist-Task Manager</td><td>LCSFE</t...



In [17]:
docstore = storage_context.docstore

for i, (node_id, node) in enumerate(docstore.docs.items(), 1):
    print(f"Node {i} (ID: {node_id}):\n")
    print(node.text)
    print("="*80)

Node 1 (ID: ee42bee1-6a8c-4d04-a090-c74807ea863f):

Table showing key personnel roles and names within an organization.,
with the following columns:
- Role: The position within the organization
- Name: The individual holding the position

Node 2 (ID: 73aaeb3c-a866-4da3-8e7f-032e23544b82):

This table provides information on the financing plan and project sponsor for a carbon finance project related to renewable energy and environmental themes. It includes details on the sources of financing, project sponsor contributions, and estimated payments for emissions reductions over the project implementation period from 2007 to 2019.,
with the following columns:
- Source: None
- Local: None
- Foreign: None
- Total: None

Node 3 (ID: 9fe2ea7b-b8ec-465a-a0e1-85b3d13457cc):

Table Title: Risk Assessment and Readiness Criteria

Table ID: N/A

Keep Table: Yes,
with the following columns:

Node 4 (ID: b40abea3-24ce-4d09-b0ed-e69be616dcbe):

This table contains information with restricted distributio

In [18]:
from pathlib import Path

file_path = Path('../output/test/auto/test.md')

with open(file_path, "r") as f:
    md_string = f.read()

doc = Document(
    text=md_string,
    metadata={"source": str(file_path)})

parser = MarkdownElementNodeParser(
    include_metadata=True, 
    include_prev_next_rel=True
)

nodes = parser.get_nodes_from_documents([doc])

32it [00:00, 121244.56it/s]


In [20]:
len(nodes)

108

In [6]:
import re
from typing import Dict, List


def extract_acronyms_from_markdown(md_text: str) -> Dict[str, str]:
    lines = md_text.splitlines()
    glossary = {}

    # Step 1: Locate 'ABBREVIATIONS AND ACRONYMS' section
    try:
        start_idx = next(
            i for i, line in enumerate(lines)
            if re.search(r'#\s*(ABBREVIATIONS|ACRONYMS)', line, re.IGNORECASE)
        )
    except StopIteration:
        return glossary

    # Find end of section (next header or blank line after 50 lines max)
    end_idx = start_idx + 1
    for i in range(start_idx + 1, min(len(lines), start_idx + 50)):
        if lines[i].strip().startswith('#') and i > start_idx + 3:
            break
        end_idx = i

    section_lines = [line.strip() for line in lines[start_idx + 1:end_idx + 1] if line.strip()]

    # Heuristic: split acronyms/definitions into two columns
    half = len(section_lines) // 2
    acronyms = section_lines[:half]
    definitions = section_lines[half:]

    for a, d in zip(acronyms, definitions):
        if a and d:
            glossary[a.strip()] = d.strip()

    # Step 2: Look for headers like "# ABC" with definitions in next line
    for i, line in enumerate(lines):
        if re.fullmatch(r'#\s+([A-Z]{2,10})', line.strip()):
            acronym = line.strip().replace('#', '').strip()
            if i + 1 < len(lines):
                defn = lines[i + 1].strip()
                # Be cautious of repeating or short filler lines
                if len(defn.split()) >= 3:
                    glossary[acronym] = defn

    return glossary


In [21]:
def is_acronym(line: str) -> bool:
    """
    We use the heuristic that an acronym is 3 - 11 characters and > 50% uppercase 
    (following Schwartz & Hearst, 2003).
    """
    stripped = line.strip()
    if len(stripped) > 12 or len(stripped) < 2:
        return False
    
    # Allow spaces (e.g., 'IMN G') and mixed characters
    chars = [c for c in stripped if c.isalpha()]
    if not chars:
        return False
    uppercase_ratio = sum(c.isupper() for c in chars) / len(chars)
    return uppercase_ratio >= 0.5


def handle_multiline_acronym_definitions(definition_lines):
    """
    Helper function for acronym definitions across multiple lines

    If the line starts with an uppercase letter (^[A-Z]) and the buffer is not empty:
    - We assume a new definition is starting.
    - So, the current buffer (previous definition) is joined and saved.
    - Start a new buffer with the new line.

    Else:
    - We assume the line is a continuation of the previous definition.
    - Add it to the current buffer.
    """
    merged_definitions = []
    buffer = []
    for line in definition_lines:
        if re.match(r'^[A-Z]', line) and buffer:
            merged_definitions.append(" ".join(buffer).strip())
            buffer = [line]
        else:
            buffer.append(line)
    if buffer:
        merged_definitions.append(" ".join(buffer).strip())
    return merged_definitions


def extract_acronym_glossary(md_string: str) -> Dict[str, str]:
    """
    Returns a dictionary of acronyms from the "Acronyms" and/or "Abbreviations" section.

    Assumes the acronyms and their definitions occur in the same order in the text 
    (side-by-side or in two columns)
    Stops looking for acronyms after next header is found.
    """
    lines = md_string.splitlines()
    try:
        start_idx = next(
            i for i, line in enumerate(lines)
            if re.search(r'#\s*(ABBREVIATION|ACRONYM)', line, re.IGNORECASE)
        )
    except StopIteration:
        print("Acronym section not found")

    end_idx = start_idx + 1
    for i in range(start_idx + 1, min(len(lines), start_idx + 200)):
        if lines[i].strip().startswith('#') and i > start_idx + 3:
            break
        end_idx = i

    # Extract lines and remove empty ones
    section_lines = [line.strip() for line in lines[start_idx + 1:end_idx] if line.strip()]

    # Heuristic: acronym lines are all uppercase and short
    acronym_lines = [line for line in section_lines if is_acronym(line)]
    definition_lines = [line for line in section_lines if line not in acronym_lines]

    # Handle multi-line definitions
    merged_definitions = handle_multiline_acronym_definitions(definition_lines)

    # Align and return
    glossary = {
        k.strip(): v.strip()
        for k, v in zip(acronym_lines, merged_definitions)
    }
    return glossary

In [None]:
lines = md_string.splitlines()
try:
    start_idx = next(
        i for i, line in enumerate(lines)
        if re.search(r'#\s*(ABBREVIATION|ACRONYM)', line, re.IGNORECASE)
    )
except StopIteration:
    print("Acronym section not found")

end_idx = start_idx + 1
for i in range(start_idx + 1, min(len(lines), start_idx + 200)):
    if lines[i].strip().startswith('#') and i > start_idx + 3:
        break
    end_idx = i

# Extract lines and remove empty ones
section_lines = [line.strip() for line in lines[start_idx + 1:end_idx] if line.strip()]

# Heuristic: acronym lines are all uppercase and short
acronym_lines = [line for line in section_lines if is_acronym(line)]
definition_lines = [line for line in section_lines if line not in acronym_lines]

# Handle multi-line definitions
merged_definitions = handle_multiline_acronym_definitions(definition_lines)

# Align and return
glossary = {
    k.strip(): v.strip()
    for k, v in zip(acronym_lines, merged_definitions)
}
glossary

In [92]:
def extract_acronym_section(md_string: str) -> str:
    lines = md_string.splitlines()
    try:
        start_idx = next(
            i for i, line in enumerate(lines)
            if re.search(r'#\s*(ABBREVIATION|ACRONYM)', line, re.IGNORECASE)
        )
    except StopIteration:
        print("Acronym section not found")
        return []
        
    end_idx = start_idx + 1
    for i in range(start_idx + 1, min(len(lines), start_idx + 200)):
        if lines[i].strip().startswith('#') and i > start_idx + 3:
            break
        end_idx = i

    return ' '.join(lines[start_idx : end_idx])

In [81]:
text = """
# Acronyms
BLT
BM
CO2e
GoM
Build-Lease-Transfer
Build Margin emission factor
Carbon Dioxide equivalent
Government of Mexico
"""

In [83]:
text = """
# Acronyms
BLT Build-Lease-Transfer 
BM Build Margin emission factor
CO2e Carbon Dioxide equivalent
GoM Government of Mexico
"""

In [85]:
text = """
# Acronymns BLT Build-Lease-Transfer BM Build Margin emission factor CO2e Carbon Dioxide equivalent GoM Government of Mexico
"""

In [93]:
section_lines = extract_acronym_section(md_string)
section_lines

'# ABBREVIATIONS AND ACRONYMS  BLT    BM    BOT    CAS    CDM    CENACE    CER    CFE    CM    CO2    CO2e    DOE    EMP    ER    ERPA    GEF    GoM    GHG    GW    GWh    IMN G    INEGI    IPER    IPP    IRR    MW    NPV    OM    O&M    OPF    PEMEX    PIDIREGAS    Build-Lease-Transfer    Build Margin emission factor    Build-Operate-Transfer    Country Assistance Strategy    Clean Development Mechanism    National Center of Energy Control (Centro Nacional de Control de Energía)    Certified Emissions Reduction    National Electric Commission (Comisión Nacional de Electricidad)    Combined Margin emission factor    Carbon Dioxide    Carbon Dioxide equivalent    Designated Operational Entity    Environmental Management Plan    Emissions Reduction    Emissions Reduction Purchase Agreement    Global Environment Facility    Government of Mexico    Greenhouse Gas    Gigawatt    Gigawatthour    Interconnected Mexican National Grid    National Institute of Statistics, Geography and Computer 

With OpenAI

In [91]:
from openai import OpenAI
client = OpenAI()

prompt = """
Extract a dictionary of acronyms and their definitions from the following text.

Return as a valid JSON dictionary like: {"ABC": "Definition of ABC", ...}

Text:
""" + extract_acronym_section(md_string)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an expert at understanding document formatting and extracting structured acronym definitions."},
        {"role": "user", "content": prompt}
    ],
    temperature=0,
)

acronym_dict = json.loads(response.choices[0].message.content)
acronym_dict

{'BLT': 'Build-Lease-Transfer',
 'BM': 'Build Margin emission factor',
 'BOT': 'Build-Operate-Transfer',
 'CAS': 'Country Assistance Strategy',
 'CDM': 'Clean Development Mechanism',
 'CENACE': 'National Center of Energy Control (Centro Nacional de Control de Energía)',
 'CER': 'Certified Emissions Reduction',
 'CFE': 'National Electric Commission (Comisión Nacional de Electricidad)',
 'CM': 'Combined Margin emission factor',
 'CO2': 'Carbon Dioxide',
 'CO2e': 'Carbon Dioxide equivalent',
 'DOE': 'Designated Operational Entity',
 'EMP': 'Environmental Management Plan',
 'ER': 'Emissions Reduction',
 'ERPA': 'Emissions Reduction Purchase Agreement',
 'GEF': 'Global Environment Facility',
 'GoM': 'Government of Mexico',
 'GHG': 'Greenhouse Gas',
 'GW': 'Gigawatt',
 'GWh': 'Gigawatthour',
 'IMN G': 'Interconnected Mexican National Grid',
 'INEGI': 'National Institute of Statistics, Geography and Computer Science (Instituto Nacional de Estadística, Geografía e Informática)',
 'IPER': 'Infr

In [1]:
text = """
The Carbon Dioxide equivalent (CO2e) was calculated using standard metrics. Later in the document, CO2e is used repeatedly.
The Certified Emissions Reduction (CER) units are issued for validated projects. CERs are tradable.
"""

In [7]:
fpath = "../output/test/auto/test.md"
with open(fpath, "r") as f:
    md_string = f.read()

In [9]:
import scispacy
import spacy
from scispacy.abbreviation import AbbreviationDetector
from pprint import pprint

In [10]:
# Load SciSpacy model
nlp = spacy.load("en_core_sci_sm")

# Add abbreviation detector to the pipeline
nlp.add_pipe("abbreviation_detector")

doc = nlp(md_string)

# Extract abbreviations
abbreviations = {}
for abrv in doc._.abbreviations:
    abbreviations[abrv.text] = abrv._.long_form.text

pprint(abbreviations)

{'A.C.': 'available in Project files',
 'AWEA': 'American Wind Energy Association',
 'BLT': 'Build, Lease and Transfer',
 'BM': 'build margin emission factor',
 'CAS': 'Country Assistance Strategy',
 'CDM': 'Clean Development Mechanism',
 'CENACE': 'controlled by the National Center for Energy Control',
 'CFE': 'Comisi6n Federal de Electricidad',
 'CM': 'combined margin',
 'Control': 'Control de Energía',
 'DOE': 'Designated Operational Entity',
 'EA': 'Environmental Assessment',
 'EIA': 'Environmental Impact Assessment',
 'EMP': 'Environmental Management Plan',
 'ER': 'emissions reductions',
 'ERPA': 'Emissons Reduction Purchase Agreement',
 'ERs': 'Emissions Reductions',
 'Forests': 'Forestal Sustentable',
 'GEF': 'Global Environment Facility',
 'GoM': 'government of Mexico',
 'IMNG': 'Interconnected Mexican National Grid',
 'IPPs': 'Independent Power Producers',
 'JI': 'Joint Implementation',
 'LFC': 'Luz y Fuerza del Centro',
 'LoI': 'letter of intention',
 'MP': 'Monitoring Plan',

In [11]:
acronym_dict = {'BLT': 'Build-Lease-Transfer',
 'BM': 'Build Margin emission factor',
 'BOT': 'Build-Operate-Transfer',
 'CAS': 'Country Assistance Strategy',
 'CDM': 'Clean Development Mechanism',
 'CENACE': 'National Center of Energy Control (Centro Nacional de Control de Energía)',
 'CER': 'Certified Emissions Reduction',
 'CFE': 'National Electric Commission (Comisión Nacional de Electricidad)',
 'CM': 'Combined Margin emission factor',
 'CO2': 'Carbon Dioxide',
 'CO2e': 'Carbon Dioxide equivalent',
 'DOE': 'Designated Operational Entity',
 'EMP': 'Environmental Management Plan',
 'ER': 'Emissions Reduction',
 'ERPA': 'Emissions Reduction Purchase Agreement',
 'GEF': 'Global Environment Facility',
 'GoM': 'Government of Mexico',
 'GHG': 'Greenhouse Gas',
 'GW': 'Gigawatt',
 'GWh': 'Gigawatthour',
 'IMN G': 'Interconnected Mexican National Grid',
 'INEGI': 'National Institute of Statistics, Geography and Computer Science (Instituto Nacional de Estadística, Geografía e Informática)',
 'IPER': 'Infrastructure Public Expenditure Review',
 'IPP': 'Independent Power Producers',
 'IRR': 'Internal Rate of Return',
 'MW': 'Megawatt',
 'NPV': 'Net Present Value',
 'OM': 'Operating Margin emission factor',
 'O&M': 'Operation and Maintenance',
 'OPF': 'Publicly Finance Works (Obra Pública Financiada)',
 'PEMEX': 'Mexican Petroleum (Petróleos Mexicanos)',
 'PIDIREGAS': 'Projects with Deferred Impact in the Budgetary Registry (Proyectos de Impacto Diferido en el Registro de Gasto)'}

In [14]:
def merge_acronym_dicts(primary: dict, detected: dict) -> dict:
    """Merge two acronym dictionaries with a warning on conflicting definitions.

    Args:
        primary (dict): Existing acronym glossary (e.g., from acronym section)
        detected (dict): Acronyms detected from full text using SciSpacy

    Returns:
        dict: Merged dictionary with priority to `primary`
    """
    merged = primary.copy()
    
    for abbr, definition in detected.items():
        if abbr in merged:
            if merged[abbr] != definition:
                print(f"⚠️ Warning: Conflict for acronym '{abbr}':")
                print(f"    Primary:  {merged[abbr]}")
                print(f"    Detected: {definition}")
        else:
            print(f'➕ {abbr}: {definition}')
            merged[abbr] = definition

    return merged


merged_acronyms = merge_acronym_dicts(acronym_dict, abbreviations)

➕ Control: Control de Energía
➕ SCF: Spanish Carbon Fund
    Primary:  National Electric Commission (Comisión Nacional de Electricidad)
    Detected: Comisi6n Federal de Electricidad
➕ ERs: Emissions Reductions
    Primary:  Emissions Reduction Purchase Agreement
    Detected: Emissons Reduction Purchase Agreement
➕ LFC: Luz y Fuerza del Centro
➕ IPPs: Independent Power Producers
    Primary:  Build-Lease-Transfer
    Detected: Build, Lease and Transfer
➕ iii: ii) cogeneration;
    Primary:  Government of Mexico
    Detected: government of Mexico
➕ UNFCCC: United Nations Framework Convention on Climate Change
➕ JI: Joint Implementation
➕ ii: inequality;
➕ VERs: Verified Emissions Reductions
    Primary:  National Center of Energy Control (Centro Nacional de Control de Energía)
    Detected: controlled by the National Center for Energy Control
➕ LoI: letter of intention
➕ MP: Monitoring Plan
    Primary:  Emissions Reduction
    Detected: emissions reductions
➕ EIA: Environmental Impact

In [18]:
import html
import re

def clean_acronyms(acronym_dict: dict, min_upper_ratio: float = 0.5) -> dict:
    """Clean an acronym dictionary by decoding HTML entities in definitions 
    and filtering acronyms that are not sufficiently uppercase.

    Args:
        acronym_dict (dict): Dictionary of acronyms and their definitions.
        min_upper_ratio (float): Minimum ratio of uppercase letters required to keep the acronym.

    Returns:
        dict: Cleaned acronym dictionary.
    """
    cleaned = {}

    for abbr, defn in acronym_dict.items():
        # Remove acronyms that don't meet the uppercase threshold
        if not abbr or 11 > len(abbr) < 2:
            continue

        num_upper = sum(1 for c in abbr if c.isupper())
        ratio_upper = num_upper / len(abbr)

        if ratio_upper < min_upper_ratio:
            continue

        # Decode HTML entities in definition
        cleaned_defn = html.unescape(defn).strip()
        cleaned[abbr] = cleaned_defn

    return cleaned

merged_acronyms = clean_acronyms(merged_acronyms)
merged_acronyms

{'BLT': 'Build-Lease-Transfer',
 'BM': 'Build Margin emission factor',
 'BOT': 'Build-Operate-Transfer',
 'CAS': 'Country Assistance Strategy',
 'CDM': 'Clean Development Mechanism',
 'CENACE': 'National Center of Energy Control (Centro Nacional de Control de Energía)',
 'CER': 'Certified Emissions Reduction',
 'CFE': 'National Electric Commission (Comisión Nacional de Electricidad)',
 'CM': 'Combined Margin emission factor',
 'CO2': 'Carbon Dioxide',
 'CO2e': 'Carbon Dioxide equivalent',
 'DOE': 'Designated Operational Entity',
 'EMP': 'Environmental Management Plan',
 'ER': 'Emissions Reduction',
 'ERPA': 'Emissions Reduction Purchase Agreement',
 'GEF': 'Global Environment Facility',
 'GoM': 'Government of Mexico',
 'GHG': 'Greenhouse Gas',
 'GW': 'Gigawatt',
 'GWh': 'Gigawatthour',
 'IMN G': 'Interconnected Mexican National Grid',
 'INEGI': 'National Institute of Statistics, Geography and Computer Science (Instituto Nacional de Estadística, Geografía e Informática)',
 'IPER': 'Infr

In [46]:
import spacy
from spacy.pipeline import EntityRuler
from spacy.language import Language

text = """
The Mexican electricity sector is dominated by two state-owned companies: the Federal
Electricity Commission (CFE), which serves most of the Mexican territory; and Luz y Fuerza del
Centro (LFC), which is responsible for providing electricity services in the area of Mexico City
and its surroundings. CFE is a vertically integrated electricity company that controls the
majority of generation, the transmission system and energy dispatching functions, while it is also
responsible for the distribution and commercialization of electricity.
"""

nlp = spacy.load("en_core_sci_sm")

ruler = nlp.add_pipe("entity_ruler", before="ner")
patterns = [
    {"label": "ACRONYM", "pattern": [{"LOWER": abbr.lower()}], "id": abbr}
    for abbr in merged_acronyms
]
ruler.add_patterns(patterns)

@Language.component("replace_acronyms")
def replace_acronyms(doc):
    replacements = {}
    for ent in doc.ents:
        if ent.label_ == "ACRONYM":
            abbr = ent.text
            full = merged_acronyms.get(abbr)
            if full:
                replacements[ent.start] = f"{full} ({abbr})"

    new_tokens = []
    for i, token in enumerate(doc):
        if i in replacements:
            new_tokens.append(replacements[i])
        else:
            new_tokens.append(token.text)

    return spacy.tokens.Doc(doc.vocab, words=new_tokens)

nlp.add_pipe("replace_acronyms", last=True)

doc = nlp(md_string)
print(doc.text)

Document of The World Bank 

 FOR OFFICIAL USE ONLY 

 PROJECT APPRAISAL DOCUMENT 

 ON A 

 PROPOSED PURCHASE OF EMISSIONS REDUCTIONS 

 BY THE SPANISH CARBON FUND AND THE BIO CARBON FUND 

 IN THE AMOUNT OF US\$ 17,473,211 

 FROM THE 

 COMISION FEDERAL DE ELECTRICIDAD ( MEXICO ) 

 FOR THE 

 WIND UMBRELLA ( LA VENTA II ) PROJECT 

 April 24 , 2006 

 # CURRENCY EQUIVALENTS 

 ( Exchange Rate Effective { January 2006 } ) Currency Unit $ = $ Mexican Peso 1 Mexican Peso $ = $ $ \mathrm { U S } \$ 0.095 $ $ \begin{array } { l l l } { 1 \mathrm { U S } \mathbb { S } } & { = } & { 0 . 7 9 1 6 \in } \end{array}$ FISCAL YEAR January 1 December 31 

 # ABBREVIATIONS AND ACRONYMS 

 Build-Lease-Transfer (BLT)   
 Build Margin emission factor (BM)   
 Build-Operate-Transfer (BOT)   
 Country Assistance Strategy (CAS)   
 Clean Development Mechanism (CDM)   
 National Center of Energy Control (Centro Nacional de Control de Energía) (CENACE)   
 Certified Emissions Reduction (CER)   
 National

In [49]:
merged_acronyms

{'BLT': 'Build-Lease-Transfer',
 'BM': 'Build Margin emission factor',
 'BOT': 'Build-Operate-Transfer',
 'CAS': 'Country Assistance Strategy',
 'CDM': 'Clean Development Mechanism',
 'CENACE': 'National Center of Energy Control (Centro Nacional de Control de Energía)',
 'CER': 'Certified Emissions Reduction',
 'CFE': 'National Electric Commission (Comisión Nacional de Electricidad)',
 'CM': 'Combined Margin emission factor',
 'CO2': 'Carbon Dioxide',
 'CO2e': 'Carbon Dioxide equivalent',
 'DOE': 'Designated Operational Entity',
 'EMP': 'Environmental Management Plan',
 'ER': 'Emissions Reduction',
 'ERPA': 'Emissions Reduction Purchase Agreement',
 'GEF': 'Global Environment Facility',
 'GoM': 'Government of Mexico',
 'GHG': 'Greenhouse Gas',
 'GW': 'Gigawatt',
 'GWh': 'Gigawatthour',
 'IMN G': 'Interconnected Mexican National Grid',
 'INEGI': 'National Institute of Statistics, Geography and Computer Science (Instituto Nacional de Estadística, Geografía e Informática)',
 'IPER': 'Infr

## Extract keywords

In [None]:
from keybert import KeyBERT
from typing import List

def extract_keywords_from_chunks(chunks: List[str], top_n: int = 5, diversity: float = 0.5):
    model = KeyBERT(model="all-MiniLM-L6-v2")

    all_keywords = []
    for i, chunk in enumerate(chunks):
        keywords = model.extract_keywords(
            chunk,
            keyphrase_ngram_range=(1, 3),
            stop_words="english",
            use_mmr=True,
            diversity=diversity,
            top_n=top_n
        )
        all_keywords.append({
            "chunk_index": i,
            "text": chunk[:200] + "...",
            "keywords": [kw[0] for kw in keywords]
        })
    
    return all_keywords

import pandas as pd
results = extract_keywords_from_chunks(chunks, top_n=5, diversity=0.7)
df_keywords = pd.DataFrame(results)
df_keywords

In [None]:

# Step 2: Summarize
llm = ChatOpenAI(model_name="gpt-4o")
summary_prompt = PromptTemplate.from_template("Summarize the document:\n\n{content}")
summary_chain = LLMChain(llm=llm, prompt=summary_prompt)

summary = summary_chain.run({"content": docs[0].page_content})

# Step 3: Extract keywords
keyword_prompt = PromptTemplate.from_template("Extract keywords:\n\n{content}")
keywords = LLMChain(llm=llm, prompt=keyword_prompt).run({"content": docs[0].page_content})

# Step 4: Chunking
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = splitter.split_documents(docs)