# ESG Metric Extraction

## Structure

### Example: AAPL
1. Imports
2. Data Extraction
3. Prompt Definition
4. Response Retrieval

### Pipeline 

## Example: AAPL

### 1. Imports:

_from google import genai_:    
For this we are going to use Google's Gemini as our LLM for the heavy lifting.    

_from langchain.prompts import PromptTemplate_:   
We once again use langchain and its prompt template to make an instruction set for the LLM 

_import fitz_:    
For the document instruction we use the PyMuPDF also know as fitz. It will do most of the preprocessing from pdf loading to text extraction.



In [1]:
import os
import sys
sys.path.append('..')

from google import genai
from langchain.prompts import PromptTemplate
import fitz  # PyMuPDF


#### File path:
Currently in this repository it is find in _data/raw/ESG_ folder

In [2]:
ticker_name = 'AAPL'
file_path = f'../data/raw/ESG/{ticker_name}.pdf'

### 2. Data Extraction

This function extracts the text from the pdf file. It filters the document for header and footers to not use any uncessary tokens on the LLM. 

In [3]:
def extract_pdf_text(doc_path, margin_height=50):
    doc = fitz.open(doc_path)
    all_text = []

    for page in doc:
        page_height = page.rect.height
        blocks = page.get_text("dict")["blocks"]

        for block in blocks:
            if "lines" not in block:
                continue
            for line in block["lines"]:
                for span in line["spans"]:
                    text = span["text"].strip()
                    y0, y1 = span["bbox"][1], span["bbox"][3]

                    # Skip if within header/footer margin
                    if y1 < margin_height or y0 > (page_height - margin_height):
                        continue

                    all_text.append(text)

    return "\n".join(all_text)

pdf_text = extract_pdf_text(file_path)

### 3. Prompt Definition.
With larger Models like Gemini and ChatGPT, the prompt has a siginificant impact on the output. 
These prompts are highly configurable and designable to your goals.

In this case we want the model to pull all the data that may in any way be a metric related to ESG from the document.    
To measure this we require it to return the unit in ISO format so that the different profiles comparable later on.    
Lastly we want it to return it in JSON format so it is easily manipulatable and useful for saving and loading.

In [None]:
prompt = PromptTemplate(
    input_variables=["document_text"],
    template="""You are an expert in environmental, social, and governance (ESG) data extraction and reporting.

            Your task is to carefully extract **all relevant ESG metrics** from the following report text. 
            These include, but are not limited to: emissions (Scopes 1, 2, 3), energy usage, water usage, waste, recycled materials, carbon footprint, renewable energy deployment, and any target/goal progression. 
            Please include all the data from previous years as well.

            Please:

            1. Structure the output as **valid JSON**.
            2. Include **all available years** for each metric (not just the current year).
            3. Make sure to take a look at and for **tables**. Some data may be in tabular format and structured as such.
            4. Use **standardized ISO units** where possible (e.g., metric tons of CO2e, MWh, gallons, percentages). Convert if necessary.
            5. Use **clear and consistent keys**, and group metrics by category such as "Emissions", "Energy", "Water", "Materials", "Waste", etc.
            6. Omit interpretation — just provide structured data as precisely as stated in the document.

            Here is the report text:

            --- START OF REPORT TEXT ---

            {document_text}

            --- END OF REPORT TEXT ---

            Ensure the output is JSON-serializable and can be parsed using json.loads() in Python:
            Please provide the output in the following format where the hierachy is "Category > Subcategory > Year > Metric and value" like the following:

            "DesignAndMaterials": [
                "RecycledGoldIncrease": [
                    "2021": [
                        "value": 1,
                        "unit": "%"
                    ],
                ]
            ],

            Now return the extracted ESG metrics in valid JSON format:
        
            """
)

  template="""You are an expert in environmental, social, and governance (ESG) data extraction and reporting.
  template="""You are an expert in environmental, social, and governance (ESG) data extraction and reporting.


ValidationError: 1 validation error for PromptTemplate
  Value error, unmatched '{' in format spec [type=value_error, input_value={'input_variables': ['doc...'partial_variables': {}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/value_error

### 4. Response Retrieval

This is the model definition as per the Google API. For this you will require an API Key that you can get [here](https://aistudio.google.com/apikey).    
The model will output a class called _GenerateContentResponse_ that contains metadata and the actually text at output.text.

Since its output is json it will output a markdown format preceded by \`\`\`json and succeded by \`\`\` which we filter out.

The rest is then in nice json format that we use for our metric. This will be done in a different document though. 


In [None]:
# client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])

In [None]:
# json_response = client.models.generate_content(
#     model="gemini-2.0-flash",
#     contents=prompt.format(document_text=pdf_text),
# )
# output_text = json_response.text.replace("```json", "").replace("```", "")


### 5. Saving the Data

After we received the output text we need to make sure we can saved it to a json file, since this is the filetype we are going to be going with.
For this we will be using _json5_ as a tolerant json laoder and _ast_ as our string to json converter.

In [None]:
# import json5
# import json
# import ast
# clean_json_string = ast.literal_eval(f"'''{output_text}'''")
# parsed_json = json5.loads(clean_json_string)

In [None]:
# with open(f"../data/processed/ESG/{ticker_name}.json", "w") as json_file:
#     json.dump(parsed_json, json_file, indent=4)

## Full Pipeline with Evaluation Metric

In [None]:
from src.common.gemini_pipeline import GeminiPipeline
import sys
sys.path.append('..')


In [None]:
pipeline = GeminiPipeline(prompt_template=prompt, source_path='../data/raw/ESG', target_path='../data/processed/ESG')

In [None]:
pipeline.run()

100%|██████████| 4/4 [01:10<00:00, 17.74s/it]
