<a href="https://colab.research.google.com/github/UnstoppableLu/Ingredion/blob/main/LLM_Ingredion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction:
This project aims to automate the extraction of sustainability and ESG (Environmental, Social, and Governance) data from PDF reports. These reports are usually unstructured, image-based, and/or text encoded, posing a significant challenge when obtaining valuable quantitative metrics. <br>
This notebook is a WIP data extraction pipeline that converts unstructured sustainability reports into structured datasets, enabling further analysis, comparision, and visualization.

## Setting All Dependencies:

In [1]:
!uv pip install -q langchain-google-genai google-generativeai

import google.generativeai as genai
import os, getpass

os.environ["GEMINI_API_KEY"] = getpass.getpass("Enter your Google AI API key: ")
genai.configure(api_key=os.environ["GEMINI_API_KEY"])


Enter your Google AI API key: ··········


You'll need an API Key. You can create and manage Gemini API Keys from the Google AI Studio Page.<br> Additionally, create or import a project from Google Cloud. Each Gemini API Key is associated with a Google Cloud Project.

In [2]:
!uv venv
!uv pip install pymupdf4llm
!uv pip install requests
!uv pip install pymupdf

Using CPython 3.12.12 interpreter at: [36m/usr/bin/python3[39m
Creating virtual environment at: [36m.venv[39m
Activate with: [32msource .venv/bin/activate[39m
[2mUsing Python 3.12.12 environment at: /usr[0m
[2K[2mResolved [1m2 packages[0m [2min 125ms[0m[0m
[2K[2mPrepared [1m2 packages[0m [2min 305ms[0m[0m
[2K[2mInstalled [1m2 packages[0m [2min 6ms[0m[0m
 [32m+[39m [1mpymupdf[0m[2m==1.26.5[0m
 [32m+[39m [1mpymupdf4llm[0m[2m==0.0.27[0m
[2mUsing Python 3.12.12 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 109ms[0m[0m
[2mUsing Python 3.12.12 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 116ms[0m[0m


## Uploading PDF Report:

In [3]:
from google.colab import files
uploaded = files.upload()

# You can then access the uploaded file(s) by their filename
for fn in uploaded.keys():
  print(f'User uploaded file "{fn}" with length {len(uploaded[fn])} bytes')

Saving Ingredion 2024 Sustainability Report.pdf to Ingredion 2024 Sustainability Report.pdf
User uploaded file "Ingredion 2024 Sustainability Report.pdf" with length 13511806 bytes


## Extracting Text from PDF to Markdown:

In [6]:
import pymupdf4llm
import pathlib

md_text = pymupdf4llm.to_markdown(fn)
print(md_text)

pathlib.Path("output.md").write_bytes(md_text.encode())

##### **Welcome to Our 2024** **Sustainability Report**

I am so pleased to be sharing with you Ingredion’s
2024 Sustainability Report. This report provides
a high-level overview of our activity under our 2030
All Life sustainability plan, and of the great work our
employees and our business partners engage in across
the globe to enable a more sustainable business and
a more sustainable world.


Over the past few years, we have seen a growing willingness for collaboration in
sustainability, and it is that trend that gives me the most hope for the future. Our
customers, suppliers, NGO partners and other stakeholders continue to look for
ways to create shared value that allows us to progress sustainable products and
practices that drive a real and positive impact.


I want to call to your attention the title of this year’s report: Create the Future with
People Who Care. At Ingredion, these are more than just words that we have chosen
for the cover of our report, it is our new employee va

125359

In [7]:
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2")
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

#md_header_splits
docs = markdown_splitter.split_text(md_text)
print(docs)


[Document(metadata={}, page_content='##### **Welcome to Our 2024** **Sustainability Report**  \nI am so pleased to be sharing with you Ingredion’s\n2024 Sustainability Report. This report provides\na high-level overview of our activity under our 2030\nAll Life sustainability plan, and of the great work our\nemployees and our business partners engage in across\nthe globe to enable a more sustainable business and\na more sustainable world.  \nOver the past few years, we have seen a growing willingness for collaboration in\nsustainability, and it is that trend that gives me the most hope for the future. Our\ncustomers, suppliers, NGO partners and other stakeholders continue to look for\nways to create shared value that allows us to progress sustainable products and\npractices that drive a real and positive impact.  \nI want to call to your attention the title of this year’s report: Create the Future with\nPeople Who Care. At Ingredion, these are more than just words that we have chosen\nf

Testing Call to Gemini:

In [8]:
model = genai.GenerativeModel("gemini-2.5-flash")
response = model.generate_content("Hello Gemini!")
print(response.text)


Hello! How can I help you today?


## Connecting To LangChain:

In [9]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import Optional, List

#Define desrired structure to force structured output
class Metric(BaseModel):
  """A single extracted performance metric."""
  metric_name: str = Field(..., description="The name of the metric, e.g., 'Scope 1 and 2 emissions reduction'.")
  value: str = Field(..., description="The value of the metric, including units, e.g., '50%'.")
  year: Optional[int] = Field(None, description="The year the metric corresponds to, if mentioned.")

class ExtractedMetrics(BaseModel):
  """The complete set of metrics extracted from a text chunk."""
  title: str = Field(..., description="A suitable title for the extracted data, e.g., 'Climate Targets'.")
  metrics: List[Metric] = Field(..., description="A list of all the metrics found in the text.")

#initialize model
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    temperature=0,
    google_api_key=os.environ["GEMINI_API_KEY"]
)

#Default Function Calling Approach
structured_llm = llm.with_structured_output(ExtractedMetrics)

prompt = ChatPromptTemplate([
    ("system", "You're an expert sustainability analyst! From the following text, extract all relevant metrics and format them according to the provided schema."),
    ("human", "{text_chunk}")
])

chain = prompt | structured_llm

batch_inputs = [{"text_chunk": doc.page_content} for doc in docs]
results = chain.batch(batch_inputs)
print(results)


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)


[ExtractedMetrics(title='Ingredion 2024 Sustainability Metrics', metrics=[Metric(metric_name='Absolute carbon emissions reduction (since 2019)', value='22%', year=2024), Metric(metric_name='Tier 1 priority crops sustainably sourced', value='85%', year=2024), Metric(metric_name='Employee and contractor TRIR target', value='0.18', year=2025), Metric(metric_name='Employee and contractor TRIR target', value='0.15', year=2030), Metric(metric_name='ISO 26000 social responsibility guidance implementation', value='Implemented', year=2023), Metric(metric_name='Human rights protection assessment across agricultural supply chain for Tier 1 priority crops', value='100%', year=2024), Metric(metric_name='Suppliers meeting high-risk criteria for human rights audited', value='100%', year=2027), Metric(metric_name='Human rights protection validated across operations and supply chain', value='Validated', year=2030), Metric(metric_name='Plastics circular economy projects completed per country', value='3 

#Normalization

In [10]:
import pandas as pd
#Flattening Data into Dictionary
flattened_data = []

for result in results:
  if not result.metrics:
    continue
  for metric in result.metrics:
    flattened_data.append({
        'title': result.title,
        'metric_name': metric.metric_name,
        'value': metric.value,
        'year': metric.year
    })

df = pd.DataFrame(flattened_data)
print(df)

                                                title  \
0               Ingredion 2024 Sustainability Metrics   
1               Ingredion 2024 Sustainability Metrics   
2               Ingredion 2024 Sustainability Metrics   
3               Ingredion 2024 Sustainability Metrics   
4               Ingredion 2024 Sustainability Metrics   
..                                                ...   
76  Governance and Sustainability Performance Metrics   
77  Governance and Sustainability Performance Metrics   
78  Governance and Sustainability Performance Metrics   
79  Governance and Sustainability Performance Metrics   
80  Governance and Sustainability Performance Metrics   

                                          metric_name  \
0    Absolute carbon emissions reduction (since 2019)   
1           Tier 1 priority crops sustainably sourced   
2                 Employee and contractor TRIR target   
3                 Employee and contractor TRIR target   
4   ISO 26000 social responsib

In [15]:
df.to_csv(f"{fn}_Extracted_Metrics.csv", index = False)