# ESG Metric

#### Idea: Use RAG Pipeline to extract information from the Document to get a self defined ESG Score 

### Structure:

1. RAG Pipeline Setup
2.

In [12]:
import sys
sys.path.append('..')

### Setting up RAG Pipeline

The RAG pipeline that was previously looked at was now compacted into the class _RagPipeline_.   
It has all the functions that we previously used, take a look at it in **src/common/rag_pipeline.py** !

```python 
class RagPipeline:
    def load_pdf(self, pdf_path: str) -> Document:
        """Load PDF file and return a Document object."""

    def chunk_text(self, document: Document, chunk_size: int = 1000, chunk_overlap: int = 20) -> list[Document]:
        """Chunk the text into smaller pieces."""

    def create_vectordb(self, chunks: list[Document], embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2", search_kwargs: dict = None ) -> FAISS:
        """Create a vector database from the chunks."""

    def load_llm(self, llm_model: str = "google/flan-t5-base", pipeline_type: str = "text2text-generation" ) -> HuggingFacePipeline:
        """Load the LLM model."""

    def create_qa_chain(self, chain_type="stuff", chain_type_kwargs: dict = {}) -> RetrievalQA:
        """Create a QA chain."""

    def run(self, query: str) -> str:
        """Run the QA chain with a query."""
```


In [13]:
from src.common.rag_pipeline import RagPipeline

In [14]:
rag_pipeline = RagPipeline()
data = rag_pipeline.load_pdf(pdf_path='../data/raw/ESG/AAPL.pdf')

In [15]:
data

[Document(metadata={'source': '../data/raw/ESG/AAPL.pdf', 'detection_class_prob': 0.3707296550273895, 'coordinates': {'points': ((np.float64(75.7718734741211), np.float64(316.462646484375)), (np.float64(75.7718734741211), np.float64(942.5064697265625)), (np.float64(1585.115234375), np.float64(942.5064697265625)), (np.float64(1585.115234375), np.float64(316.462646484375))), 'system': 'PixelSpace', 'layout_width': 3023, 'layout_height': 1700}, 'last_modified': '2025-04-24T11:39:41', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'file_directory': '../data/raw/ESG', 'filename': 'AAPL.pdf', 'category': 'Title', 'element_id': '2a1f5e36027f578c81bd94f8c7db7ca6'}, page_content='Environmental Progress Report'),
 Document(metadata={'source': '../data/raw/ESG/AAPL.pdf', 'detection_class_prob': 0.7932204604148865, 'coordinates': {'points': ((np.float64(75.45633697509766), np.float64(1582.1368408203125)), (np.float64(75.45633697509766), np.float64(1615.1640625)), (np.float6

In [None]:
docs = rag_pipeline.chunk_text(documents=data,chunk_size=1000,chunk_overlap=200)
vectordb = rag_pipeline.create_vectordb(chunks=docs,search_kwargs={"k" :10},embedding_model='sentence-transformers/all-mpnet-base-v2')
llm = rag_pipeline.load_llm( llm_model='google/flan-t5-large')

  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html

In [17]:
# [doc for doc in docs if doc.metadata['category'] == "TableRow"]

### Structured Data Extraction

Now that the main pipeline (LLM and VectorDB) is set up we can now formulate the ESG metric and prompt.    
The metric can use any of the information that should be found in an ESG report such as CO2 output or similar.

Our goal for the prompt is to get the network to ouput a nicely structurec JSON file with our values.   
For this we will create a custom prompt structure



#### Prompt:

In [18]:
from langchain.prompts import PromptTemplate


#### Metric: Environmental Intensity Score (EIS) — 0 to 10

**Inputs**  
- **CI**: Carbon intensity (tonnes CO₂e per million USD revenue)  
- **WI**: Water intensity (m³ per million USD revenue)  
- **RR**: Recycling rate (%)


In [19]:
def eis(ci, wi, rr, ci_min, ci_max, wi_min, wi_max):
    """
    Compute the Simplified Environmental Intensity Score (SEIS) scaled 0–10.

    Parameters:
    - ci: Carbon intensity (tonnes CO₂e per million USD revenue)
    - wi: Water intensity (m³ per million USD revenue)
    - rr: Recycling rate (%)
    - ci_min, ci_max: Benchmark minima and maxima for carbon intensity
    - wi_min, wi_max: Benchmark minima and maxima for water intensity

    Returns:
    - SEIS: Score between 0 and 10
    """
    s_c = 1 - (ci - ci_min) / (ci_max - ci_min)
    s_w = 1 - (wi - wi_min) / (wi_max - wi_min)
    s_r = rr / 100.0

    for s in (s_c, s_w, s_r):
        if not 0 <= s <= 1:
            s = max(0, min(1, s))

    composite = (s_c + s_w + s_r) / 3
    return composite * 10


### Implementation

Now that we have a metric and prompt read, we can generate the information using RAG!

In [20]:
rag_pipeline.create_qa_chain(return_source_documents=True)

RetrievalQA(verbose=False, combine_documents_chain=StuffDocumentsChain(verbose=False, llm_chain=LLMChain(verbose=False, prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text2text_generation.Text2TextGenerationPipeline object at 0x32a8f10a0>), output_parser=StrOutputParser(), llm_kwargs={}), document_prompt=PromptTemplate(input_variables=['page_content'], input_types={}, partial_variables={}, template='{page_content}'), document_variable_name='context'), return_source_documents=True, retriever=VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x350858b30>, search_kwargs={'k':

In [21]:
query = """what is gross CO2 emissions for Apple in 2024?"""
answer,stuff = rag_pipeline.run({"query": query})

Token indices sequence length is longer than the specified maximum sequence length for this model (922 > 512). Running this sequence through the model will result in indexing errors


In [22]:
answer,stuff

('8.4 million annualized metric tons',
 [Document(id='860f4d37-3764-474f-b35c-e02c081cd889', metadata={'source': '../data/raw/ESG/AAPL.pdf', 'detection_class_prob': 0.9107416868209839, 'coordinates': {'points': ((np.float64(805.7642211914062), np.float64(434.890869140625)), (np.float64(805.7642211914062), np.float64(761.1541748046875)), (np.float64(1480.727783203125), np.float64(761.1541748046875)), (np.float64(1480.727783203125), np.float64(434.890869140625))), 'system': 'PixelSpace', 'layout_width': 3023, 'layout_height': 1700}, 'last_modified': '2025-04-24T11:39:41', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 32, 'parent_id': 'ca6d9fca85d1d2066b2d5bac2c51dcc7', 'file_directory': '../data/raw/ESG', 'filename': 'AAPL.pdf', 'category': 'NarrativeText', 'element_id': '0084562195aed09b42b305b5179613f0'}, page_content="We've historically supported voluntary efforts by our display and semiconductor manufacturers to reduce their F-GHG emissions. But we're pushing th