# ESG Metric

#### Idea: Use RAG Pipeline to extract information from the Document to get a self defined ESG Score 

### Structure:

1. RAG Pipeline Setup
2.

In [50]:
import sys
sys.path.append('..')

### Setting up RAG Pipeline

The RAG pipeline that was previously looked at was now compacted into the class _RagPipeline_.   
It has all the functions that we previously used, take a look at it in **src/common/rag_pipeline.py** !

```python 
class RagPipeline:
    def load_pdf(self, pdf_path: str) -> Document:
        """Load PDF file and return a Document object."""

    def chunk_text(self, document: Document, chunk_size: int = 1000, chunk_overlap: int = 20) -> list[Document]:
        """Chunk the text into smaller pieces."""

    def create_vectordb(self, chunks: list[Document], embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2", search_kwargs: dict = None ) -> FAISS:
        """Create a vector database from the chunks."""

    def load_llm(self, llm_model: str = "google/flan-t5-base", pipeline_type: str = "text2text-generation" ) -> HuggingFacePipeline:
        """Load the LLM model."""

    def create_qa_chain(self, chain_type="stuff", chain_type_kwargs: dict = {}) -> RetrievalQA:
        """Create a QA chain."""

    def run(self, query: str) -> str:
        """Run the QA chain with a query."""


def parse_html_table (page_content) -> List[dict]:
    """ Parses a single HTML <table> from the given page_content and returns a list of row-dictionaries. """

def convert_table_row_to_text(row: str) -> str:
    """Converts a table row (dict) to a string representation."""

```




In [51]:
from src.common.rag_pipeline import RagPipeline

In [52]:
rag_pipeline = RagPipeline()
data = rag_pipeline.load_pdf(pdf_path='../data/raw/ESG/NVDA.pdf')
docs = rag_pipeline.chunk_text(documents=data,chunk_size=1000,chunk_overlap=500)
vectordb = rag_pipeline.create_vectordb(chunks=docs,search_kwargs={"k" :3},embedding_model='sentence-transformers/all-mpnet-base-v2')
llm = rag_pipeline.load_llm( llm_model='google/flan-t5-large')

  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
  dfs = pd.read_html(page_content)
Device set to use mps:0


### Structured Data Extraction

Now that the main pipeline (LLM and VectorDB) is set up we can now formulate the ESG metric and prompt.    
The metric can use any of the information that should be found in an ESG report such as CO2 output or similar.

Our goal is to extract the information as concise as possible, i.e. a number that we can convert to an int/float and compute the metric with.

Experimentation has showed that a query structure like the following had the best results 

#### Prompt:
```python
query = """What {parameter} in {unit}? Return number only."""
```

Interestingly enough adding an _'is'_ after the _'What'_ is already enough to make it a bit more inconsitent. 

#### Metric: Environmental Intensity Score (EIS) — 0 to 10

**Inputs**  
- **CI**: Carbon intensity (tonnes CO₂e per million USD revenue)  
- **WI**: Water intensity (m³ per million USD revenue)  
- **RR**: Recycling rate (%)


In [53]:
def metric(carbon_intensity,water_intensity,recycling_rate):
    """
    This function calculates the ESG score based on carbon intensity, water intensity, and recycling rate.
    The formula used is a simple weighted sum of the three factors.
    """
    score = (0.5 * carbon_intensity) + (0.3 * water_intensity) + (2 * recycling_rate)
    return score

### Implementation

Now that we have a metric and prompt read, we can generate the information using RAG!

In [54]:
rag_pipeline.create_qa_chain(return_source_documents=True)

RetrievalQA(verbose=False, combine_documents_chain=StuffDocumentsChain(verbose=False, llm_chain=LLMChain(verbose=False, prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text2text_generation.Text2TextGenerationPipeline object at 0x313f21a90>), output_parser=StrOutputParser(), llm_kwargs={}), document_prompt=PromptTemplate(input_variables=['page_content'], input_types={}, partial_variables={}, template='{page_content}'), document_variable_name='context'), return_source_documents=True, retriever=VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x3117f8530>, search_kwargs={'k':

#### Example:

In [55]:
query = """What total gross carbon footprint in 2024 in metric tons CO2e? Return number only."""
answer,_ = rag_pipeline.run({"query": query}) # the _ is the source documents
print(f"Answer: {answer}")

Answer: 71


#### Pipeline:

In [None]:
parameters = [ {'metric' : 'Carbon intensity in 2024',
                'unit' : 'metric tons CO2e/million USD revenue'},
                {'metric': 'CO2 or carbon emissions in 2024',
                'unit': 'metric tons CO2e'},
                {'metric': 'water intensity in 2024',
                'unit': 'gallons/million USD revenue'},
                {'metric': 'water consumption in 2024',
                'unit': 'gallons'},
                {'metric': 'recycling rate in 2024',
                'unit': 'percentage'},
                ]

In [70]:
output = {}

for parameter in parameters:
    query = f"""What is the {parameter['metric']} in {parameter['unit']}? Return number only.If you cannot find a value, return NaN."""
    answer, context = rag_pipeline.run({"query": query}) # the _ is the source documents
    for doc in context:
        print(doc.page_content)
    output[parameter['metric']] = answer


{Metric is Energy intensity (Energy used MWh/$M revenue), FY24 is 10.1, FY23 is 18.4, FY22 is 15.8, Reference Indicator is GRI 302-3}
{Metric is Reduction of GHG emissions, FY24 is 2023 CDP Climate Change, FY23 is Response, pp. 29-46, 71, FY22 is , Reference Indicator is SASB TC-SC-110a.2}
{Product Energy Efficiency is Greenhouse Gas Emissions, 10 is }
{Metric is Water consumption, FY24 is 134219, FY23 is 197849, FY22 is 239780, Reference Indicator is GRI 303-5}
{Metric is Water consumption, FY24 is , FY23 is , FY22 is , Reference Indicator is SASB TC-SC-140a.1}
{Metric is Interactions with water as a shared resource, FY24 is Water Conservation, FY23 is , FY22 is , Reference Indicator is GRI 303-1}
{Metric is Landfill diversion rate (%), FY24 is 71%, FY23 is 58%, FY22 is 56%, Reference Indicator is GRI 306-4}
{Metric is General waste recycled, FY24 is 374, FY23 is 295, FY22 is 127, Reference Indicator is }
{Metric is Category 5 is Waste generated in operations?, FY24 is 617, FY23 is 57

### Postprocessing

Cleaning the outputs to be numbers so that we can do math with them

In [64]:
from src.utils.utils import convert_strings_to_floats

In [65]:
output = list(output.values())
output = convert_strings_to_floats(output)

In [66]:
output

[10.1, 134219.0, 0.71]

In [67]:
metric_score = metric(output[0], output[1], output[2])

In [68]:
print('metric_score:', metric_score)

metric_score: 40272.17
