# BRSR Principle 6 — Faithfulness Mapping Pipeline

This notebook is a **ready-to-run template** that implements the minimal end-to-end pipeline for the assignment:
- Load SEBI BRSR rules PDF and a company's BRSR PDF
- Chunk documents and build vector index (FAISS)
- Run Retrieval-Augmented Generation (RAG) using an LLM to compare SEBI requirements vs company disclosures
- Produce drift scores (0–3), Sankey diagram, drift dashboard
- Export a Word report containing findings and visuals

**Important:** This notebook is a template. You must install the required Python packages and provide API keys (if using OpenAI). Instructions are below.

## 1. Environment & Installation

Run this (once) in your notebook environment (Colab / local / VS Code terminal):
```bash
# Create venv (optional)
python -m venv .venv
source .venv/bin/activate  # (or .venv\Scripts\activate on Windows)
pip install --upgrade pip
pip install openai faiss-cpu sentence-transformers python-docx PyPDF2 matplotlib scikit-learn tqdm
# Optional (if you prefer LangChain)
pip install langchain
```
If you want to use **local embeddings** (no OpenAI), the notebook uses `sentence-transformers`.

If you prefer OpenAI embeddings or OpenAI LLMs, set your API key as an environment variable:
```bash
export OPENAI_API_KEY='sk-...'
```

## 2. Required Files

Place these PDFs in the same folder as this notebook or provide paths when prompted:
- `sebi_brsr.pdf`  (SEBI BRSR Annexure / guidance PDF)
- `company_brsr.pdf` (e.g., Infosys/TCS/Wipro BRSR PDF)

This notebook will create an `outputs/` folder where it stores images and the final Word doc.

In [14]:
# Create outputs folder
from pathlib import Path
Path('outputs').mkdir(exist_ok=True)
print('outputs/ created')

outputs/ created


## 3. Simple PDF loader & extractor
This uses `PyPDF2` to extract text. For more robust extraction, consider `unstructured` or `pdfplumber`.

In [15]:
import PyPDF2
def extract_text_from_pdf(path):
    text = []
    with open(path, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        for i, page in enumerate(reader.pages):
            try:
                page_text = page.extract_text() or ''
            except Exception as e:
                page_text = ''
            text.append(page_text)
    return '\n'.join(text)

# Example usage (replace with your actual file paths)
sebi_pdf = 'SEBIpdf.pdf'
company_pdf = 'infosis.pdf'
print('NOTE: replace sebi_brsr.pdf and company_brsr.pdf with your local file paths before running')

NOTE: replace sebi_brsr.pdf and company_brsr.pdf with your local file paths before running


## 4. Text chunking
We split long text into fixed-size chunks with some overlap so retrieval works better.

In [16]:
import re
def simple_chunk_text(text, chunk_size=800, overlap=100):
    # sanitize
    text = re.sub(r'\s+', ' ', text).strip()
    chunks = []
    i = 0
    while i < len(text):
        chunk = text[i:i+chunk_size]
        chunks.append(chunk)
        i += chunk_size - overlap
    return chunks

# Small demo
sample = 'This is a long text. ' * 200
print('Chunks:', len(simple_chunk_text(sample)))

Chunks: 6


## 5. Embeddings
Two options:
1. **OpenAI embeddings** (requires OPENAI_API_KEY) — convenient and high-quality.
2. **SentenceTransformers** local model — works offline (smaller and slower).
Choose one and follow the cells.

In [17]:
# Option A: OpenAI embeddings (comment out if not using)
USE_OPENAI_EMBEDDINGS = False
OPENAI_MODEL_EMBED = 'text-embedding-3-small'

# Option B: local sentence-transformers
USE_LOCAL_EMBEDDINGS = True
LOCAL_EMBED_MODEL = 'all-MiniLM-L6-v2'

print('Embedding setup: USE_OPENAI_EMBEDDINGS=', USE_OPENAI_EMBEDDINGS, 'USE_LOCAL_EMBEDDINGS=', USE_LOCAL_EMBEDDINGS)

Embedding setup: USE_OPENAI_EMBEDDINGS= False USE_LOCAL_EMBEDDINGS= True


In [18]:
if USE_LOCAL_EMBEDDINGS:
    from sentence_transformers import SentenceTransformer
    embedder = SentenceTransformer(LOCAL_EMBED_MODEL)
    def get_embeddings(texts):
        # returns list of vectors
        return embedder.encode(texts, show_progress_bar=True)
    print('Loaded local embedding model:', LOCAL_EMBED_MODEL)
else:
    print('Using OpenAI embeddings (you must set OPENAI_API_KEY in environment)')
    import openai
    def get_embeddings(texts):
        res = openai.Embedding.create(model=OPENAI_MODEL_EMBED, input=texts)
        return [r['embedding'] for r in res['data']]
    print('OpenAI embeddings ready')

Loaded local embedding model: all-MiniLM-L6-v2


## 6. Build FAISS index
We will embed chunks and index them with FAISS for fast nearest-neighbor retrieval.

In [19]:
import faiss
import numpy as np

def build_faiss_index(chunks, embed_fn):
    vectors = embed_fn(chunks)
    dim = len(vectors[0])
    index = faiss.IndexFlatL2(dim)
    index.add(np.array(vectors).astype('float32'))
    return index, np.array(vectors).astype('float32')

def faiss_search(index, vectors, query_vec, k=4):
    D, I = index.search(np.array([query_vec]).astype('float32'), k)
    return I[0], D[0]

print('FAISS functions ready')

FAISS functions ready


## 7. Retrieval helper
Given a query (SEBI requirement) retrieve the most relevant chunks from the company report.

In [20]:
def retrieve_chunks(query, index, vectors, chunks, embed_fn, k=4):
    qv = embed_fn([query])[0]
    idxs, dists = faiss_search(index, vectors, qv, k=k)
    results = [{'chunk': chunks[i], 'score': float(dists[j]), 'index': int(idxs[j])} for j,i in enumerate(idxs) if i!=-1]
    return results

print('Retrieval helper ready')

Retrieval helper ready


## 8. LLM Comparison Prompt Template
We will send the SEBI requirement + retrieved company chunks to an LLM and ask for:
1. A short mapping explanation
2. A drift score (0–3) with short justification
3. Evidence quotes (company chunks & page numbers if available)

Example prompt is below. This notebook uses OpenAI's text completion (or chat) API, but you can adapt it to any LLM provider.

In [21]:
LLM_TYPE = 'openai'  # 'openai' or 'mock' (mock returns example outputs for offline testing)
OPENAI_CHAT_MODEL = 'gpt-4o-mini'  # replace with your model

def build_prompt(sebi_requirement, retrieved_chunks):
    prompt = []
    prompt.append('You are an auditor comparing a regulation requirement to company disclosures.')
    prompt.append('SEBI requirement:\n' + sebi_requirement)
    prompt.append('\n---\nRetrieved company evidence (most relevant chunks):')
    for i, r in enumerate(retrieved_chunks):
        snippet = r['chunk'][:800]
        prompt.append(f'\n--- Chunk {i+1} (score:{r["score"]}):\n{snippet}\n')
    prompt.append('\nTask:')
    prompt.append('1) Provide a one-paragraph mapping between the SEBI requirement and the company evidence.')
    prompt.append('2) Assign a drift score (0,1,2,3). Output only the number and then a one-line justification.')
    prompt.append('3) Provide up to two short direct quotes from the evidence that support your judgment.')
    prompt.append('4) Provide a short machine-readable JSON with keys: {"score": <int>, "justification": <str>, "quotes": [..]}')
    return '\n'.join(prompt)

def call_llm(prompt):
    if LLM_TYPE == 'mock':
        # simple deterministic mock answer for offline testing
        return {
            'score': 0,
            'justification': 'Numbers match SEBI expectation; detailed tables present',
            'quotes': ['Total electricity consumption 712,134 GJ', 'Total Scope 1 emissions 8,593 tCO2e']
        }
    else:
        import openai, os, json
        openai.api_key = os.getenv('OPENAI_API_KEY')
        if not openai.api_key:
            raise ValueError('OPENAI_API_KEY not set in environment. Set it and rerun or use mock mode.')
        # Use chat completions
        resp = openai.ChatCompletion.create(
            model=OPENAI_CHAT_MODEL,
            messages=[{'role':'system','content':'You are a concise auditor.'},{'role':'user','content':prompt}],
            temperature=0.0,
            max_tokens=500
        )
        text = resp['choices'][0]['message']['content']
        # try to parse JSON at the end of the response
        # we assume the assistant returns the required JSON object at the end of its text
        # attempt to find a JSON substring
        import re
        m = re.search(r"(\{\s*\"score\".*\})", text, re.S)
        if m:
            j = json.loads(m.group(1))
            return j
        else:
            # fallback: return entire text as justification
            return {'score': None, 'justification': text, 'quotes': []}

print('LLM prompt/template ready. Set LLM_TYPE and OPENAI_API_KEY as required.')

LLM prompt/template ready. Set LLM_TYPE and OPENAI_API_KEY as required.


## 9. High-level pipeline function
This function ties everything together for a list of SEBI requirements (we will extract key Principle 6 items manually or from the SEBI PDF).

In [22]:
import json
def analyze_requirements(requirements, company_chunks, company_index, vectors, embed_fn, top_k=4):
    results = []
    for req in requirements:
        retrieved = retrieve_chunks(req, company_index, vectors, company_chunks, embed_fn, k=top_k)
        prompt = build_prompt(req, retrieved)
        try:
            llm_out = call_llm(prompt)
        except Exception as e:
            print('LLM call failed:', e)
            llm_out = {'score': None, 'justification': str(e), 'quotes': []}
        results.append({'requirement': req, 'retrieved': retrieved, 'llm': llm_out})
    return results

print('analyze_requirements function ready')

analyze_requirements function ready


## 10. Visualization helpers (Sankey & Drift dashboard)
These create and save images to `outputs/`.

In [23]:
import matplotlib.pyplot as plt
from matplotlib.sankey import Sankey
def create_sankey(mapping_pairs, out_path='outputs/sankey.png'):
    # mapping_pairs: list of (left_label, right_label, weight)
    labels_left = [p[0] for p in mapping_pairs]
    labels_right = [p[1] for p in mapping_pairs]
    flows = [p[2] for p in mapping_pairs]
    fig = plt.figure(figsize=(8,5))
    ax = fig.add_subplot(1,1,1)
    S = Sankey(ax=ax, unit=None)
    for i, f in enumerate(flows):
        try:
            S.add(flows=[f, -f], labels=[labels_left[i], labels_right[i]], orientations=[0,0], trunklength=1.0, pathlengths=[0.25,0.25])
        except Exception:
            pass
    S.finish()
    plt.title('Sankey: Principle 6 concepts → Company evidence (illustrative)')
    plt.savefig(out_path, bbox_inches='tight')
    plt.close(fig)
    return out_path

def create_drift_dashboard(results, out_path='outputs/drift_dashboard.png'):
    names = [r['requirement'][:60] for r in results]
    scores = [r['llm'].get('score', 3) if r['llm'].get('score') is not None else 3 for r in results]
    color_map = {0:'#2ca02c', 1:'#98df8a', 2:'#ff7f0e', 3:'#d62728'}
    colors = [color_map.get(int(s), '#7f7f7f') for s in scores]
    fig, ax = plt.subplots(figsize=(10, max(4, len(names)*0.4)))
    bars = ax.barh(range(len(names)), scores, color=colors)
    ax.set_yticks(range(len(names)))
    ax.set_yticklabels(names, fontsize=9)
    ax.set_xlim(-0.2,3.5)
    ax.set_xlabel('Drift score (0 = verbatim, 3 = vague/performative)')
    ax.invert_yaxis()
    for i, v in enumerate(scores):
        ax.text(v + 0.05, i, str(v), va='center')
    plt.title('Drift Dashboard — Principle 6')
    plt.tight_layout()
    plt.savefig(out_path, bbox_inches='tight')
    plt.close(fig)
    return out_path

print('Visualization helpers ready')

Visualization helpers ready


## 11. Word report generator
This creates a `.docx` containing the summary, a drift table, and embeds the images produced above.

In [24]:
from docx import Document
from docx.shared import Pt, Inches
def create_word_report(results, sankey_path='outputs/sankey.png', drift_path='outputs/drift_dashboard.png', out_path='outputs/brsr_faithfulness_report.docx'):
    doc = Document()
    doc.styles['Normal'].font.name = 'Calibri'
    doc.styles['Normal'].font.size = Pt(11)
    doc.add_heading('Faithful Concept Mapper — BRSR Principle 6', level=1)
    doc.add_paragraph('This report is generated by the BRSR Faithfulness Mapping pipeline.')
    doc.add_heading('Summary', level=2)
    good = sum(1 for r in results if r['llm'].get('score')==0)
    doc.add_paragraph(f'Number of requirements analyzed: {len(results)}. Number with score 0 (verbatim): {good}.')
    doc.add_heading('Drift Table', level=2)
    table = doc.add_table(rows=1, cols=4)
    hdr = table.rows[0].cells
    hdr[0].text = 'Requirement'
    hdr[1].text = 'Drift score'
    hdr[2].text = 'Justification'
    hdr[3].text = 'Quotes'
    for r in results:
        row = table.add_row().cells
        row[0].text = r['requirement']
        row[1].text = str(r['llm'].get('score'))
        row[2].text = r['llm'].get('justification','')[:500]
        row[3].text = '\n'.join(r['llm'].get('quotes',[]))[:500]
    doc.add_heading('Visuals', level=2)
    try:
        doc.add_picture(sankey_path, width=Inches(6))
    except Exception:
        doc.add_paragraph('Sankey image not available')
    try:
        doc.add_picture(drift_path, width=Inches(6))
    except Exception:
        doc.add_paragraph('Drift dashboard not available')
    doc.save(out_path)
    return out_path

print('Word report generator ready')

Word report generator ready


## 12. Example: Minimal run-through (mock mode)
If you do not want to call a real LLM while testing, set `LLM_TYPE = 'mock'` earlier. Below is an example that runs the entire pipeline in mock mode using toy text.

In [25]:
# Demo run (mock)
LLM_TYPE = 'mock'
sebi_text = 'PRINCIPLE 6: Report total energy consumption, energy intensity, Scope 1 and Scope 2 GHG emissions, water withdrawal and water intensity, waste management by category.'
company_text = (
    'Total electricity consumption 712134 GJ. Total fuel consumption 38852 GJ. Energy intensity 5.11 GJ per crore. '
    'Scope 1 emissions 8593 tCO2e. Scope 2 emissions 62352 tCO2e. '\
    'Water withdrawal 2274679 kl. Water intensity 15.5 kl per crore. '
    'Waste: 1200 t general waste, 800 t recycled.'
)
sebi_chunks = simple_chunk_text(sebi_text, chunk_size=400, overlap=50)
company_chunks = simple_chunk_text(company_text, chunk_size=400, overlap=50)
company_index, vectors = build_faiss_index(company_chunks, get_embeddings)
requirements = [
    'Total energy consumption and energy intensity',
    'Scope 1 and Scope 2 GHG emissions and intensity',
    'Water withdrawal and water intensity',
    'Waste management by category and recycling rates'
]
results = analyze_requirements(requirements, company_chunks, company_index, vectors, get_embeddings, top_k=2)
sankey_path = create_sankey([
    ('Energy', 'Energy tables', 1),
    ('GHG', 'Scope 1 & 2 tables', 1),
    ('Water', 'Water tables', 1),
    ('Waste', 'Waste tables', 1)
], out_path='outputs/sankey.png')
drift_path = create_drift_dashboard(results, out_path='outputs/drift_dashboard.png')
doc_path = create_word_report(results, sankey_path=sankey_path, drift_path=drift_path, out_path='outputs/brsr_faithfulness_report.docx')
print('Demo outputs created:')
print(' -', sankey_path)
print(' -', drift_path)
print(' -', doc_path)


Batches: 100%|██████████| 1/1 [00:00<00:00,  4.46it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 52.80it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 34.99it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 41.17it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 38.99it/s]
Ignoring fixed y limits to fulfill fixed data aspect with adjustable data limits.


Demo outputs created:
 - outputs/sankey.png
 - outputs/drift_dashboard.png
 - outputs/brsr_faithfulness_report.docx


## 13. Next steps (when you run with real PDFs)
1. Replace `sebi_brsr.pdf` and `company_brsr.pdf` paths with your real files.
2. Extract text with `extract_text_from_pdf` and chunk with `simple_chunk_text`.
3. Build embeddings (set `USE_LOCAL_EMBEDDINGS` or `USE_OPENAI_EMBEDDINGS`).
4. Build FAISS index for the company report chunks.
5. Prepare a list of SEBI Principle 6 requirements (you can manually type the essential & leadership indicators from SEBI PDF).
6. Run `analyze_requirements(...)` to get results.
7. Generate visuals and Word report.

