
# Financial Report Analyzer — MVP Completed on 2025-07-22


**What it does:**  
This notebook lets you query, compare, and summarize key numbers and trends from multiple banks’ financial reports (PDF or text), using semantic search and a local LLM to produce plain-English output.

- Supports single-quarter, pairwise, or multi-quarter analysis.
- Just type your query, pick a company and quarter(s), and get an analyst-style summary in seconds.

**Who it’s for:**  
Anyone who wants to save time reading financial reports—analysts, journalists, PMs, students, or anyone working with earnings releases.




**Sample Output — Multi-Quarter Trend Summary (JPM)**

**User Query:**

Type your query below (e.g. 'Q1 2024 net income', 'Q2 2024 revenue for GS', 'credit costs for JPM')
Query:  JPM

Company (optional, e.g. 'JPM' or 'GS', leave blank if any):  
JPM

Prompt style? ('standalone', 'pairwise', 'multiquarter') — default is 'standalone'
Prompt type:  multiquarter

Enter ALL quarters to analyze, separated by comma (e.g. 'Q1 2024, Q2 2024, Q3 2024'):
Quarters:  Q1 2024, Q2 2024, Q3 2024

**========== SUMMARY FROM LLM ==========**

**Plain English Year-in-Review Summary**

- **Revenue Growth Driven by Payments and Wealth Management:** Revenue increased significantly across all quarters, particularly driven by the Payments segment (up 9% YoY in Q1, 11% in Q3) and Wealth Management (AUM growth of 3.9T USD and asset management fees up 15% in Q3). This suggests a successful strategy to expand into higher-margin payment services and attract more assets under management.

- **Investment Banking Boom – A Temporary Tailwind:** Investment Banking fees saw a dramatic surge in Q3, increasing by 31%. While this is impressive, it's likely fueled by high deal volume and M&A activity, indicating a potentially temporary boost rather than a fundamental shift.

- **Shifting Deposits & Increased Lending:** Client Deposits grew significantly in Q3 (up 17% YoY), demonstrating increased trust and demand for the bank’s services. Simultaneously, loan volume increased modestly (2% YoY), indicating a willingness to borrow despite rising interest rates. 

- **Credit Costs Remain Low – Underwriting Discipline Pays Off:** Credit losses remained remarkably low throughout the year, with only 32M USD in Q1 dropping to 316M USD in Q3. This reflects strong underwriting standards and effective risk management, benefiting the bank's profitability. 

- **Wealth Management Driving Key Metrics:**  Record long-term net inflows of 72B USD in AWM and asset management fees up 15% in Q3 highlight the bank’s growing strength in wealth management. This area is now contributing a very large portion of revenue. 

---
**Note:****Data is based on the information provided in the four quarters. Further analysis would require a deeper dive into specific segments and market conditions**===============================





In [3]:
!pip install sentence-transformers faiss-cpu tqdm




In [7]:
!pip install pdfplumber



In [9]:
## 1. Imports & Dependencies


# --- Imports: File Handling, NLP, PDF Processing, Vector Search, Progress, and HTTP ---
import os
import re
import pdfplumber         # For parsing PDF files
import pickle
import numpy as np        # Numeric operations
from sentence_transformers import SentenceTransformer  # Embeddings for semantic search
import faiss             # Fast Approximate Nearest Neighbors for retrieval
from tqdm import tqdm     # Progress bar for loops
import requests           # For making HTTP requests to LLM API



In [11]:
## 2. File Loading & Chunking Utility



# --- Function: Load and Chunk Financial Report ---
def load_and_chunk_report(
    path,
    company=None,
    quarter=None,
    chunk_size=500,
    encoding="utf-8"
):
    """
    Loads a financial report (.txt or .pdf), splits it into chunks for embedding/RAG.
    Returns: List of chunk dicts with metadata (company, quarter, page, etc.)
    """
    ext = os.path.splitext(path)[-1].lower()
    text_by_page = []

    # Handle TXT files: split by words, assign to pages based on chunking
    if ext == ".txt":
        with open(path, "r", encoding=encoding) as f:
            content = f.read()
        words = content.split()
        for i in range(0, len(words), chunk_size):
            chunk = " ".join(words[i:i+chunk_size])
            text_by_page.append({"text": chunk, "page": (i//chunk_size)+1})

    # Handle PDF files: extract text per page, clean, then chunk further if needed
    elif ext == ".pdf":
        with pdfplumber.open(path) as pdf:
            for i, page in enumerate(pdf.pages):
                page_text = page.extract_text()
                if not page_text: continue
                clean_text = re.sub(r"\n+", "\n", page_text.strip())
                text_by_page.append({"text": clean_text, "page": i+1})

    else:
        raise ValueError("Only .txt and .pdf supported.")

    # Further split each page into smaller chunks (by words)
    chunks = []
    for page_data in text_by_page:
        words = page_data["text"].split()
        for i in range(0, len(words), chunk_size):
            chunk = " ".join(words[i:i+chunk_size])
            if chunk.strip():
                chunks.append({
                    "company": company,
                    "quarter": quarter,
                    "content": chunk,
                    "page": page_data["page"],
                    "source": path
                })
    return chunks





In [13]:
## 3. Demo/Test Data (for Quick Prototyping)


# --- Demo/Test Data: Financial Report Chunks ---
# List of dicts, each representing a chunk of financial data for a specific company, quarter, and section.
# Used for semantic search and summary when no PDFs are loaded.


# For MVP/testing: manually curated report chunks for GS, JPM, and other companies.
# In production, these would be generated by loading and chunking PDF files.



data = [
    # ---- GS Q1 ----
    {
        "company": "GS",
        "quarter": "Q1 2024",
        "business_unit": "",
        "section": "Key Highlights & Overview",
        "content": "Net Income: $4.13B (EPS $11.58)\nROE: 14.8% | ROTCE: 15.9%\nCapital Ratios (CET1): 14.7% (Standardized), 15.9% (Advanced)\nAverage Loans: $184B\nAverage Deposits: $441B\nRevenue: $14.2B\nExpense: $8.7B | Efficiency ratio: 60.9%\nCredit Costs: $318M (credit cards + wholesale impairments)",
        "data_type": "financial_metric",
        "tags": ["GS", "Q1 2024", "Net Income", "ROE", "Revenue", "Credit Costs"],
        "source": "Earnings Release"
    },
    {
        "company": "GS",
        "quarter": "Q1 2024",
        "business_unit": "Global Banking & Markets",
        "section": "Segment Performance",
        "content": "Net Revenue: $9.7B\nInvestment Banking Fees: $2.1B",
        "delta": {
            "Investment Banking Fees": { "direction": "up", "value": "32%" }
        },
        "sub_metrics": {
            "Advisory": "$1.01B",
            "Equity Underwriting": "$370M",
            "Debt Underwriting": "$699M",
            "FICC": "$4.3B",
            "FICC Intermediation": "$3.47B",
            "FICC Financing": "$852M",
            "Equities": "$3.3B",
            "Equities Intermediation": "$1.99B",
            "Equities Financing": "$1.32B"
        },
        "data_type": "segment_performance",
        "tags": ["GS", "Q1 2024", "Global Banking & Markets", "Investment Banking", "FICC", "Equities"],
        "source": "Earnings Release"
    },
    {
        "company": "GS",
        "quarter": "Q1 2024",
        "business_unit": "Asset & Wealth Management",
        "section": "Segment Performance",
        "content": "Net Revenue: $3.8B\nManagement & Other Fees: $2.45B\nPrivate Banking & Lending: $682M\nEquity Investments: $222M\nDebt Investments: $345M\nAUM: $2.85T",
        "delta": {
            "AUM": { "direction": "up", "value": "$36B" }
        },
        "data_type": "segment_performance",
        "tags": ["GS", "Q1 2024", "Asset & Wealth Management", "AUM", "Fees"],
        "source": "Earnings Release"
    },
    # ---- GS Q2 ----
    {
        "company": "GS",
        "quarter": "Q2 2024",
        "business_unit": "",
        "section": "Key Highlights & Overview",
        "content": "Net Income: $3.04B (EPS $8.62)\nROE: 10.9% | ROTCE: 11.6%\nCapital Ratios (CET1): 14.8% (Standardized), 15.7% (Advanced)\nAverage Loans: $184B (flat YoY and QoQ)\nAverage Deposits: $433B\nRevenue: $12.73B\nExpense: $8.53B | Efficiency ratio: 63.8% (YTD)\nCredit Costs: $282M",
        "delta": {
            "Average Deposits": { "direction": "down", "value": "2% QoQ" },
            "Revenue": [
                { "direction": "down", "value": "10% QoQ" },
                { "direction": "up", "value": "17% YoY" }
            ],
            "Credit Costs": [
                { "direction": "down", "value": "11% QoQ" },
                { "direction": "down", "value": "54% YoY" }
            ]
        },
        "data_type": "financial_metric",
        "tags": ["GS", "Q2 2024", "Net Income", "ROE", "Credit Costs", "Revenue", "Capital"],
        "source": "Earnings Release"
    },
    {
        "company": "GS",
        "quarter": "Q2 2024",
        "business_unit": "Global Banking & Markets",
        "section": "Segment Performance",
        "content": "Net Revenues: $8.18B\nInvestment Banking Fees: $1.73B\nFICC: $3.18B\nEquities: $3.17B\nFICC Intermediation: $2.33B | FICC Financing: $850M\nEquities Intermediation: $1.79B | Equities Financing: $1.38B",
        "delta": {
            "Net Revenues": [
                { "direction": "down", "value": "16% QoQ" },
                { "direction": "up", "value": "14% YoY" }
            ],
            "Investment Banking Fees": [
                { "direction": "down", "value": "17% QoQ" },
                { "direction": "up", "value": "21% YoY" }
            ],
            "FICC": [
                { "direction": "down", "value": "26% QoQ" },
                { "direction": "up", "value": "17% YoY" }
            ],
            "Equities": [
                { "direction": "down", "value": "4% QoQ" },
                { "direction": "up", "value": "7% YoY" }
            ]
        },
        "data_type": "segment_performance",
        "tags": ["GS", "Q2 2024", "Global Banking & Markets", "Investment Banking", "FICC", "Equities"],
        "source": "Earnings Release"
    },
    {
        "company": "GS",
        "quarter": "Q2 2024",
        "business_unit": "Asset & Wealth Management",
        "section": "Segment Performance",
        "content": "Net Revenues: $3.88B\nManagement & Other Fees: $2.54B\nEquity Investments: $292M\nDebt Investments: $297M\nPrivate Banking & Lending: $707M\nAssets Under Supervision (AUS): $2.93T",
        "delta": {
            "Net Revenues": [
                { "direction": "up", "value": "2% QoQ" },
                { "direction": "up", "value": "27% YoY" }
            ],
            "Debt Investments": { "direction": "up", "value": "51% YoY" },
            "Private Banking & Lending": { "direction": "down", "value": "19% YoY" },
            "AUS": { "direction": "up", "value": "$86B QoQ" }
        },
        "data_type": "segment_performance",
        "tags": ["GS", "Q2 2024", "Asset & Wealth Management", "AUS", "Management Fees"],
        "source": "Earnings Release"
    },

    # ---- JPM Q1 ----
    {
        "company": "JPM",
        "quarter": "Q1 2024",
        "business_unit": "",
        "section": "Key Highlights & Overview",
        "content": "Net Income: $13.4B (EPS $4.44); excluding FDIC charge: $14.0B (EPS $4.63)\nROE: 17% | ROTCE: 21%\nCapital Ratios (CET1): 15.0% (Standardized), 15.3% (Advanced)\nAverage Loans: $1.3T\nAverage Deposits: ↑2% YoY\nRevenue: Reported $41.9B | Managed $42.5B\nExpense: $22.8B | Overhead ratio: 53–54%\nCredit Costs: $1.9B (with $2.0B in net charge-offs + $72M reserve release)",
        "delta": {
            "Average Loans": [
                { "direction": "up", "value": "16% YoY" },
                { "direction": "up", "value": "3% ex-First Republic" }
            ],
            "Average Deposits": { "direction": "up", "value": "2% YoY" }
        },
        "data_type": "financial_metric",
        "tags": ["JPM", "Q1 2024", "Net Income", "EPS", "Revenue", "ROE", "CET1"],
        "source": "Earnings Release"
    },
    {
        "company": "JPM",
        "quarter": "Q1 2024",
        "business_unit": "Consumer & Community Banking",
        "section": "Segment Performance",
        "content": "ROE: 35%\nNet Income: $4.8B\nRevenue: $17.7B\nBanking & Wealth Mgmt: $10.3B\nHome Lending: $1.2B\nCard Services & Auto: $6.1B\nCredit Losses: $1.9B\nCard Net Charge-Off Rate: 3.32%\nMobile Customers: ↑7% YoY\nCard Volume: ↑9% YoY",
        "delta": {
            "Net Income": { "direction": "down", "value": "8%" },
            "Revenue": { "direction": "up", "value": "7%" },
            "Banking & Wealth Mgmt": { "direction": "up", "value": "3%" },
            "Home Lending": { "direction": "up", "value": "65%" },
            "Card Services & Auto": { "direction": "up", "value": "8%" },
            "Mobile Customers": { "direction": "up", "value": "7% YoY" },
            "Card Volume": { "direction": "up", "value": "9% YoY" }
        },
        "data_type": "segment_performance",
        "tags": ["JPM", "Q1 2024", "Consumer", "Revenue", "Net Income", "Mobile", "Card"],
        "source": "Earnings Release"
    },
    {
        "company": "JPM",
        "quarter": "Q1 2024",
        "business_unit": "Corporate & Investment Bank",
        "section": "Segment Performance",
        "content": "ROE: 18%\nNet Income: $4.8B\nRevenue: $13.6B\nIB Fees: $2.0B\nPayments: $2.4B\nMarkets Revenue: $8.0B\nFixed Income: $5.3B\nEquity Markets: $2.7B\nCredit Losses: $32M",
        "delta": {
            "Net Income": { "direction": "up", "value": "8%" },
            "IB Fees": { "direction": "up", "value": "21%" },
            "Payments": { "direction": "down", "value": "1%" },
            "Markets Revenue": { "direction": "down", "value": "5%" },
            "Fixed Income": { "direction": "down", "value": "7%" },
            "Equity Markets": { "direction": "flat" }
        },
        "data_type": "segment_performance",
        "tags": ["JPM", "Q1 2024", "Investment Bank", "Markets", "IB Fees", "Fixed Income"],
        "source": "Earnings Release"
    },
    # ---- JPM Q2 ----
    {
        "company": "JPM",
        "quarter": "Q2 2024",
        "business_unit": "",
        "section": "Key Highlights & Overview",
        "content": "Net Income: $18.1B (EPS $6.12); excluding significant items: $13.1B (EPS $4.40)\nROE: 23% | ROTCE: 28% (20% ex. significant items)\nCapital Ratios (CET1): 15.3% (Standardized), 15.5% (Advanced)\nAverage Loans: $1.3T\nAverage Deposits: ↓1% YoY\nRevenue: Reported $50.2B | Managed $51.0B\nExpense: $23.7B | Overhead Ratio: 47%\nCredit Costs: $3.1B (with $2.2B in net charge-offs + $821M reserve build)",
        "delta": {
            "Average Loans": [
                { "direction": "up", "value": "6% YoY" },
                { "direction": "flat", "value": "QoQ" }
            ],
            "Average Deposits": { "direction": "down", "value": "1% YoY" }
        },
        "data_type": "financial_metric",
        "tags": ["JPM", "Q2 2024", "Net Income", "EPS", "Revenue", "ROE", "CET1"],
        "source": "Earnings Release"
    },
    {
        "company": "JPM",
        "quarter": "Q2 2024",
        "business_unit": "Consumer & Community Banking",
        "section": "Segment Performance",
        "content": "ROE: 30%\nNet Income: $4.2B\nRevenue: $17.7B\nBanking & Wealth Mgmt: $10.4B\nHome Lending: $1.3B\nCard Services & Auto: $6.0B\nCredit Losses: $2.6B\nCard Net Charge-Off Rate: 3.50%\nMobile Customers: ↑7% YoY\nCard Volume: ↑7%",
        "delta": {
            "Net Income": { "direction": "down", "value": "21%" },
            "Revenue": { "direction": "up", "value": "3%" },
            "Banking & Wealth Mgmt": { "direction": "down", "value": "5%" },
            "Home Lending": { "direction": "up", "value": "31%" },
            "Card Services & Auto": { "direction": "up", "value": "14%" },
            "Mobile Customers": { "direction": "up", "value": "7% YoY" },
            "Card Volume": { "direction": "up", "value": "7%" }
        },
        "data_type": "segment_performance",
        "tags": ["JPM", "Q2 2024", "Consumer", "Card", "Mobile", "Revenue"],
        "source": "Earnings Release"
    },
    {
        "company": "JPM",
        "quarter": "Q2 2024",
        "business_unit": "Corporate & Investment Bank",
        "section": "Segment Performance",
        "content": "ROE: 17%\nNet Income: $5.9B\nRevenue: $17.9B\nIB Fees: $2.5B\nPayments: $4.5B\nMarkets Revenue: $7.8B\nFixed Income: $4.8B\nEquity Markets: $3.0B\nCredit Losses: $384M\nClient Deposits: ↑2% YoY",
        "delta": {
            "Net Income": { "direction": "up", "value": "11%" },
            "Revenue": { "direction": "up", "value": "9%" },
            "IB Fees": { "direction": "up", "value": "50%" },
            "Payments": { "direction": "down", "value": "4%" },
            "Markets Revenue": { "direction": "up", "value": "10%" },
            "Fixed Income": { "direction": "up", "value": "5%" },
            "Equity Markets": { "direction": "up", "value": "21%" },
            "Client Deposits": { "direction": "up", "value": "2% YoY" }
        },
        "data_type": "segment_performance",
        "tags": ["JPM", "Q2 2024", "CIB", "Markets", "IB Fees", "Fixed Income", "Deposits"],
        "source": "Earnings Release"
    },

    # ---- GS Q3 ----
    {
        "company": "GS",
        "quarter": "Q3 2024",
        "business_unit": "",
        "section": "Key Highlights & Overview",
        "content": "Net Income: $3.0B (EPS $8.40)\nROE: 10.4% | ROTCE: 11.1%\nCapital Ratios (CET1): 14.6% (Standardized), 15.5% (Advanced)\nAverage Loans: $192B\nAverage Deposits: $445B\nRevenue: $12.7B\nExpense: $8.3B | Efficiency ratio: 64.3% (YTD)\nCredit Costs: $397M (primarily credit card net charge-offs, offset by wholesale recoveries)",
        "delta": {
            "Average Loans": { "direction": "up", "value": "4% QoQ" },
            "Average Deposits": { "direction": "up", "value": "3% QoQ" }
        },
        "data_type": "financial_metric",
        "tags": ["GS", "Q3 2024", "Net Income", "ROE", "Loans", "Deposits", "Credit Costs"],
        "source": "Earnings Release"
    },
    {
        "company": "GS",
        "quarter": "Q3 2024",
        "business_unit": "Global Banking & Markets",
        "section": "Segment Performance",
        "content": "Net Revenue: $8.55B\nInvestment Banking Fees: $1.87B\nAdvisory: $875M\nDebt Underwriting: $605M\nEquity Underwriting: $385M\nFICC: $2.96B\nIntermediation: $2.01B\nFinancing: $949M\nEquities: $3.5B\nIntermediation: $2.21B\nFinancing: $1.29B",
        "delta": {
            "Net Revenue": [
                { "direction": "up", "value": "5% QoQ" },
                { "direction": "up", "value": "7% YoY" }
            ],
            "Investment Banking Fees": { "direction": "up", "value": "20%" },
            "FICC": { "direction": "down", "value": "12%" },
            "Equities": { "direction": "up", "value": "18%" }
        },
        "data_type": "segment_performance",
        "tags": ["GS", "Q3 2024", "Global Banking & Markets", "FICC", "Equities", "IB Fees"],
        "source": "Earnings Release"
    },
    {
        "company": "GS",
        "quarter": "Q3 2024",
        "business_unit": "Asset & Wealth Management",
        "section": "Segment Performance",
        "content": "Net Revenue: $3.75B\nManagement & Other Fees: $2.62B\nPrivate Banking & Lending: $756M\nEquity Investments: $116M\nDebt Investments: $178M\nAUM: $3.1T",
        "delta": {
            "Net Revenue": [
                { "direction": "down", "value": "3% QoQ" },
                { "direction": "up", "value": "16% YoY" }
            ],
            "AUM": { "direction": "up", "value": "$169B QoQ" }
        },
        "data_type": "segment_performance",
        "tags": ["GS", "Q3 2024", "Asset & Wealth Management", "AUM", "Private Banking"],
        "source": "Earnings Release"
    },
    {
        "company": "GS",
        "quarter": "Q3 2024",
        "business_unit": "Platform Solutions",
        "section": "Segment Performance",
        "content": "Net Revenue: $391M\nConsumer Platforms: $333M\nTransaction Banking & Other: $58M",
        "delta": {
            "Net Revenue": [
                { "direction": "down", "value": "32% YoY" },
                { "direction": "down", "value": "42% QoQ" }
            ]
        },
        "data_type": "segment_performance",
        "tags": ["GS", "Q3 2024", "Platform Solutions", "Consumer Platforms"],
        "source": "Earnings Release"
    },
    {
        "company": "GS",
        "quarter": "Q3 2024",
        "business_unit": "",
        "section": "Capital & Distributions",
        "content": "Dividend: $3.00/share\nStock Repurchases: $1.0B (2.0M shares at avg. $489.50)\nBook Value/Share: $332.96\nLiquidity: Avg. Global Core Liquid Assets (GCLA): $447B\nLeverage Ratio: 5.5%",
        "delta": {
            "Book Value/Share": [
                { "direction": "up", "value": "2% QoQ" },
                { "direction": "up", "value": "6% YoY" }
            ]
        },
        "data_type": "financial_metric",
        "tags": ["GS", "Q3 2024", "Dividends", "Buybacks", "Book Value", "Liquidity"],
        "source": "Earnings Release"
    },
    {
        "company": "GS",
        "quarter": "Q3 2024",
        "business_unit": "",
        "section": "CEO Commentary",
        "content": "“Our performance demonstrates the strength of our world-class franchise in an improving operating environment.”\n“We continue to lean into our strengths — talent, execution, and risk management — to serve clients and deliver for shareholders.”",
        "data_type": "commentary",
        "tags": ["GS", "Q3 2024", "CEO Commentary", "Franchise Strength", "Execution"],
        "source": "Earnings Call Transcript"
    },

    # ---- JPM Q3 ----
    {
        "company": "JPM",
        "quarter": "Q3 2024",
        "business_unit": "",
        "section": "Key Highlights & Overview",
        "content": "Net Income: $12.9B (EPS $4.37)\nROE: 16% | ROTCE: 19%\nCapital Ratios (CET1): 15.3% (Standardized), 15.5% (Advanced)\nAverage Loans: $1.3T\nAverage Deposits: $1.4T\nRevenue: Reported $42.7B | Managed $43.3B\nExpense: $22.6B | Overhead ratio: 52–53%\nCredit Costs: $3.1B (includes $2.1B in net charge-offs + $1.0B reserve build)",
        "delta": {
            "Average Loans": [
                { "direction": "up", "value": "1% YoY" },
                { "direction": "up", "value": "1% QoQ" }
            ],
            "Average Deposits": [
                { "direction": "up", "value": "1% YoY" },
                { "direction": "up", "value": "1% QoQ" }
            ]
        },
        "data_type": "financial_metric",
        "tags": ["JPM", "Q3 2024", "Net Income", "ROE", "CET1", "Loans", "Deposits"],
        "source": "Earnings Release"
    },
    {
        "company": "JPM",
        "quarter": "Q3 2024",
        "business_unit": "Consumer & Community Banking",
        "section": "Segment Performance",
        "content": "ROE: 29%\nNet Income: $4.0B\nRevenue: $17.8B\nBanking & Wealth Mgmt: $10.1B\nHome Lending: $1.3B\nCard Services & Auto: $6.4B\nCredit Losses: $2.8B\nCard Net Charge-Off Rate: 3.24%\nMobile Customers: ↑7% YoY\nCard Volume: ↑6% YoY",
        "delta": {
            "Net Income": { "direction": "down", "value": "31%" },
            "Revenue": { "direction": "down", "value": "3%" },
            "Banking & Wealth Mgmt": { "direction": "down", "value": "11%" },
            "Home Lending": { "direction": "up", "value": "3%" },
            "Card Services & Auto": { "direction": "up", "value": "11%" },
            "Mobile Customers": { "direction": "up", "value": "7% YoY" },
            "Card Volume": { "direction": "up", "value": "6% YoY" }
        },
        "data_type": "segment_performance",
        "tags": ["JPM", "Q3 2024", "Consumer", "Card", "Mobile", "Revenue", "Net Income"],
        "source": "Earnings Release"
    },
    {
        "company": "JPM",
        "quarter": "Q3 2024",
        "business_unit": "Corporate & Investment Bank",
        "section": "Segment Performance",
        "content": "ROE: 17%\nNet Income: $5.7B\nRevenue: $17.0B\nIB Fees: $2.4B\nPayments: $4.4B\nMarkets Revenue: $7.2B\nFixed Income: $4.5B\nEquity Markets: $2.6B\nCredit Losses: $316M\nClient Deposits: ↑7% YoY",
        "delta": {
            "Net Income": { "direction": "up", "value": "13%" },
            "Revenue": { "direction": "up", "value": "8%" },
            "IB Fees": { "direction": "up", "value": "31%" },
            "Payments": { "direction": "up", "value": "4%" },
            "Markets Revenue": { "direction": "up", "value": "8%" },
            "Fixed Income": { "direction": "flat", "value": "" },
            "Equity Markets": { "direction": "up", "value": "27%" },
            "Client Deposits": { "direction": "up", "value": "7% YoY" }
        },
        "data_type": "segment_performance",
        "tags": ["JPM", "Q3 2024", "CIB", "Markets", "IB Fees", "Fixed Income", "Deposits"],
        "source": "Earnings Release"
    },
    {
        "company": "JPM",
        "quarter": "Q3 2024",
        "business_unit": "Asset & Wealth Management",
        "section": "Segment Performance",
        "content": "ROE: 34%\nNet Income: $1.4B\nRevenue: $5.4B\nAUM: $3.9T\nClient Assets: $5.7T\nLoans: ↑2% YoY | Deposits: ↑17% YoY",
        "delta": {
            "Net Income": { "direction": "down", "value": "5%" },
            "Revenue": { "direction": "up", "value": "9%" },
            "AUM": { "direction": "up", "value": "23%" },
            "Client Assets": { "direction": "up", "value": "23%" },
            "Loans": { "direction": "up", "value": "2% YoY" },
            "Deposits": { "direction": "up", "value": "17% YoY" }
        },
        "data_type": "segment_performance",
        "tags": ["JPM", "Q3 2024", "AWM", "AUM", "Deposits", "Client Assets"],
        "source": "Earnings Release"
    },
    {
        "company": "JPM",
        "quarter": "Q3 2024",
        "business_unit": "Corporate",
        "section": "Segment Performance",
        "content": "Net Income: $1.8B\nRevenue: $3.1B\nSignificant Item: Not specified\nNoninterest Revenue: $155M",
        "delta": {
            "Net Income": { "direction": "up", "value": "$998M YoY" },
            "Revenue": { "direction": "up", "value": "$1.5B YoY" },
            "Noninterest Revenue": { "direction": "up", "value": "$580M YoY" }
        },
        "data_type": "segment_performance",
        "tags": ["JPM", "Q3 2024", "Corporate", "Noninterest Revenue", "YoY"],
        "source": "Earnings Release"
    },
    {
        "company": "JPM",
        "quarter": "Q3 2024",
        "business_unit": "",
        "section": "Capital & Distributions",
        "content": "Dividend: $3.6B ($1.25/share)\nStock Repurchases: $6.0B\nNet Payout Ratio (LTM): 54%\nBook Value/Share: $115.15\nTangible Book Value/Share: $96.42\nLiquidity: $1.5T in cash & marketable securities\nLeverage Ratio: 6.0%",
        "delta": {
            "Book Value/Share": { "direction": "up", "value": "15%" },
            "Tangible Book Value/Share": { "direction": "up", "value": "18%" }
        },
        "data_type": "financial_metric",
        "tags": ["JPM", "Q3 2024", "Dividends", "Buybacks", "Book Value", "Liquidity"],
        "source": "Earnings Release"
    },
    {
        "company": "JPM",
        "quarter": "Q3 2024",
        "business_unit": "",
        "section": "CEO Commentary",
        "content": "“The Firm reported strong Q3 results: $12.9B net income, ROTCE 19%. CIB saw 31% growth in investment banking fees. Payments grew double-digits.”\n“CCB added 2.5M card accounts, with card loans up 11%. AWM hit record $72B in long-term net inflows; asset management fees up 15%.”\n“We’re watching global risks, new regulatory rules, and macro uncertainty closely. Balance sheet remains strong with $1.5T liquidity and $544B in loss-absorbing capital.”",
        "data_type": "commentary",
        "tags": ["JPM", "Q3 2024", "CEO Commentary", "CIB", "CCB", "AWM", "Net Income", "Liquidity"],
        "source": "Earnings Call Transcript"
    }
]





In [15]:
# --- 4. Build Embedding Model & Create Semantic Search Index ---

# Load pre-trained embedding model for text vectorization
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract content and associated metadata from each data chunk
documents = [chunk.get("content", "") for chunk in data]
metadatas = [
    {
        "company": chunk.get("company", ""),
        "quarter": chunk.get("quarter", ""),
        "business_unit": chunk.get("business_unit", ""),
        "section": chunk.get("section", ""),
        "tags": chunk.get("tags", []),
        "data_type": chunk.get("data_type", ""),
        "source": chunk.get("source", ""),
        "page": chunk.get("page", None)
    }
    for chunk in data
]

# Encode documents as vector embeddings (for semantic search)
embeddings = model.encode(documents, show_progress_bar=True) if documents else np.empty((0, 384))
dimension = 384
index = faiss.IndexFlatL2(dimension)
if len(documents) > 0:
    index.add(np.array(embeddings, dtype="float32"))

# --- Add New Reports Dynamically (Example) ---
# Example code for loading a new PDF and adding its chunks to the index.
path = "MyNewReport.pdf"
company = "GS"
quarter = "Q4 2024"

try:
    new_chunks = load_and_chunk_report(path, company=company, quarter=quarter)
    add_new_chunks_and_update_index(new_chunks, model, index, documents, metadatas)
    print(f"Added new report: {path} ({company} {quarter})")
except FileNotFoundError:
    print("ERROR: File not found. Please check the filename/path.")
except ValueError as ve:
    print(f"ERROR: {ve}")






Batches:   0%|          | 0/1 [00:00<?, ?it/s]

ERROR: File not found. Please check the filename/path.


In [17]:
## 5. Bulk Import: Add All PDFs from a Folder to Index


import glob

pdf_folder = "./financial_reports/"
pdf_paths = glob.glob(pdf_folder + "*.pdf")

# Loop over all PDFs and add them to the semantic index
for path in pdf_paths:
    # Parse company and quarter from filename (e.g. "GS_Q3_2024.pdf" → "GS", "Q3 2024")
    try:
        fname = path.split("/")[-1].replace(".pdf","")
        company, quarter = fname.split("_", 1)
        # Chunk, embed, and index the new report
        new_chunks = load_and_chunk_report(path, company=company, quarter=quarter)
        add_new_chunks_and_update_index(new_chunks, model, index, documents, metadatas)
        print(f"Added: {path} | {company} {quarter}")
    except Exception as e:
        print(f"ERROR with {path}: {e}")



In [19]:

## 6. Data Coverage: How Many Chunks Per Company/Quarter?


print("Loaded documents by company & quarter:")
from collections import Counter

# Collect (company, quarter) pairs for all loaded document chunks
meta_tuples = [(m['company'], m['quarter']) for m in metadatas]

# Print a count of chunks for each (company, quarter)
for k, v in Counter(meta_tuples).items():
    print(f"{k}: {v} chunks")



Loaded documents by company & quarter:
('GS', 'Q1 2024'): 3 chunks
('GS', 'Q2 2024'): 3 chunks
('JPM', 'Q1 2024'): 3 chunks
('JPM', 'Q2 2024'): 3 chunks
('GS', 'Q3 2024'): 6 chunks
('JPM', 'Q3 2024'): 7 chunks


In [21]:
## 7. Helper: Add New Chunks and Update the Search Index



def add_new_chunks_and_update_index(
    new_chunks, 
    model, 
    index, 
    documents, 
    metadatas
):
    """
    Incrementally add new chunks to FAISS index, documents, and metadatas.
    - new_chunks: List of dicts, each representing a chunk of a new financial report.
    - model: SentenceTransformer model for embedding the new chunks.
    - index: FAISS vector index (in-place modification).
    - documents: List of all document texts (appended to).
    - metadatas: List of all document metadata (appended to).
    """
    # Extract text and metadata from each new chunk
    new_docs = [chunk["content"] for chunk in new_chunks]
    new_metas = [
        {
            "company": chunk.get("company", ""),
            "quarter": chunk.get("quarter", ""),
            "business_unit": chunk.get("business_unit", ""),
            "section": chunk.get("section", ""),
            "tags": chunk.get("tags", []),
            "data_type": chunk.get("data_type", ""),
            "source": chunk.get("source", ""),
            "page": chunk.get("page", None),
        }
        for chunk in new_chunks
    ]
    if new_docs:
        # Compute embeddings and add to the FAISS index
        new_embs = model.encode(new_docs, show_progress_bar=False)
        index.add(np.array(new_embs, dtype="float32"))
        # Extend the documents and metadata lists in-place
        documents.extend(new_docs)
        metadatas.extend(new_metas)


In [23]:
## 8. Semantic Search: Query Top-K Chunks


def query_top_k(
    query, index, documents, metadatas, k=3, 
    company=None, quarter=None, 
    show_all_metadata=False,
    print_no_match_warning=True
):
    """
    Run a semantic search on the FAISS index to find the top-k most relevant chunks.
    Optionally filter results by company and/or quarter.
    Prints and returns the top results (document, metadata).
    """
    # Encode the user query as an embedding
    q_emb = model.encode([query])
    # Perform nearest neighbor search (get k*5 for filtering)
    D, I = index.search(np.array(q_emb, dtype='float32'), k*5)
    results = []
    for idx in I[0]:
        if idx < 0 or idx >= len(documents): continue  # skip out-of-bounds
        meta = metadatas[idx]
        # Filter by company/quarter if specified
        if (company and meta.get('company') != company): continue
        if (quarter and meta.get('quarter') != quarter): continue
        results.append((documents[idx], meta))
        if len(results) >= k:
            break
    # If no results, print a helpful warning and fallback to top unfiltered match
    if not results and print_no_match_warning:
        print("No matching results found for your filters.\nShowing top unfiltered semantic match:")
        idx = I[0][0]
        if idx >= 0 and idx < len(documents):
            doc, meta = documents[idx], metadatas[idx]
            print("🔹 Match:", doc)
            print("🗂️ Metadata:", meta if show_all_metadata else f"[{meta.get('company')} - {meta.get('quarter')} - {meta.get('business_unit', '')}]")
            print("----------")
        return []
    # Print all found results (either full metadata or compact view)
    for doc, meta in results:
        print("🔹 Match:", doc)
        print("🗂️ Metadata:", meta if show_all_metadata else f"[{meta.get('company')} - {meta.get('quarter')} - {meta.get('business_unit', '')}]")
        print("----------")
    return results





In [25]:

## 9. Summarization with Local LLM (Ollama)


import requests

def summarize_with_ollama(chunks, model='gemma3:latest', prompt_style='complex'):
    """
    Sends the top-k retrieved document chunks to a local Ollama LLM instance for summarization.

    Args:
        chunks (list): List of string chunks to summarize.
        model (str): Model name for Ollama (default: 'gemma3:latest').
        prompt_style (str): Not currently used, placeholder for prompt customization.

    Returns:
        str or None: The summary returned by the LLM, or None if there is an error.
    """
    # Build the summarization prompt
    user_prompt = (
        "Summarize the following financial report excerpts in plain language, "
        "focusing on key numbers, business drivers, and trends.\n\n"
    )
    for i, chunk in enumerate(chunks):
        user_prompt += f"---\nChunk {i+1}:\n{chunk}\n"
    user_prompt += "\nSummary:"

    # Send request to Ollama LLM API
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": user_prompt,
            "stream": False,
        },
        timeout=90,
    )
    out = response.json()
    if "response" in out:
        return out["response"]
    else:
        print("Ollama API did not return a summary. Full API output below:")
        print(out)
        return None



In [27]:

## 10. Prompt Templates: Standalone Summary Prompts


def build_standalone_prompts(chunks):
    """
    Given a list of chunks (text excerpts), returns a dict of prompt templates for standalone (single-report) summarization.
    Supports three styles: 'basic', 'complex', and 'upgraded'.
    """
    # Combine all chunks into a formatted block of text for the LLM prompt
    text = "\n".join([f"---\nChunk {i+1}:\n{chunk}" for i, chunk in enumerate(chunks)])
    
    # Define various prompt templates for different levels of summarization detail
    return {
        "basic": (
            # Very concise, plain English summary, 3-5 bullet points, max 25 words each
            "Summarize this financial report in plain English. Highlight 3–5 key points. "
            "Use bullet points. Keep it simple and under 25 words per point.\n"
            "\nText to summarize:\n"
            f"{text}\n"
        ),
        "complex": (
            # For more nuanced, detailed, and accessible summaries, tailored for non-experts
            "🧠 Complex Prompt – Standalone Summary\n"
            "You are a veteran financial analyst with decades of experience simplifying complex financial reports for general audiences with no background in finance.\n"
            "Your goal is to clearly summarize the key takeaways from a single financial document. Focus on: what happened, why it matters, and what it means for the business.\n"
            "Instructions:\n"
            "Identify the 3–5 most important insights\n\n"
            "Use plain, simple English (roughly 9th-grade reading level)\n"
            "Use bullet points; max 25 words per sentence\n"
            "Define any technical terms or numbers in plain language\n"
            "Remain factual and neutral — avoid hype, filler, or speculation\n"
            "Optional Enhancements:\n"
            "If requested, adopt a casual tone while staying accurate\n"
            "If requested, expand bullet points into short, readable paragraphs\n"
            "Briefly define technical terms (e.g., 'ROE = return on equity')\n"
            "If the content is vague or unclear, say so directly\n"
            "Output Format:\n"
            "Title: Plain English Summary\n"
            "Use up to 5 bullet points\n"
            "Format: [Key Fact]: [Plain English explanation of its meaning or impact]\n"
            " Example: “Net income rose 10% YoY: The company made more profit compared to the same quarter last year.”\n"
            "Each bullet: Max 2 sentences, 25 words per sentence\n"
            "If not enough meaningful info: \"Not enough clear information to summarize.\"\n"
            "Input:\n"
            f"{text}\n"
        ),
        "upgraded": (
            # For expert-level, actionable summaries—focus on business significance and drivers
            "You are a veteran Wall Street analyst who explains earnings results to busy investors, journalists, and junior bankers — using clear, plain language.\n"
            "Your task is to summarize the 3–5 most meaningful insights from a **single quarterly earnings report**.\n"
            "Go beyond surface-level numbers — focus on:\n"
            "- What changed in the business\n"
            "- Why those changes happened (management explanation, segment performance, macro context)\n"
            "- What that means for the company's trajectory\n"
            "Instructions:\n"
            "- Use plain English (9th-grade level)\n"
            "- Each bullet max 2 sentences, 25 words per sentence\n"
            "- Define technical terms (e.g., “ROE = Return on Equity”)\n"
            "- Avoid hype, exaggeration, filler, or repetition\n"
            "- If data is unclear or vague, say so directly\n"
            "Output Format:\n"
            "**Plain English Summary**\n"
            "- [What changed]: [Explanation of the change and its significance]\n"
            "- Example: “ROE fell to 10%: This means the company earned less profit for every dollar of equity — possibly due to rising costs or slower growth.”\n"
            "If not enough clear data: write  “Not enough meaningful information to summarize.”\n"
            "\nText to summarize:\n"
            f"{text}\n"
        )
    }





In [29]:
## 11. Prompt Templates: Pairwise (Quarter-over-Quarter) Summary Prompts


def build_pairwise_prompts(period1, period2):
    """
    Given two period summaries, returns a dict of prompt templates for pairwise (Q-over-Q) comparison.
    Supports 'basic', 'complex', and 'upgraded' summary styles.
    """
    # Each template compares two quarters of financial report data, focusing on changes, drivers, and business meaning
    return {
        "basic": (
            # Simple, no-jargon: Compare key changes, 3–5 bullet points, each <25 words
            "Compare these two financial reports in plain English. Focus on key changes from Q1 to Q2. "
            "Highlight 3–5 changes using bullet points. Keep each point under 25 words. Make it easy to understand — no jargon.\n"
            "\nPeriod 1:\n"
            f"{period1}\n"
            "\nPeriod 2:\n"
            f"{period2}\n"
        ),
        "complex": (
            # Detailed, accessible: Summarize/comapre changes, trends, explanations; with clear formatting and instructions
            "✅ Complex Prompt - Pairwise Comparison:\n"
            "You are a veteran financial analyst with decades of experience simplifying complex financial reports for general audiences with no background in finance.\n"
            "Your goal is to clearly summarize or compare key takeaways from one or more financial documents. Focus on: what happened, why it matters, and what changed.\n"
            "Instructions:\n"
            "- Identify the 3–5 most important insights\n"
            "- Use plain, simple English (roughly 9th-grade reading level)\n"
            "- Use bullet points; max 25 words per sentence\n"
            "- Define any technical terms or numbers in plain language\n"
            "- Remain factual and neutral — avoid hype, filler, or speculation\n"
            "If two periods are provided:\n"
            "- Focus on what improved, declined, or stayed the same\n"
            "- Highlight meaningful trends or shifts across periods (e.g., revenue, profit, credit costs)\n"
            "- Use consistent structure and terms to improve clarity\n"
            "Optional Enhancements:\n"
            "- If requested, adopt a casual tone while staying accurate\n"
            "- If requested, expand bullet points into short, readable paragraphs\n"
            "- Briefly define technical terms (e.g., 'CET1 = core capital ratio')\n"
            "- If the content is vague or unclear, say so directly\n"
            "Output Format:\n"
            "- Title: Plain English Summary (or Comparison Summary if applicable)\n"
            "- Use up to 5 bullet points\n"
            "- Format: [Key Fact or Change]: [Plain English explanation of its meaning or impact]\n"
            "  Example: 'Revenue rose 4% QoQ: The company earned more money this quarter than last.'\n"
            "- Each bullet: Max 2 sentences, 25 words per sentence\n"
            "- If not enough meaningful info: 'Not enough clear information to summarize.'\n"
            "Input:\n"
            "Period 1:\n"
            f"{period1}\n"
            "Period 2:\n"
            f"{period2}\n"
        ),
        "upgraded": (
            # For expert audiences: Focus on what changed and *why*, using business and management drivers.
            "You are a senior Wall Street analyst who compares earnings across quarters to brief investors, bankers, and relationship managers.\n"
            "Your job is to **compare two quarters of financial data** and identify the 3–5 most meaningful changes — not just what changed, but why, and what it means.\n"
            "Go beyond surface-level stats. Focus on strategic metrics (e.g., net income, ROE, IB fees, credit costs), management commentary, and any inflection points.\n"
            "Instructions:\n"
            "- Use clear, plain English (9th-grade reading level)\n"
            "- Focus on what improved, declined, or stayed flat\n"
            "- Prioritize material changes (not noise)\n"
            "- Explain *why* the change occurred, using management notes or segment trends\n"
            "- Summarize the business impact clearly\n"
            "- Each bullet: max 2 sentences, 25 words per sentence\n"
            "- Define any technical terms (e.g., ROE = return on equity)\n"
            "- If input is vague or lacks clarity, say so directly\n"
            "Output Format:\n"
            "**Quarter-over-Quarter Comparison Summary**\n"
            "- [Change Detected]: [Causal explanation or management attribution]. [What this implies for the business.]\n"
            "  Example: “Revenue dropped 10% QoQ: Management attributed this to weaker investment banking activity. This suggests soft demand in capital markets.”\n"
            "If not enough data:  “Not enough clear information to summarize.”\n"
            "Optional Enhancements:\n"
            "- If requested, output in a more casual tone for newsletters or CRM use\n"
            "- If requested, expand each bullet into a short paragraph (max 60 words)\n"
            "\nPeriod 1 (Earlier):\n"
            f"{period1}\n"
            "\nPeriod 2 (Later):\n"
            f"{period2}\n"
        )
    }



In [31]:
## 12. Prompt Templates: Multi-Quarter (Trend) Summary Prompts


def build_multiquarter_prompts(periods):
    """
    Given a list of period summaries (e.g., for Q1–Q4), return a dictionary of multi-quarter trend prompt templates.
    Supports 'basic', 'complex', and 'upgraded' summary styles.
    Each prompt guides the LLM to analyze and explain changes and patterns across multiple quarters.
    """
    # Format all input periods as Q1:, Q2:, etc, for prompt readability
    period_text = "".join([f"Q{i+1}:\n{period}\n\n" for i, period in enumerate(periods)])
    return {
        "basic": (
            # Simple version: Summarize trends across quarters, focus on 3–5 key changes, bullet points, easy English
            "Summarize the trends across Q1 2025, Q2 2025 and Q4 2024 in plain English. "
            "Identify 3–5 key changes or patterns over time. Use bullet points, max 25 words each. Highlight what improved, declined, or stayed steady.\n"
            "\n"
            f"{period_text}"
        ),
        "complex": (
            # More detailed: Emphasizes clear, jargon-free review of multi-quarter shifts, management context, and inconsistencies
            "📊 Complex Prompt – Multi-Quarter Comparison (Q1–Q4)\n"
            "You are a veteran financial analyst with decades of experience simplifying complex financial reports for general audiences with no background in finance.\n"
            "Your goal is to clearly summarize the key trends and shifts across multiple quarters of financial data. Focus on: what changed, why it matters, and what patterns emerged.\n"
            "Instructions:\n"
            "Identify 3–5 key trends across the quarters (Q1–Q4)\n"
            "Use plain English (around a 9th-grade level)\n"
            "Use bullet points; max 25 words per sentence\n"
            "Define any technical terms or key numbers\n"
            "Focus on improvements, declines, or steady performance across the year\n"
            "Avoid hype, exaggeration, or technical jargon\n"
            "Optional Enhancements:\n"
            "If requested, include a table of metric changes across quarters\n"
            "If requested, explain trends in context (e.g., economic conditions, company strategy)\n"
            "Flag any inconsistencies or missing data\n"
            "Output Format:\n"
            "Title: Plain English Year-in-Review Summary (or similar)\n"
            "Use up to 5 bullet points\n"
            "Format: [Metric or Trend]: [Explanation in plain English]\n"
            " Example: “Profit improved each quarter: Net income rose steadily from Q1 to Q4, showing consistent performance throughout the year.”\n"
            "Each bullet: Max 2 sentences, 25 words per sentence\n"
            "If not enough meaningful info: \"Not enough clear information to summarize.\"\n"
            "Input:\n"
            f"{period_text}"
        ),
        "upgraded": (
            # Most advanced: For expert audiences, emphasizes not just what, but why and business meaning.
            "You are a senior financial analyst preparing a year-in-review briefing for investors and banking clients.\n"
            "Your task is to review earnings summaries across **four quarters** and identify 3–5 key trends — not just what changed, but *why*, and *what it means*.\n"
            "Focus on directional movement in metrics (revenue, ROE, credit costs, etc.), consistent patterns, inflection points, and management commentary that explains cause and effect.\n"
            "Instructions:\n"
            "- Use plain, clear English (around 9th-grade level)\n"
            "- Prioritize material trends: major increases, declines, or stable patterns\n"
            "- Use max 2 sentences per bullet, 25 words per sentence\n"
            "- For each trend, explain:\n"
            "   1. What changed (and when)\n"
            "   2. Why it changed (if explanation exists)\n"
            "   3. What it suggests about business direction\n"
            "- Define any technical terms (e.g., ROE = return on equity)\n"
            "- If info is unclear or inconsistent, flag it\n"
            "- Don’t include generic facts — only strategic trends\n"
            "Output Format:\n"
            "**Plain English Year-in-Review Summary**\n"
            "- [Metric or Trend]: [Explanation of change, cause, and implication]\n"
            "  Example: “Credit costs fell every quarter: Management said lower card losses and better wholesale recoveries helped, signaling stronger underwriting discipline.”\n"
            "If not enough info:  “Not enough meaningful information to summarize.”\n"
            "Optional:\n"
            "- If requested, include a simple table showing how each metric evolved over the four quarters\n"
            "- If requested, explain trends in macro context (e.g., rate hikes, M&A rebound)\n"
            "\n"
            f"{period_text}"
        )
    }



In [33]:

## 13. Run Local LLM Summarization via Ollama

def summarize_with_ollama(
    chunks, 
    prompt_type="standalone",   # "standalone", "pairwise", or "multiquarter"
    prompt_variant="upgraded",  # "basic", "complex", or "upgraded"
    model="gemma3:latest"
):
    """
    Call a local LLM (via Ollama API) to generate a summary for retrieved financial text chunks.
    - Supports 'standalone', 'pairwise', and 'multiquarter' prompt types.
    - User can select the summary prompt style/complexity (prompt_variant).
    - Robust error handling to avoid crashes on API/network problems.
    """
    # Build the correct prompt template based on summary type and variant
    if prompt_type == "standalone":
        prompt_dict = build_standalone_prompts(chunks)
    elif prompt_type == "pairwise":
        if len(chunks) != 2:
            print("ERROR: Pairwise summary needs two period summaries.")
            return None
        prompt_dict = build_pairwise_prompts(chunks[0], chunks[1])
    elif prompt_type == "multiquarter":
        prompt_dict = build_multiquarter_prompts(chunks)
    else:
        print(f"ERROR: Unknown prompt_type: {prompt_type}")
        return None

    user_prompt = prompt_dict.get(prompt_variant)
    if not user_prompt:
        print(f"ERROR: Prompt variant '{prompt_variant}' not found for type '{prompt_type}'.")
        return None

    # Call Ollama API for summarization
    try:
        response = requests.post(
            "http://localhost:11434/api/generate",
            json={
                "model": model,
                "prompt": user_prompt,
                "stream": False,
            },
            timeout=90,
        )
        out = response.json()
    except requests.ConnectionError:
        print("ERROR: Could not connect to Ollama. Is the server running?")
        return None
    except requests.Timeout:
        print("ERROR: Request to Ollama timed out.")
        return None
    except Exception as e:
        print(f"ERROR: Unexpected problem communicating with Ollama: {e}")
        return None

    # Check for API or data errors
    if "error" in out:
        print("Ollama API error:", out["error"])
        return None

    if "response" not in out or not out["response"].strip():
        print("No meaningful info to summarize from LLM.")
        return None

    return out["response"]


In [35]:


## 14. Retrieve & Summarize Workflow




#*This function retrieves relevant report chunks (or uses prebuilt period summaries for pairwise/multiquarter mode), displays the matching context, and then sends the results to the local LLM (Ollama) for summarization. Supports all summary types and error handling.*



def retrieve_and_summarize(
    user_query,
    index,
    documents,
    metadatas,
    k=3,
    company=None,
    quarter=None,
    prompt_type="standalone",
    prompt_variant="upgraded",
    model="gemma3:latest",
    print_chunks=True,
    show_all_metadata=True,
    extra_period_summaries=None
):
    """
    Full workflow: retrieve top-k chunks, show context, then summarize with local LLM.
    Enhanced for pairwise/multiquarter to ensure correct chunk selection.
    """
    
    # --- 1. Use already-prepared summaries if provided (e.g. for pairwise/multiquarter) ---
    if extra_period_summaries is not None:
        # Use summaries already created for each period (skips search)
        topk_docs = extra_period_summaries
        print(f"INFO: Using provided summaries for {prompt_type} mode ({len(topk_docs)} period(s)).")
        results = [("<<SUMMARY-ONLY: See below>>", {}) for _ in topk_docs]  # Dummy metadata/context
    else:
        # --- 2. Otherwise, perform retrieval using semantic search ---
        if prompt_type in ["pairwise", "multiquarter"]:
            # (If needed, you could add logic for classic multi-period chunk retrieval here)
            pass  # This path isn't used if you always pre-build summaries for these modes.
        else:
            # Single-period/standalone mode: semantic search for top-k results
            results = query_top_k(
                user_query,
                index,
                documents,
                metadatas,
                k=k,
                company=company,
                quarter=quarter,
                show_all_metadata=show_all_metadata,
                print_no_match_warning=True
            )
            if not results:
                print("No retrieval results, cannot summarize.")
                return
        topk_docs = [doc for doc, meta in results]

    # --- 3. Print the matched context/chunks for transparency/debugging ---
    if print_chunks:
        print("\n--- MATCHED CHUNKS (context) ---")
        for i, (doc, meta) in enumerate(results):
            print(f"\nChunk {i+1}:")
            print(doc)
            print("Meta:", meta)
        print("\n------------------------\n")

    # --- 4. Send to LLM for summarization, then print/return ---
    summary = summarize_with_ollama(
        topk_docs,
        prompt_type=prompt_type,
        prompt_variant=prompt_variant,
        model=model
    )
    print("\n========== SUMMARY FROM LLM ==========\n")
    print(summary)
    print("\n======================================\n")
    return summary












In [37]:
## 15. Example: Run Standalone Summary for a Specific Query

# Example: Standalone summary of top 3 results for Q1 2024 net income (JPM)

retrieve_and_summarize(
    user_query="Q1 2024 net income",     # The question or topic you want to analyze
    index=index,
    documents=documents,
    metadatas=metadatas,
    k=3,                                 # Number of chunks to retrieve (top 3 matches)
    company="JPM",                       # Filter by company (optional)
    quarter="Q1 2024",                   # Filter by quarter (optional)
    prompt_type="standalone",            # Summary type: "standalone", "pairwise", or "multiquarter"
    prompt_variant="upgraded",           # Prompt style: "basic", "complex", or "upgraded"
    model="gemma3:latest",               # LLM model (local, via Ollama)
    print_chunks=True,                   # Print the matched chunks/context (for transparency)
    show_all_metadata=True               # Show detailed metadata in output
)




🔹 Match: ROE: 35%
Net Income: $4.8B
Revenue: $17.7B
Banking & Wealth Mgmt: $10.3B
Home Lending: $1.2B
Card Services & Auto: $6.1B
Credit Losses: $1.9B
Card Net Charge-Off Rate: 3.32%
Mobile Customers: ↑7% YoY
Card Volume: ↑9% YoY
🗂️ Metadata: {'company': 'JPM', 'quarter': 'Q1 2024', 'business_unit': 'Consumer & Community Banking', 'section': 'Segment Performance', 'tags': ['JPM', 'Q1 2024', 'Consumer', 'Revenue', 'Net Income', 'Mobile', 'Card'], 'data_type': 'segment_performance', 'source': 'Earnings Release', 'page': None}
----------
🔹 Match: ROE: 18%
Net Income: $4.8B
Revenue: $13.6B
IB Fees: $2.0B
Payments: $2.4B
Markets Revenue: $8.0B
Fixed Income: $5.3B
Equity Markets: $2.7B
Credit Losses: $32M
🗂️ Metadata: {'company': 'JPM', 'quarter': 'Q1 2024', 'business_unit': 'Corporate & Investment Bank', 'section': 'Segment Performance', 'tags': ['JPM', 'Q1 2024', 'Investment Bank', 'Markets', 'IB Fees', 'Fixed Income'], 'data_type': 'segment_performance', 'source': 'Earnings Release', 'pag

'**Plain English Summary**\n\n*   **Strong Overall Performance, But Declining Returns:** The bank reported a total net income of $13.4 billion, a solid number. However, the Return on Equity (ROE) dropped to 17%, meaning the bank’s profits are less for every dollar of shareholder investment.\n\n*   **Card Services Driving Growth:** Card Services & Auto revenue jumped 9% year-over-year to $6.1 billion. This growth was fueled by increased card volume and a higher percentage of customers using mobile banking. \n\n*   **Markets Division is Key:** The Markets division performed exceptionally well, generating $8.0 billion in revenue and holding $42.5 billion in assets under management. This reflects strong trading activity and investment management fees.\n\n*   **Credit Losses Increasing:** The bank took a significant hit to its income due to $1.9 billion in credit losses – this includes $2.0 billion in charge-offs and $72 million in a release of reserves. This indicates a potential rise in l

In [39]:
## 16. One-Step Financial Analysis Pipeline (`analyze_report`)


def analyze_report(
    user_query,
    company=None,
    quarter=None,
    k=3,
    prompt_type="standalone",
    model="gemma3:latest",
    show_context=True,
    show_metadata=True,
    save_output=False,
    output_path=None
):
    """
    Unified pipeline: semantic retrieval + summary, optional save. Use in scripts, CLI, or as UI backend.
    """
    # 1. Retrieve top-k relevant chunks for the user's query
    results = query_top_k(
        user_query,
        index,
        documents,
        metadatas,
        k=k,
        company=company,
        quarter=quarter,
        show_all_metadata=show_metadata,
        print_no_match_warning=True
    )
    if not results:
        print("No retrieval results, cannot summarize.")
        return None

    # 2. Optionally show the matched context and metadata
    if show_context:
        print("\n--- MATCHED CHUNKS (context) ---")
        for i, (doc, meta) in enumerate(results):
            print(f"\nChunk {i+1} (Page {meta.get('page', '-')}, Section: {meta.get('section','-')}):")
            print(doc)
            print("Meta:", meta)
        print("\n------------------------\n")

    # 3. Summarize with the local LLM (Ollama)
    topk_docs = [doc for doc, meta in results]
    summary = summarize_with_ollama(
        topk_docs,
        prompt_type=prompt_type,
        model=model
    )
    print("\n========== SUMMARY FROM LLM ==========\n")
    print(summary)
    print("\n======================================\n")

    # 4. Optionally save output as JSON
    if save_output:
        output_data = {
            "query": user_query,
            "company": company,
            "quarter": quarter,
            "matched_chunks": [
                {"content": doc, "metadata": meta}
                for doc, meta in results
            ],
            "summary": summary,
        }
        if not output_path:
            # Auto-name by query and date
            import datetime
            ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
            safe_query = re.sub(r'\W+', '_', user_query)[:40]
            output_path = f"analysis_{safe_query}_{ts}.json"
        import json
        with open(output_path, "w", encoding="utf-8") as f:
            json.dump(output_data, f, indent=2, ensure_ascii=False)
        print(f"Summary & context saved to: {output_path}")

    return summary



In [53]:

## 17. Interactive CLI: Run a Custom Financial Report Query

#*This cell lets you analyze earnings with flexible options:*
#- Standalone (single quarter)
#- Pairwise (compare two quarters)
#- Multiquarter (spot trends over 3+ quarters)

#**Instructions:**  
#Type your query (e.g., “Q2 2024 net income for GS”), select company and prompt style, then enter the relevant quarters.


# Interactive CLI for analyzing financial results
print("Financial Report Analyzer\n")
print("Type your query below (e.g. 'Q1 2024 net income', 'Q2 2024 revenue for GS', 'credit costs for JPM')")

user_query = input("Query: ").strip()
company = input("Company (optional, e.g. 'JPM' or 'GS', leave blank if any): ").strip() or None

# --- Prompt style: standalone / pairwise / multiquarter ---
print("\nPrompt style? ('standalone', 'pairwise', 'multiquarter') — default is 'standalone'")
prompt_type = input("Prompt type: ").strip() or "standalone"
valid_prompt_types = {"standalone", "pairwise", "multiquarter"}
if prompt_type not in valid_prompt_types:
    print(f"ERROR: '{prompt_type}' is not a valid prompt style. Choose from: {', '.join(valid_prompt_types)}")
    raise SystemExit()  # or return

# --- Dynamically prompt for quarters based on mode ---
if prompt_type == "standalone":
    quarter = input("Quarter (optional, e.g. 'Q1 2024', leave blank if any): ").strip() or None
elif prompt_type == "pairwise":
    print("Enter two quarters to compare, separated by comma (e.g. 'Q1 2024, Q2 2024'):")
    quarters = input("Quarters: ").split(",")
    quarters = [q.strip() for q in quarters if q.strip()]
    if len(quarters) != 2:
        print("ERROR: Please enter exactly TWO quarters for pairwise mode.")
        quarters = None
elif prompt_type == "multiquarter":
    print("Enter ALL quarters to analyze, separated by comma (e.g. 'Q1 2024, Q2 2024, Q3 2024'):")
    quarters = input("Quarters: ").split(",")
    quarters = [q.strip() for q in quarters if q.strip()]
    if len(quarters) < 3:
        print("ERROR: Please enter at least THREE quarters for multi-quarter mode.")
        quarters = None
else:
    quarter = None
    quarters = None

print("\nRunning analysis...")





Financial Report Analyzer

Type your query below (e.g. 'Q1 2024 net income', 'Q2 2024 revenue for GS', 'credit costs for JPM')


Query:  JPM
Company (optional, e.g. 'JPM' or 'GS', leave blank if any):  JPM



Prompt style? ('standalone', 'pairwise', 'multiquarter') — default is 'standalone'


Prompt type:  multiquarter


Enter ALL quarters to analyze, separated by comma (e.g. 'Q1 2024, Q2 2024, Q3 2024'):


Quarters:  Q1 2024, Q2 2024, Q3 2024



Running analysis...


In [55]:


## 18. Validate User Input and Run Analysis

#*Checks if the selected company and quarters actually exist in your data, so users don’t waste time waiting for the LLM when the inputs are invalid.*

#- Errors are handled gracefully with clear messages.
#- Supports all prompt styles: standalone, pairwise, multi-quarter.
#- If inputs are valid, runs the appropriate analysis and prints a summary.

#*Tip: This cell is the “engine” that powers the interactive CLI experience above!*




# Validate user input against available companies and quarters in your data
valid_companies = set(m['company'] for m in metadatas)
valid_quarters = set(m['quarter'] for m in metadatas)

if company and company not in valid_companies:
    print(f"ERROR: Company '{company}' not found in your data. Available companies: {sorted(valid_companies)}")
    raise SystemExit()  # Or just return

# Check that requested quarters exist (per company, if given)
if prompt_type in ["pairwise", "multiquarter"]:
    missing_quarters = [q for q in quarters if q not in valid_quarters or not any(m['company']==company and m['quarter']==q for m in metadatas)]
    if missing_quarters:
        print(f"ERROR: Could not find data for: {', '.join(missing_quarters)} (company={company})")
        print(f"Available quarters for {company or 'all companies'}: {sorted(set(m['quarter'] for m in metadatas if (not company or m['company']==company)))}")
        raise SystemExit()
elif prompt_type == "standalone":
    if quarter and (quarter not in valid_quarters or not any(m['company']==company and m['quarter']==quarter for m in metadatas)):
        print(f"ERROR: Could not find data for quarter: {quarter} (company={company})")
        print(f"Available quarters for {company or 'all companies'}: {sorted(set(m['quarter'] for m in metadatas if (not company or m['company']==company)))}")
        raise SystemExit()

# --- Run the main analysis flow ---
try:
    if prompt_type == "standalone":
        retrieve_and_summarize(
            user_query=user_query,
            index=index,
            documents=documents,
            metadatas=metadatas,
            k=3,
            company=company,
            quarter=quarter,
            prompt_type=prompt_type,
            model="gemma3:latest",
            print_chunks=True,
            show_all_metadata=True
        )
    elif prompt_type == "pairwise":
        if not quarters or len(quarters) != 2:
            print("Cannot run: need two valid quarters for pairwise.")
        else:
            summaries = []
            for q in quarters:
                results = query_top_k(user_query, index, documents, metadatas, k=3, company=company, quarter=q)
                quarter_summary = f"{q}:\n" + "\n".join([doc for doc, _ in results])
                summaries.append(quarter_summary)
            retrieve_and_summarize(
                user_query=user_query,
                index=index,
                documents=documents,
                metadatas=metadatas,
                prompt_type="pairwise",
                k=3,
                company=company,
                quarter=None,
                model="gemma3:latest",
                print_chunks=True,
                show_all_metadata=True,
                extra_period_summaries=summaries
            )
    elif prompt_type == "multiquarter":
        if not quarters or len(quarters) < 3:
            print("Cannot run: need at least three valid quarters for multi-quarter.")
        else:
            summaries = []
            for q in quarters:
                results = query_top_k(user_query, index, documents, metadatas, k=3, company=company, quarter=q)
                # Use the same labeling as pairwise:
                quarter_summary = f"{q}:\n" + "\n".join([doc for doc, _ in results])
                summaries.append(quarter_summary)
            retrieve_and_summarize(
                user_query=user_query,
                index=index,
                documents=documents,
                metadatas=metadatas,
                prompt_type="multiquarter",
                k=3,
                company=company,
                quarter=None,
                model="gemma3:latest",
                print_chunks=True,
                show_all_metadata=True,
                extra_period_summaries=summaries
            )

except Exception as e:
    print(f"ERROR (full pipeline): {e}")










🔹 Match: ROE: 18%
Net Income: $4.8B
Revenue: $13.6B
IB Fees: $2.0B
Payments: $2.4B
Markets Revenue: $8.0B
Fixed Income: $5.3B
Equity Markets: $2.7B
Credit Losses: $32M
🗂️ Metadata: [JPM - Q1 2024 - Corporate & Investment Bank]
----------
🔹 Match: ROE: 35%
Net Income: $4.8B
Revenue: $17.7B
Banking & Wealth Mgmt: $10.3B
Home Lending: $1.2B
Card Services & Auto: $6.1B
Credit Losses: $1.9B
Card Net Charge-Off Rate: 3.32%
Mobile Customers: ↑7% YoY
Card Volume: ↑9% YoY
🗂️ Metadata: [JPM - Q1 2024 - Consumer & Community Banking]
----------
🔹 Match: ROE: 17%
Net Income: $5.9B
Revenue: $17.9B
IB Fees: $2.5B
Payments: $4.5B
Markets Revenue: $7.8B
Fixed Income: $4.8B
Equity Markets: $3.0B
Credit Losses: $384M
Client Deposits: ↑2% YoY
🗂️ Metadata: [JPM - Q2 2024 - Corporate & Investment Bank]
----------
🔹 Match: ROE: 30%
Net Income: $4.2B
Revenue: $17.7B
Banking & Wealth Mgmt: $10.4B
Home Lending: $1.3B
Card Services & Auto: $6.0B
Credit Losses: $2.6B
Card Net Charge-Off Rate: 3.50%
Mobile Custome

In [45]:

## 19. Extract Key Metrics from Summaries (Optional)

#*Quick utility to pull out Net Income (or other metrics) from multiple summary texts and tabulate for review or export.*

#- You can adapt this cell to extract other numbers (ROE, Revenue, etc.).
#- Useful for tracking trends or quick charting in pandas.





import pandas as pd

# Example: extract Net Income from each summary (as "$Xm" or "$XmB" etc)
net_income_list = []
for text in summaries:
    match = re.search(r"Net Income:\s*\$([\d\.]+[MBT]?)", text)  # Regex can be adjusted per format
    ni = match.group(1) if match else "-"
    net_income_list.append(ni)

# Combine into a DataFrame for easy comparison or plotting
df = pd.DataFrame({'Quarter': quarters, 'Net Income': net_income_list})
print(df)





   Quarter Net Income
0  Q1 2024       4.8B
1  Q2 2024       4.2B
2  Q3 2024       1.4B


In [47]:

## 20. End-to-End: One-Click Financial Report Analysis

#Call `analyze_report()` to run the **full pipeline** — semantic search, LLM summarization, and optional save to disk.  
#Use this cell for rapid prototyping or as a backend call for future UI/API.

#- Customize `user_query`, `company`, `quarter`, and `prompt_type` for any scenario.
#- Set `save_output=True` to persist results as JSON for later review or visualization.




try:
    analyze_report(
        user_query="Q1 2024 net income",   # <-- Change this to any question
        company="JPM",                     # Company ticker (optional)
        quarter="Q1 2024",                 # Quarter (optional for standalone)
        k=3,                               # How many chunks to retrieve/summarize
        prompt_type="standalone",          # "standalone", "pairwise", or "multiquarter"
        show_context=True,                 # Print context chunks matched
        save_output=True                   # Save summary/context as .json file
    )
except Exception as e:
    print(f"ERROR (analyze_report): {e}")



🔹 Match: ROE: 35%
Net Income: $4.8B
Revenue: $17.7B
Banking & Wealth Mgmt: $10.3B
Home Lending: $1.2B
Card Services & Auto: $6.1B
Credit Losses: $1.9B
Card Net Charge-Off Rate: 3.32%
Mobile Customers: ↑7% YoY
Card Volume: ↑9% YoY
🗂️ Metadata: {'company': 'JPM', 'quarter': 'Q1 2024', 'business_unit': 'Consumer & Community Banking', 'section': 'Segment Performance', 'tags': ['JPM', 'Q1 2024', 'Consumer', 'Revenue', 'Net Income', 'Mobile', 'Card'], 'data_type': 'segment_performance', 'source': 'Earnings Release', 'page': None}
----------
🔹 Match: ROE: 18%
Net Income: $4.8B
Revenue: $13.6B
IB Fees: $2.0B
Payments: $2.4B
Markets Revenue: $8.0B
Fixed Income: $5.3B
Equity Markets: $2.7B
Credit Losses: $32M
🗂️ Metadata: {'company': 'JPM', 'quarter': 'Q1 2024', 'business_unit': 'Corporate & Investment Bank', 'section': 'Segment Performance', 'tags': ['JPM', 'Q1 2024', 'Investment Bank', 'Markets', 'IB Fees', 'Fixed Income'], 'data_type': 'segment_performance', 'source': 'Earnings Release', 'pag

In [49]:



## 21. Load and Display Saved Analysis Output

#Quickly load and preview any saved analysis JSON (from `analyze_report(save_output=True)`) to check summaries, matched context, or share results with others.





import json

# Replace the filename with your latest output if needed
with open('analysis_Q1_2024_net_income_20250721_172934.json', 'r', encoding='utf-8') as f:
    out = json.load(f)

# Pretty-print the JSON for human inspection
print(json.dumps(out, indent=2, ensure_ascii=False))





{
  "query": "Q1 2024 net income",
  "company": "JPM",
  "quarter": "Q1 2024",
  "matched_chunks": [
    {
      "content": "ROE: 35%\nNet Income: $4.8B\nRevenue: $17.7B\nBanking & Wealth Mgmt: $10.3B\nHome Lending: $1.2B\nCard Services & Auto: $6.1B\nCredit Losses: $1.9B\nCard Net Charge-Off Rate: 3.32%\nMobile Customers: ↑7% YoY\nCard Volume: ↑9% YoY",
      "metadata": {
        "company": "JPM",
        "quarter": "Q1 2024",
        "business_unit": "Consumer & Community Banking",
        "section": "Segment Performance",
        "tags": [
          "JPM",
          "Q1 2024",
          "Consumer",
          "Revenue",
          "Net Income",
          "Mobile",
          "Card"
        ],
        "data_type": "segment_performance",
        "source": "Earnings Release",
        "page": null
      }
    },
    {
      "content": "ROE: 18%\nNet Income: $4.8B\nRevenue: $13.6B\nIB Fees: $2.0B\nPayments: $2.4B\nMarkets Revenue: $8.0B\nFixed Income: $5.3B\nEquity Markets: $2.7B\nCredit L