## Automated Extraction of Management Discussion & Analysis (MD&A) Sections from Indian Annual Report PDFs

### Introduction

The Management Discussion & Analysis (MD&A) section is a critical component of corporate annual reports, providing qualitative insights into a company's financial performance, operational challenges, risk factors, and future outlook. As mandated by regulatory frameworks such as the Companies Act, 2013 in India, MD&A serves as a strategic tool for stakeholders to assess management perspectives beyond quantitative financial statements, enabling informed decision-making in investment, risk assessment, and corporate governance.

Extracting MD&A content from PDF-based annual reports presents significant technical challenges. Annual reports are inherently unstructured documents, featuring complex layouts with embedded tables, images, and multi-column text that complicate text extraction. Layout variability across companies due to differing design choices, font styles, and page structures further hinders automated processing. Additionally, MD&A sections are often integrated with other report components, such as Directors' Reports or financial statements, making precise boundary identification difficult.

Indian annual reports exhibit particular structural diversity in MD&A presentation. Some companies provide standalone MD&A sections, while others embed the content within annexures or integrate it directly into the Directors' Report. This variability necessitates robust extraction methods capable of adapting to multiple organizational patterns.

This notebook implements a systematic pipeline for MD&A extraction, comprising the following stages:

1. **Data Collection**: Identification and organization of PDF annual reports from diverse Indian companies.
2. **PDF Parsing**: Extraction of raw text and structural elements using specialized libraries.
3. **Text Preprocessing**: Cleaning and normalization of extracted content to handle encoding artifacts and formatting inconsistencies.
4. **Section Detection**: Identification of MD&A boundaries through pattern matching and keyword-based analysis.
5. **Content Extraction**: Precise isolation of MD&A text while filtering extraneous sections.
6. **Validation and Output**: Quality assessment of extracted content and structured output generation for downstream analysis.

### 2. Imports & Configuration Layer : 

In [18]:
import fitz  
import pandas as pd
import re
import logging
from tqdm import tqdm
import pathlib

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

PROJECT_ROOT = pathlib.Path.cwd()

INPUT_PDF_DIR = PROJECT_ROOT / "../data" / "../pdfs"
OUTPUT_DIR = PROJECT_ROOT / "../output"

OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

### 3. PDF Interface Layer : 

In [19]:
class PDFInterface:
    def __init__(self, pdf_path):
        self.pdf_path = pathlib.Path(pdf_path)
        self.doc = fitz.open(self.pdf_path)
        logging.info("Loaded PDF: %s", self.pdf_path.name)

    def get_pages_text(self):
        pages = []
        for page_index in range(self.doc.page_count):
            page = self.doc.load_page(page_index)
            pages.append(
                {
                    "page_number": page_index + 1,
                    "text": page.get_text(),
                }
            )
        return pages

    def close(self):
        self.doc.close()

###  4. Company Name & Financial Year Extraction: 

In [20]:
def extract_company_name(pages_text, company_folder: str | None = None):
    """Extract company name from the first 3 pages.

    Goal: prefer the actual company name that typically appears in the header area of
    page 2/3 near "ANNUAL REPORT" or similar, and avoid unrelated "... Limited" names
    from narrative paragraphs.

    Returns:
        str | None
    """

    # Strong ALL-CAPS pattern (per your requirement)
    caps_re = re.compile(r"\b([A-Z][A-Z\s&]{5,})\s*(LIMITED|LTD)\b")

    # Keywords typically close to the company header
    keyword_re = re.compile(r"\b(ANNUAL\s+REPORT|DIRECTORS[’']?\s+REPORT|BOARD[’']?S\s+REPORT)\b", re.IGNORECASE)

    candidates: dict[str, dict] = {}

    def _norm(name: str) -> str:
        name = re.sub(r"\s+", " ", (name or "").strip())
        return name

    def _add_candidate(name: str, near_keyword: bool):
        name = _norm(name)
        if not name:
            return
        rec = candidates.setdefault(name, {"count": 0, "near": 0, "len": len(name)})
        rec["count"] += 1
        if near_keyword:
            rec["near"] += 1

    # Search only first 3 pages
    for page in pages_text[:3]:
        text = page.get("text", "") or ""
        if not text:
            continue

        # 1) Prefer header-like region: first 25 non-empty lines
        lines = [ln.strip() for ln in text.splitlines() if (ln or "").strip()]
        header_text = "\n".join(lines[:25])

        # Mark if this page is likely a cover/header page
        has_keyword = bool(keyword_re.search(text))

        # Collect matches in the header region first
        for m in caps_re.finditer(header_text):
            full = f"{m.group(1).strip()} {m.group(2).strip()}"
            _add_candidate(full, near_keyword=has_keyword or bool(keyword_re.search(header_text)))

        # 2) Also collect matches close to keywords (within a bounded window)
        for km in keyword_re.finditer(text):
            start = max(0, km.start() - 800)
            end = min(len(text), km.end() + 800)
            window = text[start:end]
            for m in caps_re.finditer(window):
                full = f"{m.group(1).strip()} {m.group(2).strip()}"
                _add_candidate(full, near_keyword=True)

    if candidates:
        # Score: frequency first, then near-keyword hits, then length
        best = max(
            candidates.items(),
            key=lambda kv: (kv[1]["count"], kv[1]["near"], kv[1]["len"]),
        )[0]
        logging.info("Extracted company name: %s", best)
        return best

    # Final fallback: for Amit Spinning, do not leave blank
    if company_folder == "Amit_spinning":
        logging.warning("Falling back to default company name for Amit_spinning")
        return "AMIT SPINNING INDUSTRIES LIMITED"

    # Conservative fallback: do not guess from narrative text
    if company_folder:
        logging.warning("Company name not found in header region; using folder name: %s", company_folder)
        return company_folder.replace("_", " ").upper()

    logging.info("Company name not found in first 3 pages")
    return None


def extract_financial_year(pages_text):
    """Extract financial year from the first 5 pages of PDF text.

    Supported examples:
      - "Annual Report 2019-20"
      - "31st Annual Report 2019-20"
      - "Year ended March 31, 2020"

    Returns:
        str: Normalized financial year (e.g., '2019-20') or None if not found
    """

    # Patterns for various year formats (search order matters: more specific first)
    year_patterns = [
        # 31st Annual Report 2019-20 / 31st Annual Report 2019 - 2020
        re.compile(
            r"\b\d{1,3}(?:st|nd|rd|th)\s+Annual\s+Report\s+(\d{4})\s*[-–]\s*(\d{2,4})\b",
            re.IGNORECASE,
        ),
        # Annual Report 2019-20 / Annual Report 2019 - 2020
        re.compile(
            r"\bAnnual\s+Report\s+(\d{4})\s*[-–]\s*(\d{2,4})\b",
            re.IGNORECASE,
        ),
        # Year ended March 31, 2020 (or similar)
        re.compile(
            r"\bYear\s+ended\s+\w+\s+\d{1,2},\s+(\d{4})\b",
            re.IGNORECASE,
        ),
    ]

    # Search first 5 pages
    for page in pages_text[:5]:
        text = page.get('text', '')

        for pattern in year_patterns:
            match = pattern.search(text)
            if not match:
                continue

            groups = match.groups()

            if len(groups) == 1:
                # Single year (e.g., Year ended March 31, 2020) -> previous year - last 2 digits
                year = groups[0]
                prev_year = str(int(year) - 1)
                normalized = f"{prev_year}-{year[-2:]}"
            else:
                year1, year2 = groups
                if len(year2) == 2:
                    normalized = f"{year1}-{year2}"
                else:
                    normalized = f"{year1}-{year2[-2:]}"

            logging.info("Extracted financial year: %s", normalized)
            return normalized

    logging.info("Financial year not found in first 5 pages")
    return None


### 5. Table of Contents (ToC) Analyzer: 

In [21]:
def collect_toc_raw_lines(pages_text, max_pages=5):
    """Collect STRICT ToC raw lines from ONLY the first `max_pages` pages.

    Returns a list preserving original extracted order exactly (per splitlines()).
    Each item:
      {
        "toc_page": int,
        "line_index": int,   # 0-based across non-empty ToC lines
        "line_text": str
      }

    Notes:
      - This collects ALL non-empty lines (not just those with digits), because
        MD&A titles can appear without a page number on the same extracted line,
        and titles can be wrapped across multiple extracted lines.
      - Downstream logic MUST NOT consult body text; this is ToC-page-only.
    """

    raw_lines = []
    line_index = 0

    for page in pages_text[:max_pages]:
        toc_page_no = page.get("page_number")
        text = page.get("text", "") or ""

        for ln in text.splitlines():
            s = (ln or "").strip()
            if not s:
                continue

            raw_lines.append(
                {
                    "toc_page": toc_page_no,
                    "line_index": line_index,
                    "line_text": s,
                }
            )
            line_index += 1

    return raw_lines


def detect_toc_entries(pages_text, max_pages=5):
    """STRICT ToC line collection.

    Non-negotiable behavior (per spec):
      - Scan ONLY the first `max_pages` pages.
      - Collect all non-empty lines containing BOTH:
          * at least one alphabetic character, AND
          * at least one numeric character
      - Preserve original line order exactly as extracted.

    Returns:
      list[dict] with keys: toc_page, line_index, line_text
    """

    raw_lines = collect_toc_raw_lines(pages_text, max_pages=max_pages)

    entries = []
    for item in raw_lines:
        txt = item.get("line_text", "")
        if re.search(r"[A-Za-z]", txt) and re.search(r"\d", txt):
            entries.append(item)

    logging.info("Collected %d strict ToC declaration lines (first %d pages)", len(entries), max_pages)
    return entries


_PAGE_INT_RE = re.compile(r"\b(\d{1,4})\b")


def _first_valid_page_number_in_text(text: str, max_page: int):
    for m in _PAGE_INT_RE.finditer(text or ""):
        try:
            n = int(m.group(1))
        except ValueError:
            continue
        if 1 <= n <= max_page:
            return n
    return None


def _first_valid_page_number_after_pos(text: str, start_pos: int, max_page: int):
    res = _first_valid_page_number_after_pos_with_span(text, start_pos, max_page)
    return res[0] if res else None


def _first_valid_page_number_after_pos_with_span(text: str, start_pos: int, max_page: int):
    for m in _PAGE_INT_RE.finditer(text or ""):
        if m.start() < (start_pos or 0):
            continue
        try:
            n = int(m.group(1))
        except ValueError:
            continue
        if 1 <= n <= max_page:
            return n, m.start(), m.end()
    return None


def resolve_page_number_strict(raw_lines, title_line_index: int, max_page: int, lookahead_lines: int = 3):
    """STRICT page-number association (no fallback logic).

    Rules:
      - If the title line contains a valid integer page number, use it.
      - ELSE look ONLY at the immediately following lines (max next `lookahead_lines`).
      - The FIRST valid integer page number encountered is used.
      - If none found in this strict window, return None.

    Note:
      - This is the generic helper; for section titles that may share a line with
        other sections (multi-column extraction), prefer
        resolve_page_number_for_title_block_strict().
    """

    if title_line_index < 0 or title_line_index >= len(raw_lines):
        return None

    # If the line contains a page number, use it.
    same_line = raw_lines[title_line_index].get("line_text", "")
    n = _first_valid_page_number_in_text(same_line, max_page)
    if n is not None:
        return n

    # Else look at the next lines only.
    for offset in range(1, lookahead_lines + 1):
        j = title_line_index + offset
        if j >= len(raw_lines):
            break

        candidate_line = raw_lines[j].get("line_text", "")
        n = _first_valid_page_number_in_text(candidate_line, max_page)
        if n is not None:
            return n

    return None


def find_title_block_strict(
    raw_lines,
    title_re: re.Pattern,
    max_join_lines: int = 3,
    anchor_re: re.Pattern | None = None,
):
    """Find a title match treating the ToC as structural blocks.

    Deterministic behavior:
      - Scans raw_lines in order.
      - Only considers a block starting at line i if anchor_re matches line i
        (when anchor_re is provided). This prevents accidentally starting a
        block on an unrelated neighboring section.
      - At each valid start position i, tests the concatenation of
        1..max_join_lines lines (joined with a single space) against title_re.
      - Returns (start_idx, end_idx, block_text) for the first match.

    It does NOT consult body text and does NOT search beyond the ToC pages.
    """

    n = len(raw_lines)
    for i in range(n):
        start_line = raw_lines[i].get("line_text", "")
        if anchor_re is not None and not anchor_re.search(start_line or ""):
            continue

        parts = []
        for j in range(i, min(n, i + max_join_lines)):
            parts.append(raw_lines[j].get("line_text", ""))
            block_text = " ".join(p for p in parts if p)
            if title_re.search(block_text or ""):
                return i, j, block_text

    return None, None, None


def resolve_page_number_for_title_block_strict(
    raw_lines,
    title_start_idx: int,
    title_end_idx: int,
    title_re: re.Pattern,
    max_page: int,
    lookahead_lines: int = 3,
):
    """STRICT page-number association for a title block, robust to multi-column merges.

    Deterministic rules:
      - If the title appears on a line that also contains multiple page numbers,
        choose the FIRST valid integer page number that occurs AFTER the matched
        title text on that line.
      - Otherwise (no resolvable number on the title-containing line), search
        ONLY the next `lookahead_lines` lines after the title block; FIRST valid
        integer page number wins.
      - If none found, return None.

    This is still ToC-only and bounded; no body-text inference.
    """

    details = resolve_page_number_for_title_block_strict_with_details(
        raw_lines,
        title_start_idx=title_start_idx,
        title_end_idx=title_end_idx,
        title_re=title_re,
        max_page=max_page,
        lookahead_lines=lookahead_lines,
    )
    return details["page"] if details else None


def resolve_page_number_for_title_block_strict_with_details(
    raw_lines,
    title_start_idx: int,
    title_end_idx: int,
    title_re: re.Pattern,
    max_page: int,
    lookahead_lines: int = 3,
):
    """Same as resolve_page_number_for_title_block_strict, but returns details.

    Returns dict:
      {
        "page": int,
        "page_span_start": int | None,
        "page_span_end": int | None,
        "page_line_idx": int,
      }
    """

    if title_start_idx is None or title_end_idx is None:
        return None

    # 1) Prefer a page number that occurs after the matched title on the same line.
    for i in range(title_start_idx, title_end_idx + 1):
        line = raw_lines[i].get("line_text", "")
        m = title_re.search(line or "")
        if not m:
            continue

        res = _first_valid_page_number_after_pos_with_span(line, m.end(), max_page)
        if res is not None:
            n, s, e = res
            return {
                "page": n,
                "page_span_start": s,
                "page_span_end": e,
                "page_line_idx": i,
            }

    # 2) Otherwise, search next N lines after the title block.
    for offset in range(1, lookahead_lines + 1):
        j = title_end_idx + offset
        if j >= len(raw_lines):
            break

        line = raw_lines[j].get("line_text", "")
        res = _first_valid_page_number_after_pos_with_span(line, 0, max_page)
        if res is not None:
            n, s, e = res
            return {
                "page": n,
                "page_span_start": s,
                "page_span_end": e,
                "page_line_idx": j,
            }

    return None


def find_mdna_start_from_toc(pages_text):
    """Backward-compatible helper: STRICT MD&A start-page discovery from ToC pages only."""

    raw_lines = collect_toc_raw_lines(pages_text, max_pages=5)
    if not raw_lines:
        logging.info("No ToC raw lines found in first 5 pages")
        return None

    max_page = len(pages_text)

    mdna_title_re = re.compile(
        r"\bmanagement\s+discussion\s+(?:and|&)\s+analysis(?:\s+report)?\b",
        re.IGNORECASE,
    )

    mdna_anchor_re = re.compile(r"\bmanagement\b", re.IGNORECASE)

    start_idx, end_idx, _ = find_title_block_strict(
        raw_lines,
        mdna_title_re,
        max_join_lines=3,
        anchor_re=mdna_anchor_re,
    )

    if start_idx is None:
        logging.info("MD&A title block not found in ToC raw lines")
        return None

    start_page = resolve_page_number_for_title_block_strict(
        raw_lines,
        title_start_idx=start_idx,
        title_end_idx=end_idx,
        title_re=mdna_title_re,
        max_page=max_page,
        lookahead_lines=3,
    )

    if start_page is None:
        logging.info("MD&A page number not found within strict 3-line window")
        return None

    logging.info(
        "MD&A ToC entry found (strict block): '%s' -> page %s",
        " ".join(raw_lines[i].get("line_text", "") for i in range(start_idx, end_idx + 1)),
        start_page,
    )
    return start_page


### 6. MD&A Boundary Detection : 

In [22]:
import re
import logging


def _detect_mdna_boundaries_amit_spinning_index(raw_lines, max_page: int):
    """Amit_spinning only: parse INDEX-style ToC and derive MD&A boundaries.

    Handles both common layouts seen in Amit Spinning PDFs:
      A) Row-style: title lines followed by a standalone page number line.
      B) Boxed INDEX: titles listed first, then a separate block of standalone page numbers.

    Rules:
      - Treat "INDEX" as ToC.
      - Support multi-line titles for "Board’s Report Including / Management Discussions & / Analysis Report".
      - Identify MD&A as the entry whose merged title contains:
          * "including" AND (("management" AND "discussion") OR "analysis")
      - End at the page before the first of: "Auditor’s Report" or "Balance Sheet".

    Returns:
      (start_page, end_page) or (None, None)
    """

    index_start = None
    for i, item in enumerate(raw_lines):
        t = (item.get("line_text") or "").strip()
        if re.search(r"\bINDEX\b", t, re.IGNORECASE):
            index_start = i
            break

    if index_start is None:
        logging.warning("Amit_spinning: INDEX not found in first 5 pages")
        return None, None

    # Start after an optional "Page No." line
    start_i = index_start + 1
    for j in range(index_start + 1, min(len(raw_lines), index_start + 40)):
        if re.fullmatch(r"page\s*no\.?", (raw_lines[j].get("line_text") or "").strip(), re.IGNORECASE):
            start_i = j + 1
            break

    lines = [(raw_lines[k].get("line_text") or "").strip() for k in range(start_i, len(raw_lines))]
    lines = [ln for ln in lines if ln]

    standalone_num_re = re.compile(r"^\s*(\d{1,3})\s*$")

    def _is_standalone_page_line(ln: str):
        m = standalone_num_re.match(ln or "")
        if not m:
            return None
        try:
            n = int(m.group(1))
        except ValueError:
            return None
        if 1 <= n <= max_page:
            return n
        return None

    # Heuristic (ToC-only, deterministic) to detect layout B: a run of standalone numbers.
    num_positions = []
    for idx, ln in enumerate(lines):
        n = _is_standalone_page_line(ln)
        if n is not None:
            num_positions.append((idx, n))

    numbers_block_start = None
    for pos, _ in num_positions:
        # If we see at least 3 standalone numbers within the next 10 lines, treat as the page-number column.
        count = 0
        for k in range(pos, min(len(lines), pos + 10)):
            if _is_standalone_page_line(lines[k]) is not None:
                count += 1
        if count >= 3:
            numbers_block_start = pos
            break

    # Expected ToC entry starts in Amit_spinning INDEX boxes.
    expected_start_re = re.compile(
        r"^(notice|board|annexures?|corporate\s+governance|auditors?[’']?\s+report|auditor[’']?s\s+report|balance\s+sheet|statement\s+of\s+profit|cash\s+flow\s+statement|notes)\b",
        re.IGNORECASE,
    )

    next_expected_start_re = re.compile(
        r"^(notice|annexures?|corporate\s+governance|auditors?[’']?\s+report|auditor[’']?s\s+report|balance\s+sheet|statement\s+of\s+profit|cash\s+flow\s+statement|notes)\b",
        re.IGNORECASE,
    )

    def _clean_title_line(ln: str) -> str | None:
        s = (ln or "").strip()
        if not s:
            return None
        if s in {"•", ":"}:
            return None
        if not re.search(r"[A-Za-z]", s):
            return None
        return s

    def _build_titles_from_lines_strict(title_lines: list[str]) -> list[str]:
        """Build a strict ordered list of INDEX titles.

        For boxed INDEX layouts we *only* accept known top-level entries. This avoids accidentally
        treating AGM/date/venue text as ToC entries.
        """
        cleaned = []
        for ln in title_lines:
            s = _clean_title_line(ln)
            if s is None:
                continue
            cleaned.append(s)

        # Stop once we reach Notes (INDEX content after that is AGM details / venue etc.)
        for stop_idx, s in enumerate(cleaned):
            if re.match(r"^notes\b", s, re.IGNORECASE):
                cleaned = cleaned[: stop_idx + 1]
                break

        titles: list[str] = []
        i = 0
        while i < len(cleaned):
            s = cleaned[i]

            if not expected_start_re.match(s):
                i += 1
                continue

            # Special multi-line capture for Board's Report Including ...
            if re.search(r"\bboard\b.*\bincluding\b", s, re.IGNORECASE):
                parts = [s]
                i += 1
                while i < len(cleaned):
                    nxt = cleaned[i]
                    if next_expected_start_re.match(nxt):
                        break
                    parts.append(nxt)
                    i += 1
                titles.append(re.sub(r"\s+", " ", " ".join(parts)).strip())
                continue

            # Other known top-level entries: single-line
            titles.append(s)
            i += 1

        # De-dupe while preserving order
        seen = set()
        out = []
        for t in titles:
            if t in seen:
                continue
            seen.add(t)
            out.append(t)
        return out

    entries = []

    if numbers_block_start is not None:
        # Layout B: titles first, then a block of page numbers
        title_region = lines[:numbers_block_start]
        number_region = lines[numbers_block_start:]

        titles = _build_titles_from_lines_strict(title_region)

        page_numbers = []
        for ln in number_region:
            n = _is_standalone_page_line(ln)
            if n is None:
                # Stop if we hit body text
                if re.search(r"ANNUAL\s+REPORT", ln, re.IGNORECASE):
                    break
                continue
            page_numbers.append(n)
            if len(page_numbers) >= len(titles):
                break

        if not titles or len(page_numbers) < len(titles):
            logging.warning(
                "Amit_spinning: INDEX layout detected but could not align titles (%d) with page numbers (%d)",
                len(titles),
                len(page_numbers),
            )
            return None, None

        for t, p in zip(titles, page_numbers):
            entries.append({"title": t, "page": p})

    else:
        # Layout A: streaming merge until a standalone page number line is detected
        buf_parts: list[str] = []
        for ln in lines:
            n = _is_standalone_page_line(ln)
            if n is None:
                buf_parts.append(ln)
                continue

            merged_title = re.sub(r"\s+", " ", " ".join(buf_parts)).strip()
            if merged_title:
                entries.append({"title": merged_title, "page": n})
            buf_parts = []

    if not entries:
        logging.warning("Amit_spinning: INDEX parsed but produced zero entries")
        return None, None

    # MD&A embedded inside the 'including ... management discussion/analysis' entry
    mdna_start_page = None
    for ent in entries:
        title_l = (ent.get("title") or "").lower()
        if ("including" in title_l) and ((("management" in title_l) and ("discussion" in title_l)) or ("analysis" in title_l)):
            mdna_start_page = ent.get("page")
            break

    if not isinstance(mdna_start_page, int):
        logging.warning("Amit_spinning: MD&A-containing INDEX entry not found")
        return None, None

    terminator_re = re.compile(r"\bauditors?\s*[’']?\s*report\b|\bbalance\s+sheet\b", re.IGNORECASE)

    next_section_page = None
    for ent in entries:
        p = ent.get("page")
        if not isinstance(p, int) or p <= mdna_start_page:
            continue
        if terminator_re.search(ent.get("title") or ""):
            next_section_page = p
            break

    mdna_end_page = max_page if next_section_page is None else (next_section_page - 1)
    if mdna_end_page < mdna_start_page:
        logging.warning("Amit_spinning: invalid computed range start=%s end=%s", mdna_start_page, mdna_end_page)
        return None, None

    logging.info("Amit_spinning MD&A boundaries (INDEX): start=%s, end=%s", mdna_start_page, mdna_end_page)
    return mdna_start_page, mdna_end_page


def _detect_mdna_boundaries_strict_toc(raw_lines, max_page: int):
    """Shared strict-ToC-only MD&A boundary detection used for all companies.

    This is the original "normal" detection path. It does not do any Amit_spinning INDEX logic.
    """

    mdna_title_re = re.compile(
        r"\bmanagement(?:\s*[’']?s)?\s+discussion(?:s)?\s+(?:and|&)\s+analysis(?:\s+report)?\b",
        re.IGNORECASE,
    )
    mdna_anchor_re = re.compile(r"\bmanagement\b", re.IGNORECASE)

    # Exclusion regexes (kept strict)
    directors_re = re.compile(r"\bdirectors\s*[’']?\s*report\b", re.IGNORECASE)
    secretarial_re = re.compile(r"\bsecretarial\s+audit\b", re.IGNORECASE)
    mr3_re = re.compile(r"\bform\s+mr\s*[-–]?\s*3\b|\bmr\s*[-–]?\s*3\b", re.IGNORECASE)
    corp_info_re = re.compile(r"\bcorporate\s+information\b", re.IGNORECASE)
    auditors_re = re.compile(r"\bauditors?\s*[’']?\s*report\b|\bindependent\s+auditor\b", re.IGNORECASE)
    corp_gov_re = re.compile(r"\bcorporate\s+governance\b", re.IGNORECASE)

    disallowed = [
        (directors_re, re.compile(r"\bdirectors\b", re.IGNORECASE)),
        (secretarial_re, re.compile(r"\bsecretarial\b", re.IGNORECASE)),
        (mr3_re, re.compile(r"\bmr\b|\bform\b", re.IGNORECASE)),
        (corp_info_re, re.compile(r"\bcorporate\b", re.IGNORECASE)),
        (auditors_re, re.compile(r"\bauditor\b|\bindependent\b", re.IGNORECASE)),
        (corp_gov_re, re.compile(r"\bgovernance\b|\bcorporate\b", re.IGNORECASE)),
    ]

    # --- 1) Find MD&A title as a structural block (up to 3 joined lines) ---
    mdna_start_idx, mdna_end_idx, mdna_block_text = find_title_block_strict(
        raw_lines,
        mdna_title_re,
        max_join_lines=3,
        anchor_re=mdna_anchor_re,
    )

    # --- 1a) Handle special case: MD&A is part of Directors' Report block ---
    if mdna_start_idx is None:
        logging.info("MD&A not found as standalone title; checking inside Directors' Report block")
        directors_anchor_re = re.compile(r"\bdirectors\b", re.IGNORECASE)
        dir_start_idx, dir_end_idx, dir_block_text = find_title_block_strict(
            raw_lines,
            directors_re,
            max_join_lines=3,
            anchor_re=directors_anchor_re,
        )

        if dir_block_text and mdna_title_re.search(dir_block_text):
            logging.info("MD&A title found inside Directors' Report block; using its boundaries")
            mdna_start_idx = dir_start_idx
            mdna_end_idx = dir_end_idx
            mdna_block_text = dir_block_text
        else:
            logging.warning("MD&A title block not found in ToC raw lines; skipping")
            return None, None

    # --- 2) STRICT page number association: page number after MD&A match, else next 3 lines ---
    mdna_page_details = resolve_page_number_for_title_block_strict_with_details(
        raw_lines,
        title_start_idx=mdna_start_idx,
        title_end_idx=mdna_end_idx,
        title_re=mdna_title_re,
        max_page=max_page,
        lookahead_lines=3,
    )

    # Special case: MD&A is a sub-entry under "Board's Report including" with no page number.
    # In this layout, the next numeric line belongs to the next sibling (e.g., Annexures),
    # so we inherit the parent's page (e.g., 4) and end at the next TRUE top-level section
    # (e.g., Corporate Governance at 17 -> end 16).
    inherited_parent_for_mdna = False
    if mdna_page_details:
        page_line_idx = mdna_page_details.get("page_line_idx")
        if isinstance(page_line_idx, int) and page_line_idx > (mdna_end_idx + 1):
            prev_txt = raw_lines[page_line_idx - 1].get("line_text", "")
            if re.search(r"[A-Za-z]", prev_txt or "") and not mdna_title_re.search(prev_txt or ""):
                parent_page = None
                parent_page_line_idx = None
                parent_title = None

                for back in range(mdna_start_idx - 1, max(-1, mdna_start_idx - 12), -1):
                    t = (raw_lines[back].get("line_text", "") or "").strip()
                    if not t:
                        continue
                    mnum = re.fullmatch(r"\s*(\d{1,3})\s*", t)
                    if not mnum:
                        continue
                    n = int(mnum.group(1))
                    if not (1 <= n <= max_page):
                        continue

                    parent_page = n
                    parent_page_line_idx = back

                    for tt in range(back - 1, max(-1, back - 10), -1):
                        cand = (raw_lines[tt].get("line_text", "") or "").strip()
                        if cand and re.search(r"[A-Za-z]", cand):
                            parent_title = cand
                            break

                    break

                if parent_page is not None and parent_title:
                    parent_l = parent_title.lower()
                    if ("including" in parent_l) and ("report" in parent_l) and ("board" in parent_l) and ("directors" not in parent_l):
                        logging.info(
                            "MD&A appears as sub-entry; inheriting parent start page %s from '%s'",
                            parent_page,
                            parent_title,
                        )
                        inherited_parent_for_mdna = True
                        mdna_page_details = {
                            "page": parent_page,
                            "page_span_start": None,
                            "page_span_end": None,
                            "page_line_idx": parent_page_line_idx,
                        }

    if not mdna_page_details:
        logging.warning("MD&A start page not found within strict 3-line window; skipping")
        return None, None

    start_page = mdna_page_details["page"]

    # --- 3) Determine next section page (including same-line multi-column cases) ---
    def _next_page_number_in_same_line(line_text: str, after_pos: int, current_start_page: int):
        # Prefer numbers that occur after the current entry's page span (when ordering is preserved).
        res = _first_valid_page_number_after_pos_with_span(line_text, after_pos, max_page)
        if res:
            n, _, _ = res
            if n > current_start_page:
                return n

        # Fallback for multi-column merges where extraction order may be scrambled within the same line:
        # choose the smallest page number on the line that is greater than the current start page.
        candidates = []
        for m in _PAGE_INT_RE.finditer(line_text or ""):
            try:
                n = int(m.group(1))
            except ValueError:
                continue
            if current_start_page < n <= max_page:
                candidates.append(n)

        return min(candidates) if candidates else None

    def _next_section_start_page_after_line(after_line_idx: int, current_start_page: int):
        for j in range(after_line_idx + 1, len(raw_lines)):
            txt = raw_lines[j].get("line_text", "")
            if not re.search(r"[A-Za-z]", txt or ""):
                continue

            # Avoid treating the same MD&A title again
            if mdna_title_re.search(txt or ""):
                continue

            candidate = resolve_page_number_strict(raw_lines, j, max_page=max_page, lookahead_lines=3)
            if candidate is None:
                continue

            if candidate > current_start_page:
                return candidate

        return None

    def _next_top_level_section_page_after_line(after_line_idx: int, current_start_page: int):
        top_level_re = re.compile(
            r"\b(corporate\s+governance|auditors?\s*[’']?\s*report|independent\s+auditor|balance\s+sheet|statement\s+of\s+profit|cash\s+flow\s+statement|notes)\b",
            re.IGNORECASE,
        )
        for j in range(after_line_idx + 1, len(raw_lines)):
            txt = raw_lines[j].get("line_text", "")
            if not re.search(r"[A-Za-z]", txt or ""):
                continue
            if not top_level_re.search(txt or ""):
                continue

            candidate = resolve_page_number_strict(raw_lines, j, max_page=max_page, lookahead_lines=3)
            if candidate is None:
                continue
            if candidate > current_start_page:
                return candidate

        return None

    same_line_idx = mdna_page_details["page_line_idx"]
    same_line_text = raw_lines[same_line_idx].get("line_text", "")
    same_line_next_page = _next_page_number_in_same_line(
        same_line_text,
        after_pos=mdna_page_details.get("page_span_end") or 0,
        current_start_page=start_page,
    )

    next_section_page = same_line_next_page
    if next_section_page is None:
        if inherited_parent_for_mdna:
            next_section_page = _next_top_level_section_page_after_line(mdna_end_idx, start_page)
        if next_section_page is None:
            next_section_page = _next_section_start_page_after_line(mdna_end_idx, start_page)

    end_page = max_page if next_section_page is None else (next_section_page - 1)

    if end_page < start_page:
        logging.warning("Computed invalid MD&A range: start=%s end=%s; skipping", start_page, end_page)
        return None, None

    # --- 4) Exclusion ranges (handle same-line next-section; do not require resolving ALL exclusions) ---
    def _excluded_ranges():
        ranges = []

        for title_re, anchor_re in disallowed:
            # If MD&A was found inside the Directors' Report, don't treat Directors' Report as an exclusion
            if directors_re.pattern == title_re.pattern and mdna_title_re.search(mdna_block_text or ""):
                if directors_re.search(mdna_block_text or ""):
                    continue

            ex_start_idx, ex_end_idx, ex_block_text = find_title_block_strict(
                raw_lines,
                title_re,
                max_join_lines=3,
                anchor_re=anchor_re,
            )

            if ex_start_idx is None:
                continue

            ex_details = resolve_page_number_for_title_block_strict_with_details(
                raw_lines,
                title_start_idx=ex_start_idx,
                title_end_idx=ex_end_idx,
                title_re=title_re,
                max_page=max_page,
                lookahead_lines=3,
            )

            # If exclusion exists but cannot be aligned within strict window, we cannot
            # form a reliable range; skip enforcing that specific exclusion.
            if not ex_details:
                logging.warning("Excluded section found but page number not aligned; ignoring exclusion: %s", ex_block_text)
                continue

            ex_start_page = ex_details["page"]

            ex_same_line_idx = ex_details["page_line_idx"]
            ex_same_line_text = raw_lines[ex_same_line_idx].get("line_text", "")
            ex_same_line_next = _next_page_number_in_same_line(
                ex_same_line_text,
                after_pos=ex_details.get("page_span_end") or 0,
                current_start_page=ex_start_page,
            )

            ex_next_page = ex_same_line_next
            if ex_next_page is None:
                ex_next_page = _next_section_start_page_after_line(ex_end_idx, ex_start_page)

            # Critical safety for multi-column/boxed ToCs:
            # if an excluded section starts before MD&A (by page number), it must end no later
            # than the MD&A start page (as both are ToC-derived section starts), even if the
            # extracted line order is scrambled.
            if ex_start_page < start_page:
                if ex_next_page is None or start_page < ex_next_page:
                    ex_next_page = start_page

            ex_end_page = max_page if ex_next_page is None else (ex_next_page - 1)

            ranges.append(
                {
                    "title": ex_block_text,
                    "start": ex_start_page,
                    "end": ex_end_page,
                }
            )

        return ranges

    ex_ranges = _excluded_ranges()

    for r in ex_ranges:
        if r["start"] <= start_page <= r["end"]:
            logging.warning(
                "MD&A start page %s falls inside excluded section '%s' (%s-%s); skipping",
                start_page,
                r["title"],
                r["start"],
                r["end"],
            )
            return None, None

    logging.info("MD&A boundaries (STRICT ToC blocks): start=%s, end=%s", start_page, end_page)
    return start_page, end_page


def detect_mdna_boundaries(pages_text, toc_start_page=None, company_folder: str | None = None):
    """Detect MD&A boundaries using STRICT Table of Contents (ToC) rules ONLY.

    Required behavior:
      - ToC is the only source of truth (first 5 pages only).
      - Treat ToC as STRUCTURAL BLOCKS: titles may be wrapped across lines.
      - When MD&A title is detected, search ONLY next 3 extracted lines for page number.
      - Deterministic alignment, no body-text inference; if not resolvable, SKIP.

    Amit_spinning behavior (hybrid, still ToC-only):
      - First try the normal strict-ToC detector (works for Amit PDFs that look like other companies).
      - If it fails to resolve boundaries, fall back to INDEX-style detection for Amit PDFs whose
        MD&A is embedded under "Board’s Report Including ...".

    Returns:
      (start_page, end_page) or (None, None)
    """

    max_page = len(pages_text)
    if max_page <= 0:
        logging.warning("Empty document; cannot detect MD&A boundaries")
        return None, None

    raw_lines = collect_toc_raw_lines(pages_text, max_pages=5)
    if not raw_lines:
        logging.warning("No ToC raw lines detected in the first 5 pages")
        return None, None

    # 1) Always try the normal strict-ToC path first.
    start_page, end_page = _detect_mdna_boundaries_strict_toc(raw_lines, max_page=max_page)
    if start_page is not None and end_page is not None:
        return start_page, end_page

    # 2) Amit_spinning fallback: INDEX-style.
    if company_folder == "Amit_spinning":
        return _detect_mdna_boundaries_amit_spinning_index(raw_lines, max_page=max_page)

    return None, None


### 7. MD&A Text Extraction (Boundary-Aware) : 

In [23]:
def extract_mdna_text(pages_text, start_page, end_page):
    """Extract raw MD&A text from PDF pages using detected boundaries.

    Args:
        pages_text: List of page dictionaries from PDFInterface.get_pages_text(),
                    each like {"page_number": int, "text": str}
        start_page: 1-based start page number (inclusive)
        end_page: 1-based end page number (inclusive)

    Returns:
        str: Concatenated raw MD&A text (no cleaning), with double newlines
             inserted between pages.
    """

    included_text_chunks = []
    included_pages = 0

    for page in pages_text:
        page_no = page.get("page_number")
        if page_no is None:
            continue

        if start_page <= page_no <= end_page:
            included_pages += 1
            included_text_chunks.append(page.get("text", ""))

    mdna_text = "\n\n".join(included_text_chunks)

    logging.info("MD&A pages included: %d", included_pages)
    logging.info("Extracted MD&A text length (chars): %d", len(mdna_text))

    return mdna_text


# mdna_text = extract_mdna_text(pages, start_page, end_page)

# print("MD&A preview:\n")
# print(mdna_text[:1000])



2025-12-29 22:50:38,643 | INFO | root | MD&A pages included: 6
2025-12-29 22:50:38,655 | INFO | root | Extracted MD&A text length (chars): 18561


MD&A preview:

56 | AMTEK AUTO LIMITED
AMTEK AUTO LIMITED
MANAGEMENT DISCUSSION AND ANALYSIS REPORT
1.
GLOBAL ECONOMIC OVERVIEW
The overall performance of the global economy remained subdued through 2014, as well as into 2015. The world
economy grew 3.4% in 2014, impacted by a slowdown in many developing countries, which account for approximately
75% of the world economy. According to the International Monetary Fund (IMF), the global GDP growth rate is expected
to decline further by 30bps to 3.1% in 2015. A modest pickup in advanced economies and continued challenges in
emerging markets are the major factors behind these lower projections. The GDP growth for emerging markets and
developing countries in 2015 is expected to decline by 60bps to 4.0%, owing to weaker economic growth in the oil
exporting countries, a slowdown in China and expected negative growth in Brazil.
2.
INDIAN ECONOMIC OVERVIEW
During fiscal year 2015, the Indian economy started to show signs of a recovery after a pr

### 8. MD&A Text Cleaning & Artifact Removal: 

In [24]:
def clean_mdna_text(raw_text):
    """Conservatively clean extracted MD&A text.

    What this does (conservative heuristics):
    - Removes obvious repeating headers/footers (e.g., 'Annual Report ...', standalone page numbers,
      and very header-like company-name lines when they appear repeatedly).
    - Removes obvious table artifacts (high digit-density lines, and separator-only lines).
    - Normalizes whitespace while preserving paragraph breaks.

    What this does NOT do:
    - Does not lowercase text
    - Does not remove punctuation
    - Does not change wording

    Args:
        raw_text: str

    Returns:
        str: cleaned MD&A text
    """

    if raw_text is None:
        raw_text = ""

    original_len = len(raw_text)

    # Split into lines to enable conservative line-based removals.
    lines = raw_text.splitlines()

    # Pre-compute line frequencies (normalized) to detect repeated headers/footers.
    def _norm_line_for_freq(line: str) -> str:
        return re.sub(r"\s+", " ", (line or "").strip())

    normalized_lines = [_norm_line_for_freq(ln) for ln in lines]
    freq = {}
    for nl in normalized_lines:
        if not nl:
            continue
        freq[nl] = freq.get(nl, 0) + 1

    annual_report_re = re.compile(r"^\s*Annual\s+Report(?:\s+\d{4}\s*[-–]\s*\d{2,4})?\s*$", re.IGNORECASE)
    standalone_page_no_re = re.compile(r"^\s*(?:page\s*)?\d{1,4}\s*$", re.IGNORECASE)
    separators_only_re = re.compile(r"^[\s\-_=*~•·\.\|:;,+/\\]+$")

    def _is_company_name_like(line: str) -> bool:
        # Conservative: only remove if it looks like a standalone header/footer line.
        s = (line or "").strip()
        if not s:
            return False
        if len(s) > 90:
            return False

        # Must contain common company suffixes.
        if not re.search(r"\b(LIMITED|LTD\.?|PVT\.?\s+LTD\.?|PRIVATE\s+LIMITED)\b", s, flags=re.IGNORECASE):
            return False

        # Must be mostly uppercase (typical header styling).
        letters = re.findall(r"[A-Za-z]", s)
        if not letters:
            return False
        upper_letters = sum(1 for ch in letters if ch.isupper())
        if upper_letters / len(letters) < 0.8:
            return False

        # Keep it short in words (header/footer line).
        if len(s.split()) > 10:
            return False

        return True

    def _is_high_numeric_density(line: str) -> bool:
        s = (line or "").strip()
        if len(s) < 12:
            return False
        # Density computed over non-space characters.
        compact = re.sub(r"\s+", "", s)
        if not compact:
            return False
        digits = sum(1 for ch in compact if ch.isdigit())
        return (digits / len(compact)) > 0.40

    cleaned_lines = []

    for raw_line, norm_line in zip(lines, normalized_lines):
        s = (raw_line or "").rstrip()
        sn = norm_line

        # Remove obvious separators/formatting-only lines.
        if sn and separators_only_re.match(sn):
            continue

        # Remove obvious page numbers (standalone).
        if sn and standalone_page_no_re.match(sn):
            continue

        # Remove 'Annual Report' headers/footers.
        if sn and annual_report_re.match(sn):
            continue

        # Remove frequent header/footer-like lines conservatively.
        # (Only if repeated AND header-ish AND not too long.)
        if sn and freq.get(sn, 0) >= 3:
            if _is_company_name_like(sn) or annual_report_re.match(sn) or standalone_page_no_re.match(sn):
                continue

        # Remove obvious table artifacts: high numeric density.
        if _is_high_numeric_density(sn):
            continue

        # Whitespace normalization inside the line.
        s = re.sub(r"[ \t]{2,}", " ", s).strip(" ")
        cleaned_lines.append(s)

    # Re-join with newlines and normalize paragraph spacing.
    cleaned_text = "\n".join(cleaned_lines)

    # Reduce 3+ consecutive newlines to at most 2.
    cleaned_text = re.sub(r"\n{3,}", "\n\n", cleaned_text)

    # Trim leading/trailing whitespace/newlines.
    cleaned_text = cleaned_text.strip()

    logging.info("MD&A text original length (chars): %d", original_len)
    logging.info("MD&A text cleaned length (chars): %d", len(cleaned_text))

    return cleaned_text

#sanity test: 
# cleaned_text = clean_mdna_text(mdna_text)

# print("Cleaned MD&A preview:\n")
# print(cleaned_text[:1200])


2025-12-29 15:06:51,628 | INFO | root | MD&A text original length (chars): 39415
2025-12-29 15:06:51,629 | INFO | root | MD&A text cleaned length (chars): 37792


Cleaned MD&A preview:

20 | AMTEK AUTO LIMITED
SECRETARIAL AUDIT REPORT
The Board has appointed M/s S. Khurana & Associates, Company Secretaries, to conduct Secretarial Audit for the financial
year 2015-16. The Secretarial Audit Report for the financial year ended March 31, 2016 is annexed herewith marked as
Annexure - I to this Report. The Secretarial Audit Report does not contain any qualification, reservation or adverse remark.
As per the directive of Securities and Exchange Board of India, M/s S. Khurana & Associates Company Secretaries, New
Delhi, undertook the Reconciliation of Share Capital Audit on a quarterly basis. The purpose of the audit is to reconcile the
total number of shares held in National Securities Depository Limited (NSDL), Central Depository Services (India) Limited
(CDSL) and in physical form with the respect to admitted, issued and paid up capital of the Company.
CORPORATE GOVERNANCE
The Company is committed to maintain high standards of Corporate Governance an

### 9. MD&A Quality Verification & Validation Metrics : 

In [25]:
def verify_mdna_quality(mdna_text, pages_match_toc: bool):
    """Verify extracted/cleaned MD&A text quality using STRICT semantics.

    quality_passed MUST be TRUE only if:
      - Extracted pages match the ToC-declared MD&A range exactly (pages_match_toc=True)
      - Text contains at least 2 MD&A-specific phrases (case-insensitive)

    Guardrail:
      - Audit / Corporate / Director section signals must ALWAYS fail quality.

    Returns:
        dict with keys:
          - word_count (int)
          - narrative_density (float)
          - keyword_hits (list[str])
          - quality_passed (bool)
    """

    if mdna_text is None:
        mdna_text = ""

    # a) Word count
    words = re.findall(r"\b\w+\b", mdna_text)
    word_count = len(words)

    # b) Narrative density (alphabetic chars / total chars)
    total_chars = len(mdna_text)
    alpha_chars = sum(1 for ch in mdna_text if ch.isalpha())
    narrative_density = (alpha_chars / total_chars) if total_chars else 0.0

    lower_text = mdna_text.lower()

    # c) Required MD&A phrases
    mdna_phrases = [
        "industry outlook",
        "opportunities and threats",
        "risk management",
        "future outlook",
        "segment performance",
        "global economy",
    ]

    keyword_hits = [p for p in mdna_phrases if p in lower_text]

    # Disqualifiers: MUST always fail
    disqualifiers = [
        "independent auditor",
        "auditor's report",
        "auditors' report",
        "auditors report",
        "secretarial audit",
        "form mr-3",
        "mr-3",
        "corporate information",
        "corporate governance",
        "directors' report",
        "director's report",
        "directors report",
    ]

    disqualifier_hit = any(bad in lower_text for bad in disqualifiers)

    criteria_pages_match = bool(pages_match_toc)
    criteria_phrases = len(keyword_hits) >= 2

    quality_passed = bool(criteria_pages_match and criteria_phrases and (not disqualifier_hit))

    logging.info("MD&A quality — word_count: %d", word_count)
    logging.info("MD&A quality — narrative_density: %.4f", narrative_density)
    logging.info("MD&A quality — keyword_hits (%d): %s", len(keyword_hits), keyword_hits)
    logging.info("MD&A quality — pages_match_toc: %s", criteria_pages_match)
    logging.info("MD&A quality — disqualifier_hit: %s", disqualifier_hit)

    if quality_passed:
        logging.info("MD&A quality PASSED")
    else:
        logging.warning(
            "MD&A quality FLAGGED (pages_match_toc=%s, phrases_ok=%s, disqualifier_hit=%s)",
            criteria_pages_match,
            criteria_phrases,
            disqualifier_hit,
        )

    return {
        "word_count": word_count,
        "narrative_density": float(narrative_density),
        "keyword_hits": keyword_hits,
        "quality_passed": quality_passed,
    }


In [49]:
import re


def _norm_for_match(s: str) -> str:
    s = (s or "").lower()
    # normalize common apostrophes and whitespace
    s = s.replace("’", "'").replace("‘", "'")
    s = re.sub(r"\s+", " ", s).strip()
    return s


def _extract_printed_page_number_from_page_text(text: str) -> int | None:
    """Best-effort parse of the *printed* page number from footer/header.

    Heuristic: look at last ~15 non-empty lines and pick a small, standalone integer
    (e.g., "23", "- 23 -", "Page 23").
    """

    lines = [ln.strip() for ln in (text or "").splitlines() if (ln or "").strip()]
    if not lines:
        return None

    tail = lines[-15:]

    patterns = [
        re.compile(r"^page\s*(\d{1,4})\s*$", re.IGNORECASE),
        re.compile(r"^[-–—]*\s*(\d{1,4})\s*[-–—]*$"),
        re.compile(r"^\(?\s*(\d{1,4})\s*\)?$"),
    ]

    for ln in reversed(tail):
        # avoid matching years or long numeric strings
        if re.search(r"\b(19|20)\d{2}\b", ln):
            continue
        if len(ln) > 20:
            continue

        for pat in patterns:
            m = pat.match(ln)
            if not m:
                continue
            try:
                n = int(m.group(1))
            except ValueError:
                continue
            if 1 <= n <= 5000:
                return n

    return None


def build_printed_to_pdf_page_map(pages_text: list[dict]) -> dict[int, int]:
    """Map printed page number -> PDF page index (1-based)."""

    out: dict[int, int] = {}
    for p in pages_text:
        pdf_page = p.get("page_number")
        if not isinstance(pdf_page, int):
            continue
        printed = _extract_printed_page_number_from_page_text(p.get("text", "") or "")
        if printed is None:
            continue
        # Keep the first occurrence (most docs have a 1-1 mapping)
        out.setdefault(int(printed), int(pdf_page))
    return out


class PageOffsetSolver:
    """Compute delta between printed ToC page numbers and PDF page indices.

    Anchor-based approach:
      1) Find anchor ('Independent Auditor's Report') in ToC lines and read its ToC page number.
      2) Find the same anchor in the actual PDF pages (header/title text) to get PDF page index.
      3) delta = pdf_page_index - toc_page_number

    Returns 0 if anchor can't be resolved safely.
    """

    def __init__(self, doc: fitz.Document, pages_text: list[dict], toc_max_pages: int = 15):
        self.doc = doc
        self.pages_text = pages_text
        self.toc_raw_lines = collect_toc_raw_lines(pages_text, max_pages=toc_max_pages)

    def find_offset(self) -> int:
        toc_anchor_page = self._find_anchor_page_in_toc()
        if toc_anchor_page is None:
            return 0

        pdf_anchor_page = self._find_anchor_page_in_pdf()
        if pdf_anchor_page is None:
            return 0

        return int(pdf_anchor_page - toc_anchor_page)

    def _find_anchor_page_in_toc(self) -> int | None:
        # Match variants like: Independent Auditor’s Report / Independent Auditors' Report
        anchor_re = re.compile(r"\bindependent\s+auditors?\s*[’']?\s*report\b", re.IGNORECASE)
        max_page = len(self.pages_text)

        for i, item in enumerate(self.toc_raw_lines):
            line = item.get("line_text", "") or ""
            if not anchor_re.search(line):
                continue

            page = resolve_page_number_strict(self.toc_raw_lines, i, max_page=max_page, lookahead_lines=3)
            if isinstance(page, int) and page >= 1:
                return page

        return None

    def _find_anchor_page_in_pdf(self, search_limit: int = 250) -> int | None:
        anchor_re = re.compile(r"\bindependent\s+auditors?\s*[’']?\s*report\b", re.IGNORECASE)
        limit = min(search_limit, self.doc.page_count)

        for page_idx0 in range(limit):
            page = self.doc.load_page(page_idx0)
            text = page.get_text("text") or ""

            # Prefer header-ish region: first ~30 non-empty lines
            lines = [ln.strip() for ln in text.splitlines() if (ln or "").strip()]
            header_text = "\n".join(lines[:30])

            if anchor_re.search(header_text) or anchor_re.search(text):
                return page_idx0 + 1

        return None


def validate_and_adjust_start_page(doc: fitz.Document, candidate_start_page_1based: int) -> int:
    """Validate MD&A start page using keyword check; if miss, search +/- 3 pages."""

    keywords_re = re.compile(r"management\s+discussion|\bstructure\b|\boutlook\b", re.IGNORECASE)

    def _page_has_keywords(page_1based: int) -> bool:
        if not (1 <= page_1based <= doc.page_count):
            return False
        text = doc.load_page(page_1based - 1).get_text("text") or ""
        return bool(keywords_re.search(text))

    if _page_has_keywords(candidate_start_page_1based):
        return candidate_start_page_1based

    for delta in range(1, 4):
        for p in (candidate_start_page_1based - delta, candidate_start_page_1based + delta):
            if _page_has_keywords(p):
                return p

    return candidate_start_page_1based


In [50]:
from pathlib import Path
from datetime import datetime

# -----------------------------
# End-to-end MD&A Extraction Pipeline (STRICT ToC-Only Boundaries)
# -----------------------------
PDF_ROOT = Path("../data/pdfs")

# Prefer the configured OUTPUT_DIR if present; otherwise default to ../output
try:
    output_dir = OUTPUT_DIR
except NameError:
    output_dir = Path("../output")

output_dir.mkdir(parents=True, exist_ok=True)

pdf_paths = sorted(PDF_ROOT.rglob("*.pdf"))
logging.info("Discovered %d PDFs under %s", len(pdf_paths), PDF_ROOT)

results = []

attempted = 0
succeeded = 0
skipped_no_boundaries = 0
failed = 0
passed_quality = 0
flagged_quality = 0

for idx, pdf_path in enumerate(pdf_paths, start=1):
    attempted += 1
    company_folder = pdf_path.parent.name

    logging.info("(%d/%d) Processing: %s/%s", idx, len(pdf_paths), company_folder, pdf_path.name)

    pdf = None
    try:
        pdf = PDFInterface(pdf_path)
        pages = pdf.get_pages_text()
        max_page = len(pages)

        company_name = extract_company_name(pages, company_folder=company_folder)
        financial_year = extract_financial_year(pages)

        # STRICT ToC-only boundaries: if not deterministically resolvable from ToC lines, skip.
        start_page, end_page = detect_mdna_boundaries(
            pages_text=pages,
            toc_start_page=None,
            company_folder=company_folder,
        )

        if not start_page or not end_page:
            skipped_no_boundaries += 1
            logging.warning("Skipping (MD&A boundaries not determinable via STRICT ToC rules): %s", pdf_path.name)
            continue

        # --- Printed page -> PDF page mapping (preferred) ---
        printed_to_pdf = build_printed_to_pdf_page_map(pages)

        start_page_pdf = printed_to_pdf.get(int(start_page))
        end_page_pdf = printed_to_pdf.get(int(end_page))

        # --- Anchor-based offset (fallback) ---
        offset = 0
        if start_page_pdf is None or end_page_pdf is None:
            solver = PageOffsetSolver(pdf.doc, pages_text=pages, toc_max_pages=15)
            offset = solver.find_offset()

        if start_page_pdf is None:
            start_page_pdf = start_page + offset
        if end_page_pdf is None:
            end_page_pdf = end_page + offset
        # ------------------------------------

        # Clamp to document bounds
        start_page_pdf = max(1, min(int(start_page_pdf), max_page))
        end_page_pdf = max(1, min(int(end_page_pdf), max_page))
        if end_page_pdf < start_page_pdf:
            end_page_pdf = start_page_pdf

        # Validate start page contains MD&A-ish keywords; else search +/-3.
        # If start shifts, shift end by the same amount to preserve section length.
        original_start_page_pdf = start_page_pdf
        start_page_pdf_validated = validate_and_adjust_start_page(pdf.doc, start_page_pdf)
        shift = int(start_page_pdf_validated - original_start_page_pdf)

        if shift != 0:
            logging.info(
                "Adjusted MD&A start page after validation: %s -> %s (shift=%s)",
                original_start_page_pdf,
                start_page_pdf_validated,
                shift,
            )
            end_page_pdf = end_page_pdf + shift
            end_page_pdf = max(1, min(int(end_page_pdf), max_page))
            if end_page_pdf < start_page_pdf_validated:
                end_page_pdf = start_page_pdf_validated

        start_page_pdf = start_page_pdf_validated

        raw_mdna_text = extract_mdna_text(pages_text=pages, start_page=start_page_pdf, end_page=end_page_pdf)
        cleaned_mdna_text = clean_mdna_text(raw_mdna_text)

        # STRICT: extracted page count must match the ToC-declared MD&A range length
        expected_pages = end_page - start_page + 1
        actual_pages = sum(
            1
            for p in pages
            if isinstance(p.get("page_number"), int) and start_page_pdf <= p.get("page_number") <= end_page_pdf
        )
        pages_match_toc = bool(actual_pages == expected_pages)

        quality_report = verify_mdna_quality(cleaned_mdna_text, pages_match_toc=pages_match_toc)

        if quality_report.get("quality_passed"):
            passed_quality += 1
        else:
            flagged_quality += 1

        results.append(
            {
                "company_folder": company_folder,
                "company_name": company_name,
                "report_file": pdf_path.name,
                "financial_year": financial_year,
                "mdna_start_page": start_page_pdf,
                "mdna_end_page": end_page_pdf,
                "mdna_text": cleaned_mdna_text,
                **quality_report,
            }
        )

        succeeded += 1

    except Exception as e:
        failed += 1
        logging.exception("Failed processing %s: %s", pdf_path, e)

    finally:
        if pdf is not None:
            pdf.close()

results_df = pd.DataFrame(results)

# -----------------------------
# Safe CSV/Excel writes (avoid Windows PermissionError when file is open)
# -----------------------------
run_stamp = datetime.now().strftime("%Y%m%d_%H%M%S")

csv_path = output_dir / "mdna_extracted.csv"
xlsx_path = output_dir / "mdna_extracted.xlsx"

fallback_csv = output_dir / f"mdna_extracted_{run_stamp}.csv"
fallback_xlsx = output_dir / f"mdna_extracted_{run_stamp}.xlsx"

# Excel safety: remove illegal control characters
def _sanitize_for_excel(val):
    if val is None:
        return ""
    if isinstance(val, (list, dict, tuple, set)):
        val = str(val)
    s = str(val)
    return re.sub(r"[\x00-\x08\x0B\x0C\x0E-\x1F]", "", s)

excel_df = results_df.copy()
for col in excel_df.columns:
    excel_df[col] = excel_df[col].map(_sanitize_for_excel)

try:
    results_df.to_csv(csv_path, index=False, encoding="utf-8")
    logging.info("Saved CSV: %s", csv_path)
except PermissionError:
    results_df.to_csv(fallback_csv, index=False, encoding="utf-8")
    logging.warning("CSV locked; saved fallback CSV: %s", fallback_csv)

try:
    excel_df.to_excel(xlsx_path, index=False)
    logging.info("Saved Excel: %s", xlsx_path)
except PermissionError:
    excel_df.to_excel(fallback_xlsx, index=False)
    logging.warning("Excel locked; saved fallback Excel: %s", fallback_xlsx)


2026-01-01 12:54:50,975 - INFO - Discovered 14 PDFs under ..\data\pdfs
2026-01-01 12:54:50,977 - INFO - (1/14) Processing: Alcheimist/5267070319.pdf


2026-01-01 12:54:50,982 - INFO - Loaded PDF: 5267070319.pdf
2026-01-01 12:54:51,381 - INFO - Extracted company name: ALCHEMIST LTD
2026-01-01 12:54:51,383 - INFO - Extracted financial year: 2018-19
2026-01-01 12:54:51,385 - INFO - MD&A boundaries (STRICT ToC blocks): start=23, end=100
2026-01-01 12:54:51,407 - INFO - Adjusted MD&A start page after validation: 23 -> 26 (shift=3)
2026-01-01 12:54:51,409 - INFO - MD&A pages included: 20
2026-01-01 12:54:51,409 - INFO - Extracted MD&A text length (chars): 50690
2026-01-01 12:54:51,424 - INFO - MD&A text original length (chars): 50690
2026-01-01 12:54:51,424 - INFO - MD&A text cleaned length (chars): 49647
2026-01-01 12:54:51,435 - INFO - MD&A quality — word_count: 7935
2026-01-01 12:54:51,436 - INFO - MD&A quality — narrative_density: 0.7799
2026-01-01 12:54:51,437 - INFO - MD&A quality — keyword_hits (1): ['risk management']
2026-01-01 12:54:51,439 - INFO - MD&A quality — pages_match_toc: False
2026-01-01 12:54:51,440 - INFO - MD&A qualit

In [51]:
# Quick summary of how many rows were extracted
print("\nPipeline output summary")
print("- results_df shape:", results_df.shape)
print("- quality passed:", int((results_df["quality_passed"] == True).sum()) if "quality_passed" in results_df.columns else "N/A")
print("- quality flagged:", int((results_df["quality_passed"] == False).sum()) if "quality_passed" in results_df.columns else "N/A")



Pipeline output summary
- results_df shape: (14, 11)
- quality passed: 0
- quality flagged: 14


In [52]:
# Diagnostics: sanity-check extracted MD&A text looks like MD&A (not Audit/Directors)
import re

if "results_df" not in globals() or results_df is None or results_df.empty:
    print("results_df is empty; run the pipeline cell first")
else:
    df = results_df.copy()

    def _contains(pat: str, s: str) -> bool:
        return bool(re.search(pat, s or "", flags=re.IGNORECASE))

    df["has_mdna_heading"] = df["mdna_text"].map(lambda s: _contains(r"management\s+discussion", s))
    df["has_structure_or_outlook"] = df["mdna_text"].map(lambda s: _contains(r"\bstructure\b|\boutlook\b", s))
    df["has_audit_terms"] = df["mdna_text"].map(lambda s: _contains(r"independent\s+auditor|auditors?\s*[’']?\s*report", s))
    df["has_directors_report"] = df["mdna_text"].map(lambda s: _contains(r"directors?\s*[’']?\s*report", s))
    df["keyword_hits_count"] = df["keyword_hits"].map(lambda x: len(x) if isinstance(x, list) else 0)

    print("\nMD&A extraction diagnostics")
    print("- has_mdna_heading:", int(df["has_mdna_heading"].sum()), "/", len(df))
    print("- has_structure_or_outlook:", int(df["has_structure_or_outlook"].sum()), "/", len(df))
    print("- has_audit_terms (should be low):", int(df["has_audit_terms"].sum()), "/", len(df))
    print("- has_directors_report (should be low):", int(df["has_directors_report"].sum()), "/", len(df))
    print("- keyword_hits_count>=2 (required by current quality gate):", int((df["keyword_hits_count"] >= 2).sum()), "/", len(df))

    show_cols = [
        "company_folder",
        "report_file",
        "financial_year",
        "mdna_start_page",
        "mdna_end_page",
        "word_count",
        "narrative_density",
        "keyword_hits",
        "has_mdna_heading",
        "has_structure_or_outlook",
        "has_audit_terms",
        "has_directors_report",
        "quality_passed",
    ]
    show_cols = [c for c in show_cols if c in df.columns]

    print("\nSample rows (first 10):")
    print(df[show_cols].head(10).to_string(index=False))



MD&A extraction diagnostics
- has_mdna_heading: 13 / 14
- has_structure_or_outlook: 11 / 14
- has_audit_terms (should be low): 10 / 14
- has_directors_report (should be low): 6 / 14
- keyword_hits_count>=2 (required by current quality gate): 4 / 14

Sample rows (first 10):
company_folder     report_file financial_year  mdna_start_page  mdna_end_page  word_count  narrative_density                           keyword_hits  has_mdna_heading  has_structure_or_outlook  has_audit_terms  has_directors_report  quality_passed
    Alcheimist  5267070319.pdf        2018-19               26             45        7935           0.779926                      [risk management]              True                      True            False                 False           False
    Alcheimist 67050526707.pdf        2019-20               29             74       25113           0.776934 [risk management, segment performance]              True                      True             True                 False 

In [44]:
# Diagnostic: confirm Amit_spinning PDFs were processed in the latest run
if "results_df" in globals() and results_df is not None:
    print("\nresults_df columns:", list(results_df.columns))

    company_col = "company_folder" if "company_folder" in results_df.columns else ("company" if "company" in results_df.columns else None)
    if company_col:
        subset = results_df[results_df[company_col] == "Amit_spinning"].copy()
        print("\nAmit_spinning preview (after running Cell 18):")
        if subset.empty:
            print("(none)")
        else:
            display_cols = [c for c in [
                company_col,
                "report_file",
                "financial_year",
                "mdna_start_page",
                "mdna_end_page",
                "word_count",
                "quality_passed",
            ] if c in subset.columns]
            subset = subset.sort_values(["financial_year"] if "financial_year" in subset.columns else [company_col])
            print(subset[display_cols].to_string(index=False))
    else:
        print("Could not find a company column in results_df")
else:
    print("results_df not found; run Cell 18 first")



results_df columns: ['company_folder', 'company_name', 'report_file', 'financial_year', 'mdna_start_page', 'mdna_end_page', 'mdna_text', 'word_count', 'narrative_density', 'keyword_hits', 'quality_passed']

Amit_spinning preview (after running Cell 18):
company_folder    report_file financial_year  mdna_start_page  mdna_end_page  word_count  quality_passed
 Amit_spinning 5210760315.pdf        2014-15                8             15        5952           False
 Amit_spinning 5210760316.pdf        2015-16                3             16        9156           False
 Amit_spinning 5210760317.pdf        2016-17                3             28       17248           False
 Amit_spinning 5210760318.pdf        2017-18                3             21       12623           False


In [31]:
# Diagnostic: quick look at extracted company names (one per folder)
if "results_df" in globals() and results_df is not None and not results_df.empty:
    if {"company_folder", "company_name"}.issubset(results_df.columns):
        print("\nCompany names (folder -> extracted):")
        pairs = results_df[["company_folder", "company_name"]].drop_duplicates().sort_values("company_folder")
        print(pairs.to_string(index=False))
    else:
        print("company_folder/company_name columns not present")


### Test cell — MD&A Boundary Detection 

In [45]:
from pathlib import Path

PDF_ROOT = Path("../data/pdfs")

# Pick ONE representative PDF per company
test_pdfs = {}
for pdf in PDF_ROOT.rglob("*.pdf"):
    company = pdf.parent.name
    if company not in test_pdfs:
        test_pdfs[company] = pdf

# Prefer an INDEX-style Amit_spinning file that exercises the special-case logic
amit_preferred = PDF_ROOT / "Amit_spinning" / "5210760318.pdf"
if amit_preferred.exists():
    test_pdfs["Amit_spinning"] = amit_preferred

print("Testing MD&A boundary detection on sample PDFs:\n")

for company, pdf_path in test_pdfs.items():
    print("=" * 70)
    print(f"Company Folder : {company}")
    print(f"PDF File       : {pdf_path.name}")

    pdf = PDFInterface(pdf_path)
    pages = pdf.get_pages_text()

    start_page, end_page = detect_mdna_boundaries(
        pages_text=pages,
        toc_start_page=None,
        company_folder=company,
    )

    print(f"Detected MD&A Start Page: {start_page}")
    print(f"Detected MD&A End Page  : {end_page}")

    if start_page and end_page:
        assert start_page <= end_page, "Start page must be strictly before end page"
        assert 1 <= start_page <= len(pages), "Start page out of range"
        assert 1 <= end_page <= len(pages), "End page out of range"
        print("✔ Boundary detection looks valid")
    else:
        print("⚠ MD&A boundaries not detected (may require fallback logic)")

print("\nBoundary detection test completed.")


2026-01-01 12:52:14,244 - INFO - Loaded PDF: 5267070319.pdf


Testing MD&A boundary detection on sample PDFs:

Company Folder : Alcheimist
PDF File       : 5267070319.pdf


2026-01-01 12:52:14,673 - INFO - MD&A boundaries (STRICT ToC blocks): start=23, end=100
2026-01-01 12:52:14,677 - INFO - Loaded PDF: 5210700315.pdf


Detected MD&A Start Page: 23
Detected MD&A End Page  : 100
✔ Boundary detection looks valid
Company Folder : ALok
PDF File       : 5210700315.pdf


2026-01-01 12:52:15,250 - INFO - MD&A boundaries (STRICT ToC blocks): start=51, end=82
2026-01-01 12:52:15,253 - INFO - Loaded PDF: 5210760318.pdf
2026-01-01 12:52:15,333 - INFO - Amit_spinning MD&A boundaries (INDEX): start=1, end=21
2026-01-01 12:52:15,335 - INFO - Loaded PDF: 5200770316.pdf


Detected MD&A Start Page: 51
Detected MD&A End Page  : 82
✔ Boundary detection looks valid
Company Folder : Amit_spinning
PDF File       : 5210760318.pdf
Detected MD&A Start Page: 1
Detected MD&A End Page  : 21
✔ Boundary detection looks valid
Company Folder : Amtek
PDF File       : 5200770316.pdf


2026-01-01 12:52:15,608 - INFO - MD&A boundaries (STRICT ToC blocks): start=63, end=71


Detected MD&A Start Page: 63
Detected MD&A End Page  : 71
✔ Boundary detection looks valid

Boundary detection test completed.


In [33]:
# Diagnostic: test ALL Amit_spinning PDFs (INDEX-style ToC)
from pathlib import Path

amit_dir = Path("../data/pdfs/Amit_spinning")
amit_pdfs = sorted(amit_dir.glob("*.pdf"))
print("\nAmit_spinning PDFs:", [p.name for p in amit_pdfs])

for p in amit_pdfs:
    pdf = PDFInterface(p)
    pages = pdf.get_pages_text()
    s, e = detect_mdna_boundaries(pages_text=pages, toc_start_page=None, company_folder="Amit_spinning")
    print(f"{p.name}: start={s}, end={e}, doc_pages={len(pages)}")
    pdf.close()


2026-01-01 01:30:54,498 - INFO - Loaded PDF: 5210760315.pdf


2026-01-01 01:30:54,498 - INFO - Loaded PDF: 5210760315.pdf



Amit_spinning PDFs: ['5210760315.pdf', '5210760316.pdf', '5210760317.pdf', '5210760318.pdf']


2026-01-01 01:30:54,751 - INFO - MD&A boundaries (STRICT ToC blocks): start=8, end=15
2026-01-01 01:30:54,755 - INFO - Loaded PDF: 5210760316.pdf
2026-01-01 01:30:54,944 - INFO - MD&A appears as sub-entry; inheriting parent start page 4 from 'Board's Report including'
2026-01-01 01:30:54,947 - INFO - MD&A boundaries (STRICT ToC blocks): start=4, end=16
2026-01-01 01:30:54,950 - INFO - Loaded PDF: 5210760317.pdf


2026-01-01 01:30:54,498 - INFO - Loaded PDF: 5210760315.pdf



Amit_spinning PDFs: ['5210760315.pdf', '5210760316.pdf', '5210760317.pdf', '5210760318.pdf']


2026-01-01 01:30:54,751 - INFO - MD&A boundaries (STRICT ToC blocks): start=8, end=15
2026-01-01 01:30:54,755 - INFO - Loaded PDF: 5210760316.pdf
2026-01-01 01:30:54,944 - INFO - MD&A appears as sub-entry; inheriting parent start page 4 from 'Board's Report including'
2026-01-01 01:30:54,947 - INFO - MD&A boundaries (STRICT ToC blocks): start=4, end=16
2026-01-01 01:30:54,950 - INFO - Loaded PDF: 5210760317.pdf


5210760315.pdf: start=8, end=15, doc_pages=46
5210760316.pdf: start=4, end=16, doc_pages=49


2026-01-01 01:30:54,498 - INFO - Loaded PDF: 5210760315.pdf



Amit_spinning PDFs: ['5210760315.pdf', '5210760316.pdf', '5210760317.pdf', '5210760318.pdf']


2026-01-01 01:30:54,751 - INFO - MD&A boundaries (STRICT ToC blocks): start=8, end=15
2026-01-01 01:30:54,755 - INFO - Loaded PDF: 5210760316.pdf
2026-01-01 01:30:54,944 - INFO - MD&A appears as sub-entry; inheriting parent start page 4 from 'Board's Report including'
2026-01-01 01:30:54,947 - INFO - MD&A boundaries (STRICT ToC blocks): start=4, end=16
2026-01-01 01:30:54,950 - INFO - Loaded PDF: 5210760317.pdf


5210760315.pdf: start=8, end=15, doc_pages=46
5210760316.pdf: start=4, end=16, doc_pages=49


2026-01-01 01:30:55,155 - INFO - Amit_spinning MD&A boundaries (INDEX): start=3, end=28
2026-01-01 01:30:55,159 - INFO - Loaded PDF: 5210760318.pdf
2026-01-01 01:30:55,326 - INFO - Amit_spinning MD&A boundaries (INDEX): start=1, end=21


2026-01-01 01:30:54,498 - INFO - Loaded PDF: 5210760315.pdf



Amit_spinning PDFs: ['5210760315.pdf', '5210760316.pdf', '5210760317.pdf', '5210760318.pdf']


2026-01-01 01:30:54,751 - INFO - MD&A boundaries (STRICT ToC blocks): start=8, end=15
2026-01-01 01:30:54,755 - INFO - Loaded PDF: 5210760316.pdf
2026-01-01 01:30:54,944 - INFO - MD&A appears as sub-entry; inheriting parent start page 4 from 'Board's Report including'
2026-01-01 01:30:54,947 - INFO - MD&A boundaries (STRICT ToC blocks): start=4, end=16
2026-01-01 01:30:54,950 - INFO - Loaded PDF: 5210760317.pdf


5210760315.pdf: start=8, end=15, doc_pages=46
5210760316.pdf: start=4, end=16, doc_pages=49


2026-01-01 01:30:55,155 - INFO - Amit_spinning MD&A boundaries (INDEX): start=3, end=28
2026-01-01 01:30:55,159 - INFO - Loaded PDF: 5210760318.pdf
2026-01-01 01:30:55,326 - INFO - Amit_spinning MD&A boundaries (INDEX): start=1, end=21


5210760317.pdf: start=3, end=28, doc_pages=49
5210760318.pdf: start=1, end=21, doc_pages=48


In [34]:
# Diagnostic: inspect ToC raw lines for Amit_spinning 5210760316.pdf (why start=8?)
from pathlib import Path

p = Path("../data/pdfs/Amit_spinning") / "5210760316.pdf"
if not p.exists():
    print("Missing:", p)
else:
    pdf = PDFInterface(p)
    pages = pdf.get_pages_text()
    raw = collect_toc_raw_lines(pages, max_pages=5)

    mdna_title_re = re.compile(
        r"\bmanagement(?:\s*[’']?s)?\s+discussion(?:s)?\s+(?:and|&)\s+analysis(?:\s+report)?\b",
        re.IGNORECASE,
    )
    mdna_anchor_re = re.compile(r"\bmanagement\b", re.IGNORECASE)

    print("\n--- Amit_spinning 5210760316.pdf ToC MD&A debug ---")
    print("doc_pages:", len(pages), "toc_lines:", len(raw))

    s_idx, e_idx, block_text = find_title_block_strict(
        raw,
        mdna_title_re,
        max_join_lines=3,
        anchor_re=mdna_anchor_re,
    )

    print("mdna_block:", (s_idx, e_idx))
    print("mdna_block_text:", block_text)

    if s_idx is not None:
        lo = max(0, s_idx - 4)
        up = min(len(raw), e_idx + 8)
        print("\nContext around detected MD&A block:")
        for j in range(lo, up):
            it = raw[j]
            prefix = ">" if s_idx <= j <= e_idx else " "
            print(f"{prefix} [toc_p{it['toc_page']} line{it['line_index']}] {it['line_text']}")

        details = resolve_page_number_for_title_block_strict_with_details(
            raw,
            title_start_idx=s_idx,
            title_end_idx=e_idx,
            title_re=mdna_title_re,
            max_page=len(pages),
            lookahead_lines=3,
        )
        print("\nresolved_details:", details)
        if details:
            line = raw[details["page_line_idx"]].get("line_text") or ""
            nums = [(mm.group(0), mm.start(), mm.end()) for mm in re.finditer(r"\b\d{1,3}\b", line)]
            print("page_line_text:", line)
            print("page_line_nums:", nums)

    # Show final strict detector output
    s1, e1 = _detect_mdna_boundaries_strict_toc(raw, max_page=len(pages))
    print("\nstrict_toc result:", (s1, e1))

    pdf.close()


2026-01-01 01:30:55,346 - INFO - Loaded PDF: 5210760316.pdf
2026-01-01 01:30:55,554 - INFO - MD&A appears as sub-entry; inheriting parent start page 4 from 'Board's Report including'
2026-01-01 01:30:55,559 - INFO - MD&A boundaries (STRICT ToC blocks): start=4, end=16


2026-01-01 01:30:55,346 - INFO - Loaded PDF: 5210760316.pdf
2026-01-01 01:30:55,554 - INFO - MD&A appears as sub-entry; inheriting parent start page 4 from 'Board's Report including'
2026-01-01 01:30:55,559 - INFO - MD&A boundaries (STRICT ToC blocks): start=4, end=16



--- Amit_spinning 5210760316.pdf ToC MD&A debug ---
doc_pages: 49 toc_lines: 216
mdna_block: (61, 62)
mdna_block_text: Management Discussions & Analysis Report

Context around detected MD&A block:
  [toc_p3 line57] Notice
  [toc_p3 line58] 1
  [toc_p3 line59] Board's Report including
  [toc_p3 line60] 4
> [toc_p3 line61] Management Discussions &
> [toc_p3 line62] Analysis Report
  [toc_p3 line63] Annexures to Boards' Report
  [toc_p3 line64] 8
  [toc_p3 line65] Corporate Governance
  [toc_p3 line66] 17
  [toc_p3 line67] Auditor's Report
  [toc_p3 line68] 24
  [toc_p3 line69] Balance Sheet

resolved_details: {'page': 8, 'page_span_start': 0, 'page_span_end': 1, 'page_line_idx': 64}
page_line_text: 8
page_line_nums: [('8', 0, 1)]

strict_toc result: (4, 16)


In [35]:
# Diagnostic: show INDEX raw lines for Amit_spinning PDFs 0317/0318
from pathlib import Path

for name in ["5210760317.pdf", "5210760318.pdf"]:
    p = Path("../data/pdfs/Amit_spinning") / name
    pdf = PDFInterface(p)
    pages = pdf.get_pages_text()
    raw = collect_toc_raw_lines(pages, max_pages=5)

    idx_pos = None
    for i, it in enumerate(raw):
        if "INDEX" in (it.get("line_text") or "").upper():
            idx_pos = i
            break

    print("\n---", name, "---")
    print("doc_pages:", len(pages), "index_line:", idx_pos)
    if idx_pos is None:
        pdf.close()
        continue

    for it in raw[idx_pos : idx_pos + 40]:
        print(f"[toc_p{it['toc_page']} line{it['line_index']}] {it['line_text']}")

    pdf.close()


2026-01-01 01:30:55,586 - INFO - Loaded PDF: 5210760317.pdf
2026-01-01 01:30:55,798 - INFO - Loaded PDF: 5210760318.pdf


2026-01-01 01:30:55,586 - INFO - Loaded PDF: 5210760317.pdf
2026-01-01 01:30:55,798 - INFO - Loaded PDF: 5210760318.pdf



--- 5210760317.pdf ---
doc_pages: 49 index_line: 49
[toc_p3 line49] INDEX
[toc_p3 line50] Page No.
[toc_p3 line51] Notice
[toc_p3 line52] Board’s Report Including
[toc_p3 line53] Management Discussions &
[toc_p3 line54] Analysis Report
[toc_p3 line55] Annexures to Boards’ Report
[toc_p3 line56] Corporate Governance
[toc_p3 line57] Auditor’s Report
[toc_p3 line58] Balance Sheet
[toc_p3 line59] Statement of Profit & Loss
[toc_p3 line60] Cash Flow Statement
[toc_p3 line61] Notes
[toc_p3 line62] 25th AGM
[toc_p3 line63] •
[toc_p3 line64] Date
[toc_p3 line65] :
[toc_p3 line66] September 25, 2017 Time 11:30 A.M.
[toc_p3 line67] Venue
[toc_p3 line68] :
[toc_p3 line69] Bipin Chandra Pal Memorial Bhavan, A-81, Chittaranjan Park, New Delhi - 110 019
[toc_p3 line70] •
[toc_p3 line71] Book Closure :
[toc_p3 line72] From Thursday September 21, 2017 to Monday, September 25, 2017 (both days inclusive).
[toc_p3 line73] Company’s shares are listed on BSE Ltd. and National Stock Exchange of India Ltd.


### Sanity check  after cell 4: 

In [36]:
from pathlib import Path
import os

# ---- CONFIG ----
PDF_ROOT = Path("../data/pdfs")

print(f"Current working directory: {os.getcwd()}")
print(f"PDF_ROOT: {PDF_ROOT}")
print(f"PDF_ROOT exists: {PDF_ROOT.exists()}")
print(f"PDF_ROOT resolved: {PDF_ROOT.resolve()}")

# ---- STEP 1: Discover PDFs ----
pdf_files = list(PDF_ROOT.rglob("*.pdf"))

print(f"Total PDFs found: {len(pdf_files)}")

if not pdf_files:
    print("No PDFs found in ../data/pdfs directory. Please add PDF files to test the pipeline.")
    print("Skipping sanity checks.")
else:
    # Pick one PDF from each company (if available)
    sample_pdfs = {}
    for pdf in pdf_files:
        company = pdf.parent.name
        if company not in sample_pdfs:
            sample_pdfs[company] = pdf

    print("\nSample PDFs selected for testing:")
    for company, pdf in sample_pdfs.items():
        print(f"- {company}: {pdf.name}")

    # ---- STEP 2: Test PDFInterface + Metadata Extraction ----
    print("\n--- Running Sanity Checks ---\n")

    for company, pdf_path in sample_pdfs.items():
        print(f"Testing company: {company}")
        print(f"PDF: {pdf_path.name}")

        pdf = PDFInterface(pdf_path)
        pages = pdf.get_pages_text()

        print("Pages extracted:", len(pages))
        assert len(pages) > 0, "No pages extracted!"

        # Metadata extraction
        extracted_company = extract_company_name(pages, company_folder=company)
        extracted_year = extract_financial_year(pages)

        print("Extracted Company Name:", extracted_company)
        print("Extracted Financial Year:", extracted_year)

        print("-" * 50)

    print("\nSanity check completed successfully.")


2026-01-01 01:30:55,960 - INFO - Loaded PDF: 5267070319.pdf


2026-01-01 01:30:55,960 - INFO - Loaded PDF: 5267070319.pdf


Current working directory: c:\Users\LOQ\Desktop\SPJIMR\mdna_extraction_project\notebooks
PDF_ROOT: ..\data\pdfs
PDF_ROOT exists: True
PDF_ROOT resolved: C:\Users\LOQ\Desktop\SPJIMR\mdna_extraction_project\data\pdfs
Total PDFs found: 14

Sample PDFs selected for testing:
- Alcheimist: 5267070319.pdf
- ALok: 5210700315.pdf
- Amit_spinning: 5210760315.pdf
- Amtek: 5200770316.pdf

--- Running Sanity Checks ---

Testing company: Alcheimist
PDF: 5267070319.pdf


2026-01-01 01:30:55,960 - INFO - Loaded PDF: 5267070319.pdf


Current working directory: c:\Users\LOQ\Desktop\SPJIMR\mdna_extraction_project\notebooks
PDF_ROOT: ..\data\pdfs
PDF_ROOT exists: True
PDF_ROOT resolved: C:\Users\LOQ\Desktop\SPJIMR\mdna_extraction_project\data\pdfs
Total PDFs found: 14

Sample PDFs selected for testing:
- Alcheimist: 5267070319.pdf
- ALok: 5210700315.pdf
- Amit_spinning: 5210760315.pdf
- Amtek: 5200770316.pdf

--- Running Sanity Checks ---

Testing company: Alcheimist
PDF: 5267070319.pdf


2026-01-01 01:30:56,488 - INFO - Extracted company name: ALCHEMIST LTD
2026-01-01 01:30:56,489 - INFO - Extracted financial year: 2018-19
2026-01-01 01:30:56,496 - INFO - Loaded PDF: 5210700315.pdf


2026-01-01 01:30:55,960 - INFO - Loaded PDF: 5267070319.pdf


Current working directory: c:\Users\LOQ\Desktop\SPJIMR\mdna_extraction_project\notebooks
PDF_ROOT: ..\data\pdfs
PDF_ROOT exists: True
PDF_ROOT resolved: C:\Users\LOQ\Desktop\SPJIMR\mdna_extraction_project\data\pdfs
Total PDFs found: 14

Sample PDFs selected for testing:
- Alcheimist: 5267070319.pdf
- ALok: 5210700315.pdf
- Amit_spinning: 5210760315.pdf
- Amtek: 5200770316.pdf

--- Running Sanity Checks ---

Testing company: Alcheimist
PDF: 5267070319.pdf


2026-01-01 01:30:56,488 - INFO - Extracted company name: ALCHEMIST LTD
2026-01-01 01:30:56,489 - INFO - Extracted financial year: 2018-19
2026-01-01 01:30:56,496 - INFO - Loaded PDF: 5210700315.pdf


Pages extracted: 145
Extracted Company Name: ALCHEMIST LTD
Extracted Financial Year: 2018-19
--------------------------------------------------
Testing company: ALok
PDF: 5210700315.pdf


2026-01-01 01:30:55,960 - INFO - Loaded PDF: 5267070319.pdf


Current working directory: c:\Users\LOQ\Desktop\SPJIMR\mdna_extraction_project\notebooks
PDF_ROOT: ..\data\pdfs
PDF_ROOT exists: True
PDF_ROOT resolved: C:\Users\LOQ\Desktop\SPJIMR\mdna_extraction_project\data\pdfs
Total PDFs found: 14

Sample PDFs selected for testing:
- Alcheimist: 5267070319.pdf
- ALok: 5210700315.pdf
- Amit_spinning: 5210760315.pdf
- Amtek: 5200770316.pdf

--- Running Sanity Checks ---

Testing company: Alcheimist
PDF: 5267070319.pdf


2026-01-01 01:30:56,488 - INFO - Extracted company name: ALCHEMIST LTD
2026-01-01 01:30:56,489 - INFO - Extracted financial year: 2018-19
2026-01-01 01:30:56,496 - INFO - Loaded PDF: 5210700315.pdf


Pages extracted: 145
Extracted Company Name: ALCHEMIST LTD
Extracted Financial Year: 2018-19
--------------------------------------------------
Testing company: ALok
PDF: 5210700315.pdf


2026-01-01 01:30:57,535 - INFO - Financial year not found in first 5 pages
2026-01-01 01:30:57,538 - INFO - Loaded PDF: 5210760315.pdf


2026-01-01 01:30:55,960 - INFO - Loaded PDF: 5267070319.pdf


Current working directory: c:\Users\LOQ\Desktop\SPJIMR\mdna_extraction_project\notebooks
PDF_ROOT: ..\data\pdfs
PDF_ROOT exists: True
PDF_ROOT resolved: C:\Users\LOQ\Desktop\SPJIMR\mdna_extraction_project\data\pdfs
Total PDFs found: 14

Sample PDFs selected for testing:
- Alcheimist: 5267070319.pdf
- ALok: 5210700315.pdf
- Amit_spinning: 5210760315.pdf
- Amtek: 5200770316.pdf

--- Running Sanity Checks ---

Testing company: Alcheimist
PDF: 5267070319.pdf


2026-01-01 01:30:56,488 - INFO - Extracted company name: ALCHEMIST LTD
2026-01-01 01:30:56,489 - INFO - Extracted financial year: 2018-19
2026-01-01 01:30:56,496 - INFO - Loaded PDF: 5210700315.pdf


Pages extracted: 145
Extracted Company Name: ALCHEMIST LTD
Extracted Financial Year: 2018-19
--------------------------------------------------
Testing company: ALok
PDF: 5210700315.pdf


2026-01-01 01:30:57,535 - INFO - Financial year not found in first 5 pages
2026-01-01 01:30:57,538 - INFO - Loaded PDF: 5210760315.pdf


Pages extracted: 208
Extracted Company Name: ALOK
Extracted Financial Year: None
--------------------------------------------------
Testing company: Amit_spinning
PDF: 5210760315.pdf


2026-01-01 01:30:57,829 - INFO - Extracted company name: AMIT SPINNING INDUSTRIES LIMITED
2026-01-01 01:30:57,831 - INFO - Extracted financial year: 2014-15
2026-01-01 01:30:57,836 - INFO - Loaded PDF: 5200770316.pdf


2026-01-01 01:30:55,960 - INFO - Loaded PDF: 5267070319.pdf


Current working directory: c:\Users\LOQ\Desktop\SPJIMR\mdna_extraction_project\notebooks
PDF_ROOT: ..\data\pdfs
PDF_ROOT exists: True
PDF_ROOT resolved: C:\Users\LOQ\Desktop\SPJIMR\mdna_extraction_project\data\pdfs
Total PDFs found: 14

Sample PDFs selected for testing:
- Alcheimist: 5267070319.pdf
- ALok: 5210700315.pdf
- Amit_spinning: 5210760315.pdf
- Amtek: 5200770316.pdf

--- Running Sanity Checks ---

Testing company: Alcheimist
PDF: 5267070319.pdf


2026-01-01 01:30:56,488 - INFO - Extracted company name: ALCHEMIST LTD
2026-01-01 01:30:56,489 - INFO - Extracted financial year: 2018-19
2026-01-01 01:30:56,496 - INFO - Loaded PDF: 5210700315.pdf


Pages extracted: 145
Extracted Company Name: ALCHEMIST LTD
Extracted Financial Year: 2018-19
--------------------------------------------------
Testing company: ALok
PDF: 5210700315.pdf


2026-01-01 01:30:57,535 - INFO - Financial year not found in first 5 pages
2026-01-01 01:30:57,538 - INFO - Loaded PDF: 5210760315.pdf


Pages extracted: 208
Extracted Company Name: ALOK
Extracted Financial Year: None
--------------------------------------------------
Testing company: Amit_spinning
PDF: 5210760315.pdf


2026-01-01 01:30:57,829 - INFO - Extracted company name: AMIT SPINNING INDUSTRIES LIMITED
2026-01-01 01:30:57,831 - INFO - Extracted financial year: 2014-15
2026-01-01 01:30:57,836 - INFO - Loaded PDF: 5200770316.pdf


Pages extracted: 46
Extracted Company Name: AMIT SPINNING INDUSTRIES LIMITED
Extracted Financial Year: 2014-15
--------------------------------------------------
Testing company: Amtek
PDF: 5200770316.pdf


2026-01-01 01:30:55,960 - INFO - Loaded PDF: 5267070319.pdf


Current working directory: c:\Users\LOQ\Desktop\SPJIMR\mdna_extraction_project\notebooks
PDF_ROOT: ..\data\pdfs
PDF_ROOT exists: True
PDF_ROOT resolved: C:\Users\LOQ\Desktop\SPJIMR\mdna_extraction_project\data\pdfs
Total PDFs found: 14

Sample PDFs selected for testing:
- Alcheimist: 5267070319.pdf
- ALok: 5210700315.pdf
- Amit_spinning: 5210760315.pdf
- Amtek: 5200770316.pdf

--- Running Sanity Checks ---

Testing company: Alcheimist
PDF: 5267070319.pdf


2026-01-01 01:30:56,488 - INFO - Extracted company name: ALCHEMIST LTD
2026-01-01 01:30:56,489 - INFO - Extracted financial year: 2018-19
2026-01-01 01:30:56,496 - INFO - Loaded PDF: 5210700315.pdf


Pages extracted: 145
Extracted Company Name: ALCHEMIST LTD
Extracted Financial Year: 2018-19
--------------------------------------------------
Testing company: ALok
PDF: 5210700315.pdf


2026-01-01 01:30:57,535 - INFO - Financial year not found in first 5 pages
2026-01-01 01:30:57,538 - INFO - Loaded PDF: 5210760315.pdf


Pages extracted: 208
Extracted Company Name: ALOK
Extracted Financial Year: None
--------------------------------------------------
Testing company: Amit_spinning
PDF: 5210760315.pdf


2026-01-01 01:30:57,829 - INFO - Extracted company name: AMIT SPINNING INDUSTRIES LIMITED
2026-01-01 01:30:57,831 - INFO - Extracted financial year: 2014-15
2026-01-01 01:30:57,836 - INFO - Loaded PDF: 5200770316.pdf


Pages extracted: 46
Extracted Company Name: AMIT SPINNING INDUSTRIES LIMITED
Extracted Financial Year: 2014-15
--------------------------------------------------
Testing company: Amtek
PDF: 5200770316.pdf


2026-01-01 01:30:58,351 - INFO - Extracted company name: AMTEK AUTO LIMITED
2026-01-01 01:30:58,352 - INFO - Extracted financial year: 2015-16


2026-01-01 01:30:55,960 - INFO - Loaded PDF: 5267070319.pdf


Current working directory: c:\Users\LOQ\Desktop\SPJIMR\mdna_extraction_project\notebooks
PDF_ROOT: ..\data\pdfs
PDF_ROOT exists: True
PDF_ROOT resolved: C:\Users\LOQ\Desktop\SPJIMR\mdna_extraction_project\data\pdfs
Total PDFs found: 14

Sample PDFs selected for testing:
- Alcheimist: 5267070319.pdf
- ALok: 5210700315.pdf
- Amit_spinning: 5210760315.pdf
- Amtek: 5200770316.pdf

--- Running Sanity Checks ---

Testing company: Alcheimist
PDF: 5267070319.pdf


2026-01-01 01:30:56,488 - INFO - Extracted company name: ALCHEMIST LTD
2026-01-01 01:30:56,489 - INFO - Extracted financial year: 2018-19
2026-01-01 01:30:56,496 - INFO - Loaded PDF: 5210700315.pdf


Pages extracted: 145
Extracted Company Name: ALCHEMIST LTD
Extracted Financial Year: 2018-19
--------------------------------------------------
Testing company: ALok
PDF: 5210700315.pdf


2026-01-01 01:30:57,535 - INFO - Financial year not found in first 5 pages
2026-01-01 01:30:57,538 - INFO - Loaded PDF: 5210760315.pdf


Pages extracted: 208
Extracted Company Name: ALOK
Extracted Financial Year: None
--------------------------------------------------
Testing company: Amit_spinning
PDF: 5210760315.pdf


2026-01-01 01:30:57,829 - INFO - Extracted company name: AMIT SPINNING INDUSTRIES LIMITED
2026-01-01 01:30:57,831 - INFO - Extracted financial year: 2014-15
2026-01-01 01:30:57,836 - INFO - Loaded PDF: 5200770316.pdf


Pages extracted: 46
Extracted Company Name: AMIT SPINNING INDUSTRIES LIMITED
Extracted Financial Year: 2014-15
--------------------------------------------------
Testing company: Amtek
PDF: 5200770316.pdf


2026-01-01 01:30:58,351 - INFO - Extracted company name: AMTEK AUTO LIMITED
2026-01-01 01:30:58,352 - INFO - Extracted financial year: 2015-16


Pages extracted: 145
Extracted Company Name: AMTEK AUTO LIMITED
Extracted Financial Year: 2015-16
--------------------------------------------------

Sanity check completed successfully.
