# PDF Text + Graphical Data Q&A

This notebook documents the approach and demonstrates results for:

- Extracting **text** and **graphical data** (tables, basic bar charts) from a PDF.
- Converting those into structured, queryable data.
- Letting users **ask questions** against both text and derived data.

**Instructions:** Place your PDF under `../data/` and set its filename below.

In [None]:
from pathlib import Path
import os, pandas as pd
from dotenv import load_dotenv
load_dotenv()

DATA_DIR = Path("../data")
OUTPUTS = Path("../outputs")
OUTPUTS.mkdir(exist_ok=True, parents=True)

# <<< SET YOUR PDF FILE HERE >>>
PDF_FILE = DATA_DIR / "sample.pdf"  # replace after uploading
assert PDF_FILE.exists(), f"PDF not found: {PDF_FILE}. Put your PDF under {DATA_DIR}."

## 1) Extraction
We use:
- `pdfplumber` for text and page rasterization.
- `camelot` (stream/lattice) → `tabula-py` → `pdfplumber` fallback for tables.
- Simple computer vision + OCR to attempt digitizing **bar charts**.

> Note: Chart digitization is best-effort. Tables are the primary path to precise structured data.

In [None]:
from src.extract_text import extract_text
from src.extract_tables import extract_tables
from src.extract_charts import extract_charts_as_data

pages = extract_text(str(PDF_FILE))
tables = extract_tables(str(PDF_FILE), flavor="stream")
chart_dfs, chart_images = extract_charts_as_data(str(PDF_FILE), str(OUTPUTS/"images"))

len(pages), len(tables), len(chart_dfs), len(chart_images)

### Quick Preview

In [None]:
pages[0].text[:1000] if pages else "(no text)"

In [None]:
tables[0].head() if tables else "(no tables)"

In [None]:
chart_dfs[0].head() if chart_dfs else "(no chart data detected)"

## 2) Build Searchable Index for Text

In [None]:
from src.build_index import build_text_index, TextChunk, tables_summary
import os

model_name = os.getenv("EMBEDDING_MODEL", "sentence-transformers/all-MiniLM-L6-v2")
chunks = [TextChunk(page_num=p.page_num, text=p.text) for p in pages if p.text.strip()]
text_index = build_text_index(chunks, model_name=model_name)

print("Text chunks:", len(chunks))
tables_summary(tables)

## 3) Ask Questions
We support:
- **Semantic text search** over the narrative text.
- **Table/derived data queries** for simple aggregations and previews.

In [None]:
from src.qa import answer_text_query, answer_table_query

q1 = "What does the document say about revenue growth?"
ans1 = answer_text_query(text_index, q1, top_k=int(os.getenv("TOP_K", 5)))
ans1

In [None]:
q2 = "sum of Revenue where Year == 2023"
ans2 = answer_table_query(tables + chart_dfs, q2)
ans2

## 4) Notes on Design Choices & Challenges

1. **Multi-backend table extraction**: PDFs vary widely. We layer Camelot → Tabula → pdfplumber to maximize table recovery.
2. **Chart digitization (best-effort)**: Simple bar charts can often be parsed by detecting vertical rectangles and scaling bar heights. Arbitary plots (line charts, stacked bars, complex legends) are left as future work.
3. **Embeddings**: We use local Sentence-Transformers to avoid external APIs. Cosine similarity over normalized vectors provides robust semantic search across pages.
4. **NL-to-table**: A tiny pattern parser recognizes queries like `sum of <col> where <col2> == X`. When unrecognized, we surface a **preview** of likely columns as a helpful fallback.
5. **Provenance**: We retain page numbers with text chunks, and we keep DataFrames for tables so you can trace results back to source pages.

**Limitations**
- Tabula requires Java; if missing, those steps are skipped.
- Ghostscript improves Camelot lattice detection.
- Chart digitization is intentionally conservative: it will skip when uncertain rather than hallucinate numbers.
