# Sustainability Report Parser — Quickstart\n\nThis notebook parses a sustainability report PDF into a hierarchical topic tree + section texts (pandas DataFrame), and optionally exports figures/tables.\n\n**Steps:**\n1. Put a PDF into `data/pdfs/`\n2. Set `pdf_path` below\n3. Run cells top to bottom\n

In [None]:
import sustain_parser as sp\nfrom sustain_parser import analysis as ana\n\n# Change this to your PDF file name\npdf_path = "data/pdfs/YOUR_REPORT.pdf"\n

In [None]:
res = sp.parse_pdf(pdf_path)\nprint("Strategy used:", res.strategy_used, "| pages:", res.page_count)\n\n# Show the first ~80 lines of the tree\nprint("\n".join(res.tree_md.splitlines()[:80]))

In [None]:
df = res.sections_df()\ndf[['level','title','start_page','end_page','n_words']].head(25)

## Find key sections (Materiality / Assurance)\nThese are common anchors in sustainability reporting standards and credibility discussions.

In [None]:
ana.find_materiality_sections(df)[['title','start_page','end_page','n_words']].head(20)

In [None]:
ana.find_assurance_sections(df)[['title','start_page','end_page','n_words']].head(20)

## Framework mentions + metric density\nMetric density is a rough proxy for “quantitative content” (numbers per 1000 characters).

In [None]:
df2 = ana.add_framework_counts(ana.add_metric_density(df))\ndf2.sort_values('metric_density', ascending=False)[\n    ['title','metric_density','GRI','SASB','ISSB','TCFD','ESRS','start_page','end_page']\n].head(15)

## Extract targets and Scope snippets\nTargets and Scope 1/2/3 disclosures are central to climate reporting and transition plan discussions.

In [None]:
targets = ana.extract_targets(df)\ntargets.head(25)

In [None]:
scopes = ana.scope_snippets(df)\nscopes.head(25)

## Export outputs (tree + sections + figures/tables)\n\n- `tree.md` and `tree.json` help you navigate the report.\n- `sections.jsonl` stores section-level text.\n- Figures and tables export depends on PDF quality (scanned PDFs are harder).

In [None]:
out_dir = "outputs/my_report"\nres.export(out_dir)\nres.export_assets(out_dir, export_figures=True, export_tables=True, table_max_pages=30)\n\ndisplay(res.figures_df.head() if res.figures_df is not None else None)\ndisplay(res.tables_df.head() if res.tables_df is not None else None)

## Optional: topic clustering (no LLM)\nThis clusters sections into themes using TF-IDF + KMeans.

In [None]:
df_clustered = ana.cluster_sections(df, k=6)\ndf_clustered[['cluster','title','start_page','end_page','n_words']].sort_values(['cluster','start_page']).head(30)