ks-xlsx-parser — the open-source Python library that parses Excel (.xlsx) files into citation-ready JSON for LLMs, RAG pipelines, and AI agents (LangChain, LangGraph, CrewAI, OpenAI Agents SDK, Claude, MCP).
Tip
.xlsx → structured, typed, citation-ready JSON that an LLM can actually reason about.
Cells, formulas, merged regions, tables, charts, conditional formatting,
dependency graphs, and RAG-ready chunks — deterministic, fully tested, MIT.
Raw workbook on the left (financial_model.xlsx) → parser output on the right: 4 chunks, each tied back to an exact sheet!range, ready to cite in an LLM response.
Spreadsheets are still the #1 unstructured data source in the enterprise.
Feeding a .xlsx directly to an LLM loses structure (rows, formulas, merges),
loses provenance (which cell said what), and blows through context windows.
ks-xlsx-parser turns an Excel workbook into a token-counted, source-addressable
graph that drops straight into LangChain,
LangGraph,
CrewAI, the
OpenAI Agents SDK, or any
MCP-aware client (Claude Desktop, Cursor, Windsurf, Zed, …).
| 🧾 Typed cell graph values, formulas, styles, coords |
🧭 Citation URIs file.xlsx#Sheet!A1:F18 |
🧮 Dependency graph upstream · downstream · cycles |
🧩 RAG-ready chunks HTML + text + token count |
| 📊 All 7 chart types bar · line · pie · scatter · area · radar · bubble |
🎨 Conditional formatting every Excel rule type |
📋 Tables & merges ListObjects + master/slave |
🔐 Safe by default no macros · no external links · ZIP-bomb guard |
| ⚡ Fast 1054 workbooks / 70s in CI |
🧬 Deterministic xxhash64 content addressing |
🧰 Framework-agnostic LangChain · LangGraph · CrewAI · MCP |
📜 MIT licensed use it, fork it, ship it |
This project is free, open source (MIT), and part of the Knowledge Stack ecosystem — document intelligence for agents. Stars, contributions, and honest feedback are all first-class ways to keep the lights on.
Jump into the community:
- 💬 Discord — real-time help, roadmap conversations, show off what you're building. Drop in, say hi.
- 🗣 GitHub Discussions — async Q&A, RFCs, and long-form ideas.
- 🐞 Issues — report a bug, request a feature, or file a parser edge case.
- 🎯 Show & Tell — tell us about your production use.
- 🔐 Security — private vulnerability disclosure.
- 🙌 Contribute — every PR is reviewed;
good-first-issuelabels live on Issues. - 🧰 Knowledge Stack org — see the rest of the ecosystem (ks-cookbook, ks-xlsx-parser, more on the way).
Not sure where to start? Run make testbench, find a file that breaks, open a
Parser edge case.
That's the fastest path to a merged PR.
pip install ks-xlsx-parserfrom ks_xlsx_parser import parse_workbook
result = parse_workbook(path="q4_forecast.xlsx")
# LLM-ready chunks with citation URIs
for chunk in result.chunks:
print(chunk.source_uri) # q4_forecast.xlsx#Revenue!A1:F18
print(chunk.token_count) # 412
print(chunk.render_text[:200]) # Pipe-delimited Markdown-ish text
print(chunk.render_html[:200]) # HTML with proper colspan/rowspan
# Or dump the whole workbook graph
import json
json.dump(result.to_json(), open("workbook.json", "w"), default=str)That's it. Every chunk has:
source_uri— cite back to exact cellsrender_text/render_html— LLM-consumable bodiestoken_count— cap your context window properlydependency_summary— upstream/downstream formulas- content hash — dedupe across versions
- 🤔 Why a dedicated XLSX parser for LLMs?
- 🏗️ Architecture
- 📦 Installation
- 📚 Documentation
- ⚔️ How it compares
- 🎯 Who this is for
- 🧪 The testBench dataset
- 🚧 Limitations
- 🧰 Knowledge Stack ecosystem
- 📡 Stay in touch
- 🙌 Contributing
- ❓ FAQ
- 📜 License
Most Excel libraries answer one of two questions well: "read a rectangle of
values" (pandas, openpyxl) or "run Excel headless" (xlwings, LibreOffice).
ks-xlsx-parser answers a third one: "give me a structured, inspectable,
loss-minimising graph that an LLM or auditor can reason about."
| Output | Why an LLM cares |
|---|---|
| Typed cell graph (values, formulas, styles, coordinates) | Round-trips to JSON/DB/vector store without losing formulas or data types |
| Formula AST + directed dependency graph | Answer "what drives Q4 revenue?" via upstream traversal |
| Detected tables, merged regions, layout blocks | Multi-table sheets no longer collapse into one giant CSV |
| Chart extractions (bar / line / pie / scatter / area / radar / bubble) | Text summaries the model can read |
| Token-counted render chunks (HTML + pipe-text) | Plug straight into an embedding pipeline without blowing context |
Citation-ready source URIs (sheet!A1:B10) |
The LLM can cite the exact cell it's talking about |
| Deterministic content hashes (xxhash64) | Dedupe across versions, detect change between uploads |
Everything is deterministic, everything is tested on a 1054-workbook stress corpus, and everything is open source.
%%{init: {'theme':'base', 'themeVariables': {
'primaryColor':'#10B981','primaryTextColor':'#fff','primaryBorderColor':'#047857',
'lineColor':'#94A3B8','secondaryColor':'#22C55E','tertiaryColor':'#34D399',
'background':'#FFFFFF','mainBkg':'#10B981','clusterBkg':'#F0FDF4'
}}}%%
flowchart TD
IN([📄 .xlsx bytes])
PARSE[["① parsers/<br/>OOXML drivers<br/><i>openpyxl + lxml</i>"]]
MODELS[["② models/<br/>Pydantic DTOs<br/><i>Workbook · Sheet · Cell · Table · Chart</i>"]]
FORMULA[["③ formula/<br/>lexer + parser<br/><i>cross-sheet · table · array</i>"]]
ANALYSIS[["④ analysis/<br/>dependency graph<br/><i>cycles · impact</i>"]]
CHARTS[["⑤ charts/<br/>OOXML chart extraction"]]
ANNOT[["⑥ annotation/<br/>semantic roles · KPIs"]]
SEG[["⑦ chunking/<br/>adaptive segmenter"]]
REND[["⑧ rendering/<br/>HTML + pipe-text<br/>token counts"]]
STORE[["🗄️ storage/<br/>JSON · DB rows · vectors"]]
VER[["✅ verification/<br/>stage assertions"]]
CMP[["🔀 comparison/<br/>multi-workbook templates"]]
EXP[["🧬 export/<br/>generated importer"]]
OUT([🤖 LLM-ready chunks<br/>with citations])
IN --> PARSE --> MODELS
MODELS --> FORMULA
MODELS --> ANALYSIS
MODELS --> CHARTS
FORMULA --> ANALYSIS
ANALYSIS --> ANNOT
CHARTS --> ANNOT
ANNOT --> SEG --> REND --> STORE
MODELS --> VER
STORE --> OUT
STORE -.-> CMP -.-> EXP
%% All-green palette: deepest for entry, lightest for auxiliary stages,
%% emerald for the headline output node.
classDef entry fill:#064E3B,stroke:#022C22,color:#fff,stroke-width:2px;
classDef parse fill:#065F46,stroke:#022C22,color:#fff,stroke-width:2px;
classDef model fill:#047857,stroke:#064E3B,color:#fff,stroke-width:2px;
classDef analyze fill:#059669,stroke:#065F46,color:#fff,stroke-width:2px;
classDef render fill:#16A34A,stroke:#166534,color:#fff,stroke-width:2px;
classDef output fill:#22C55E,stroke:#15803D,color:#fff,stroke-width:2px;
classDef aux fill:#A7F3D0,stroke:#047857,color:#065F46,stroke-width:2px;
class IN entry
class PARSE parse
class MODELS model
class FORMULA,ANALYSIS,CHARTS analyze
class ANNOT,SEG,REND render
class STORE,OUT output
class VER,CMP,EXP aux
The pipeline has 8 stages (parse → analyse → annotate → segment → render → serialise → verify → compare/export). Full breakdown in Pipeline Internals.
Note
The importable module is xlsx_parser; ks_xlsx_parser is a re-export
matching the PyPI package name. The package is fully type-annotated
(py.typed is shipped).
Requires Python 3.10+.
pip install ks-xlsx-parser # core library
pip install "ks-xlsx-parser[api]" # + FastAPI web server
pip install "ks-xlsx-parser[dev]" # + test toolingFrom source:
git clone https://github.com/knowledgestack/ks-xlsx-parser.git
cd ks-xlsx-parser
make install # pip install -e ".[dev,api]"
make test # default suite
make testbench-build # generate the 1000-file stress corpus
make testbench # round-trip every workbook through the parserRuntime deps: openpyxl, pydantic, lxml, xxhash, tiktoken.
All implementation detail lives under docs/wiki/ (mirrored
to the GitHub Wiki
on each release) so this README stays scannable:
- 🚀 Quick Start — parse, iterate chunks, walk the dep graph, serialise, parse from bytes. Five short snippets, ~90 % of real usage.
- 📖 API Reference — full signatures for
parse_workbook,compare_workbooks,export_importer,StageVerifier. - 🌐 Web API — the bundled FastAPI server, Python + TypeScript clients, deployment notes.
- 📦 Data Models — every Pydantic DTO field by field.
- 🛠 Pipeline Internals — where to hook in if you want to extend the parser.
- 📜 Workbook Graph Spec — canonical schema for the output.
- 🐛 Known Issues — documented edge cases.
- 📝 CHANGELOG — release history.
| pandas / openpyxl | Docling | ks-xlsx-parser |
|
|---|---|---|---|
| Reads values | ✅ | ✅ | ✅ |
| Keeps formulas | ❌ | ✅ parsed + dependency graph | |
| Preserves merges | ✅ master/slave with colspan/rowspan | ||
| Extracts charts | ❌ | ❌ | ✅ all 7 chart types + text summary |
| Conditional formatting | ❌ | ❌ | ✅ cell/color-scale/icon/data-bar/formula |
| Data validation (dropdowns) | ❌ | ❌ | ✅ all types incl. cross-sheet lists |
| Multi-table sheet layout | ❌ | ✅ adaptive-gap segmentation | |
| Per-chunk source URI (citation) | ❌ | ✅ file.xlsx#Sheet!A1:F18 |
|
| Token counts per chunk | ❌ | ❌ | ✅ via tiktoken |
| Dependency graph traversal | ❌ | ❌ | ✅ upstream / downstream, cycle detection |
| Deterministic content hashes | ❌ | ❌ | ✅ xxhash64 per cell / block / chunk |
Streaming .xlsx > 100 MB |
❌ | ✅ (chunked parse) |
Most tools give you a dataframe. ks-xlsx-parser gives you a graph an LLM can cite.
Looking for a tiny, edge-runtime I/O library with write support? See
hucreby @productdevbook. For an unbiased head-to-head on the 1053-workbook testBench corpus — perf numbers, extraction-count parity, where each side wins — see the wiki:ks-xlsx-parservshucre.
Teams shipping agents, RAG pipelines, or auditing tools that ingest Excel.
| 🏦 Banking & Finance KPI extraction, formula lineage, regulator-ready citations |
⚖️ Legal & Contracts schedules, fee tables, covenant matrices without flattening merges |
🏥 Healthcare & Insurance normalise claims, pricing, and actuarial sheets into auditable JSON |
🏗️ Real Estate & Construction quantity takeoffs and cost models that still live in XLSX |
📈 Sales Ops / HR / Engineering "source of truth is a spreadsheet" → structured events, in minutes |
Important
Not a fit if you need to execute Excel (recalculate, run VBA, pivot-refresh).
Use xlwings or a headless Excel for that. ks-xlsx-parser reads; it doesn't run.
A 1054-workbook stress corpus ships under testBench/ and
is round-tripped in CI on every commit. It's the easiest way to see whether
the parser does the right thing on your kind of workbook.
| Group | Files | What it covers |
|---|---|---|
real_world/ |
8 | Real anonymised workbooks (financial, engineering, project tracking) |
enterprise/ |
4 | Deterministic enterprise templates |
github_datasets/ |
10 | Public datasets (iris, titanic, superstore, …) |
stress/curated/ |
26 | 26 progressive stress levels authored by hand |
stress/merges/ |
5 | Pathological merge patterns |
generated/matrix/ |
297 | One feature per file across 18 categories |
generated/combo/ |
400 | Deterministic feature cocktails (5 densities × 80 seeds) |
generated/adversarial/ |
300 | Unicode bombs, circular refs, 32k-char cells, deep formula chains, sparse 1M-row sheets, 250-sheet workbooks |
make testbench-build # regenerate testBench/generated/ (~1 minute)
make testbench # 1054/1054 in ~70 seconds
make testbench-zip # package as dist/testBench-vX.Y.Z.zip for a GitHub releaseThe zipped dataset is attached to every release — pull it if you don't want to clone the full repo.
.xlsnot supported — only.xlsxand.xlsm(OOXML). Convert legacy files externally.- Pivot tables — detected but not fully parsed.
- Sparklines — not extracted.
- VBA macros — flagged but never executed or analysed.
- External links — recorded but not resolved.
- Threaded comments — only legacy comments are supported (openpyxl limitation).
- Embedded OLE objects — detected but not extracted.
- Locale-dependent number formats — not interpreted.
Full list in docs/PARSER_KNOWN_ISSUES.md.
ks-xlsx-parser is one piece of the Knowledge Stack
open-source family — document intelligence for agents, built so that
engineering teams can focus on agents and we handle the messy parts of
enterprise data.
| Repo | What it does |
|---|---|
| ks-cookbook | 32 production-style flagship agents + recipes for LangChain, LangGraph, CrewAI, Temporal, the OpenAI Agents SDK, and any MCP client. |
| ks-xlsx-parser (this repo) | Turn .xlsx into LLM-ready JSON with citations and dependency graphs. |
| @knowledgestack | Follow the org for upcoming repos — parsers, extractors, and MCP servers for PDF, DOCX, PPTX, HTML, and more. |
Building on top of the stack? Tell us about it in Show & Tell or the #showcase channel on Discord.
- 💬 Join the Discord — our main real-time channel. Roadmap, help, job postings, show-and-tell, and the occasional meme.
- 🐙 Follow @knowledgestack on GitHub for new releases across the ecosystem.
- 📣 Watch this repo (→ Releases only) to get pinged when
ks-xlsx-parserships an update.
If you'd rather just peek first — thousands of parsed workbooks live in the testBench release as a single zip. Pull it, diff it, file an issue if your Excel does something weirder than ours.
We love contributions. Three paths, in order of speed-to-merge:
- Report a testBench failure — run
make testbench, find a file that breaks, attach it to a Parser edge case issue. - Add a new adversarial workbook — contribute a builder to
scripts/build_testbench.py. Any file that makes the parser crash or lose information is welcome. - Fix a flagged issue — see
docs/PARSER_KNOWN_ISSUES.md.
Full dev loop, PR checklist, and code style in CONTRIBUTING.md.
See the Code of Conduct and
Security policy before posting.
If you don't have time to contribute but the project helped you, please star the repo. That's the main signal that keeps this maintained.
What is the best Python library to parse Excel (.xlsx) for LLMs?
ks-xlsx-parser is purpose-built for it. Unlike pandas or openpyxl, it preserves formulas with a directed dependency graph, merged regions, tables, charts, and conditional formatting, and emits token-counted chunks with source_uri citations an LLM can quote. pip install ks-xlsx-parser.
How do I parse Excel for a LangChain or LangGraph agent?
Call parse_workbook(path=...), then expose result.chunks as a LangChain @tool or a LangGraph ToolNode. Each chunk carries source_uri, render_text, token_count, and a dependency_summary — everything the agent needs to cite and reason.
How do I use Excel in a CrewAI or OpenAI-Agents-SDK agent?
Same pattern — wrap parse_workbook in whatever tool abstraction your framework provides (@tool in CrewAI, @function_tool in the OpenAI Agents SDK). The parser's output is framework-agnostic.
Can Claude Desktop, Cursor, Windsurf, or another MCP client read Excel files?
Yes — run the bundled FastAPI server (pip install ks-xlsx-parser[api]; xlsx-parser-api) and call POST /parse. A native MCP server is on the Knowledge Stack roadmap.
How do I build a RAG pipeline over Excel spreadsheets?
Three steps: pip install ks-xlsx-parser, call parse_workbook() on each file, then result.serializer.to_vector_store_entries() to get id + text + metadata triples ready for Qdrant, pgvector, Weaviate, or Pinecone. Every entry has a content_hash for dedup and a source_uri the LLM cites in its answer.
How is ks-xlsx-parser different from openpyxl or pandas?
openpyxl and pandas give you a rectangle of values. ks-xlsx-parser gives you the full workbook graph: parsed formulas with dependency edges, merged regions, Excel ListObjects, all 7 chart types, every conditional-formatting rule type, and LLM chunks with citation URIs + token counts. It wraps openpyxl and uses lxml for the bits openpyxl loses.
Does ks-xlsx-parser run Excel formulas or macros?
No. The library reads .xlsx files; it never executes them. VBA macros are flagged but never run. External links are recorded but never resolved. ZIP-bomb and cell-count limits make it safe for untrusted uploads.
How fast is it?
The full 1054-workbook testBench round-trips in ~70 s on a single machine. A real 21k-cell, 13-sheet financial model parses in ~4.6 s (down from 307 s pre-0.1.1 after a circular-ref caching fix). Sparse workbooks with extreme addresses parse in under 200 ms.
Search queries this library answers: Python Excel parser for LLMs, XLSX to JSON for LangChain, Excel ingestion for LangGraph, spreadsheet reader for CrewAI, Excel tool for OpenAI Agents SDK, Excel for Claude Desktop, Excel for Cursor, Excel MCP server, openpyxl alternative for RAG, Excel dependency graph extractor, XLSX OOXML parser for AI, how to parse Excel for an LLM agent, how to feed a spreadsheet to ChatGPT, how to cite Excel cells in an LLM answer, best library to turn Excel into JSON, Python library for parsing formulas, Excel formula dependency traversal, document intelligence for spreadsheets, RAG over Excel files, Excel chunker with token counts, parse .xlsx for Qdrant / pgvector / Weaviate / Pinecone.
MIT. Use it, fork it, ship it. Attribution appreciated but not required.
If you ship something built on top of ks-xlsx-parser, we'd love a
Show & Tell
post or a shoutout on Discord.