Skip to content

knowledgestack/ks-xlsx-parser

Knowledge Stack

📊 Make XLSX LLM Ready 🤖

ks-xlsx-parser — the open-source Python library that parses Excel (.xlsx) files into citation-ready JSON for LLMs, RAG pipelines, and AI agents (LangChain, LangGraph, CrewAI, OpenAI Agents SDK, Claude, MCP).

PyPI Python 3.10+ MIT License Tests CI

Discord Knowledge Stack Discussions GitHub stars Landing site

LangChain ready LangGraph ready CrewAI ready OpenAI Agents SDK MCP compatible

Tip

.xlsx → structured, typed, citation-ready JSON that an LLM can actually reason about. Cells, formulas, merged regions, tables, charts, conditional formatting, dependency graphs, and RAG-ready chunks — deterministic, fully tested, MIT.

ks-xlsx-parser highlighting a financial model on the left and emitting typed, citation-linked chunks on the right
Raw workbook on the left (financial_model.xlsx) → parser output on the right: 4 chunks, each tied back to an exact sheet!range, ready to cite in an LLM response.

Spreadsheets are still the #1 unstructured data source in the enterprise. Feeding a .xlsx directly to an LLM loses structure (rows, formulas, merges), loses provenance (which cell said what), and blows through context windows. ks-xlsx-parser turns an Excel workbook into a token-counted, source-addressable graph that drops straight into LangChain, LangGraph, CrewAI, the OpenAI Agents SDK, or any MCP-aware client (Claude Desktop, Cursor, Windsurf, Zed, …).

Star the repo   Join our Discord

Quick start   Docs   Dataset


✨ What you get, at a glance

🧾
Typed cell graph
values, formulas, styles, coords
🧭
Citation URIs
file.xlsx#Sheet!A1:F18
🧮
Dependency graph
upstream · downstream · cycles
🧩
RAG-ready chunks
HTML + text + token count
📊
All 7 chart types
bar · line · pie · scatter · area · radar · bubble
🎨
Conditional formatting
every Excel rule type
📋
Tables & merges
ListObjects + master/slave
🔐
Safe by default
no macros · no external links · ZIP-bomb guard

Fast
1054 workbooks / 70s in CI
🧬
Deterministic
xxhash64 content addressing
🧰
Framework-agnostic
LangChain · LangGraph · CrewAI · MCP
📜
MIT licensed
use it, fork it, ship it

⭐ If this helps you

This project is free, open source (MIT), and part of the Knowledge Stack ecosystem — document intelligence for agents. Stars, contributions, and honest feedback are all first-class ways to keep the lights on.

Jump into the community:

  • 💬 Discord — real-time help, roadmap conversations, show off what you're building. Drop in, say hi.
  • 🗣 GitHub Discussions — async Q&A, RFCs, and long-form ideas.
  • 🐞 Issues — report a bug, request a feature, or file a parser edge case.
  • 🎯 Show & Tell — tell us about your production use.
  • 🔐 Security — private vulnerability disclosure.
  • 🙌 Contribute — every PR is reviewed; good-first-issue labels live on Issues.
  • 🧰 Knowledge Stack org — see the rest of the ecosystem (ks-cookbook, ks-xlsx-parser, more on the way).

Not sure where to start? Run make testbench, find a file that breaks, open a Parser edge case. That's the fastest path to a merged PR.


🚀 30-second demo

pip install ks-xlsx-parser
from ks_xlsx_parser import parse_workbook

result = parse_workbook(path="q4_forecast.xlsx")

# LLM-ready chunks with citation URIs
for chunk in result.chunks:
    print(chunk.source_uri)          # q4_forecast.xlsx#Revenue!A1:F18
    print(chunk.token_count)         # 412
    print(chunk.render_text[:200])   # Pipe-delimited Markdown-ish text
    print(chunk.render_html[:200])   # HTML with proper colspan/rowspan

# Or dump the whole workbook graph
import json
json.dump(result.to_json(), open("workbook.json", "w"), default=str)

That's it. Every chunk has:

  • source_uri — cite back to exact cells
  • render_text / render_html — LLM-consumable bodies
  • token_count — cap your context window properly
  • dependency_summary — upstream/downstream formulas
  • content hash — dedupe across versions

🗺️ Table of Contents


🤔 Why a dedicated XLSX parser for LLMs?

Most Excel libraries answer one of two questions well: "read a rectangle of values" (pandas, openpyxl) or "run Excel headless" (xlwings, LibreOffice). ks-xlsx-parser answers a third one: "give me a structured, inspectable, loss-minimising graph that an LLM or auditor can reason about."

Output Why an LLM cares
Typed cell graph (values, formulas, styles, coordinates) Round-trips to JSON/DB/vector store without losing formulas or data types
Formula AST + directed dependency graph Answer "what drives Q4 revenue?" via upstream traversal
Detected tables, merged regions, layout blocks Multi-table sheets no longer collapse into one giant CSV
Chart extractions (bar / line / pie / scatter / area / radar / bubble) Text summaries the model can read
Token-counted render chunks (HTML + pipe-text) Plug straight into an embedding pipeline without blowing context
Citation-ready source URIs (sheet!A1:B10) The LLM can cite the exact cell it's talking about
Deterministic content hashes (xxhash64) Dedupe across versions, detect change between uploads

Everything is deterministic, everything is tested on a 1054-workbook stress corpus, and everything is open source.


🏗️ Architecture

%%{init: {'theme':'base', 'themeVariables': {
  'primaryColor':'#10B981','primaryTextColor':'#fff','primaryBorderColor':'#047857',
  'lineColor':'#94A3B8','secondaryColor':'#22C55E','tertiaryColor':'#34D399',
  'background':'#FFFFFF','mainBkg':'#10B981','clusterBkg':'#F0FDF4'
}}}%%
flowchart TD
    IN([📄 .xlsx bytes])
    PARSE[["① parsers/<br/>OOXML drivers<br/><i>openpyxl + lxml</i>"]]
    MODELS[["② models/<br/>Pydantic DTOs<br/><i>Workbook · Sheet · Cell · Table · Chart</i>"]]
    FORMULA[["③ formula/<br/>lexer + parser<br/><i>cross-sheet · table · array</i>"]]
    ANALYSIS[["④ analysis/<br/>dependency graph<br/><i>cycles · impact</i>"]]
    CHARTS[["⑤ charts/<br/>OOXML chart extraction"]]
    ANNOT[["⑥ annotation/<br/>semantic roles · KPIs"]]
    SEG[["⑦ chunking/<br/>adaptive segmenter"]]
    REND[["⑧ rendering/<br/>HTML + pipe-text<br/>token counts"]]
    STORE[["🗄️ storage/<br/>JSON · DB rows · vectors"]]
    VER[["✅ verification/<br/>stage assertions"]]
    CMP[["🔀 comparison/<br/>multi-workbook templates"]]
    EXP[["🧬 export/<br/>generated importer"]]
    OUT([🤖 LLM-ready chunks<br/>with citations])

    IN --> PARSE --> MODELS
    MODELS --> FORMULA
    MODELS --> ANALYSIS
    MODELS --> CHARTS
    FORMULA --> ANALYSIS
    ANALYSIS --> ANNOT
    CHARTS --> ANNOT
    ANNOT --> SEG --> REND --> STORE
    MODELS --> VER
    STORE --> OUT
    STORE -.-> CMP -.-> EXP

    %% All-green palette: deepest for entry, lightest for auxiliary stages,
    %% emerald for the headline output node.
    classDef entry   fill:#064E3B,stroke:#022C22,color:#fff,stroke-width:2px;
    classDef parse   fill:#065F46,stroke:#022C22,color:#fff,stroke-width:2px;
    classDef model   fill:#047857,stroke:#064E3B,color:#fff,stroke-width:2px;
    classDef analyze fill:#059669,stroke:#065F46,color:#fff,stroke-width:2px;
    classDef render  fill:#16A34A,stroke:#166534,color:#fff,stroke-width:2px;
    classDef output  fill:#22C55E,stroke:#15803D,color:#fff,stroke-width:2px;
    classDef aux     fill:#A7F3D0,stroke:#047857,color:#065F46,stroke-width:2px;

    class IN entry
    class PARSE parse
    class MODELS model
    class FORMULA,ANALYSIS,CHARTS analyze
    class ANNOT,SEG,REND render
    class STORE,OUT output
    class VER,CMP,EXP aux
Loading

The pipeline has 8 stages (parse → analyse → annotate → segment → render → serialise → verify → compare/export). Full breakdown in Pipeline Internals.

Note

The importable module is xlsx_parser; ks_xlsx_parser is a re-export matching the PyPI package name. The package is fully type-annotated (py.typed is shipped).


📦 Installation

Requires Python 3.10+.

pip install ks-xlsx-parser                 # core library
pip install "ks-xlsx-parser[api]"          # + FastAPI web server
pip install "ks-xlsx-parser[dev]"          # + test tooling

From source:

git clone https://github.com/knowledgestack/ks-xlsx-parser.git
cd ks-xlsx-parser
make install           # pip install -e ".[dev,api]"
make test              # default suite
make testbench-build   # generate the 1000-file stress corpus
make testbench         # round-trip every workbook through the parser

Runtime deps: openpyxl, pydantic, lxml, xxhash, tiktoken.


📚 Documentation

All implementation detail lives under docs/wiki/ (mirrored to the GitHub Wiki on each release) so this README stays scannable:

  • 🚀 Quick Start — parse, iterate chunks, walk the dep graph, serialise, parse from bytes. Five short snippets, ~90 % of real usage.
  • 📖 API Reference — full signatures for parse_workbook, compare_workbooks, export_importer, StageVerifier.
  • 🌐 Web API — the bundled FastAPI server, Python + TypeScript clients, deployment notes.
  • 📦 Data Models — every Pydantic DTO field by field.
  • 🛠 Pipeline Internals — where to hook in if you want to extend the parser.
  • 📜 Workbook Graph Spec — canonical schema for the output.
  • 🐛 Known Issues — documented edge cases.
  • 📝 CHANGELOG — release history.

⚔️ How it compares

pandas / openpyxl Docling ks-xlsx-parser
Reads values
Keeps formulas ⚠️ raw string ✅ parsed + dependency graph
Preserves merges ⚠️ coords only ⚠️ partial ✅ master/slave with colspan/rowspan
Extracts charts ✅ all 7 chart types + text summary
Conditional formatting ✅ cell/color-scale/icon/data-bar/formula
Data validation (dropdowns) ✅ all types incl. cross-sheet lists
Multi-table sheet layout ⚠️ ✅ adaptive-gap segmentation
Per-chunk source URI (citation) ⚠️ file.xlsx#Sheet!A1:F18
Token counts per chunk ✅ via tiktoken
Dependency graph traversal ✅ upstream / downstream, cycle detection
Deterministic content hashes ✅ xxhash64 per cell / block / chunk
Streaming .xlsx > 100 MB ⚠️ ✅ (chunked parse)

Most tools give you a dataframe. ks-xlsx-parser gives you a graph an LLM can cite.


Looking for a tiny, edge-runtime I/O library with write support? See hucre by @productdevbook. For an unbiased head-to-head on the 1053-workbook testBench corpus — perf numbers, extraction-count parity, where each side wins — see the wiki: ks-xlsx-parser vs hucre.


🎯 Who this is for

Teams shipping agents, RAG pipelines, or auditing tools that ingest Excel.

🏦
Banking & Finance
KPI extraction, formula lineage, regulator-ready citations
⚖️
Legal & Contracts
schedules, fee tables, covenant matrices without flattening merges
🏥
Healthcare & Insurance
normalise claims, pricing, and actuarial sheets into auditable JSON
🏗️
Real Estate & Construction
quantity takeoffs and cost models that still live in XLSX
📈
Sales Ops / HR / Engineering
"source of truth is a spreadsheet" → structured events, in minutes

Important

Not a fit if you need to execute Excel (recalculate, run VBA, pivot-refresh). Use xlwings or a headless Excel for that. ks-xlsx-parser reads; it doesn't run.


🧪 The testBench dataset

A 1054-workbook stress corpus ships under testBench/ and is round-tripped in CI on every commit. It's the easiest way to see whether the parser does the right thing on your kind of workbook.

Group Files What it covers
real_world/ 8 Real anonymised workbooks (financial, engineering, project tracking)
enterprise/ 4 Deterministic enterprise templates
github_datasets/ 10 Public datasets (iris, titanic, superstore, …)
stress/curated/ 26 26 progressive stress levels authored by hand
stress/merges/ 5 Pathological merge patterns
generated/matrix/ 297 One feature per file across 18 categories
generated/combo/ 400 Deterministic feature cocktails (5 densities × 80 seeds)
generated/adversarial/ 300 Unicode bombs, circular refs, 32k-char cells, deep formula chains, sparse 1M-row sheets, 250-sheet workbooks
make testbench-build   # regenerate testBench/generated/ (~1 minute)
make testbench         # 1054/1054 in ~70 seconds
make testbench-zip     # package as dist/testBench-vX.Y.Z.zip for a GitHub release

The zipped dataset is attached to every release — pull it if you don't want to clone the full repo.


🚧 Limitations

  • .xls not supported — only .xlsx and .xlsm (OOXML). Convert legacy files externally.
  • Pivot tables — detected but not fully parsed.
  • Sparklines — not extracted.
  • VBA macros — flagged but never executed or analysed.
  • External links — recorded but not resolved.
  • Threaded comments — only legacy comments are supported (openpyxl limitation).
  • Embedded OLE objects — detected but not extracted.
  • Locale-dependent number formats — not interpreted.

Full list in docs/PARSER_KNOWN_ISSUES.md.


🧰 Knowledge Stack ecosystem

ks-xlsx-parser is one piece of the Knowledge Stack open-source family — document intelligence for agents, built so that engineering teams can focus on agents and we handle the messy parts of enterprise data.

Repo What it does
ks-cookbook 32 production-style flagship agents + recipes for LangChain, LangGraph, CrewAI, Temporal, the OpenAI Agents SDK, and any MCP client.
ks-xlsx-parser (this repo) Turn .xlsx into LLM-ready JSON with citations and dependency graphs.
@knowledgestack Follow the org for upcoming repos — parsers, extractors, and MCP servers for PDF, DOCX, PPTX, HTML, and more.

Building on top of the stack? Tell us about it in Show & Tell or the #showcase channel on Discord.


📡 Stay in touch

Discord Follow Knowledge Stack Discussions

  • 💬 Join the Discord — our main real-time channel. Roadmap, help, job postings, show-and-tell, and the occasional meme.
  • 🐙 Follow @knowledgestack on GitHub for new releases across the ecosystem.
  • 📣 Watch this repo (→ Releases only) to get pinged when ks-xlsx-parser ships an update.

If you'd rather just peek first — thousands of parsed workbooks live in the testBench release as a single zip. Pull it, diff it, file an issue if your Excel does something weirder than ours.


🙌 Contributing

We love contributions. Three paths, in order of speed-to-merge:

  1. Report a testBench failure — run make testbench, find a file that breaks, attach it to a Parser edge case issue.
  2. Add a new adversarial workbook — contribute a builder to scripts/build_testbench.py. Any file that makes the parser crash or lose information is welcome.
  3. Fix a flagged issue — see docs/PARSER_KNOWN_ISSUES.md.

Full dev loop, PR checklist, and code style in CONTRIBUTING.md. See the Code of Conduct and Security policy before posting.

If you don't have time to contribute but the project helped you, please star the repo. That's the main signal that keeps this maintained.


❓ FAQ

What is the best Python library to parse Excel (.xlsx) for LLMs?

ks-xlsx-parser is purpose-built for it. Unlike pandas or openpyxl, it preserves formulas with a directed dependency graph, merged regions, tables, charts, and conditional formatting, and emits token-counted chunks with source_uri citations an LLM can quote. pip install ks-xlsx-parser.

How do I parse Excel for a LangChain or LangGraph agent?

Call parse_workbook(path=...), then expose result.chunks as a LangChain @tool or a LangGraph ToolNode. Each chunk carries source_uri, render_text, token_count, and a dependency_summary — everything the agent needs to cite and reason.

How do I use Excel in a CrewAI or OpenAI-Agents-SDK agent?

Same pattern — wrap parse_workbook in whatever tool abstraction your framework provides (@tool in CrewAI, @function_tool in the OpenAI Agents SDK). The parser's output is framework-agnostic.

Can Claude Desktop, Cursor, Windsurf, or another MCP client read Excel files?

Yes — run the bundled FastAPI server (pip install ks-xlsx-parser[api]; xlsx-parser-api) and call POST /parse. A native MCP server is on the Knowledge Stack roadmap.

How do I build a RAG pipeline over Excel spreadsheets?

Three steps: pip install ks-xlsx-parser, call parse_workbook() on each file, then result.serializer.to_vector_store_entries() to get id + text + metadata triples ready for Qdrant, pgvector, Weaviate, or Pinecone. Every entry has a content_hash for dedup and a source_uri the LLM cites in its answer.

How is ks-xlsx-parser different from openpyxl or pandas?

openpyxl and pandas give you a rectangle of values. ks-xlsx-parser gives you the full workbook graph: parsed formulas with dependency edges, merged regions, Excel ListObjects, all 7 chart types, every conditional-formatting rule type, and LLM chunks with citation URIs + token counts. It wraps openpyxl and uses lxml for the bits openpyxl loses.

Does ks-xlsx-parser run Excel formulas or macros?

No. The library reads .xlsx files; it never executes them. VBA macros are flagged but never run. External links are recorded but never resolved. ZIP-bomb and cell-count limits make it safe for untrusted uploads.

How fast is it?

The full 1054-workbook testBench round-trips in ~70 s on a single machine. A real 21k-cell, 13-sheet financial model parses in ~4.6 s (down from 307 s pre-0.1.1 after a circular-ref caching fix). Sparse workbooks with extreme addresses parse in under 200 ms.


🔎 Also known as

Search queries this library answers: Python Excel parser for LLMs, XLSX to JSON for LangChain, Excel ingestion for LangGraph, spreadsheet reader for CrewAI, Excel tool for OpenAI Agents SDK, Excel for Claude Desktop, Excel for Cursor, Excel MCP server, openpyxl alternative for RAG, Excel dependency graph extractor, XLSX OOXML parser for AI, how to parse Excel for an LLM agent, how to feed a spreadsheet to ChatGPT, how to cite Excel cells in an LLM answer, best library to turn Excel into JSON, Python library for parsing formulas, Excel formula dependency traversal, document intelligence for spreadsheets, RAG over Excel files, Excel chunker with token counts, parse .xlsx for Qdrant / pgvector / Weaviate / Pinecone.


📜 License

MIT. Use it, fork it, ship it. Attribution appreciated but not required.

If you ship something built on top of ks-xlsx-parser, we'd love a Show & Tell post or a shoutout on Discord.

About

XLSX parser for LLMs, RAG, LangChain, LangGraph, CrewAI, Claude, MCP — turns Excel (.xlsx) into citation-ready JSON with formulas, charts, dependency graphs, and token-counted chunks. Open-source Python library (MIT).

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors