XSQL is a SQL-style query language for static HTML. It treats each HTML element as a row in a node table and lets you filter by tag, attributes, and position. The project is now at v1.3.1 as an offline-first C++20 tool.
Build:
./build.sh
Run on a file:
./build/xsql --query "SELECT a FROM doc WHERE id = 'login'" --input ./data/index.html
Interactive mode:
./build/xsql --interactive --input ./data/index.html
import xsql
# Load local HTML file and execute query
doc = xsql.load("data/test.html")
result1 = xsql.execute("SELECT a.href FROM document WHERE href CONTAINS ANY ('http', 'https')")
print("result1:")
for row in result1.rows:
print(row.get('href'))
# Load remote HTML file with network access and execute query
doc = xsql.load("https://example.com", allow_network=True)
result2 = xsql.execute("SELECT p FROM doc")
print("result2:")
for row in result2.rows:
print(row)Install:
pip install pyxsql
Security Notes:
- Network access is disabled by default; enable with
allow_network=True. - Private/localhost targets are blocked unless
allow_private_network=True. - File reads are confined to
base_dirwhen provided. - Downloads are capped by
max_bytes, and query output bymax_results.
Linux (Ubuntu/Debian):
sudo apt-get update
sudo apt-get install -y build-essential cmake ninja-build pkg-config bison flex
./build.sh
macOS (Homebrew):
brew install cmake ninja pkg-config bison flex
./build.sh
Windows (PowerShell, MSVC):
cmake -S . -B build -DCMAKE_TOOLCHAIN_FILE=C:\vcpkg\scripts\buildsystems\vcpkg.cmake -DVCPKG_TARGET_TRIPLET=x64-windows
cmake --build build --config Release
Optional dependencies via vcpkg:
vcpkg install nlohmann-json libxml2 curl arrow[parquet]
If you do not want Parquet, configure with -DXSQL_WITH_ARROW=OFF.
Create a virtual environment and install the editable package:
python3 -m venv xsql_venv
source ./xsql_venv/bin/activate
pip install -U pip
pip install -e .[test]
Run Python tests:
pytest -v python/tests
Shorthand:
./install_python.sh
./test_python.sh
./build/xsql --query "<query>" --input <path>
./build/xsql --query-file <file> --input <path>
./build/xsql --interactive [--input <path>]
./build/xsql --mode duckbox|json|plain
./build/xsql --display_mode more|less
./build/xsql --highlight on|off
./build/xsql --color=disabled
Notes:
--inputis required unless reading HTML from stdin.- Colors are auto-disabled when stdout is not a TTY.
- Default output mode is
duckbox(table-style). - Duckbox output prints a footer row count.
--highlightonly affects duckbox headers (auto-disabled when not a TTY).--display_mode moredisables JSON truncation in non-interactive mode.TO CSV()/TO PARQUET()write files instead of printing results.
Commands:
.help: show help.load <path|url> [--alias <name>]/:load <path|url> [--alias <name>]: load input (path or URL).mode duckbox|json|plain: set output mode.display_mode more|less: control JSON truncation.max_rows <n|inf>: set duckbox max rows (inf= unlimited).reload_config: reload REPL config from disk.summarize [doc|alias|path|url]: list all tags and counts for the active input or target.quit/.q/:quit/:exit: exit the REPL
Keys:
- Up/Down: history (max 5 entries)
- Left/Right: move cursor
- Ctrl+L: clear screen
- Tab: autocomplete keywords/functions/commands
Tip:
- Use
.load --alias doc1to register multiple sources and query them viaFROM doc1.
Install the plugin (REPL):
.plugin install number_to_khmer
.plugin load number_to_khmer
Enable at build time (built-in commands):
cmake -S . -B build -DXSQL_ENABLE_KHMER_NUMBER=ON
cmake --build build
REPL commands:
.number_to_khmer <number> [--compact] [--khmer-digits].khmer_to_number <khmer_text> [--khmer-digits]
Example:
xsql> .number_to_khmer 12.30
ដប់-ពីរ-ក្បៀស-បី-សូន្យ
xsql> .number_to_khmer --compact 12.30
ដប់ពីរក្បៀសបីសូន្យ
xsql> .number_to_khmer --khmer-digits 12.30
១២.៣០
xsql> .khmer_to_number ដក-មួយ-រយ-ក្បៀស-ប្រាំ-សូន្យ
-100.50
xsql> .khmer_to_number --khmer-digits ដប់-ពីរ
១២
Formatting rules:
- Tokens are joined with
-(the parser also accepts whitespace). - Reverse parsing also accepts concatenated Khmer number words without separators.
- Decimal marker token:
ក្បៀស - Negative marker token:
ដក - Integer zeros are omitted unless the entire integer part is zero (
សូន្យ). - Decimal digits are emitted one-by-one and preserved (including trailing zeros).
--khmer-digitsoutputs Khmer digits (0-9 => ០-៩) with.as the decimal point.- Scales are defined up to 10^36; the module is CLI-only for now.
This optional plugin renders dynamic pages in headless Chromium via Playwright (Node)
and saves the HTML for .load.
Setup:
cd plugins/playwright_fetch
npm install
npx playwright install chromium
cmake -S plugins/cmake/playwright_fetch -B plugins/cmake/playwright_fetch/build -DXSQL_ROOT=/path/to/XSQL
cmake --build plugins/cmake/playwright_fetch/build
Usage:
./build/xsql --interactive
.plugin load playwright_fetch
.fetch "https://example.com/app" --out /tmp/page.html --state /tmp/state.json
.load /tmp/page.html
Options:
.fetch <url>fetches and renders the page.--out <path>writes HTML output (default:<cache>/last.html).--state [path]loads/saves storage state (default:<cache>/state.jsonwhen provided).--cache-dir <path>sets the cache directory.--timeout <ms>sets navigation timeout (default:60000).--wait <selector>waits for a CSS selector after navigation.--headedlaunches a visible browser for manual interaction.--pausewaits for Enter before capturing HTML (implies--headed).--cleandeletes the cache directory.
Cache directory:
- Defaults to
$XDG_CACHE_HOME/xsql/playwright_fetchor~/.cache/xsql/playwright_fetch. - Override with
XSQL_FETCH_CACHE_DIRor--cache-dir.
Environment overrides:
XSQL_NODEto select the Node.js binary.XSQL_PLAYWRIGHT_FETCH_SCRIPTto point to a customfetch.js.
The REPL reads a TOML config file at $XDG_CONFIG_HOME/xsql/config.toml
or ~/.config/xsql/config.toml. Reload it with .reload_config.
Example:
[repl]
output_mode = "duckbox"
display_mode = "more"
max_rows = 40
highlight = true
[repl.history]
max_entries = 500
path = "~/.local/state/xsql/history"Each HTML element becomes a row with fields:
node_id(int64)tag(string)attributes(map<string,string>)parent_id(int64 or null)sibling_pos(int64, 1-based position among siblings)source_uri(string)
Notes:
source_uriis stored for provenance but hidden from default output unless multiple sources appear.
Railroad diagrams are auto-generated for the current parser. Rebuild with
python3 docs/generate_diagrams.py after installing docs/requirements.txt.
Notes:
- Tag-only selections cannot mix with projected fields.
- Shorthand attributes (e.g.,
href = '...') are parsed asattributes.href. - Semicolon is optional, but the REPL requires it to end a statement on a single line.
SELECT <tag_list> FROM <source> [WHERE <expr>] [LIMIT <n>]
[TO LIST() | TO TABLE([HEADER=ON|OFF][, EXPORT='file.csv']) | TO CSV('file.csv') | TO PARQUET('file.parquet')]
FROM document
FROM 'path.html'
FROM 'https://example.com' (URL fetching requires libcurl)
FROM RAW('<div class="card"></div>')
FROM FRAGMENTS(RAW('<ul><li>1</li><li>2</li></ul>')) AS frag
FROM FRAGMENTS(SELECT inner_html(div) FROM doc WHERE class = 'pagination') AS frag
FROM doc (alias for document)
FROM document AS doc
Notes:
RAW('<html>')parses an inline HTML string as the document source.FRAGMENTS(...)builds a temporary document by concatenating HTML fragments.FRAGMENTSaccepts eitherRAW('<html>')or a subquery returning a single HTML string column (useinner_html(...)).FRAGMENTSsubqueries cannot use file or URL sources.
SHOW INPUT;
SHOW INPUTS;
SHOW FUNCTIONS;
SHOW AXES;
SHOW OPERATORS;
DESCRIBE doc;
DESCRIBE language;
Notes:
SHOW INPUTreports the active source.SHOW INPUTSlists distinct sources from the last result (or the active source if none).DESCRIBE docshows the base schema; axes are documented viaSHOW AXES.DESCRIBE languagelists the SQL language surface as a single table.
SELECT div
SELECT div,span
SELECT *
Exclude columns:
SELECT * EXCLUDE source_uri FROM doc
SELECT * EXCLUDE (source_uri, tag) FROM doc
parent
child
ancestor
descendant
Supported operators:
=IN<>/!=IS NULL/IS NOT NULL~(regex, ECMAScript)CONTAINS(attributes only, case-insensitive)HAS_DIRECT_TEXT(case-insensitive substring match on direct text)AND,OR
Attribute references (shorthand):
id = 'main'
parent.class = 'menu'
child.href <> ''
ancestor.id = 'root'
descendant.class IN ('nav','top')
href CONTAINS 'example'
Field references:
text <> ''
tag = 'div'
parent.tag = 'section'
child.tag = 'a'
ancestor.text ~ 'error|warning'
div HAS_DIRECT_TEXT 'login'
sibling_pos = 2
Shorthand attribute filters (default):
title = "Menu"
doc.title = "Menu"
Longhand attribute filters (optional for clarity):
attributes.title = "Menu"
doc.attributes.title = "Menu"
Alias the source and qualify attribute filters:
SELECT a FROM document AS d WHERE d.id = 'login'
Project a field from a tag:
SELECT a.parent_id FROM doc
SELECT link.href FROM doc
SELECT a.attributes FROM doc
SELECT div(node_id, tag, parent_id) FROM doc
Supported base fields:
node_id,tag,parent_id,sibling_pos,max_depth,doc_order,source_uri,attributes
Attribute value projection:
SELECT link.href FROM docreturns thehrefvalue
Function projection:
SELECT inner_html(div) FROM docreturns the raw inner HTML for eachdivSELECT inner_html(div, 1) FROM dockeeps only tags up to depth 1 (drops deeper tags)SELECT trim(inner_html(div)) FROM doctrims leading/trailing whitespaceSELECT TEXT(div) FROM doc WHERE tag = 'div'returns descendant text for eachdivSELECT FLATTEN_TEXT(div) AS (c1, c2) FROM doc WHERE descendant.tag IN ('p','span')flattens text at the deepest depth
Notes:
TEXT()andINNER_HTML()require aWHEREclause with a non-tag filter (e.g., attributes or parent).attributes IS NULLmatches elements with no attributes.FLATTEN_TEXT()defaults to all descendant elements (and respectsdescendant.tag); missing columns are padded with NULL and extra values are truncated.FLATTEN_TEXT()usesdescendant.tag = '...'/IN (...)anddescendant.attributes.<attr>with=,IN,CONTAINS,CONTAINS ALL, orCONTAINS ANYto filter flattened elements.FLATTEN_TEXT()extracts direct text from the matched element and falls back to inline descendant text when direct text is empty or whitespace-only; output is trimmed and whitespace-collapsed. When depth is omitted, empty-text nodes are skipped.FLATTEN_TEXT(base, depth)targets the exact element depth frombase(0 = base itself).FLATTEN_TEXT(base)defaults to a single output column namedflatten_textwhen noAS (...)list is provided.FLATTEN()is an alias ofFLATTEN_TEXT().
Output a JSON list for a single projected column:
SELECT link.href FROM doc WHERE rel = "preload" TO LIST()
Extract an HTML <table> into rows (array of arrays). By default the first row
is treated as column headers for duckbox rendering; set HEADER=OFF to render
all rows as data (CSV exports will include generated col1..colN headers):
SELECT table FROM doc TO TABLE()
SELECT table FROM doc TO TABLE(HEADER=OFF)
SELECT table FROM doc WHERE id = 'stats' TO TABLE(EXPORT='stats.csv')
If multiple tables match, the output is a list of objects:
[{ "node_id": 123, "rows": [[...], ...] }, ...]
Note: TO LIST() always returns JSON output. TO TABLE() uses duckbox by default and JSON in --mode json|plain.
EXPORT='file.csv' requires a single table result, so filter by node_id or attributes when multiple tables match.
Write any rectangular result to a CSV file:
SELECT a.href, a.text FROM doc WHERE href IS NOT NULL TO CSV('links.csv')
Write any rectangular result to a Parquet file (requires Apache Arrow feature):
SELECT * FROM doc TO PARQUET('nodes.parquet')
Note: TO CSV() and TO PARQUET() write files and do not print the result set.
If you SELECT table ... TO CSV(...), XSQL exports the HTML table rows directly (legacy).
Prefer TO TABLE(EXPORT='file.csv') for explicit table exports.
SELECT a FROM doc LIMIT 5
Minimal aggregate:
SELECT COUNT(a) FROM doc
SELECT COUNT(*) FROM doc
SELECT COUNT(link) FROM doc WHERE rel = "preload"
Use ~ with ECMAScript regex:
SELECT a FROM doc WHERE href ~ '.*\\.pdf$'
Case-insensitive substring match for attribute values:
SELECT a FROM doc WHERE href CONTAINS 'techkhmer'
SELECT a FROM doc WHERE href CONTAINS ALL ('https', '.html')
SELECT a FROM doc WHERE href CONTAINS ANY ('https', 'mailto')
Case-insensitive substring match on direct text only (excluding nested tags):
SELECT div FROM doc WHERE div HAS_DIRECT_TEXT 'computer science'
Compute per-node TF-IDF scores across the matched nodes. Each matched node is treated as a document in the IDF corpus.
SELECT TFIDF(p, li, TOP_TERMS=30, MIN_DF=1, MAX_DF=0, STOPWORDS=ENGLISH)
FROM doc WHERE class = 'article'
Output columns: node_id, parent_id, tag, terms_score (term → score map).
Options:
TOP_TERMS(default 30): max terms per node.MIN_DF(default 1): minimum document frequency.MAX_DF(default 0 = no max): maximum document frequency.STOPWORDS(ENGLISHorNONE, defaultENGLISH).
Notes:
- Tags must come before options inside
TFIDF(...). - TFIDF is an aggregate and must be the only select item.
- TFIDF ignores HTML tags and skips script/style/noscript content.
-- Filters
SELECT ul FROM doc WHERE id = 'countries';
SELECT table FROM doc WHERE parent.id = 'table-01';
SELECT div FROM doc WHERE descendant.class = 'card';
SELECT span FROM doc WHERE parent_id = 1;
SELECT span FROM doc WHERE node_id = 1;
SELECT div FROM doc WHERE attributes IS NULL;
-- Lists and exports
SELECT link.href FROM doc WHERE rel = "preload" TO LIST();
SELECT a.href, a.text FROM doc WHERE href IS NOT NULL TO CSV('links.csv');
SELECT * FROM doc TO PARQUET('nodes.parquet');
-- Fragments
SELECT li FROM FRAGMENTS(SELECT inner_html(ul) FROM doc WHERE id = 'menu') AS frag;
-- Ordering
SELECT div FROM doc ORDER BY node_id DESC;
SELECT * FROM doc ORDER BY tag, parent_id LIMIT 10;
-- Summaries
SELECT summarize(*) FROM doc;
SELECT summarize(*) FROM doc ORDER BY count DESC LIMIT 5;
| Item | Priority |
|---|---|
| Plugin for dynamic websites (headless browser fetch/render) | Highest |
DOM mutation: UPDATE / INSERT / DELETE |
High |
DOM mutation: attribute ops (SET, REMOVE) |
High |
DOM mutation: content ops (SET TEXT, SET INNER_HTML) |
High |
DB bridge: TO DB() / FROM DB() (XSQL stays front-end) |
High |
Multi-source JOIN (doc ↔ doc, doc ↔ table) |
Medium |
WITH / subqueries for reuse |
Medium |
GROUP BY + HAVING + DISTINCT |
Medium |
| Performance profiling + hot-path optimizations | Medium |
| Session cache for parsed DOM (if a clear use case appears) | Low |
| BM25 ranking (simple explainer on demand) | Exploration |
- No XPath or positional predicates.
ORDER BYis limited tonode_id,tag,text, orparent_id.- No
GROUP BYor joins. - No XML mode (HTML only).
- URL fetching requires libcurl.
- Default output is duckbox tables; JSON output is available via
--mode json. TO PARQUET()requires Apache Arrow support at build time.
Optional:
nlohmann/jsonfor pretty JSON output (vcpkg recommended).libxml2for robust HTML parsing (fallback to naive parser if missing).libcurlfor URL fetching.apache-arrow(Arrow/Parquet) forTO PARQUET()export.
- If you see
No input loadedin REPL, run:load <path|url>. - If a query fails with
Expected FROM, include aFROMclause. - If output is compact JSON, ensure
nlohmann/jsonis linked via vcpkg.