Skip to content

st fetch

b2o2i edited this page Apr 17, 2026 · 7 revisions

st-fetch — Import Any Source into the Cross Pipeline

Brings external content into Cross as a data entry and immediately chains to st-prep — so the imported material is cleaned, titled, and ready for fact-checking or publishing without any extra steps.

Five source types are supported: a tweet by ID, a local .txt or .md file, a PDF file (auto-detected by extension or magic bytes), a web page by URL, or text pasted from the clipboard. The source type is recorded in make/model so every entry stays traceable.

st-fetch concept diagram


Sources

Source Flag Notes
X / Twitter post tweet_id (positional) Requires X_COM_BEARER_TOKEN in .env
Plain text or Markdown file --file PATH .txt, .md — content imported as-is
PDF file --file PATH Auto-detected by .pdf extension or %PDF magic bytes; requires pymupdf4llm (see below)
Web page --url URL Scrapes visible text; strips nav/script/footer noise
Clipboard --clipboard macOS (pbpaste), Linux (xclip/xsel), Windows (PowerShell)

PDF import

PDF files are converted to structured Markdown using pymupdf4llm, preserving headings, bold/italic, lists, and tables. The title field is resolved from PDF metadata first, then from the first # heading, then from the first non-empty line of text. If very little text is extracted the tool warns that the file may be scanned and suggests running OCR first.

pymupdf4llm is a lazy dependency — only required when fetching a PDF:

# pipx install
pipx inject cross-st pymupdf4llm

# venv / plain pip
pip install pymupdf4llm

Pipeline

st-fetch stores the imported content as a data[] entry in the container. By default it immediately runs st-prep, which cleans the text and creates a story[] entry — making the content available to every other st-* command without any further manual steps.

source  →  st-fetch  →  article.json  →  st-prep  →  st-fact / st-post / st

Use --no-prep to store the raw data entry only and run st-prep yourself later.


Examples

st-fetch <tweet_id> article.json              # import an X / Twitter post
st-fetch --file report.txt article.json       # import a plain text file
st-fetch --file report.md article.json        # import a Markdown file
st-fetch --file paper.pdf article.json        # import a PDF
st-fetch --url https://... article.json       # scrape a web page
st-fetch --clipboard article.json             # import from clipboard

st-fetch --file paper.pdf article.json --no-prep     # store raw data only
st-fetch <tweet_id> article.json --no-cache          # bypass cache, fetch live

Full import-and-publish pipeline:

st-fetch --file report.pdf article.json   # import PDF → auto-runs st-prep
st-fact article.json                      # AI fact-check
st-post article.json                      # publish to Discourse

Options

Option Description
tweet_id Tweet / X post ID to fetch (numeric ID from post URL)
file.json Path to the .json container
--file PATH Import a .txt, .md, or .pdf file from disk
--url URL Fetch a web page and extract its visible text
--clipboard Import text from the system clipboard
--cache Enable API cache (default: on)
--no-cache Disable API cache — always fetch live
--prep Run st-prep after fetching (default: on)
--no-prep Skip st-prep — store as raw data entry only
-v, --verbose Verbose output
-q, --quiet Minimal output

Related: st-prep st-fact st-post st-gen


For developers

AI_MAKE is set to "st-fetch" and model records the source type: "file", "pdf", "clipboard", "x.com", or the URL domain (e.g. "bbc.co.uk"). gen_response includes a format key for PDF entries ("pdf") and a pages count. Twitter/X fetches require X_COM_BEARER_TOKEN in .env. Entries are deduplicated by MD5 hash before writing.

Clone this wiki locally