Python CLI for turning German webpages into English copies. The pipeline fetches HTML, extracts translatable content, calls the Hugging Face Inference API (MarianMT models), and writes both the original and translated pages to disk. Batching, caching, and retry logic keep the number of remote calls and latency under control.
Static output viewer (generated from real translation runs):
- HF backend – uses
Helsinki-NLP/opus-mt-{src}-{dst}translation models via the Hugging Face Inference API. - Batching + retries – groups texts per request with exponential backoff and jitter (httpx + tenacity).
- SQLite cache – stores normalized strings in
translation_cache.dbto avoid re-translating shared fragments such as headers and footers. - Skip heuristics – ignores punctuation-only strings, empty nodes, and text
that already looks English when targeting
en. - Whitespace-safe application – preserves the DOM structure while inserting
translations and adjusting
/de/links to/en/. - Health check –
python -m src.main --checkverifies API reachability. before running a full job.
translate/
├── src/
│ ├── fetcher.py # HTTP fetching with retry/backoff wrappers
│ ├── parser.py # BeautifulSoup extraction of texts and attributes
│ ├── translator.py # HF batching, caching, skip heuristics
│ ├── hf_client.py # httpx client with retry/backoff + error taxonomy
│ ├── cache.py # SQLite cache implementation
│ ├── writer.py # Apply translations & rewrite links
│ ├── batch.py # Batch sitemap processing orchestrator
│ └── main.py # CLI entry point
├── tests/ # pytest suite (unit + smoke integration)
├── output/ # Generated HTML (auto-detected `de_NEW` / `en_NEW`, archives under `output/archive/`)
├── translation_cache.db # Created on first run when caching enabled
└── requirements.txt
-
System deps (for
lxmlon Debian/Ubuntu):sudo apt-get install libxml2-dev libxslt1-dev
-
Virtual environment:
python3 -m venv venv source venv/bin/activate pip install -r requirements.txt -
Hugging Face token:
- Create a free Read token at https://huggingface.co/settings/tokens
- Either export it or create a
.envfile (loaded automatically by the CLI):export HF_API_TOKEN="hf_xxx" # or echo "HF_API_TOKEN=hf_xxx" > .env
Confirm connectivity and credentials before heavier runs:
python -m src.main --checkExit code 0 means the backend responded successfully; otherwise the CLI prints a diagnostic message.
python -m src.main --url https://www.landsiedel.com/de/was-ist-nlp.html --output-dir outputWrites output/de/... (original) and output/en/... (translated) after applying
translations, rewriting /de/ links, and setting lang="en".
python -m src.main --sitemap sitemap.json --output-dir output --limit 5 --delay 1.0- Supports JSON (
[{"url": "..."}]orlocfields) and XML sitemaps. - Deduplicates URLs and filters to
www.landsiedel.compaths containing/de/. - Produces a summary and
failed_urls.txtif anything goes wrong.
--url/--sitemap– mutually exclusive entry points.--limit– cap number of sitemap entries (debugging).--delay– seconds between sitemap requests (rate limiting).--timeout/--retries– fetcher controls.--log-file– optional batch log file.--output-dir– root folder for generated HTML (defaultoutput).
Inspect original and translated pages in a single browser session:
python -m src.webviewer --output-dir output --open-browser- Automatically selects
output/de_NEWandoutput/en_NEWwhen present (falls back tode/en). - Shows the directory tree on the left; preview uses the full width with a language toggle between EN/DE.
- Provides ready-to-send ZIP downloads for each language (also stored under
output/packages/). - Override with
--source-subdir/--target-subdirfor archival sets (now stored underoutput/archive/). - Retries up to five consecutive ports if 8000 is busy; tweak with
--port-attempts. - Bind to a different host/port using
--host/--portif needed.
Visit http://127.0.0.1:8000/ (or whichever host/port you chose) to explore the files. Close the process with Ctrl+C.
Generate a self-contained static viewer (under docs/) that GitHub Pages can host:
python -m src.static_site --output-dir output --site-dir docs- Copies the current
de_NEW/en_NEWtrees intodocs/deanddocs/en. - Emits
docs/index.html,docs/data/tree.json, and ZIP downloads indocs/packages/. - Push the repository with the
docs/folder to GitHub and enable Pages (Settings → Pages → Deploy from branch → main / docs). - Each time translations change, re-run the command before committing so the site stays in sync.
- Optional automation: the included GitHub Actions workflow (
.github/workflows/pages.yml) rebuildsdocs/and deploys to GitHub Pages on every push tomain. Enable Pages (branch =github-pagesdeployment) after the first run.
The test suite relies on mocked network calls; no real HTTP requests are performed:
pytest -qTargeted runs:
pytest tests/test_translator.py -v # translator helpers + caching logic
pytest tests/test_batch.py -k sitemap # sitemap parsing| Symptom | Likely Cause | Fix |
|---|---|---|
HF authentication failed |
HF_API_TOKEN missing or invalid |
regenerate token and export / update .env |
Translation API error: 429 |
Rate limit hit | wait for daily reset or lower concurrency (--delay) |
| Pages re-process slowly | Cache disabled or purged | ensure translation_cache.db exists and writable |
| CLI exits immediately | --url/--sitemap not specified |
supply exactly one of the entry-point flags |
For deeper diving (response payloads, caching internals) see
INTEGRATION_NOTES.md.
- Caching can be cleared by deleting
translation_cache.db. - Translation batches respect both a maximum item count (
BATCH_SIZE) and an approximate token budget (MAX_TOKENS_PER_BATCH) to keep requests within free tier limits. - Network access is required; no offline fallback is currently implemented.
- The CLI still supports the rest of the legacy pipeline contracts so existing
integrations can reuse
translate_batch()andhas_model().