Repository: Doclinger
Doclinger is a ready-to-go Docling installation with a web UI for easy document processing and RAG agent ingestion. Upload PDFs, Office docs, or images; configure chunk size and overlap, OCR, and cleanup options; run extraction; then download markdown and RAG-ready JSONL chunks. All local—no cloud required.
- Docker (recommended): Docker Engine and Docker Compose. The image uses Python 3.11 and includes Docling.
- Local run: Python 3.11+, pip, and a virtual environment. Docling is optional locally (placeholder extraction if not installed).
- Upload: PDF, DOCX, PPTX, XLSX, HTML, Markdown, CSV, images (PNG, TIFF, JPG), and more (limit 200MB per file)
- Extract: Run Docling extraction (Docker image includes Docling; local runs use placeholder if not installed)
- Store: Structured outputs with a prefix derived from the source filename (e.g.
User_Guide_v2.document.md,User_Guide_v2.chunks.jsonl,User_Guide_v2.metadata.json) - Chunk: Header-aware, token-sized chunking (default 1000 tokens, 120 overlap) into lean JSONL for RAG ingestion
- Preview & download: View extraction and chunks in the UI; download artifacts via buttons
- Backend: FastAPI, Uvicorn, Pydantic, Docling (or placeholder)
- UI: Streamlit, requests
- Container: Docker + docker-compose
Doclinger/ # project root (clone as Doclinger or rename as you like)
├── README.md
├── docs/
│ └── processing-ui.png # screenshot for README
├── .gitignore
├── .dockerignore
├── docker/
│ ├── Dockerfile
│ ├── docker-compose.yml
│ └── entrypoint.sh # starts backend then Streamlit
├── scripts/
│ └── prevent-sleep.ps1 # Windows: keep PC awake during extraction
├── backend/
│ ├── pyproject.toml
│ ├── requirements.txt
│ └── src/app/
│ ├── main.py
│ ├── api/ # routes: upload, extract, job, artifact, storage
│ ├── core/ # config, models, docling_runner, chunker, storage
│ └── tests/
├── ui/
│ ├── streamlit_app.py
│ ├── run.ps1 # Windows: run Streamlit with project venv
│ └── components/
├── data/ # created at runtime if missing
│ ├── uploads/
│ ├── outputs/
│ └── examples/
Build and run the container. Ports are configurable via environment variables (defaults: API 8001, UI 8502).
Configure ports (optional):
Ports can be customized via environment variables. Copy .env.example to .env and adjust as needed, or set environment variables directly:
# .env (copy from .env.example)
API_PORT=8001
UI_PORT=8502Start the services:
cd Doclinger/docker
docker compose up --build -dOr from project root:
cd Doclinger
docker compose -f docker/docker-compose.yml up --build -d| Service | URL (default ports) |
|---|---|
| UI | http://localhost:8502 |
| API | http://localhost:8001 |
| API docs | http://localhost:8001/docs |
Ports are configurable via API_PORT and UI_PORT environment variables (see .env.example).
To stop the stack: docker compose -f docker/docker-compose.yml down (from project root). Data in data/ is kept.
Using the Docker UI: Open http://localhost:8502 (or your configured UI_PORT). The sidebar uses http://127.0.0.1:8000 by default (API inside the same container). Leave it as is when using the Docker UI.
Optional — split backend and UI (dev profile): For development you can run the API and Streamlit as separate containers so you can mount only the ui/ folder:
docker compose -f docker/docker-compose.yml --profile dev up -d backend-only ui-onlyConfigure dev ports via API_PORT_DEV and UI_PORT_DEV environment variables (defaults: 8002 and 8503). The UI talks to the API at http://backend-only:8000 inside the network.
- Upload a file, then click Run extraction.
- Wait for the progress timer (large PDFs can take 3–5 minutes).
- When extraction finishes, use Download buttons under Job status for the document and chunk artifacts (filenames are prefixed with the sanitized source name, e.g.
My_Report.document.md,My_Report.chunks.jsonl).
-
From the project root, create a virtual environment and install dependencies:
cd Doclinger python -m venv Docling # Windows (PowerShell): .\Docling\Scripts\Activate.ps1 # Linux/macOS: source Docling/bin/activate pip install -r backend/requirements.txt pip install -e backend/ pip install streamlit requests
Optional: install Docling for real extraction (otherwise a placeholder runs):
pip install docling # or: pip install -e "backend[docling]" -
Start the backend from the project root (so
data/is found):# Windows (PowerShell): $env:PYTHONPATH = "backend/src" uvicorn app.main:app --reload --host 0.0.0.0 --port 8001 # Linux/macOS or Windows (cmd): export PYTHONPATH=backend/src # or set PYTHONPATH=backend/src on cmd uvicorn app.main:app --reload --host 0.0.0.0 --port 8001
The backend creates
data/uploadsanddata/outputsif they don’t exist. To use a different data directory, setDATA_ROOT(e.g.$env:DATA_ROOT = "C:\my\data"on PowerShell). -
In another terminal, start the UI:
cd Doclinger/ui python -m streamlit run streamlit_app.pyOn Windows you can use
.\run.ps1from theui/folder (expects the venv at project root asDocling/). -
Open http://localhost:8501. Set the sidebar Backend URL to http://localhost:8001 when the API runs locally.
From the project root (with the same venv that has the backend installed):
cd Doclinger/backend
pip install -r requirements.txt
pytestTests use backend/src as the Python path (via pyproject.toml).
- Upload a document (PDF, DOCX, etc.).
- Click Run extraction. A progress timer runs; extraction can take 3–5 minutes for large PDFs.
- When extraction completes:
- Job status shows Download buttons for the job’s artifacts (e.g.
<prefix>.document.md,<prefix>.document_structured.json,<prefix>.chunks.jsonl,<prefix>.manifest.json,<prefix>.metadata.json). - Download all as a ZIP or individual files.
- Job status shows Download buttons for the job’s artifacts (e.g.
Errors are shown in the sidebar under Status. Use Dismiss to clear them.
| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Health check |
| POST | /upload |
Upload file; returns job_id |
| POST | /extract/{job_id} |
Run Docling extraction (long-running; optional body: processing_config) |
| GET | /job/{job_id} |
Get job metadata and artifact list |
| GET | /job/{job_id}/progress |
Get extraction progress (e.g. status, message) |
| GET | /artifact/{job_id}/{filename} |
Download a stored artifact |
| POST | /storage/clean |
Delete all uploads and outputs (free disk space) |
-
Extraction finishes but no “Complete” or download buttons
Reuse the same file (don’t re-upload). The UI keeps the same job so you see the completed state and download buttons. If you already re-uploaded, run extraction again on the current file and wait for completion. -
500 or timeout during extraction
- Ensure the image is rebuilt after code changes:
docker compose buildthendocker compose up -d. - Check logs:
docker logs Docling. The backend does not capture subprocess output (to avoid pipe deadlock); logs go to the container. - Large PDFs: extraction can take several minutes; the UI waits up to ~5 minutes.
- Ensure the image is rebuilt after code changes:
-
“Killed” in logs / OOM
The container hit memory limits. The compose file limits the container to 4GB. Increase Docker Desktop memory (Settings → Resources) or use a smaller document. -
Extraction fails or stops when the PC goes to sleep
Sleep suspends the whole system (including Docker), so the extraction process stops. Fix: keep the PC awake during extraction. On Windows you can run the provided script in a separate PowerShell window before starting extraction; it tells the OS not to sleep until you press Ctrl+C:cd Doclinger .\scripts\prevent-sleep.ps1Then start the app and run extraction. When the job is done, press Ctrl+C in the script window. Alternatively, set Power & sleep → “When plugged in, put the computer to sleep” to Never (or 30+ minutes) while running long jobs.
-
Connection refused or wrong port
- Using the Docker UI: keep the sidebar backend URL as http://127.0.0.1:8000 (API in same container). The UI port is configurable via
UI_PORT(default: 8502). - Running the UI locally against a Docker API: set the sidebar URL to match your Docker
API_PORT(default: http://localhost:8001).
- Using the Docker UI: keep the sidebar backend URL as http://127.0.0.1:8000 (API in same container). The UI port is configurable via
-
"No space left on device" on upload
The container or host disk is full. Free space:- Remove old extraction outputs: project data lives in
data/uploadsanddata/outputsat the project root—delete or archive files there if you don’t need them. You can also call POST /storage/clean to clear all uploads and outputs. - Prune Docker:
docker system prune -a(removes unused images/containers; add--volumesonly if you’re sure you don’t need other volumes). - Check host free space on the drive where the project and Docker data live; free at least a few GB.
- Docker Desktop: Settings → Resources → Disk image size — increase if the virtual disk is full.
- Remove old extraction outputs: project data lives in
- Prefix: Every artifact filename is prefixed with the source document name (stem) sanitized for the filesystem: spaces → underscores, only
[A-Za-z0-9._-]kept, multiple underscores collapsed, max 80 characters. Example:User Guide v2.pdf→ prefixUser_Guide_v2. - Artifacts (in
data/outputs/<job_id>/):<prefix>.document.md— Extracted markdown (always kept).<prefix>.document_structured.json— Rich Docling output (can be large).<prefix>.chunks.jsonl— RAG-ready JSONL (one JSON object per line).<prefix>.manifest.json— Job summary, source file, artifact list, chunk counts, chunking params.<prefix>.metadata.json— Job metadata (job_id, status, artifact_prefix, artifacts, stats).
- chunks.jsonl schema (one line per chunk):
Chunking is header-aware (splits by
{"id": "<doc_id>_<index>", "text": "...", "meta": {"doc_id": "<job_id>", "section": "H1 > H2"}}#–######), then by approximate token windows (chars/4). No start/end offsets. - Chunking defaults: target 1000 tokens, overlap 120 tokens. Tables and paragraphs are kept intact where possible (split at blank lines).
- RAG ingestion: Use
<prefix>.chunks.jsonlas input to your vector DB or embedding pipeline. Each line is a JSON object withid,text, andmeta(doc_id, section). Embedtextand storeid/metafor retrieval.
MIT
