Originally created for 2025 Hazard Information Profiles (HIPs) from United Nations Office for Disaster Risk Reduction. This implementation is based on the novel "The Minority Report" by Philip K. Dick.
An agentic pipeline for generating high-quality multilingual technical translations and controlled vocabularies. The project uses a multi-model approach (Voter/Arbitrator architecture) to translate terms while preserving context and generating standard Croissant metadata.
- Multi-Model Orchestration: Leverages
gpt-oss:latest,gemma3:27b, anddeepseek-r1:14bvia Ollama. - Context-Aware Translation: Uses "Scope Notes" to ensure technical accuracy.
- Web Scraping: Built-in support for extracting terms and definitions from sites like PreventionWeb.
- Formal Provenance: Every translation is linked to its source model in both CSV and Croissant metadata.
- Arbitration Logic: Automatically resolves disagreements between models by triggering a voting round.
- Batch Processing: Scrape entire indexes of terms and generate metadata in bulk.
- Dockerized: Easy deployment and execution without local environment headaches.
- Docker and Docker Compose
- Ollama (running on
http://10.147.18.253:11434by default)
If you prefer running without Docker:
pip install -r requirements.txt-
Build the Environment:
docker-compose build
-
Index Scraping (Create Cache): Scrape an entire index page (e.g., PreventionWeb HIPS) to create a local cache of terms, URLs, and codes. This does not run translations yet.
docker-compose run orchestrator --index-url https://www.preventionweb.net/understanding-disaster-risk/terminology/hips/ --output-dir data
Output:
data/index_cache.json -
Run Translations (From Cache): Process the cached index file to translate all terms.
docker-compose run orchestrator --index-file data/index_cache.json --languages fr,es,de --models gpt-oss:latest
With OntoPortal Enrichment (Optional): Requires an API key (e.g., from EcoPortal).
export ONTOPORTAL_API_KEY="528c4e4a-5c3e-4798-a2e2-11d96761b8ce" docker-compose run -e ONTOPORTAL_API_KEY orchestrator \ --index-file data/index_cache.json \ --ontoportal-url "http://ecoportal.lifewatch.eu:8080"
-
Single URL Mode: Directly scrape and translate a single page.
docker-compose run orchestrator --url https://www.preventionweb.net/understanding-disaster-risk/terminology/hips/mh0103 --languages fr,es
OntoPortal-based Pages (e.g. AgroPortal): The tool automatically detects OntoPortal URLs (containing
conceptid) and uses the API to fetch the authoritative definition.export ONTOPORTAL_API_KEY="528c4e4a-5c3e-4798-a2e2-11d96761b8ce" docker-compose run -e ONTOPORTAL_API_KEY orchestrator \ --url "https://agroportal.lirmm.fr/ontologies/NCBITAXON?p=classes&conceptid=http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FSTY%2FT017" \ --ontoportal-url "http://agroportal.lirmm.fr"
-
Generate Batch Croissant Metadata: Generate rich JSON-LD metadata for the entire dataset, enriching it with the HIPS codes and Source URLs from the index cache.
docker-compose run python3 translation-skill/scripts/batch_croissant.py --input-file data/final_translations.csv --index-file data/index_cache.json --output-dir output/
Ensure your OLLAMA_HOST is set:
export OLLAMA_HOST=http://10.147.18.253:11434
python3 translation-skill/scripts/orchestrator.py --url "YOUR_URL" --output_dir dataThe Minority Report can be run as a Model Context Protocol (MCP) server, allowing LLMs to use its translation and scraping tools directly. It provides:
understand_and_translate: Translates a specialized term given its context (Scope Note).open_page_and_translate: Scrapes a term/context from a URL and translates it.
-
Install Dependencies:
pip install fastmcp mcp --break-system-packages
-
Run the Server:
python3 translation-skill/scripts/mcp_server.py
-
Configure MCP Client: Add the following to your MCP client configuration (e.g., Claude Desktop
mcp_config.json):{ "mcpServers": { "minority-report": { "command": "python3", "args": ["/Users/vyacheslavtykhonov/projects/the-minority-report/translation-skill/scripts/mcp_server.py"], "env": { "OLLAMA_HOST": "http://10.147.18.253:11434" } } } }
Once configured, you can ask your LLM to perform complex technical translations.
What to ask:
- "Open the page https://www.preventionweb.net/... and translate the term into French and Spanish."
- "I have a technical term 'Digital Twin' used in the context of urban planning. Translate it into German and Dutch using The Minority Report."
- "Scrape the latest terminology from this URL and generate a multilingual controlled vocabulary entry."
Tips:
- Provide Context: The
understand_and_translatetool works best when you provide a detailed "Scope Note" or definition. - Specify Models: You can explicitly ask for certain models (e.g.,
gemma3:27b) if you need specific language nuances. - Batching: While the MCP tools currenty handle one term at a time, you can ask the LLM to loop through a list of terms.
| Variable | Description | Default |
|---|---|---|
OLLAMA_HOST |
The URL of the Ollama API | http://10.147.18.253:11434 |
SPACY_MODEL |
Path to the trained spaCy NER model for MCP server | training/spacy_hips |
The MCP server uses a spaCy model for the find_hazards tool.
- By default, it looks for the model at
training/spacy_hips. - You can override this by setting the
SPACY_MODELenvironment variable to the absolute path of your model.
export SPACY_MODEL=/path/to/your/custom/model
python3 translation-skill/scripts/mcp_server.pytranslation-skill/scripts/: Core logic for scraping, translation, and metadata generation.translation-skill/prompts/: Markdown-based prompt templates for LLM agents.training/: Model training scripts (spaCy NER and Transformers).tests/: Verification scripts for the pipeline.data/: Input and output CSV files.output/: Generated Croissant metadata and SKOS vocabularies.agents.md: Detailed documentation on agent architecture and prompts.
The project includes two training approaches for building custom models:
Train a lightweight Named Entity Recognition model to identify disaster terminology:
Single-job training:
# Install spaCy
pip install spacy
# Train the model (fast, ~1-2 minutes)
python3 training/train-spacy.py --data-dir output --n-iter 30 --test
# Model saved to: training/spacy_model/Parallel training (for multi-core systems):
# Train 16 models in parallel
python3 training/train-spacy.py --data-dir output --n-jobs 16 --n-iter 30
# Monitor progress in another terminal
tail -f training/spacy_model/run_*.log
# Models saved to: training/spacy_model/run_0/, run_1/, ..., run_15/Advantages:
- Fast training (1-2 minutes single-job, scales with cores for parallel)
- Small model size (~10MB per model)
- Works on CPU (GPU conflicts resolved via spawn multiprocessing)
- Production-ready
- Parallel training enables model ensembling or selection of best performer
Fine-tune a Gemma model for translation tasks:
# Requires CUDA GPU
python3 training/train.py --data-dir output --model-name google/gemma-2-2b-it --max-steps 60Note: This approach requires significant GPU resources and may take hours to train.
- Vyacheslav Tykhonov: Project Lead & Architect.
- GitHub: https://github.com/4tikhonov/the-minority-report
Distributed under the Creative Commons Attribution 4.0 license.