The Minority Report

Originally created for 2025 Hazard Information Profiles (HIPs) from United Nations Office for Disaster Risk Reduction. This implementation is based on the novel "The Minority Report" by Philip K. Dick.

An agentic pipeline for generating high-quality multilingual technical translations and controlled vocabularies. The project uses a multi-model approach (Voter/Arbitrator architecture) to translate terms while preserving context and generating standard Croissant metadata.

Features

Multi-Model Orchestration: Leverages gpt-oss:latest, gemma3:27b, and deepseek-r1:14b via Ollama.
Context-Aware Translation: Uses "Scope Notes" to ensure technical accuracy.
Web Scraping: Built-in support for extracting terms and definitions from sites like PreventionWeb.
Formal Provenance: Every translation is linked to its source model in both CSV and Croissant metadata.
Arbitration Logic: Automatically resolves disagreements between models by triggering a voting round.
Batch Processing: Scrape entire indexes of terms and generate metadata in bulk.
Dockerized: Easy deployment and execution without local environment headaches.

Installation

Prerequisites

Docker and Docker Compose
Ollama (running on http://10.147.18.253:11434 by default)

Local Setup (Optional)

If you prefer running without Docker:

pip install -r requirements.txt

Usage

🚀 Using Docker (Recommended)

Build the Environment:
```
docker-compose build
```
Index Scraping (Create Cache): Scrape an entire index page (e.g., PreventionWeb HIPS) to create a local cache of terms, URLs, and codes. This does not run translations yet.
```
docker-compose run orchestrator --index-url https://www.preventionweb.net/understanding-disaster-risk/terminology/hips/ --output-dir data
```
Output: data/index_cache.json

Run Translations (From Cache): Process the cached index file to translate all terms.

docker-compose run orchestrator --index-file data/index_cache.json --languages fr,es,de --models gpt-oss:latest

With OntoPortal Enrichment (Optional): Requires an API key (e.g., from EcoPortal).

export ONTOPORTAL_API_KEY="528c4e4a-5c3e-4798-a2e2-11d96761b8ce"

docker-compose run -e ONTOPORTAL_API_KEY orchestrator \
  --index-file data/index_cache.json \
  --ontoportal-url "http://ecoportal.lifewatch.eu:8080"

Single URL Mode: Directly scrape and translate a single page.

docker-compose run orchestrator --url https://www.preventionweb.net/understanding-disaster-risk/terminology/hips/mh0103 --languages fr,es

OntoPortal-based Pages (e.g. AgroPortal): The tool automatically detects OntoPortal URLs (containing conceptid) and uses the API to fetch the authoritative definition.

export ONTOPORTAL_API_KEY="528c4e4a-5c3e-4798-a2e2-11d96761b8ce"

docker-compose run -e ONTOPORTAL_API_KEY orchestrator \
  --url "https://agroportal.lirmm.fr/ontologies/NCBITAXON?p=classes&conceptid=http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FSTY%2FT017" \
  --ontoportal-url "http://agroportal.lirmm.fr"

Generate Batch Croissant Metadata: Generate rich JSON-LD metadata for the entire dataset, enriching it with the HIPS codes and Source URLs from the index cache.

docker-compose run python3 translation-skill/scripts/batch_croissant.py --input-file data/final_translations.csv --index-file data/index_cache.json --output-dir output/

🐍 Using Python Directly

Ensure your OLLAMA_HOST is set:

export OLLAMA_HOST=http://10.147.18.253:11434
python3 translation-skill/scripts/orchestrator.py --url "YOUR_URL" --output_dir data

MCP Server

The Minority Report can be run as a Model Context Protocol (MCP) server, allowing LLMs to use its translation and scraping tools directly. It provides:

understand_and_translate: Translates a specialized term given its context (Scope Note).
open_page_and_translate: Scrapes a term/context from a URL and translates it.

How to Start the MCP Server

Install Dependencies:

pip install fastmcp mcp --break-system-packages

Run the Server:

python3 translation-skill/scripts/mcp_server.py

Configure MCP Client: Add the following to your MCP client configuration (e.g., Claude Desktop mcp_config.json):

{
  "mcpServers": {
    "minority-report": {
      "command": "python3",
      "args": ["/Users/vyacheslavtykhonov/projects/the-minority-report/translation-skill/scripts/mcp_server.py"],
      "env": {
        "OLLAMA_HOST": "http://10.147.18.253:11434"
      }
    }
  }
}

How to Use Minority Report MCP

Once configured, you can ask your LLM to perform complex technical translations.

What to ask:

"Open the page https://www.preventionweb.net/... and translate the term into French and Spanish."
"I have a technical term 'Digital Twin' used in the context of urban planning. Translate it into German and Dutch using The Minority Report."
"Scrape the latest terminology from this URL and generate a multilingual controlled vocabulary entry."

Tips:

Provide Context: The understand_and_translate tool works best when you provide a detailed "Scope Note" or definition.
Specify Models: You can explicitly ask for certain models (e.g., gemma3:27b) if you need specific language nuances.
Batching: While the MCP tools currenty handle one term at a time, you can ask the LLM to loop through a list of terms.

Environment Variables

Variable	Description	Default
`OLLAMA_HOST`	The URL of the Ollama API	`http://10.147.18.253:11434`
`SPACY_MODEL`	Path to the trained spaCy NER model for MCP server	`training/spacy_hips`

Configuration

MCP Server

The MCP server uses a spaCy model for the find_hazards tool.

By default, it looks for the model at training/spacy_hips.
You can override this by setting the SPACY_MODEL environment variable to the absolute path of your model.

export SPACY_MODEL=/path/to/your/custom/model
python3 translation-skill/scripts/mcp_server.py

Project Structure

translation-skill/scripts/: Core logic for scraping, translation, and metadata generation.
translation-skill/prompts/: Markdown-based prompt templates for LLM agents.
training/: Model training scripts (spaCy NER and Transformers).
tests/: Verification scripts for the pipeline.
data/: Input and output CSV files.
output/: Generated Croissant metadata and SKOS vocabularies.
agents.md: Detailed documentation on agent architecture and prompts.

Training Models

The project includes two training approaches for building custom models:

spaCy NER Model (Recommended)

Train a lightweight Named Entity Recognition model to identify disaster terminology:

Single-job training:

# Install spaCy
pip install spacy

# Train the model (fast, ~1-2 minutes)
python3 training/train-spacy.py --data-dir output --n-iter 30 --test

# Model saved to: training/spacy_model/

Parallel training (for multi-core systems):

# Train 16 models in parallel
python3 training/train-spacy.py --data-dir output --n-jobs 16 --n-iter 30

# Monitor progress in another terminal
tail -f training/spacy_model/run_*.log

# Models saved to: training/spacy_model/run_0/, run_1/, ..., run_15/

Advantages:

Fast training (1-2 minutes single-job, scales with cores for parallel)
Small model size (~10MB per model)
Works on CPU (GPU conflicts resolved via spawn multiprocessing)
Production-ready
Parallel training enables model ensembling or selection of best performer

Transformers Fine-tuning (Advanced)

Fine-tune a Gemma model for translation tasks:

# Requires CUDA GPU
python3 training/train.py --data-dir output --model-name google/gemma-2-2b-it --max-steps 60

Note: This approach requires significant GPU resources and may take hours to train.

Credits

Vyacheslav Tykhonov: Project Lead & Architect.
GitHub: https://github.com/4tikhonov/the-minority-report

License

Distributed under the Creative Commons Attribution 4.0 license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Minority Report

Features

Installation

Prerequisites

Local Setup (Optional)

Usage

🚀 Using Docker (Recommended)

🐍 Using Python Directly

MCP Server

How to Start the MCP Server

How to Use Minority Report MCP

Environment Variables

Configuration

MCP Server

Project Structure

Training Models

spaCy NER Model (Recommended)

Transformers Fine-tuning (Advanced)

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
data		data
dataverse		dataverse
hips		hips
ontologies		ontologies
output		output
tests		tests
training		training
translation-skill		translation-skill
Dockerfile		Dockerfile
README.md		README.md
agents.md		agents.md
docker-compose.yml		docker-compose.yml
quickstart.md		quickstart.md
requirements.txt		requirements.txt
test_json.py		test_json.py
training.md		training.md

Folders and files

Latest commit

History

Repository files navigation

The Minority Report

Features

Installation

Prerequisites

Local Setup (Optional)

Usage

🚀 Using Docker (Recommended)

🐍 Using Python Directly

MCP Server

How to Start the MCP Server

How to Use Minority Report MCP

Environment Variables

Configuration

MCP Server

Project Structure

Training Models

spaCy NER Model (Recommended)

Transformers Fine-tuning (Advanced)

Credits

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages