Themis is a Romanian legal search/analysis project built around:
- PDF parsing and article extraction for legal acts
- Elasticsearch indexing for searchable legal content
- MCP tools for legal article search and related-article discovery
- Prompt-driven legal Q&A behavior (used by an external agent setup)
- Unit tests and end-to-end behavior tests
The project is aimed at helping people without a legal or financial background, such as new startup owners and people working in small and medium-sized enterprises (SMEs), get a practical starting point for legal and fiscal / tax questions.
It is designed for situations where users may not know where to look in Romanian legislation, and want an accessible first answer before paying for or scheduling a meeting with an accountant or lawyer. It helps users identify the relevant legal basis and understand the issue faster, while still treating professional advice as the final step for case-specific decisions.
The repository is no longer just a standalone PDF parser. The PDF parser is one part of a larger ingestion + search + agent workflow.
ingest.py: Ingests configured legal PDFs (documents/) into Elasticsearch using the extraction pipelinemcp_server.py: Starts the FastMCP server and exposes legal search toolsprompts/search-law.md: System prompt for the legal assistant behaviore2e_tests/:behavescenarios validating answer content and referencestests/: Unit/integration-style tests for parsing, extraction, links, storage, and pipeline piecessrc/: Core parsing, conversion, segmentation, link extraction, storage, and config logicdocker-compose.yml: Local Elasticsearch + Kibana + MCP server stack
- Python 3.11+
- Docker + Docker Compose (for local Elasticsearch/Kibana stack)
- Dependencies from
requirements.txt
Python dependencies currently include:
PyMuPDFelasticsearchfastmcpbehaveopenairequests
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtCreate/update .env (used by Python code and e2e tests).
Common variables used in this repo:
ELASTICSEARCH_URL(default in code:http://elasticsearch:9200)ELASTICSEARCH_USERNAMEELASTICSEARCH_PASSWORDSEGMENT_PIPELINE(optional; defaults to empty string)OPENAI_API_KEY(required for e2e semantic assertion step)KIBANA_API_KEY(required for e2e tests that call Kibana Agent Builder API)
Notes:
src/config.pyloads env vars viapython-dotenve2e_tests/features/environment.pyloads.envfrom repo root
docker compose up -d elasticsearch kibanaOptional: include the MCP server container too:
docker compose up -dThe compose stack includes:
- Elasticsearch (
localhost:9200) - Kibana (
localhost:5601) mcp-serverservice (built fromDockerfile_MCP)
ingest.py is preconfigured to ingest several Romanian legal documents from documents/, including:
Codul FiscalCodul CivilCodul MunciiSecuritatea si Sanatatea in Munca
Run ingestion:
python ingest.pyWhat happens at a high level:
- PDF text is read
- Legal structure/articles are extracted
- Articles are converted/segmented
- Legal links/cross-references are extracted
- Data is stored in Elasticsearch
python mcp_server.pyThis starts a FastMCP server exposing:
themis_search_law_rothemis_find_related_articles
You can use the provided helper script:
./run-mcp.shprompts/search-law.mdcontains the legal assistant instructions used by the agent.- It includes guidance for:
- Romanian-only tool queries (translate user question intent when needed)
- fact-first answers
- mandatory legal references in the final response
- conditional/incomplete-facts responses
If you change prompt behavior, update/extend e2e_tests to keep the output contract stable.
Run the Python tests:
pytestTest coverage includes:
- PDF reading
- article extraction and conversion
- regex patterns
- legal link extraction
- storage and pipeline behavior
The e2e tests in e2e_tests/ validate:
- expected legal references appear in answers
- optional references are tolerated in some scenarios
- no unexpected references (for strict scenarios)
- answer meaning/intent via an OpenAI-based checker step
Run:
behave e2e_tests/featuresPrerequisites for e2e:
- Kibana and the target agent are running and reachable at
http://localhost:5601 KIBANA_API_KEYis setOPENAI_API_KEYis set
- Sample answer outputs used during validation/debugging may exist under
answers/ - HTTP snippets for Elasticsearch setup/testing are under
elastic-queries/ - Dockerfiles:
Dockerfile(project/runtime image)Dockerfile_MCP(MCP server image used by compose)
Expected reference ... not foundin e2e tests:- Check prompt output includes a
### Referencessection - Verify reference format includes law name, law number, and article
- Check prompt output includes a
- E2E fails with missing API keys:
- Ensure
.envcontainsOPENAI_API_KEYandKIBANA_API_KEY
- Ensure
- Elasticsearch auth/connectivity issues:
- Verify
ELASTICSEARCH_URL,ELASTICSEARCH_USERNAME,ELASTICSEARCH_PASSWORD - Confirm Docker services are running (
docker compose ps)
- Verify
- MCP server returns poor/no results:
- Confirm documents were ingested successfully before testing
- Re-check prompt instructions in
prompts/search-law.md