Skip to content

birladeanuadrian/themis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Themis

Themis is a Romanian legal search/analysis project built around:

  • PDF parsing and article extraction for legal acts
  • Elasticsearch indexing for searchable legal content
  • MCP tools for legal article search and related-article discovery
  • Prompt-driven legal Q&A behavior (used by an external agent setup)
  • Unit tests and end-to-end behavior tests

The project is aimed at helping people without a legal or financial background, such as new startup owners and people working in small and medium-sized enterprises (SMEs), get a practical starting point for legal and fiscal / tax questions.

It is designed for situations where users may not know where to look in Romanian legislation, and want an accessible first answer before paying for or scheduling a meeting with an accountant or lawyer. It helps users identify the relevant legal basis and understand the issue faster, while still treating professional advice as the final step for case-specific decisions.

The repository is no longer just a standalone PDF parser. The PDF parser is one part of a larger ingestion + search + agent workflow.

Main Components

  • ingest.py: Ingests configured legal PDFs (documents/) into Elasticsearch using the extraction pipeline
  • mcp_server.py: Starts the FastMCP server and exposes legal search tools
  • prompts/search-law.md: System prompt for the legal assistant behavior
  • e2e_tests/: behave scenarios validating answer content and references
  • tests/: Unit/integration-style tests for parsing, extraction, links, storage, and pipeline pieces
  • src/: Core parsing, conversion, segmentation, link extraction, storage, and config logic
  • docker-compose.yml: Local Elasticsearch + Kibana + MCP server stack

Requirements

  • Python 3.11+
  • Docker + Docker Compose (for local Elasticsearch/Kibana stack)
  • Dependencies from requirements.txt

Python dependencies currently include:

  • PyMuPDF
  • elasticsearch
  • fastmcp
  • behave
  • openai
  • requests

Setup

1. Python environment

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. Environment variables

Create/update .env (used by Python code and e2e tests).

Common variables used in this repo:

  • ELASTICSEARCH_URL (default in code: http://elasticsearch:9200)
  • ELASTICSEARCH_USERNAME
  • ELASTICSEARCH_PASSWORD
  • SEGMENT_PIPELINE (optional; defaults to empty string)
  • OPENAI_API_KEY (required for e2e semantic assertion step)
  • KIBANA_API_KEY (required for e2e tests that call Kibana Agent Builder API)

Notes:

  • src/config.py loads env vars via python-dotenv
  • e2e_tests/features/environment.py loads .env from repo root

3. Start local services (Docker)

docker compose up -d elasticsearch kibana

Optional: include the MCP server container too:

docker compose up -d

The compose stack includes:

  • Elasticsearch (localhost:9200)
  • Kibana (localhost:5601)
  • mcp-server service (built from Dockerfile_MCP)

Ingest Romanian Laws into Elasticsearch

ingest.py is preconfigured to ingest several Romanian legal documents from documents/, including:

  • Codul Fiscal
  • Codul Civil
  • Codul Muncii
  • Securitatea si Sanatatea in Munca

Run ingestion:

python ingest.py

What happens at a high level:

  1. PDF text is read
  2. Legal structure/articles are extracted
  3. Articles are converted/segmented
  4. Legal links/cross-references are extracted
  5. Data is stored in Elasticsearch

Run the MCP Server

Local Python process

python mcp_server.py

This starts a FastMCP server exposing:

  • themis_search_law_ro
  • themis_find_related_articles

MCP Inspector (local debugging)

You can use the provided helper script:

./run-mcp.sh

Prompts

  • prompts/search-law.md contains the legal assistant instructions used by the agent.
  • It includes guidance for:
    • Romanian-only tool queries (translate user question intent when needed)
    • fact-first answers
    • mandatory legal references in the final response
    • conditional/incomplete-facts responses

If you change prompt behavior, update/extend e2e_tests to keep the output contract stable.

Testing

Unit tests

Run the Python tests:

pytest

Test coverage includes:

  • PDF reading
  • article extraction and conversion
  • regex patterns
  • legal link extraction
  • storage and pipeline behavior

End-to-end tests (behave)

The e2e tests in e2e_tests/ validate:

  • expected legal references appear in answers
  • optional references are tolerated in some scenarios
  • no unexpected references (for strict scenarios)
  • answer meaning/intent via an OpenAI-based checker step

Run:

behave e2e_tests/features

Prerequisites for e2e:

  • Kibana and the target agent are running and reachable at http://localhost:5601
  • KIBANA_API_KEY is set
  • OPENAI_API_KEY is set

Repository Notes

  • Sample answer outputs used during validation/debugging may exist under answers/
  • HTTP snippets for Elasticsearch setup/testing are under elastic-queries/
  • Dockerfiles:
    • Dockerfile (project/runtime image)
    • Dockerfile_MCP (MCP server image used by compose)

Troubleshooting

  • Expected reference ... not found in e2e tests:
    • Check prompt output includes a ### References section
    • Verify reference format includes law name, law number, and article
  • E2E fails with missing API keys:
    • Ensure .env contains OPENAI_API_KEY and KIBANA_API_KEY
  • Elasticsearch auth/connectivity issues:
    • Verify ELASTICSEARCH_URL, ELASTICSEARCH_USERNAME, ELASTICSEARCH_PASSWORD
    • Confirm Docker services are running (docker compose ps)
  • MCP server returns poor/no results:
    • Confirm documents were ingested successfully before testing
    • Re-check prompt instructions in prompts/search-law.md

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages