Your intelligent juridical document look-up tool.
Jurask is a web-based tool that leverages LLM and RAG technologies to answer juridical questions written in natural language.
- Runs mostly locally and reasonably fast even on a laptop and uses free API otherwise, making it available for a variety of developers
- Pipelined architecture
- Jurask is able to go from links to the documents to the working web app fully automatically by running a single command
- Jenkins-like configuration capabilities
- Multithreaded parsing support
- (currently through setting MAX_THREADS value in parse.)
- Laws tested
- Конституция РФ
- ФЗ "О защите прав потребителей"
- ФЗ "Об ООО"
- ФЗ "Об АО"
- ФЗ "Об образовании"
- ФЗ "О рекламе"
- Query language support
- Russian
- Minimalistic yet intuitive web based interface
- API requirements
- Groq API key
- (Optional) Zilliz Cloud Database
All actions, unless specified, should be done while in the jurask/ folder.
Also, on Windows systems, write python jurask instead of ./jurask.
The project expects you to create a .env file and include
- GROQ_API_KEY="..."
and optionally - if remote database is set up -
- MILVUS_URI="..."
- MILVUS_TOKEN="..."
variables to be able to connect to external services
pip install -r requirements.txt
./jurask parse
./jurask embed
./jurask store
./jurask
./jurask run
Then, open http://localhost:8000 in your browser.
For debug purposes, you may also want to make queries through the CLI. Just run
./jurask cli
and wait until you're prompted to enter your question with a dinosaur-styled invitation: jurask ---^*>
(tip: you may write exit in the CLI mode to exit in a more controlled manner)
To run an experimental LLM-based testing, use
./jurask test
(This will create a simple log file in logs/ after the tests are finished)
App built with Langchain (LLM management),
Milvus (Vector DB),
FastAPI (Web framework),
Sentence transformers (To run the embedding generation model locally)
PyYAML (To parse the configuration file),
BeautifulSoup4 and Clouscraper (For webpage parsing)
and some other libraries to load the environment file, show progress bars and whatnot (see requirements.txt).
Model multilingual-e5-large-instruct has been used to generate vector embeddings for the database.
SQLite was used to store original laws for dynamic lookup
Free API to access LLMs provided by Groq
Web interface written using just pure HTML/CSS/JS
Jurask follows a certain pipeline-like architecture to go from document webpage links to the working web application:
To bring together all the different modules and make Jurask more manageable, configuration by a .yaml Jenkins-like pipeline description has been implemented. You can specify the documents and paths, as well as all included steps, in the corresponding jurask.yaml
Below is an example of such file to demonstrate the syntax and the general structure:
paths: # for an easier data management
database-path: data/db/ # where should database files be stored
parsed-data-path: data/parsed/ # where should parsed .json files of laws be stored
dataset-index-path: data/augment/ # where should a "contents" json be stored
stages: # allows for pipelined execution of tasks
parse:
skip: # include if the stage should be skipped
include: all # use to automatically include everything under documents: section
module: project.parse # see below for custom stages
embed:
skip:
include: all
except: [constitution] # handy keyword to exclude some documents from include: section
module: project.vector_db
store:
skip:
include: [constitution, education] # documents can be listed explicitly as well
module: project.reference_db
test:
skip:
module: project.testing.llm_judge_test
cli:
skip:
module: project.llm
run:
# no skip: present - this will run
module: project.web
documents: # data sources
- id: constitution # name to use in configuration and in the code
url: https://www.garant.ru/doc/constitution/ # only garant.ru links are supported to parse
description: Конституция РФ # human-friendly description of the declared data source
- id: consumer_rights
url: https://base.garant.ru/10106035/
description: ФЗ "О защите прав потребителей"
- id: advertisements
url: https://base.garant.ru/12145525/
description: ФЗ "О рекламе"
- id: llc
url: https://base.garant.ru/12109720/
description: ФЗ "Об ООО"
- id: education
url: https://base.garant.ru/70291362/
description: ФЗ "Об Образовании"
- id: jsc
url": https://base.garant.ru/10105712/
description: ФЗ "Об АО"parse- build the .jsons with law data parsed from the web;embed- build the vector database with chunked embeddings of data from the parsed .json files;store- build the relational database containing all the laws from the parsed .json files for quick referencing;test- test the system with an experimental LLM-based judging method (currently unstable);cli- make queries through the command line;run- launch the web server for the whole system, making it accesible through a browser on a local network;
You can run any set of stages from the CLI, for example:
./jurask test run
(keep in mind that non-existent stages will not stop the execution of other ones in the sequence)
In theory it is possible to implement custom add-on stages to jurask.
To do so, specify a new stage in the configuration file and the path to a python module executing it. In that module, declare a function as follows:
def main(documents, paths):
passThe executor (jurask) will then call this method, passing documents: and paths: sections' data to the corresponding parameters
- A lot of refactoring is required to make the codebase a least somewhat sustainable (for example, the Dataframe class is duplicated between two *_db.py files, a lot of global variables in both Python and Javascript code are present and so on)
- LLM testing is not robust at all
- SQLite, as well as Milvus Lite (based on SQLite) are not production-ready
- Very limited token length and minimal error handling in every part of the project
- Answer generations fail very often because the RAG pipeline is not tuned well enough
- Test log directory declaration is missing in the configuration file
- Prompts are far from being optimally written
- Frontend quite buggy
- No automatic generation of supported laws list on the frontend
Troodon Juryik does not come from the Jurassic era, but surprisingly you can still jurask him if you want.


