## juur.ai - Simplify Law

This project is made to simplify the process of understanding Estonian State laws, making it easier for ordinary citizens to navigate complex legal texts. While the Estonian government provides a helper bot called [Bürokratt](https://www.kratid.ee/en/burokratt), it has faced criticism for being unreliable and difficult to use. Our goal is to address these issues by offering a more intuitive, efficient solution that helps users quickly find the legal information they need.

Gemini 1.5's large context window, capable of processing up to 2 million tokens, provides the foundation for this project. However, due to the limitations of the free version, which supports only 1 million tokens, we run multiple parallel models to process the information. We also utilize a final model to compare and select the best response, ensuring that users receive the most accurate and relevant legal information.

## 1. Import packages

In [35]:
import os
import google.generativeai as genai
from dotenv import load_dotenv
import threading
import time
import concurrent

## 2. Load API keys

NB! to use this model at least 6 google API keys are needed.<br>
This can either be done by using `.env` file or inputing them into the code.<br>
The env file should be formated like

```
GOOGLE_API_KEY_1=abc
GOOGLE_API_KEY_2=abc
GOOGLE_API_KEY_3=abc
GOOGLE_API_KEY_4=abc
GOOGLE_API_KEY_5=abc
GOOGLE_API_KEY_6=abc
```

In [36]:
api_keys = []
if os.path.isfile(".env"):
    load_dotenv()
    api_key_1 = os.getenv('GOOGLE_API_KEY_1')
    api_key_2 = os.getenv('GOOGLE_API_KEY_2')
    api_key_3 = os.getenv('GOOGLE_API_KEY_3')
    api_key_4 = os.getenv('GOOGLE_API_KEY_4')
    api_key_5 = os.getenv('GOOGLE_API_KEY_5')
    api_key_6 = os.getenv('GOOGLE_API_KEY_6')
    
    api_keys = [
        api_key_1,
        api_key_2,
        api_key_3,
        api_key_4,
        api_key_5,
        api_key_6
    ]
else:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    api_keys = [
        user_secrets.get_secret("GOOGLE_API_KEY_1"),
        user_secrets.get_secret("GOOGLE_API_KEY_2"),
        user_secrets.get_secret("GOOGLE_API_KEY_3"),
        user_secrets.get_secret("GOOGLE_API_KEY_4"),
        user_secrets.get_secret("GOOGLE_API_KEY_5"),
        user_secrets.get_secret("GOOGLE_API_KEY_6")
    ]

## 3. Loading datasets

NB! The dataset is located here: [https://www.kaggle.com/datasets/robinotter/eesti-vabariigi-seadused](http://)

## 4. Making data clusters based on the maximum token count of the LLM

We started with 390 legal documents, which we organized into 12 groups. To ensure each document remains intact and fully accessible to a single language model, we sorted them based on their token count. Each group contains up to 998,000 tokens—just under the 1-million-token limit—leaving room for the system and user prompts. If users want to ask longer questions, we may need to reduce the group size to accommodate the extra tokens.

In [37]:
def get_file_sizes(data_folder, api_key, model_name):
    file_sizes = {}
    genai.configure(api_key=api_key)
    model = genai.GenerativeModel(model_name)

    for filename in os.listdir(data_folder):
        if not filename.endswith('.txt'):
            continue
        file_path = os.path.join(data_folder, filename)
        if os.path.isfile(file_path):
            print(f'Processing file: {file_path}')
            my_file = open(file_path, 'r')
            my_file_content = my_file.read()
            my_file.close()
            token_count = model.count_tokens(my_file_content).total_tokens
            file_sizes[filename] = token_count

    return file_sizes

In [38]:
def get_groups(file_sizes, group_token_limit=1000):
    sorted_file_sizes = dict(sorted(file_sizes.items(), key=lambda item: item[1], reverse=True))

    groups = []

    def add_to_group(file_name):
        for group in groups:
            if sum(group.values()) + file_sizes[file_name] < group_token_limit:
                group[file_name] = file_sizes[file_name]
                return
        groups.append({file_name: file_sizes[file_name]})

    for file_name in sorted_file_sizes:
        add_to_group(file_name)

    return groups


In [39]:
def generate_new_files(groups, data_folder, output_folder):
    for i, group in enumerate(groups):
        i = i + 1
        with open(f'{output_folder}/group_{i}.txt', 'w') as f:
            for file_name in group:
                input_file_path = os.path.join(data_folder, file_name)
                with open(input_file_path, 'r') as input_file:
                    print(f'Writing {file_name} to group_{i}.txt')
                    f.write(input_file.read())

In [40]:
data_folder = "/kaggle/input/eesti-vabariigi-seadused/"
sorted_data_folder = "/sorted-data"

if not os.path.exists(sorted_data_folder):
    os.makedirs(sorted_data_folder)

file_sizes = get_file_sizes(data_folder=data_folder,api_key=api_keys[0],model_name="gemini-1.5-flash")
# Note: the token limit should also fit the system prompt and user prompt
groups = get_groups(file_sizes=file_sizes, group_token_limit=998000)
generate_new_files(groups=groups, data_folder=data_folder, output_folder=sorted_data_folder)


Processing file: /kaggle/input/eesti-vabariigi-seadused/111032023056.txt
Processing file: /kaggle/input/eesti-vabariigi-seadused/106072023026.txt
Processing file: /kaggle/input/eesti-vabariigi-seadused/108102024027.txt
Processing file: /kaggle/input/eesti-vabariigi-seadused/131012023006.txt
Processing file: /kaggle/input/eesti-vabariigi-seadused/113032014034.txt
Processing file: /kaggle/input/eesti-vabariigi-seadused/111032023047.txt
Processing file: /kaggle/input/eesti-vabariigi-seadused/105052022011.txt
Processing file: /kaggle/input/eesti-vabariigi-seadused/129062024022.txt
Processing file: /kaggle/input/eesti-vabariigi-seadused/106012023047.txt
Processing file: /kaggle/input/eesti-vabariigi-seadused/106072023008.txt
Processing file: /kaggle/input/eesti-vabariigi-seadused/24407.txt
Processing file: /kaggle/input/eesti-vabariigi-seadused/107062024020.txt
Processing file: /kaggle/input/eesti-vabariigi-seadused/106072023094.txt
Processing file: /kaggle/input/eesti-vabariigi-seadused/10

## 5. Loading sorted data

In [41]:
file_contents = []

for filename in os.listdir(sorted_data_folder):
    if filename.endswith('.txt'):
        file_path = os.path.join(sorted_data_folder, filename)
        with open(file_path, 'r') as file:
            content = file.read()
            file_contents.append(content)  

## 6. Model configuration and environment setup

Our system uses Estonian for both prompts and response formats, as all the legal documents are in Estonian and most of the questions are asked in Estonian.

In [42]:
def printProgressBar (iteration, total, prefix = '', suffix = '', decimals = 1, length = 100, fill = '█', printEnd = "\r"):
    """
    from: https://stackoverflow.com/questions/3173320/text-progress-bar-in-terminal-with-block-characters
    
    Call in a loop to create terminal progress bar
    @params:
        iteration   - Required  : current iteration (Int)
        total       - Required  : total iterations (Int)
        prefix      - Optional  : prefix string (Str)
        suffix      - Optional  : suffix string (Str)
        decimals    - Optional  : positive number of decimals in percent complete (Int)
        length      - Optional  : character length of bar (Int)
        fill        - Optional  : bar fill character (Str)
        printEnd    - Optional  : end character (e.g. "\r", "\r\n") (Str)
    """
    percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / float(total)))
    filledLength = int(length * iteration // total)
    bar = fill * filledLength + '-' * (length - filledLength)
    print(f'\r{prefix} |{bar}| {percent}% {suffix}', end = printEnd)
    # Print New Line on Complete
    if iteration == total: 
        print()

In [43]:
system_prompt = "Sa oled seaduste abiline. Kui sulle antud seaduses on küsimusele vastust, siis vasta sellele. Kui vastus puudub, siis vasta '0'"
response_format = """kastuta vastamisel järgmist formaati:
<seaduse nimi>

<sinu vastus>"""

## 7. LLM threading

We utilize threading to generate multiple Gemini responses, as this project requires handling up to 12 million tokens, while the free version of the model is limited to a maximum token count of 1 million. By using multiple models, we can efficiently search for specific paragraphs that explain the law the user has requested. This approach allows us to process large amounts of data, ensuring that relevant and precise information is extracted to answer complex legal queries. The ability to handle such extensive context is crucial for delivering accurate and comprehensive results.


Traditionally, finding answers in a 12-million-character legal database could take hours, if not days. Now, with our tool, powered by LLMs, you can get accurate answers in just three minutes.

In [44]:
thread_local = threading.local()
responses_lock = threading.Lock()
api_keys_lock = threading.Lock()
responses = []


def get_session():
    if not hasattr(thread_local, "session"):
        thread_local.session = requests.Session()
    return thread_local.session

def get_response(data):
    api_keys, file_content, group_index, group_count, system_prompt, response_format, question = data
    
    with api_keys_lock:
        if not api_keys:
            print("No API keys left to process.")
            return
        api_key = api_keys.pop(0)
        
    genai.configure(api_key=api_key)

    model = genai.GenerativeModel("gemini-1.5-flash")
    if group_index % 2 == 0:
        model = genai.GenerativeModel("gemini-1.5-flash-8b")


    time.sleep(2)

    response_text = model.generate_content([
        system_prompt,
        response_format,
        file_content,
        "Küsimus on järgmine:",
        question
    ]).text

    # Safely append to the shared responses list
    with responses_lock:
        responses.append((group_index, response_text))
        
    groups_done = len(responses)
    printProgressBar(groups_done, group_count, length = 50)

def get_all_responses(api_keys, file_contents, system_prompt, response_format, question):
    group_count = len(api_keys)
    tasks = [
        ( api_keys, file_contents[i], i, group_count, system_prompt, response_format, question)
        for i in range(group_count)
    ]

    with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
        executor.map(get_response, tasks)

def control_threading(user_message):
    temp_api_keys = api_keys.copy()
    temp_api_keys = temp_api_keys * 2
    temp_api_keys.sort()

    start_time = time.time()

    group_count = len(api_keys)
    printProgressBar(0, group_count, length = 50)

    get_all_responses(temp_api_keys, file_contents, system_prompt, response_format, user_message)
    
    duration = time.time() - start_time

    responses.sort(key=lambda x: x[0])

    print()
    print(f"Got responses in {duration} seconds")


## 8. Best response

The answers also include citations, specify which legal documents were used in answering, and provide recommendations on where to find additional information.

In [45]:
def get_best_answer(user_message):
    genai.configure(api_key=api_keys[0])
    model = genai.GenerativeModel("gemini-1.5-pro")
    response_texts = [element[1] for element in responses]

    print("Choosing best answer")
    start_time = time.time()
    
    prompt = (
        ["Sulle antakse mitme erineva mudeli vastused erinevatest seadustest, tsiteeri mulle mudeli vastuseid, mis pole '0'"] +
        [response_format] +
        response_texts + 
        ["Küsimus on järgnev:"] +
        [user_message]
    )

    duration = time.time() - start_time
    print(f"Chose best answer in {duration} seconds")
    
    best_answer = model.generate_content(prompt)
    return best_answer

In [46]:
def full_flow(user_message):
    control_threading(user_message)
    return get_best_answer(user_message).text

## 9. Simple UI

Our solution features an interactive chat interface where you can easily communicate with the model. To keep you informed, a progress bar is displayed during each request, as responses can take up to three minutes. The response time is also shown for added transparency.

In [None]:
import os
import time


terminal_colors = {
    "purple": "\033[0;35m",
    "end": "\033[0m"
}


def generate_response(user_message):
    responses = []
    answer = full_flow(user_message)

    return answer

print(f"{terminal_colors['purple']}juur.ai: Hello! Type 'exit' to end the conversation.{terminal_colors['end']}")

while True:
    user_input = input("You: ")
    if user_input.lower() == 'exit':
        break
    print
    response = generate_response(user_input)
    print(f"{terminal_colors['purple']}juur.ai: {response}{terminal_colors['end']}")

[0;35mjuur.ai: Hello! Type 'exit' to end the conversation.[0m


You:  Kas ma võin alkoholi tarbida ja sõita?


 |--------------------------------------------------| 0.0% 
Got responses in 0.0018012523651123047 seconds
Choosing best answer
Chose best answer in 1.9073486328125e-06 seconds
[0;35mjuur.ai: Siin on erinevate mudelite vastused küsimusele "Kas ma võin alkoholi tarbida ja sõita?":

**Seadused ja regulatsioonid**

Te ei tohiks alkoholi tarbida ja sõita. Joobes juhtimine on ebaseaduslik ja ohtlik. See seab ohtu nii teie enda kui ka teiste elu. Kui plaanite alkoholi tarbida, veenduge, et teil on olemas kaine juht või kasutage alternatiivset transporti.
[0m
