# Retrieving Legal Information
This note book employs a large language model to retrieve legal information from case files in pdf format. All documents are in the public domain.  
  
Case: A South African case involving the Road Accident Fund  
Retrieval method:  Auto-merging Retrieval  

Auto merging is a technique used to optimize the structure and performance of an index. It works by periodically merging smaller segments of the index into larger, more efficient ones. This process can improve search performance by:

- Reducing Overhead: Smaller index segments can create overhead due to the need to manage and search through many separate pieces. Merging them reduces this overhead.
- Improving Cache Efficiency: Larger, merged segments can be more cache-friendly, making searches faster.
- Enhancing Compression: Merging can also improve compression, reducing the storage space required and potentially improving I/O performance. 

#### Set up the environment  
Note: The glob module provides a function for making file lists from directory wildcard searches. Essentially, it allows you to find all the pathnames matching a specified pattern according to the rules used by the Unix shell, although the results are returned in arbitrary order.

In [1]:
!python --version
# This notebook runs on python 3.10.13 or higher

Python 3.10.13


In [2]:
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Load the required modules
from llama_index import (
    ServiceContext,
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
)
from llama_index.node_parser import HierarchicalNodeParser
from llama_index.node_parser import get_leaf_nodes
from llama_index import StorageContext, load_index_from_storage
from llama_index.retrievers import AutoMergingRetriever
from llama_index.indices.postprocessor import SentenceTransformerRerank
from llama_index.query_engine import RetrieverQueryEngine
from llama_index import SimpleDirectoryReader
from llama_index.llms import OpenAI
from llama_index import Document

from trulens_eval import Tru

import utils # Ensure utils.py is in the current working directory
from utils import get_prebuilt_trulens_recorder
import glob
import os
import openai
openai.api_key = utils.get_openai_api_key()

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input response will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


#### Load and preprocess the documents

In [3]:
my_directory_path="./ocr_docs_georg/*" #This is the path to the folder containing the documents

# Put all files in the directory into input_files
input_files = glob.glob(my_directory_path)

# Load the documents
documents = SimpleDirectoryReader(input_files=input_files).load_data()


In [4]:
# Check document contents
print(type(documents), "\n")
print(len(documents), "\n")
print(type(documents[0]))
print(documents[0])

<class 'list'> 

4 

<class 'llama_index.schema.Document'>
Doc ID: f2fce791-a18d-4b74-b34e-c89b10a02d29
Text: COURT ONLINE COVER PAGE INI THE HIGH COURT OF SOUTH. AFRICA
Gauteng Local Division, Pretoria CASE NO: 2023-134420 Int the matter
between: THE ROAD ACCIDENT FUND Panuf/Aplon/Agpelant and THE LEGAL
PRACTICE COUNCIL,THE BOARD OF SHERIFFS,THE SHERIFF PRETORIA
CENTRAL,THE SHERIFF PRETORIA EAST,THE SHERIFF CENTURION EAST,THE
SHERIFF OHANNESBURG CENTRA...


In [5]:
# Join all documents into one
document = Document(text="\n\n".join([doc.text for doc in documents]))

#### Set up the language model

In [6]:
#llm_name="gpt-3.5-turbo"
llm_name="gpt-4-1106-preview"
llm = OpenAI(model=llm_name, temperature=0.1)

#### Define automerging functions  
This Python function, `build_automerging_index`, is designed to build (or load, if it already exists) an auto-merging index from a collection of documents. Here's a breakdown of what each part of the function does and how it works:  

  
##### 1. **Function Definition:**
- `def build_automerging_index(documents, llm, embed_model="local:BAAI/bge-small-en-v1.5", save_dir="merging_index_georg", chunk_sizes=None):`
This line defines the function with parameters for the documents to index, a language model (`llm`), an embedding model, the directory to save or load the index from, and the sizes for chunking documents.

##### 2. **Setting Default Chunk Sizes:**
- `chunk_sizes = chunk_sizes or [2048, 512, 128]`
This sets the default sizes for document chunks if none are provided. Chunking can help manage memory and processing by breaking down the document set into more manageable pieces.

##### 3. **Node Parser:**
- `node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_sizes)`
This creates a node parser that organizes documents into a hierarchical structure, which is beneficial for indexing and retrieval tasks. The parser uses the specified chunk sizes.

##### 4. **Nodes from Documents:**
- `nodes = node_parser.get_nodes_from_documents(documents)`
This retrieves the nodes from the documents. Nodes typically represent chunks or pieces of the documents structured for efficient storage and retrieval.

##### 5. **Leaf Nodes:**
- `leaf_nodes = get_leaf_nodes(nodes)`
This extracts the leaf nodes from the hierarchical structure. Leaf nodes often represent the smallest or most specific chunks of the documents, which are directly indexed and retrieved.

##### 6. **Service Context:**
- `merging_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)`
This sets up a context for the indexing service, including the language model and embedding model. This context contains configurations and resources the indexing service needs.

##### 7. **Storage Context:**
- `storage_context = StorageContext.from_defaults()`
This initializes a default storage context, which manages how and where the index and associated data are stored.
- `storage_context.docstore.add_documents(nodes)`
This line adds the documents (as nodes) to the document store within the storage context, preparing them for indexing.

##### 8. **Check and Build or Load Index:**
- The `if not os.path.exists(save_dir)` block checks whether an index already exists in the specified directory (`save_dir`). If it doesn't, the index is built from the leaf nodes and then saved to that directory. If the index does exist, it's loaded from the storage instead.

##### 9. **Index Creation and Persistence:**
- `automerging_index = VectorStoreIndex(...)`
This line either creates a new index or loads an existing one, depending on whether it already exists. The index is created with leaf nodes and the specified storage and service contexts.
- `automerging_index.storage_context.persist(persist_dir=save_dir)`
If a new index is created, this line saves it to the specified directory so it can be quickly loaded in the future.

##### 10. **Return the Index:**
- `return automerging_index`
Finally, the constructed or loaded auto-merging index is returned.

**Summary:**
The `build_automerging_index` function is a utility for creating or loading an auto-merging index based on a set of documents. It handles chunking, node parsing, and the setup of service and storage contexts. It's designed to work with hierarchical data structures and leverages a language model and an embedding model for document processing and indexing. This function is  part of a larger system for document retrieval where efficient, scalable search over large text corpora is necessary.


In [7]:
# Define a function to build the automerging index
def build_automerging_index(
    documents,
    llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    save_dir="merging_index_georg",
    chunk_sizes=None,
):
    chunk_sizes = chunk_sizes or [2048, 512, 128]
    node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_sizes)
    nodes = node_parser.get_nodes_from_documents(documents)
    leaf_nodes = get_leaf_nodes(nodes)
    merging_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model,
    )
    storage_context = StorageContext.from_defaults()
    storage_context.docstore.add_documents(nodes)

    # Check if the automerging index exists.
    # If it does not exist, build it
    # If it exists, load the index
    if not os.path.exists(save_dir):
        automerging_index = VectorStoreIndex(
            leaf_nodes, storage_context=storage_context, service_context=merging_context
        )
        automerging_index.storage_context.persist(persist_dir=save_dir)
    else:
        automerging_index = load_index_from_storage(
            StorageContext.from_defaults(persist_dir=save_dir),
            service_context=merging_context,
        )
    return automerging_index



The following cell defines a Python function `get_automerging_query_engine` that sets up an automerging query engine. The engine is used for information retrieval tasks, where it retrieves documents or other data that are most similar to a given query and then refines (reranks) those results. Here's what each part of the function does:

##### **Function Definition:**
- `def get_automerging_query_engine(automerging_index, similarity_top_k=12, rerank_top_n=6):`
This line defines the function with parameters for the auto-merging index, the number of top similar items to retrieve (`similarity_top_k`), and the number of items to rerank (`rerank_top_n`).

##### **Base Retriever:**
- `base_retriever = automerging_index.as_retriever(similarity_top_k=similarity_top_k)`
This creates a basic retriever from the auto-merging index. The `similarity_top_k` parameter indicates that the retriever should fetch the top-k most similar items to a given query. The base retriever is responsible for the initial retrieval of documents based on similarity.

##### **AutoMerging Retriever:**
- `retriever = AutoMergingRetriever(base_retriever, automerging_index.storage_context, verbose=True)`
This line initializes an `AutoMergingRetriever`, likely a more advanced or specialized version of the basic retriever. It's given the `base_retriever` and the storage context from the auto-merging index. The `verbose=True` parameter indicates that the retriever should provide additional output or logging, useful for understanding its behavior and debugging.

##### **Reranking with Sentence Transformers:**
- `rerank = SentenceTransformerRerank(top_n=rerank_top_n, model="BAAI/bge-reranker-base")`
This sets up a reranker using a sentence transformer model. The `rerank` object is configured to refine the top `rerank_top_n` results from the initial retrieval. The model specified, `"BAAI/bge-reranker-base"`, is likely a pre-trained transformer model designed for reranking or similar tasks, providing more nuanced understanding and ordering of the retrieved items.

##### **Automerging Query Engine:**
- `auto_merging_engine = RetrieverQueryEngine.from_args(retriever, node_postprocessors=[rerank])`
This creates an automerging query engine, which manages the overall retrieval process. It's configured to use the `retriever` for initial retrieval and the `rerank` object as a postprocessor to refine the results. The engine encapsulates the process of taking a query, fetching relevant items, and applying post-processing to deliver a final set of refined results.

##### **Return the Query Engine:**
- `return auto_merging_engine`
Finally, the function returns the fully configured automerging query engine.

**Summary:**
The `get_automerging_query_engine` function is a utility for setting up a query engine that first retrieves a set of documents or data based on similarity to a query and then reranks those results for improved relevance and accuracy. It uses an auto-merging index for initial retrieval and a sentence transformer model for reranking. This setup is typical in systems where efficient, scalable, and accurate search over large text corpora or similar datasets is necessary.


In [8]:
# Define a function to set up the automerging query engine
def get_automerging_query_engine(
    automerging_index,
    similarity_top_k=12,
    rerank_top_n=6,
):
    base_retriever = automerging_index.as_retriever(similarity_top_k=similarity_top_k)
    retriever = AutoMergingRetriever(
        base_retriever, automerging_index.storage_context, verbose=True
    )
    rerank = SentenceTransformerRerank(
        top_n=rerank_top_n, model="BAAI/bge-reranker-base"
    )
    auto_merging_engine = RetrieverQueryEngine.from_args(
        retriever, node_postprocessors=[rerank]
    )
    return auto_merging_engine

#### Select the chunk size and build the automerging index

In [9]:
auto_merging_index = build_automerging_index(
    documents,
    llm=OpenAI(model=llm_name, temperature=0.1),
    embed_model="local:BAAI/bge-small-en-v1.5",
    save_dir="merging_index_georg", # Give the name of the save directory. It can be different from the name in the function.
    chunk_sizes=[2048,512], # Two splits
    #chunk_sizes=[2048,512,128], # Three splits
)

#### Set up the automerging query engine

In [10]:
auto_merging_engine = get_automerging_query_engine(
    auto_merging_index, # Pass the auto_merging_index created above
    similarity_top_k=12,
    rerank_top_n=6,
)

#### Ask a question on the fly

In [12]:
auto_merging_response = auto_merging_engine.query(
    "For how many days have all writs of execution been suspended?"
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [13]:
from llama_index.response.notebook_utils import display_response

display_response(auto_merging_response)

**`Final Response:`** All writs of execution have been suspended for a period of 180 days.

### Evaluate a batch of questions with TruLens
TruLens is a powerful open source library for evaluating and tracking large language model-based applications.

First set up the evaluation questions

In [14]:
eval_questions = []
with open('eval_questions_georg.txt', 'r') as file:
    for line in file:
        # Remove newline character and convert to integer
        item = line.strip()
        eval_questions.append(item)
        
print(eval_questions)

['For how many days have all writs of execution been suspended?', 'In what instances will bills of costs be suspended for 45 days?', 'List the full name of each respondent', 'How many people are injured in motor vehicle related accidents per annum?']


In [15]:
# Optional: Here is how you can apend a new question:
new_question = "Where shall this order be published by the applicant?"
eval_questions.append(new_question)
print(eval_questions)

['For how many days have all writs of execution been suspended?', 'In what instances will bills of costs be suspended for 45 days?', 'List the full name of each respondent', 'How many people are injured in motor vehicle related accidents per annum?', 'Where shall this order be published by the applicant?']


#### Run the TruLens recorder  
 
##### **Resetting the Database:**
- `Tru().reset_database()`
This line calls the `reset_database()` method on an instance of the `Tru` class. This method  resets or clears a database, removing all records or returning it to an initial state. This is useful in scenarios where you want to start fresh, perhaps during testing or when initializing a system.

##### **Setting Up a Recorder:**
- `tru_recorder = get_prebuilt_trulens_recorder(auto_merging_engine, app_id='app_0')`
This line initializes a recorder, presumably for logging, monitoring, or analyzing queries and responses within the system. The specific function being used is `get_prebuilt_trulens_recorder`, a pre-configured recorder suited for use with the `auto_merging_engine`. The parameters provided are:  


    - `auto_merging_engine`: This is the query engine responsible for retrieving and processing data. It's the same engine set up previously, which combines an automerging retriever with a reranking mechanism.
    - `app_id='app_0'`: This specifies an application identifier, which is used to segregate or identify logs, configurations, or data associated with different applications or instances within a larger system.



In [16]:
Tru().reset_database()

tru_recorder = get_prebuilt_trulens_recorder(
    auto_merging_engine,
    app_id ='app_0'
)

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


#### Define a function to run the evaluation questions on TruRecorder

In [17]:
def run_evals(eval_questions, tru_recorder, query_engine):
    for question in eval_questions:
        with tru_recorder as recording:
            response = query_engine.query(question)

In [18]:
# Run the evaluation function
run_evals(eval_questions, tru_recorder, auto_merging_engine)

> Merging 2 nodes into parent node.
> Parent node id: 5280a631-de3c-4d4f-ae42-9e39efda4f37.
> Parent node text: JOHANNESBURG ON THIS 2197 DAY OF DECEMBER 2023.
Guu
MALATJI & CO ATTORNEYS
Attorneys for the appl...

> Merging 3 nodes into parent node.
> Parent node id: fd7f78d0-3307-4090-ada4-a85b5661a7d9.
> Parent node text: Respondent
REG
op HE
AL
Prvote Bga Z
Potunis 000r
2020 -2- 14 Sixteenth Respondent
DF
COPETONA
PH...

> Merging 1 nodes into parent node.
> Parent node id: 261e66c0-1178-4222-b8be-d0f8600b23f0.
> Parent node text: interest of justice.
13.4 An appliçation to be admitted as an amicus curiae must briefly describe...



#### Get the TruLens leaderboard

In [19]:
# Get the TruLens leaderboard
Tru().get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Answer Relevance,Context Relevance,Groundedness,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
app_0,1.0,0.379167,0.730769,15.8,0.004642


#### Extract the records and feedback for each question in turn

In [20]:
records, feedback = Tru().get_records_and_feedback(app_ids=[])
records[:len(eval_questions)]

Unnamed: 0,app_id,app_json,type,record_id,input,output,tags,record_json,cost_json,perf_json,ts,Answer Relevance,Context Relevance,Groundedness,Answer Relevance_calls,Context Relevance_calls,Groundedness_calls,latency,total_tokens,total_cost
0,app_0,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_d261d63c7141b5d2963d4cf802769af0,"""For how many days have all writs of execution...","""All writs of execution have been suspended fo...",-,"{""record_id"": ""record_hash_d261d63c7141b5d2963...","{""n_requests"": 1, ""n_successful_requests"": 1, ...","{""start_time"": ""2023-12-31T01:30:51.910372"", ""...",2023-12-31T01:31:09.703046,1.0,0.366667,1.0,[{'args': {'prompt': 'For how many days have a...,[{'args': {'prompt': 'For how many days have a...,[{'args': {'source': 'Page 7of184 Ang 21/22023...,17,2536,0.003812
1,app_0,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_a5e6f67f3107ad26aa86351c224b1a7c,"""In what instances will bills of costs be susp...","""Bills of costs will be suspended for 45 days ...",-,"{""record_id"": ""record_hash_a5e6f67f3107ad26aa8...","{""n_requests"": 1, ""n_successful_requests"": 1, ...","{""start_time"": ""2023-12-31T01:31:10.464693"", ""...",2023-12-31T01:31:26.052551,1.0,0.616667,1.0,[{'args': {'prompt': 'In what instances will b...,[{'args': {'prompt': 'In what instances will b...,[{'args': {'source': 'Page 7of184 Ang 21/22023...,15,2808,0.00423
2,app_0,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_1a27303a16f51a0804cbecd5a3357a9d,"""List the full name of each respondent""","""AD Dandala & Associates\nGodla & Partners\nSi...",-,"{""record_id"": ""record_hash_1a27303a16f51a0804c...","{""n_requests"": 2, ""n_successful_requests"": 2, ...","{""start_time"": ""2023-12-31T01:31:26.667861"", ""...",2023-12-31T01:31:40.704344,1.0,0.533333,0.923077,[{'args': {'prompt': 'List the full name of ea...,[{'args': {'prompt': 'List the full name of ea...,"[{'args': {'source': 'The eleventh respondent,...",14,4762,0.007234
3,app_0,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_96dc5f4ff9a40cd5a9ba9b07972bcc2f,"""How many people are injured in motor vehicle ...","""The context information does not provide a sp...",-,"{""record_id"": ""record_hash_96dc5f4ff9a40cd5a9b...","{""n_requests"": 1, ""n_successful_requests"": 1, ...","{""start_time"": ""2023-12-31T01:31:41.318718"", ""...",2023-12-31T01:31:56.680056,1.0,0.0,0.0,[{'args': {'prompt': 'How many people are inju...,[{'args': {'prompt': 'How many people are inju...,[{'args': {'source': '1.13 Further and/or othe...,15,2758,0.004147
4,app_0,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.query_engine....,record_hash_3761c556159053d6b4944025d76f6e26,"""Where shall this order be published by the ap...","""The order shall be published by the applicant...",-,"{""record_id"": ""record_hash_3761c556159053d6b49...","{""n_requests"": 1, ""n_successful_requests"": 1, ...","{""start_time"": ""2023-12-31T01:31:57.371661"", ""...",2023-12-31T01:32:15.939319,,,,,,,18,2495,0.003785


#### Print the questions, answers and the corresponding triad of metrics:  
- Answer Relevance: is the answer relevant to the query? 
- Context Relevance: is the retrieved context relevant to the query? 
- Groundedness: is the answer relevant to the retrieved context?

In [21]:
# Loop through questions and outputs
for idx, (question, output) in enumerate(zip(records["input"][:len(eval_questions)], records["output"][:len(eval_questions)])):
    print("Question:", question)
    print("Answer:", output)
    
    # Print additional information if available, formatted to two decimal places
    if "Answer Relevance" in records and len(records["Answer Relevance"]) > idx:
        print(f"Answer Relevance: {records['Answer Relevance'][idx]:.2f}")
    if "Context Relevance" in records and len(records["Context Relevance"]) > idx:
        print(f"Context Relevance: {records['Context Relevance'][idx]:.2f}")
    if "Groundedness" in records and len(records["Groundedness"]) > idx:
        print(f"Groundedness: {records['Groundedness'][idx]:.2f}")

    print("\n")  # Print a newline for better readability


Question: "For how many days have all writs of execution been suspended?"
Answer: "All writs of execution have been suspended for a period of 180 days."
Answer Relevance: 1.00
Context Relevance: 0.37
Groundedness: 1.00


Question: "In what instances will bills of costs be suspended for 45 days?"
Answer: "Bills of costs will be suspended for 45 days in cases where the bill of costs is not settled internally with the RAF but has to be taxed by the Taxing Master."
Answer Relevance: 1.00
Context Relevance: 0.62
Groundedness: 1.00


Question: "List the full name of each respondent"
Answer: "AD Dandala & Associates\nGodla & Partners\nSithombe Attorneys\nKMalao Inc.\nMduzulwana Attorneys and Legal/Consultants\nDVDM Inc.\nBe Broglio) Attorneys Inc.\nVDS Attorneys\nRoets & Van Rensburg\nKorommbi Mabuli Inc.\nSpruyt Inc.\nPersonal Injury Plaintiffs Lawyers Association\nAdvocate RAF Fee Recovery Association"
Answer Relevance: 1.00
Context Relevance: 0.53
Groundedness: 0.92


Question: "How many p

#### Finally, run the TruLens dashboard

In [22]:
Tru().run_dashboard()

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at http://localhost:8501 .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>