### Extracting Legal Information  
This notebook extracts information from legal documentation pertaining to a dispute between the Road Accident Fund and other parties. As the source files are scanned pdfs, the input files to this notebook are the OCR processed files. OCR was done separately. All legal documents used here are in the public domain.  
This notebook uses:  
- Langchain to load, retrieve information and generate responses.  
- OpenAI Embeddings  
- FAISS in-memory vector store

##### Load the required modules

In [None]:
#!pip install unstructured
#!pip install tqdm # This useful for showing progress

In [23]:
!python --version

Python 3.9.18


#### 1 - Import dependencies and set up the language model

In [24]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import PromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain 
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.faiss import FAISS
from langchain.chains import create_retrieval_chain
from langchain_core.output_parsers import JsonOutputParser
from langchain.chains.summarize import load_summarize_chain
from langchain_community.document_loaders import DirectoryLoader
from langchain.memory import ConversationBufferMemory
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain.schema import format_document
from langchain_core.messages import AIMessage, HumanMessage, get_buffer_string
from langchain_core.runnables import RunnableParallel
from operator import itemgetter
from datetime import datetime

##### Read in the OpenAI API key

In [25]:
import os
# Read OpenAI API key from environment variable
openai_api_key = os.getenv("OPENAI_API_KEY")
if openai_api_key is None:
    raise ValueError("OpenAI API Key not set in environment variables")
os.environ["OPENAI_API_KEY"] = openai_api_key

##### Set up the model

In [26]:
model = ChatOpenAI(
    model = "gpt-3.5-turbo-1106",
    #model = "gpt-4-1106-preview", # Note: Map reduce with GPT4 is long and costly
    temperature = 0.1
)

#### 2 - Read in the questions 

In [27]:
# Questions commented out were rephrased to elicit better responses.

question_1 = "What is the central legal claim or cause of action asserted in the pleadings?"
question_2 = "Can you provide a concise summary of the background facts leading to the legal dispute?"
question_3 = "What specific legal theories or statutes form the basis of the claims made?"
#question_4 = "Are there any precedents or case law cited that support these legal arguments?"
question_4 = "Are there any precedents or case law cited that support these legal arguments? Think step by step before answering and list each one."
#question_5 = "Who are the plaintiffs and defendants named in the pleadings?"
question_5 = "Who are the plaintiffs and defendants named in the pleadings? Give the name of each one."
#question_6 = "Are there any third parties mentioned, and what roles do they play in the case?"
question_6 = "Are there any third parties mentioned, and what roles do they play in the case? Give the name and role of each one."
question_7 = "What are the key factual allegations made by each party?"
question_8 = "Are there specific incidents or events that are crucial to the case?"
question_9 = "What remedies or relief are the parties seeking from the court?"
question_10 = "Are there any specific damages claimed, and how are they quantified?"
question_11 = "Does the pleading establish the court's jurisdiction over the matter?"
question_12 = "Is the chosen venue appropriate, and are there any challenges to jurisdiction?"
#question_13 = "What defenses are asserted by the opposing party?"
question_13 = "Can you list and explain the defenses that the opposing parties have asserted in these legal documents?"
#question_14 = "Are there affirmative defenses or counterclaims presented?"
question_14 = "Are there affirmative defenses or counterclaims presented? Think step by step before answering and list each one."
question_15 = "Are the pleadings clear and logically organized?"
question_16 = "Are there any inconsistencies or contradictions within the pleadings?"
question_17 = "Are there any documents referenced or attached to the pleadings?"
question_18 = "How do these documents support or undermine the parties' claims?"
question_19 = "Are there identified witnesses whose testimony is crucial to the case?"
question_20 = "How do the pleadings anticipate presenting evidence during the proceedings?"
#question_21 = "Have the pleadings addressed any applicable timelines or statutes of limitations?"
question_21 = "Please list all applicable timelines or statutes of limitations mentioned in the pleadings."
question_22 = "Are there any time-sensitive elements that may impact the case?"
question_23 = "Have there been any attempts at settlement or ADR mentioned in the pleadings?"

# Here are the second round questions

#question_24 = "In terms of the case which has been filed under case number 2023-134420, what time was the annexe documentation filed by the registrar of the court?"
question_24 = "A case was filed under case number 2023-134420. Tell me the exact date the document was filed \
by the registrar of the court. Also tell me the exact time of day the document was filed by the registrar of the court."
question_25 = "In terms of the notice of motion marked annexe “A1”, please summarise the orders that the road accident fund is \
seeking from the court."
question_26 = "Please summarise the order contained in paragraph 11.1 of the rule 16A notice document which is titled annexe 'A2'."
#question_26a = "Please summarise the order contained in paragraph 11.1 of notice of motion long form document."
question_27 = "According to the documents provided, how many people are injured annually in South Africa as a consequence of \
motor vehicle accidents?"
#question_28 = "What does the anacronym RNYP stand for?"
question_28 = "What is the purpose of the RNYP list?"
question_29 = "What is the physical address of Malatji and Co. Attorneys?"
question_30 = "In terms of the court papers provided, the relief that the applicant will seek will have an impact on the \
person’s rights in terms of the Bill of Rights of the Constitution of South Africa, 1996. What specific impacts are listed?"
#question_31 = "Please provide the full case citation of the case involving the Matjhabeng Local Municipality."
question_31 = "What does the document say about Matjhabeng?"
question_32 = "What must a court be cautious not to usurp?"
question_33 = "Please provide the names of the Advocate who appeared for the 15th respondent in the matter with \
case number reference 58145/2020."
question_34 = "What is the RAF claim number, in respect of Sithole, R?"
question_35 = "In terms of the document entitled Annexe “A.10.1”, within which period must solar panels be purchased and \
installed at a private residence, in order for a taxpayer to qualify for the tax rebate?"
question_36 = "In terms of the document entitled Annexe “A.10.1”, what will the minimum royalty rate be increased to?"
question_37 = "In terms of the document entitled Annexe “A.10.2”, please list SARS’ 9 stated strategic objectives." 
question_38 = "In terms of the document entitled Annexe “A.10.3”, please list SARS’ 9 stated strategic objectives." 
question_38 = "What are the main tax proposals for fiscal 2023/24?"


# Create a list of questions
questions = [eval(f'question_{i}') for i in range(1, 39)]

#### 3 - Create Useful Functions

##### Create a function to build the vector data base from the source OCR processed text

In [28]:
def create_vector_db(source_directory):
    loader = DirectoryLoader(source_directory, glob="**/*.txt", loader_cls=TextLoader, show_progress=True)
    pages=loader.load()

    # Define chunk size, overlap and separators
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=7000, #experiment with this
        chunk_overlap=700,
        separators=['\n\n', '\n', '(?=>\. )', ' ', '']
    )
    docs  = text_splitter.split_documents(pages)

    # Select the embeddings to use
    embeddings = OpenAIEmbeddings()

    #Create the vectorized db
    # Vectorstore: https://python.langchain.com/en/latest/modules/indexes/vectorstores.html

    vector_db = FAISS.from_documents(docs, embeddings)

    return vector_db, docs


##### Create a function to summarize the document

In [29]:
def summarize_doc(docs):

    summarization_prompt = """
    Write a concise summary of the following:
    {text}
    CONCISE SUMMARY:
    """

    prompt = PromptTemplate(template=summarization_prompt, input_variables=["text"])

    summary_chain = load_summarize_chain(
    llm = model,
    chain_type="map_reduce",
    map_prompt=prompt,
    combine_prompt=prompt
    )

    summary = summary_chain.run(docs)
    #print(summary)

    return summary

##### Create a function to create the retrieval chain with conversation history and memory
First add in Conversational History  
Given a conversation history and a follow-up question, the goal is to rephrase the follow-up question so that it becomes a standalone question.  
This means the rephrased question should be understandable without needing the context of the chat history.  

Then format and combine multiple documents into a single string. Each document is formatted according to a specified or default template, and then they are combined into one string, with each document separated by a specified separator.  

Finally, add in memory and return source documents.

In [30]:
def create_retrieval_chain():
    # Given the chat history, rephrase the follow-up question to be a stand-alone question.
    _template = """Given the following conversation and a follow up question, rephrase the follow up question \
    to be a standalone question, in its original language.
    
    Chat History:
    {chat_history}
    Follow Up Input: {question}
    Standalone question:"""
    CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

    # Create a summary prompt template from the document summary
    summary_template = PromptTemplate(input_variables=["summary"], template="{summary}")

    # Add it to the QA prompt template
    template = summary_template.format(summary=summary) + """
    You are an AI trained as an experienced legal assistant, well-versed in South African contract law. 
    Your responses should be careful, concise, and legally authoritative, drawing specifically from the provided context. 
    If the answer to a question is not clearly supported by the context, state that you do not know the answer. 
    Avoid speculations or assumptions outside the given context. 
    Focus on delivering brief, factual, and directly relevant responses.

    Answer the question based only on the following context:
    {context}

    Question: {question}
    """
    ANSWER_PROMPT = ChatPromptTemplate.from_template(template)

    # Format and combine multiple documents into a single string
    DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")


    def _combine_documents(
        docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"
    ):
        doc_strings = [format_document(doc, document_prompt) for doc in docs]
        return document_separator.join(doc_strings)

    # First we add a step to load memory
    # This adds a "memory" key to the input object
    loaded_memory = RunnablePassthrough.assign(
        chat_history=RunnableLambda(memory.load_memory_variables) | itemgetter("history"),
    )
    # Now we calculate the standalone question
    standalone_question = {
        "standalone_question": {
            "question": lambda x: x["question"],
            "chat_history": lambda x: get_buffer_string(x["chat_history"]),
        }
        | CONDENSE_QUESTION_PROMPT
        | model
        | StrOutputParser(),
    }
    # Now we retrieve the documents
    retrieved_documents = {
        "docs": itemgetter("standalone_question") | retriever,
        "question": lambda x: x["standalone_question"],
    }
    # Now we construct the inputs for the final prompt
    final_inputs = {
        "context": lambda x: _combine_documents(x["docs"]),
        "question": itemgetter("question"),
    }
    # And finally, we do the part that returns the answers
    answer = {
        "answer": final_inputs | ANSWER_PROMPT | model,
        "docs": itemgetter("docs"),
    }
    # And now we put it all together!
    final_chain = loaded_memory | standalone_question | retrieved_documents | answer

    return final_chain, memory

#### 4 - Run the functions

##### A - Create the vector data base from the source OCR processed documents

In [31]:
vector_db, docs = create_vector_db(source_directory="./RAF_ocr_docs/") 
#This is a one time event and can take some time depending on the length of the documents

 67%|█████████████████████████████▎              | 4/6 [00:00<00:00, 487.03it/s]


##### B - Create a summary of the case

In [None]:
# Get the current date and time
current_time = datetime.now()
# Format the date and time as a string
formatted_time = current_time.strftime("Started summarizing document at %Y-%m-%d %H:%M:%S")
# Print the formatted date and time
print(formatted_time)

summary = summarize_doc(docs) #This is a one time event and can take some time

# Get the current date and time
current_time = datetime.now()
# Format the date and time as a string
formatted_time = current_time.strftime("Finished summarizing document at %Y-%m-%d %H:%M:%S")
# Print the formatted date and time
print(formatted_time + "\n")
print(summary)

In [None]:
# Edit as necesary
summary = """
The document addresses financial challenges faced by the Road Accident Fund and its request for court relief to fulfill its \
obligations to claimants. The RAF seeks court intervention to prevent financial collapse and manage its obligations.
"""

##### C - Optional: Save summary to disk

In [None]:
# Specify the filename where you want to save the summary
filename = "./RAF_summaries//summary.txt"

# Open the file in write mode and write the summary
with open(filename, 'w') as file:
    file.write(summary)

print(f"The document summary has been saved to {filename}")

##### D - Optional: Load summary from disk

In [32]:
# Specify the filename from which you want to load the summary
filename = "./RAF_summaries//summary.txt"

# Open the file in read mode and read the contents
with open(filename, 'r') as file:
    summary = file.read()

# Optionally, print the loaded summary
print(f"Summary:\n {summary}")


Summary:
 
The document addresses financial challenges faced by the Road Accident Fund and its request for court relief to fulfill its obligations to claimants. The RAF seeks court intervention to prevent financial collapse and manage its obligations.



#### 5 - Set up the retriever  
    
Parameters for search_type =  similarity_score_threshold:    
search_type="similarity_score_threshold",  
search_kwargs={'score_threshold': 0.4},  
  
Parameters for search_type = Maximal Marginal Relevance (MMR)   
k = number of documents to retrieve (default = 4)  
lambda_mult = degree of diversity where 0 is more (default = 0.5)  
fetch_k = number of documents to return to the mmr algorithm (default = 20)

In [33]:
retriever = vector_db.as_retriever(
    search_type="similarity",
    #search_kwargs={'k': 6, 'lambda_mult': 0.5, 'fetch_k': 2 },
)

#### 4 - Invoke the retrieval chain and run the questions through the language model

In [34]:
# Reset memory
memory = ConversationBufferMemory(
    return_messages=True, output_key="answer", input_key="question"
)

In [None]:
final_chain, memory = create_retrieval_chain()

In [43]:
for question in questions:
    inputs = {"question": question}
    result = final_chain.invoke(inputs)
    
    # Print results
    print("\033[1m" + question + "\033[0m")
    print(result['answer'].content)
    print("\n")

    # Save the memory manually
    memory.save_context(inputs, {"answer": result["answer"].content})

    # Load the memory
    memory.load_memory_variables({})

[1mWhat is the central legal claim or cause of action asserted in the pleadings?[0m
The central legal claim or cause of action asserted in the pleadings is the request for a suspension of all writs of execution and warrants of attachment against the Road Accident Fund (RAF) based on court orders already granted or settlements already reached in terms of the Road Accident Fund Act, 1996. This request is made to prevent the collapse of the RAF due to its financial instability and inability to make immediate lump sum payments to claimants.


[1mCan you provide a concise summary of the background facts leading to the legal dispute?[0m
The Road Accident Fund (RAF) is experiencing severe financial difficulties, exacerbated by the Covid-19 pandemic, and is at risk of imminent collapse. The RAF seeks extraordinary relief to stabilize its financial position and prevent a constitutional crisis. It seeks a suspension of all writs of execution and attachments against it for a period of 180 day

Here is how you can see the chat history used to answer the last question:

In [21]:
memory.load_memory_variables({})

{'history': []}

#### 5 - Scratchpad for do-over questions

In [41]:
do_over_question = """
who is cto of the raf?
"""

In [42]:
final_chain, memory = create_retrieval_chain()
inputs = {"question": do_over_question}
result = final_chain.invoke(inputs)

# Print results
print("\033[1m" + do_over_question + "\033[0m")
print(result['answer'].content)
print("\n")

# Save memory manually
memory.save_context(inputs, {"answer": result["answer"].content})

# Load memory
memory.load_memory_variables({})

[1m
who is cto of the raf?
[0m
I do not have that information based on the provided context.




{'history': [HumanMessage(content='\nwhat does the RAF do?\n'),
  AIMessage(content='The role of the Road Accident Fund (RAF) is to pay compensation for loss or damage wrongfully caused by the driving of motor vehicles, in accordance with the RAF Act. Its powers and functions include stipulating the terms and conditions for administering claims and managing the money of the Fund for purposes connected with its duties.'),
  HumanMessage(content='\nwho is the ceo of the raf?\n'),
  AIMessage(content='The CEO of the Road Accident Fund is not explicitly mentioned in the provided context. Therefore, I do not have the information to answer this question.'),
  HumanMessage(content='\nwhen was the raf established?\n'),
  AIMessage(content='The Road Accident Fund (RAF) was established in terms of section 2(1) of the RAF Act. No specific date of establishment is provided in the given context.'),
  HumanMessage(content='\nwho is cto of the raf?\n'),
  AIMessage(content='I do not have that informa

##### Alternate ways to summarize the document

In [None]:
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-1106")
chain = load_summarize_chain(llm, chain_type="map_reduce")

summary_map_reduce=chain.run(docs)
print(summary_map_reduce)

In [None]:
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-1106")
chain = load_summarize_chain(llm, chain_type="refine")

summary_refine=chain.run(docs)
print(summary_refine)

In [None]:
# For long documents, this method may exceed the model token limit
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-1106")
chain = load_summarize_chain(llm, chain_type="stuff")

summary_stuff=chain.run(docs)
print(summary_stuff)

References

[Langchain: Retrieval Augmented Generation](https://python.langchain.com/docs/expression_language/cookbook/retrieval)