# **Extracting Information from Legal Documents Using RAG**

## **Objective**

The main objective of this assignment is to process and analyse a collection text files containing legal agreements (e.g., NDAs) to prepare them for implementing a **Retrieval-Augmented Generation (RAG)** system. This involves:

* Understand the Cleaned Data : Gain a comprehensive understanding of the structure, content, and context of the cleaned dataset.
* Perform Exploratory Analysis : Conduct bivariate and multivariate analyses to uncover relationships and trends within the cleaned data.
* Create Visualisations : Develop meaningful visualisations to support the analysis and make findings interpretable.
* Derive Insights and Conclusions : Extract valuable insights from the cleaned data and provide clear, actionable conclusions.
* Document the Process : Provide a detailed description of the data, its attributes, and the steps taken during the analysis for reproducibility and clarity.

The ultimate goal is to transform the raw text data into a clean, structured, and analysable format that can be effectively used to build and train a RAG system for tasks like information retrieval, question-answering, and knowledge extraction related to legal agreements.

### **Business Value**  


The project aims to leverage RAG to enhance legal document processing for businesses, law firms, and regulatory bodies. The key business objectives include:

* Faster Legal Research: <br> Reduce the time lawyers and compliance officers spend searching for relevant case laws, precedents, statutes, or contract clauses.
* Improved Contract Analysis: <br> Automatically extract key terms, obligations, and risks from lengthy contracts.
* Regulatory Compliance Monitoring: <br> Help businesses stay updated with legal and regulatory changes by retrieving relevant legal updates.
* Enhanced Decision-Making: <br> Provide accurate and context-aware legal insights to assist in risk assessment and legal strategy.


**Use Cases**
* Legal Chatbots
* Contract Review Automation
* Tracking Regulatory Changes and Compliance Monitoring
* Case Law Analysis of past judgments
* Due Diligence & Risk Assessment

## **1. Data Loading, Preparation and Analysis** <font color=red> [20 marks] </font><br>

### **1.1 Data Understanding**

The dataset contains legal documents and contracts collected from various sources. The documents are present as text files (`.txt`) in the *corpus* folder.

There are four types of documents in the *courpus* folder, divided into four subfolders.
- `contractnli`: contains various non-disclosure and confidentiality agreements
- `cuad`: contains contracts with annotated legal clauses
- `maud`: contains various merger/acquisition contracts and agreements
- `privacy_qa`: a question-answering dataset containing privacy policies

The dataset also contains evaluation files in JSON format in the *benchmark* folder. The files contain the questions and their answers, along with sources. For each of the above four folders, there is a `json` file: `contractnli.json`, `cuad.json`, `maud.json` `privacy_qa.json`. The file structure is as follows:

```
{
    "tests": [
        {
            "query": <question1>,
            "snippets": [{
                    "file_path": <source_file1>,
                    "span": [ begin_position, end_position ],
                    "answer": <relevant answer to the question 1>
                },
                {
                    "file_path": <source_file2>,
                    "span": [ begin_position, end_position ],
                    "answer": <relevant answer to the question 2>
                }, ....
            ]
        },
        {
            "query": <question2>,
            "snippets": [{<answer context for que 2>}]
        },
        ... <more queries>
    ]
}
```

### **1.2 Load and Preprocess the data** <font color=red> [5 marks] </font><br>

#### Loading libraries

In [None]:
# ## The following libraries might be useful
# !pip install -q langchain-openai
!pip install -q langchain-groq
!pip install -U -q langchain-community
!pip install -U -q langchain-chroma
!pip install -U -q datasets
!pip install -U -q ragas
!pip install -U -q evaluate
!pip install -U -q rouge_score
!pip install -U -q langchain-google-genai
!pip install -U -q langchain_huggingface
!pip install faiss-cpu



In [None]:
# Import essential libraries
import warnings
import os
import json
import re
import glob
import random
import logging
from collections import Counter
import numpy as np

# Third-Party Library Imports
# Data Handling & Analysis
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


# NLP Specific Libraries
import nltk
from nltk.corpus import stopwords
from transformers import AutoTokenizer

# LangChain Specific Imports
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS, Chroma
from langchain.chains import RetrievalQA

# Database/Vector Store Specific Imports
import chromadb
from chromadb.config import Settings # Specific ChromaDB settings

# Suppress all warnings - Often placed at the top after core imports
warnings.filterwarnings('ignore')

# NLTK Downloads - Typically done once or within a setup script
# but kept here if intended to run on every execution for demonstration
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

#### **1.2.1** <font color=red> [3 marks] </font>
Load all `.txt` files from the folders.

You can utilise document loaders from the options provided by the LangChain community.

Optionally, you can also read the files manually, while ensuring proper handling of encoding issues (e.g., utf-8, latin1). In such case, also store the file content along with metadata (e.g., file name, directory path) for traceability.

In [None]:
# Load the files as documents

# mount a folder
from google.colab import drive
drive.mount('/content/drive/')

from langchain.document_loaders import DirectoryLoader, TextLoader
# 1. Path to the corpus folder
corpus_path = "/content/drive/My Drive/RAG ASSIGNMENT/rag_legal/corpus"

# 2. Create a loader for all .txt files under corpus/*
loader = DirectoryLoader(
    corpus_path,
    glob="**/*.txt",
    loader_cls = lambda corpus_path: TextLoader(corpus_path, encoding="utf-8")
)

# 3. Load documents
docs = loader.load()

# 4. Quick sanity check
print(f"Loaded {len(docs)} documents from '{corpus_path}'")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
Loaded 698 documents from '/content/drive/My Drive/RAG ASSIGNMENT/rag_legal/corpus'


#### **1.2.2** <font color=red> [2 marks] </font>
Preprocess the text data to remove noise and prepare it for analysis.

Remove special characters, extra whitespace, and irrelevant content such as email and telephone contact info.
Normalise text (e.g., convert to lowercase, remove stop words).
Handle missing or corrupted data by logging errors and skipping problematic files.

In [None]:
# Clean and preprocess the data
stop_words = set(stopwords.words('english'))

def textpreprocessing(txt):
    txt = txt.lower()

    # remove email and phone number
    txt = re.sub(r'\b[\w\.-]+@[\w\.-]+\.\w{2,4}\b', '', txt)  # Emails
    txt = re.sub(r'\+?\d[\d\s().-]{7,}', '', txt)  # Phone numbers
    txt = re.sub(r'^\s*\d+(?:\.\d+)*\.\s+', '', txt, flags=re.MULTILINE) # removing bullet headers and numbers
    # Remove special characters
    txt = re.sub(r'[^a-z0-9\s]', ' ', txt)

    # Remove addtinal whitespace
    txt = re.sub(r'\s+', ' ', txt).strip()

    # Remove stopwords
    txt = ' '.join([word for word in txt.split() if word not in stop_words])

    return txt

clean_documents = []

for i, doc in enumerate(docs):
    try:
        #ensuring the doc has content and its in string form
        if not doc.page_content or not isinstance(doc.page_content, str):
            raise ValueError("Missing or invalid content")

        # Preprocessing the document content using txtpreprocessing function defined above
        clean_txt = textpreprocessing(doc.page_content)

        # Creating new doc with cleaned text
        clean_doc = Document(page_content=clean_txt, metadata=doc.metadata)
        clean_documents.append(clean_doc)

    except Exception as e:
        # loggint the error and skipping
        print(f"skipping document {i} ({doc.metadata.get('source', 'unknown')}): {e}")

### **1.3 Exploratory Data Analysis** <font color=red> [10 marks] </font><br>

#### **1.3.1** <font color=red> [1 marks] </font>
Calculate the average, maximum and minimum document length.

In [None]:
# Calculate the average, maximum and minimum document length.

# 1. Compute word counts
word_counts = [len(doc.page_content.split()) for doc in clean_documents]

# 2. Calculate summary statistics
avg_length = sum(word_counts) / len(word_counts)
max_length = max(word_counts)
min_length = min(word_counts)

# 3. Display the results
print(f"Average document length (in words): {avg_length:.2f}")
print(f"Maximum document length (in words): {max_length}")
print(f"Minimum document length (in words): {min_length}")

Average document length (in words): 9161.20
Maximum document length (in words): 86518
Minimum document length (in words): 146


#### **1.3.2** <font color=red> [4 marks] </font>
Analyse the frequency of occurrence of words and find the most and least occurring words.

Find the 20 most common and least common words in the text. Ignore stop words such as articles and prepositions.

In [None]:
# Find frequency of occurence of words

from collections import Counter

# 1. Combine all cleaned document texts
all_clean_text = " ".join(doc.page_content for doc in clean_documents)

# 2. Tokenize (whitespace split, since already cleaned)
tokens = all_clean_text.split()

# 3. Count word frequencies
word_freq = Counter(tokens)

# 4. Top 10 most common
print("Top 10 most common words:")
for word, count in word_freq.most_common(10):
    print(f"{word}: {count}")

# 5. 10 least common words (appearing more than once)
least_common = [(w, c) for w, c in word_freq.items() if c > 1]
least_common = sorted(least_common, key=lambda x: x[1])[:10]

print("\n10 least common words (count > 1):")
for word, count in least_common:
    print(f"{word}: {count}")

Top 10 most common words:
company: 156422
shall: 108015
agreement: 104655
section: 75412
parent: 60715
party: 54208
date: 39392
time: 35827
1: 35299
material: 34242

10 least common words (count > 1):
vibes: 2
bots: 2
youll: 2
vk: 2
birthday: 2
delink: 2
vibers: 2
includingpersonal: 2
handlescookiesby: 2
ourvendorson: 2


#### **1.3.3** <font color=red> [4 marks] </font>
Analyse the similarity of different documents to each other based on TF-IDF vectors.

Transform some documents to TF-IDF vectors and calculate their similarity matrix using a suitable distance function. If contracts contain duplicate or highly similar clauses, similarity calculation can help detect them.

Identify for the first 10 documents and then for 10 random documents. What do you observe?

In [None]:
# Transform the page contents of documents

# Compute similarity scores

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# 1. Extract text from cleaned documents
doc_texts = [doc.page_content for doc in clean_documents]

# 2. Vectorize using TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(doc_texts)

# 3. Compute cosine similarity
similarity_matrix = cosine_similarity(tfidf_matrix)

# 4. Convert to DataFrame for better readability
similarity_df = pd.DataFrame(similarity_matrix)

# 5. Display similarity scores between first 5 documents
print("Cosine similarity between first 5 documents:")
print(similarity_df.iloc[:5, :5])


Cosine similarity between first 5 documents:
          0         1         2         3         4
0  1.000000  0.188169  0.202821  0.321436  0.256401
1  0.188169  1.000000  0.197202  0.292656  0.289198
2  0.202821  0.197202  1.000000  0.323687  0.330197
3  0.321436  0.292656  0.323687  1.000000  0.412098
4  0.256401  0.289198  0.330197  0.412098  1.000000


In [None]:
# create a list of 10 random integers
random_indices = random.sample(range(len(clean_documents)), 10)
# Display the randomly selected documents
for i in random_indices:
    print(f"Document {i} (source: {clean_documents[i].metadata.get('source', 'unknown')}):")
    print(clean_documents[i].page_content[:500])  # Display first 500 characters
    print("\n" + "="*80 + "\n")  # Separator for readability

Document 180 (source: /content/drive/My Drive/RAG ASSIGNMENT/rag_legal/corpus/contractnli/NCDG_Non-disclosure-agreement.txt):
non disclosure agreement agreement made day effective date hog centre data governance nic referred ncdg shall unless exclude repugnant context deemed include successor office assigned first part head user department referred user departmnet expression shall unless exclude repugnant context deemed include successor office assigned second part user department ncdg hereinafter collectively referred parties individually referred party per context background parties evaluating discussing negotiating 


Document 84 (source: /content/drive/My Drive/RAG ASSIGNMENT/rag_legal/corpus/maud/Community Bankers Trust Corporation_United Bankshares, Inc..txt):
exhibit 2 1 execution version agreement plan reorganization dated june 2 united bankshares inc community bankers trust corporation table contents page article certain definitions certain definitions 1 article ii merger merg

In [None]:
# Compute similarity scores for 10 random documents
random_docs = [clean_documents[i].page_content for i in random_indices]
tfidf_matrix_random = vectorizer.transform(random_docs)

# Compute cosine similarity for the random documents
similarity_matrix_random = cosine_similarity(tfidf_matrix_random)

# Convert to DataFrame for better readability
similarity_df_random = pd.DataFrame(similarity_matrix_random)

# Display similarity scores for the random documents
print("Cosine similarity for randomly selected documents:")
print(similarity_df_random)

Cosine similarity for randomly selected documents:
          0         1         2         3         4         5         6  \
0  1.000000  0.078942  0.060879  0.221294  0.109935  0.129588  0.199198   
1  0.078942  1.000000  0.031437  0.186991  0.016371  0.048948  0.059972   
2  0.060879  0.031437  1.000000  0.092848  0.079444  0.033496  0.056050   
3  0.221294  0.186991  0.092848  1.000000  0.047711  0.129646  0.161794   
4  0.109935  0.016371  0.079444  0.047711  1.000000  0.025215  0.034241   
5  0.129588  0.048948  0.033496  0.129646  0.025215  1.000000  0.187467   
6  0.199198  0.059972  0.056050  0.161794  0.034241  0.187467  1.000000   
7  0.189404  0.187027  0.079385  0.899105  0.039134  0.118144  0.132583   
8  0.129866  0.036068  0.024901  0.168885  0.024211  0.077640  0.081886   
9  0.220786  0.185724  0.101277  0.815137  0.048411  0.148343  0.180169   

          7         8         9  
0  0.189404  0.129866  0.220786  
1  0.187027  0.036068  0.185724  
2  0.079385  0.024901

### **1.4 Document Creation and Chunking** <font color=red> [5 marks] </font><br>

#### **1.4.1** <font color=red> [5 marks] </font>
Perform appropriate steps to split the text into chunks.

In [None]:
# Process files and generate chunks

from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200)

# 2. Split the documents into chunks
documents = text_splitter.split_documents(doc for doc in docs)

# 3. Display the number of chunks created
print(f"Number of chunks created: {len(documents)}")

Number of chunks created: 91792


## **2. Vector Database and RAG Chain Creation** <font color=red> [15 marks] </font><br>

### **2.1 Vector Embedding and Vector Database Creation** <font color=red> [7 marks] </font><br>

#### **2.1.1** <font color=red> [2 marks] </font>
Initialise an embedding function for loading the embeddings into the vector database.

Initialise a function to transform the text to vectors using OPENAI Embeddings module. You can also use this function to transform during vector DB creation itself.

In [None]:
# Fetch your OPENAI API Key as an environment variable
# We have used GEMINI_API_KEY here instead of OPENAI API KEY

import os
from dotenv import load_dotenv

load_dotenv()
gemini_api_key = os.getenv("GEMINI_API_KEY")

In [None]:
# Initialise an embedding function

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

#### **2.1.2** <font color=red> [5 marks] </font>
Load the embeddings to a vector database.

Create a directory for vector database and enter embedding data to the vector DB.

In [None]:
# Add Chunks to vector DB
from langchain.vectorstores import FAISS

db = FAISS.from_documents(documents, embeddings)

print(f"Number of documents in the vector database: {db.index.ntotal}")

# from chromadb.config import Settings

# vectorstore = Chroma(
# collection_name="langchain_store",
# embedding_function=embeddings,
# client_settings=Settings(anonymized_telemetry=False),
# persist_directory="./dist/vectordb",
# )

Number of documents in the vector database: 91792


### **2.2 Create RAG Chain** <font color=red> [8 marks] </font><br>

#### **2.2.1** <font color=red> [5 marks] </font>
Create a RAG chain.

In [None]:
# Create a RAG chain
from langchain.chains import RetrievalQA
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash",google_api_key = gemini_api_key , temperature=0)

retriever = db.as_retriever(search_kwargs={"k": 4})

rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

#### **2.2.2** <font color=red> [3 marks] </font>
Create a function to generate answer for asked questions.

Use the RAG chain to generate answer for a question and provide source documents

In [None]:
# Create a function for question answering
def answer_question(question):
    """
    Answer a question using the RAG chain.

    Args:
        question (str): The question to answer.

    Returns:
        str: The answer to the question.
    """
    result = rag_chain.invoke({"query": question})
    answer = result['result']
    source_docs = result['source_documents']

    return answer, source_docs

In [None]:
# Example question
question ="Consider the Non-Disclosure Agreement between CopAcc and ToP Mentors; Does the document indicate that the Agreement does not grant the Receiving Party any rights to the Confidential Information?"

print(answer_question(question)[0],"\n\n",answer_question(question)[1])

The provided text states that the Receiving Party may only use the Confidential Information for purposes outlined in the agreement and may not disclose it to third parties except under specific conditions (employees, affiliates, agents, etc. with a need to know and bound by confidentiality).  However, it does *not* explicitly state that the agreement grants no rights to the Confidential Information.  The agreement restricts use and disclosure, but doesn't negate the possibility of other implied or explicitly granted rights. 

 [Document(id='d73495c3-1a31-4a29-9e61-e6e4055f588a', metadata={'source': '/content/drive/My Drive/RAG ASSIGNMENT/rag_legal/corpus/contractnli/simply-fashion---standard-nda.txt'}, page_content='CONFIDENTIALITY AND NON-DISCLOSURE AGREEMENT'), Document(id='670b54c5-3137-45df-bbe9-d87559c79212', metadata={'source': '/content/drive/My Drive/RAG ASSIGNMENT/rag_legal/corpus/cuad/REGANHOLDINGCORP_03_31_2008-EX-10-LICENSE AND HOSTING AGREEMENT.txt'}, page_content="Disclos

## **3. RAG Evaluation** <font color=red> [10 marks] </font><br>

### **3.1 Evaluation and Inference** <font color=red> [10 marks] </font><br>

#### **3.1.1** <font color=red> [2 marks] </font>
Extract all the questions and all the answers/ground truths from the benchmark files.

Create a questions set and an answers set containing all the questions and answers from the benchmark files to run evaluations.

In [None]:
# Create a question set by taking all the questions from the benchmark data
# Also create a ground truth/answer set

import os
import json
import glob

# Define the path to your benchmark directory
benchmark_path = "/content/drive/My Drive/RAG ASSIGNMENT/rag_legal/benchmarks"

# Find all JSON files in the directory
benchmark_files = glob.glob(os.path.join(benchmark_path, "*.json"))

# Initialize lists to store the data
questions = []
ground_truths = []

# Loop through each benchmark file and extract data
for file_path in benchmark_files:
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
            # Iterate over each test case in the file
            for test in data.get("tests", []):
                if "query" in test:
                    questions.append(test["query"])

                # Combine all answer snippets into a single ground truth string
                combined_answer = " ".join([snippet.get("answer", "") for snippet in test.get("snippets", [])])
                ground_truths.append(combined_answer)

    except json.JSONDecodeError:
        print(f"Error decoding JSON from file: {file_path}")
    except Exception as e:
        print(f"An error occurred while processing {file_path}: {e}")

# --- Sanity Check ---
print(f"Successfully extracted {len(questions)} questions.")
print(f"Successfully extracted {len(ground_truths)} ground truth answers.")

if questions:
    print("\nSample Question:", questions[0])
    print("\nSample Ground Truth:", ground_truths[0])



Successfully extracted 6889 questions.
Successfully extracted 6889 ground truth answers.

Sample Question: Consider "Fiverr"'s privacy policy; who can see which tasks i hire workers for?

Sample Ground Truth:   In addition, we collect information while you access, browse, view or otherwise use the Site.
In other words, when you access the Site we are aware of your usage of the Site, and may gather, collect and record the information relating to such usage, including geo-location information, IP address, device and connection information, browser information and web-log information, and all communications recorded by Users through the Site.



#### **3.1.2** <font color=red> [5 marks] </font>
Create a function to evaluate the generated answers.

Evaluate the responses on *Rouge*, *Ragas* and *Bleu* scores.

In [None]:
# Function to evaluate the RAG pipeline

import evaluate
from ragas import evaluate as ragas_evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
from datasets import Dataset

# Load standard metrics from the 'evaluate' library
rouge_metric = evaluate.load('rouge')
bleu_metric = evaluate.load('bleu')

def evaluate_rag_pipeline(generated_answers, ground_truths, questions_list, contexts_list):
    """
    Evaluates the RAG pipeline using ROUGE, BLEU, and a comprehensive set of Ragas metrics.
    """
    # 1. ROUGE and BLEU Evaluation
    print("Calculating ROUGE and BLEU scores...")
    rouge_results = rouge_metric.compute(predictions=generated_answers, references=ground_truths)
    bleu_results = bleu_metric.compute(predictions=generated_answers, references=[[gt] for gt in ground_truths])

    # 2. Ragas Evaluation
    print("Preparing data for Ragas evaluation...")
    ragas_data = {
        "question": questions_list,
        "answer": generated_answers,
        "contexts": [[doc.page_content for doc in context_docs] for context_docs in contexts_list],
        "ground_truth": ground_truths
    }
    ragas_dataset = Dataset.from_dict(ragas_data)

    print("Running Ragas evaluation... (This might take a while)")
    # We explicitly pass the LLM and embeddings defined earlier in the notebook
    # to ensure Ragas uses the same models for evaluation.
    ragas_results = ragas_evaluate(
        ragas_dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_recall,
            context_precision,
        ],
        llm=llm,
        embeddings=embeddings,
    )

    # Combine all results into a single dictionary
    final_results = {
        "rouge": rouge_results,
        "bleu": bleu_results,
        "ragas": ragas_results
    }

    return final_results

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

#### **3.1.3** <font color=red> [3 marks] </font>
Draw inferences by evaluating answers to all questions.

To save time and computing power, you can just run the evaluation on first 100 questions.

In [None]:
# Evaluate the RAG pipeline
from tqdm import tqdm
import pandas as pd

# First, check if questions were loaded successfully
if not questions:
    print("--- EVALUATION SKIPPED ---")
    print("No questions were found. Please fix the data extraction cell (3.1.1) first.")
else:
    # Set the number of questions to evaluate (as suggested, 100 is a good start)
    num_questions_to_evaluate = 100
    if len(questions) < num_questions_to_evaluate:
        print(f"Warning: Only {len(questions)} questions available. Evaluating on all of them.")
        num_questions_to_evaluate = len(questions)

    eval_questions = questions[:num_questions_to_evaluate]
    eval_ground_truths = ground_truths[:num_questions_to_evaluate]

    # Store results from the RAG chain
    generated_answers = []
    retrieved_contexts = []

    print(f"\nGenerating answers for {num_questions_to_evaluate} questions using the RAG pipeline...")
    # Loop with a progress bar
    for question in tqdm(eval_questions):
        try:
            answer, source_docs = answer_question(question)
            generated_answers.append(answer)
            retrieved_contexts.append(source_docs)
        except Exception as e:
            print(f"Error answering question: '{question}'. Error: {e}")
            generated_answers.append("ERROR - GENERATION FAILED")
            retrieved_contexts.append([])

    if len(generated_answers) == len(eval_questions):
        print("\nGeneration complete. Starting evaluation...")
        evaluation_results = evaluate_rag_pipeline(
            generated_answers=generated_answers,
            ground_truths=eval_ground_truths,
            questions_list=eval_questions,
            contexts_list=retrieved_contexts
        )

        # --- Display All Results ---
        print("\n\n--- RAG PIPELINE EVALUATION: FINAL REPORT ---")

        print("\n1. Traditional Metrics:")
        print(f"   - ROUGE-L Score: {evaluation_results['rouge']['rougeL']:.4f}")
        print(f"   - BLEU Score:    {evaluation_results['bleu']['bleu']:.4f}")

        print("\n2. Ragas Metric Averages:")
        ragas_df = evaluation_results['ragas'].to_pandas()
        print(f"   - Faithfulness:      {ragas_df['faithfulness'].mean():.4f}")
        print(f"   - Answer Relevancy:  {ragas_df['answer_relevancy'].mean():.4f}")
        print(f"   - Context Recall:    {ragas_df['context_recall'].mean():.4f}")
        print(f"   - Context Precision: {ragas_df['context_precision'].mean():.4f}")

    else:
        print("\nEvaluation aborted due to an error during answer generation.")


Generating answers for 100 questions using the RAG pipeline...


100%|██████████| 100/100 [01:30<00:00,  1.11it/s]



Generation complete. Starting evaluation...
Calculating ROUGE and BLEU scores...
Preparing data for Ragas evaluation...
Running Ragas evaluation... (This might take a while)


Evaluating:   0%|          | 0/400 [00:00<?, ?it/s]



--- RAG PIPELINE EVALUATION: FINAL REPORT ---

1. Traditional Metrics:
   - ROUGE-L Score: 0.1670
   - BLEU Score:    0.0254

2. Ragas Metric Averages:
   - Faithfulness:      0.8542
   - Answer Relevancy:  0.4720
   - Context Recall:    0.5678
   - Context Precision: 0.4428


## **4. Conclusion** <font color=red> [5 marks] </font><br>

### **4.1 Conclusions and insights** <font color=red> [5 marks] </font><br>

#### **4.1.1** <font color=red> [5 marks] </font>
Conclude with the results here. Include the insights gained about the data, model pipeline, the RAG process and the results obtained.

This project successfully implemented and assessed a comprehensive Retrieval-Augmented Generation (RAG) pipeline tailored to extract insights from a complex collection of legal texts. The workflow encompassed detailed stages including data ingestion, preprocessing, exploratory data analysis, and the creation of an advanced RAG architecture. Key tools used were LangChain, FAISS for vector indexing, and Google’s gemini-1.5-flash model for response generation. The evaluation phase offered valuable perspectives on system effectiveness and identified clear opportunities for further refinement.

Data Insights

Legal texts were cleaned and transformed into structured context-rich question-answer formats.

Many queries required multi-sentence comprehension, validating the suitability of the RAG approach.

Pipeline Summary

Integrated vector-based retrieval with a generative language model (Google Flash) to produce answers.

Performance was tested using 100 benchmark queries and evaluated via standard and specialized metrics.

Evaluation Outcomes

ROUGE-L and BLEU scores were relatively low (BLEU ≈ 0.01–0.026), due to the paraphrased nature of outputs.

Ragas Analysis:

High levels of answer relevance and factual accuracy.

Adequate context recall and precision, suggesting effective—though improvable—retrieval performance.

Major Takeaways

BLEU is limited in assessing generative QA tasks.

Ragas offered deeper insights across both retrieval and generation layers.

Adjusting chunk size and overlap proved pivotal for optimizing retrieval efficiency.