**EXECUTIVE SUMMARY**

You are reviewing the source code for a web-based prototype tool designed to provide answers to user prompts with domain-specific contextual awareness.

This program is a Retriever-Augmented Generation (RAG) framework that facilitates intelligent document retrieval and natural language processing for large datasets. The framework supports users in querying specific applications, such as USA Spending and the Government Accountability Office (GAO), leveraging machine learning models (like Sentence Transformers) for document embeddings and ChromaDB for vector storage.

The program has the flexibility to answer User Prompts in **Batch Mode** as well as in **Online Mode**. Batch mode is where someone- well versed with python code and technical understanding of the notebook- can prompt the application to generate contextually relevant and intelligent reponse. Online mode is for non-technical users.

Main Features:

	1.	Program Execution Time Tracking: Measures start time for monitoring execution duration.
	2.	Library and Package Imports: Loads essential libraries for data handling, text processing, and machine learning.
	3.	RAG Setup:
	•	Allows user selection between various language models (LLMs) and applications.
	•	Sets up the directory for raw document storage, imports documents, and structures text into chunks for embedding.
	•	Embeds the chunks using selected models and stores them in a persistent vector database (ChromaDB).
	4.	Document Retrieval:
	•	Utilizes contextual retrievers and rerankers, which prioritize the most relevant chunks based on cosine similarity with user queries.
	•	Allows option for enhanced contextual reranking for improved response relevance.
	5.	Query Embedding and Response Augmentation: Generates embeddings for user queries, retrieves relevant chunks, and formats responses augmented by context.
	6.	Interactive Gradio Interface: Provides a user-friendly UI with options for selecting application, LLM model, and specific prompts. Also includes a ‘Chart It Up’ button to visualize responses when applicable.
	7.	Response Customization: Allows users to specify reranking, prompt emphasis, and unaugmented responses for testing purposes.

The program is designed for efficient querying and response generation by integrating document retrieval with LLMs, ideal for exploring large datasets and gaining insights into specific federal or organizational data points.

Capture the **Program Execution Start Time** which helps in calculating the overall execution time

In [1]:
import time
def print_progress(message):
    print(f"\n{message}\n{'=' * 50}\n")
overall_start_time = time.time()

**LIBRARIES** and **PACKAGES**

In [2]:
import os
import tempfile
import re
import glob
import pandas as pd # type: ignore
import numpy as np # type: ignore
from nltk.tokenize import word_tokenize # type: ignore
from nltk.corpus import stopwords # type: ignore
from collections import Counter
import string
import nltk # type: ignore
from tqdm import tqdm # type: ignore
from concurrent.futures import ThreadPoolExecutor
import chromadb # type: ignore
from langchain_community.llms import Ollama  # type: ignore
from langchain_community.document_loaders import DirectoryLoader  # type: ignore
from langchain.text_splitter import RecursiveCharacterTextSplitter  # type: ignore
#from langchain_community.embeddings import OllamaEmbeddings  # type: ignore
from sentence_transformers import SentenceTransformer # type: ignore
from langchain_community.vectorstores import Chroma  # type: ignore
from langchain.chains import create_retrieval_chain  # type: ignore
from langchain.chains.combine_documents import create_stuff_documents_chain # type: ignore
from langchain import hub  # type: ignore
#import uuid
#import shutil
#import tempfile
from numpy import dot # type: ignore
from numpy.linalg import norm # type: ignore
from datetime import datetime
from langchain.schema import Document # type: ignore
import gradio as gr # type: ignore
from gradio.themes.base import Base # type: ignore
from gradio.themes.default import Default # type: ignore
from gradio.themes.monochrome import Monochrome # type: ignore
from gradio.themes.glass import Glass # type: ignore
from langchain.retrievers import ContextualCompressionRetriever # type: ignore
from langchain.retrievers.document_compressors import CrossEncoderReranker # type: ignore
from langchain_community.cross_encoders import HuggingFaceCrossEncoder # type: ignore
import matplotlib.pyplot as plt # type: ignore

  from tqdm.autonotebook import tqdm, trange


**OPERATIONAL PARAMETERS**

In [3]:
# If you change the location of the application codebook, Notebook-LLM-RAG-Contracts.ipynb, or adjust the directory structure of the application repository, Project_LLM_and_RAG_2024_GWU, you will need to update the following directory paths accordingly.

before_chunk_dir = "/Users/arnabraychaudhari/Documents/6317/Project_LLM_and_RAG_2024_GWU/Technical_Deliverables/Before-Chunking/"

embed_download_dir = "/Users/arnabraychaudhari/Documents/6317/Project_LLM_and_RAG_2024_GWU/Technical_Deliverables/Embedding_Downloads_CSV/"

logos_dir = "/Users/arnabraychaudhari/Documents/6317/Project_LLM_and_RAG_2024_GWU/Technical_Deliverables/Logos/"

transposed_embed_dir = "/Users/arnabraychaudhari/Documents/6317/Project_LLM_and_RAG_2024_GWU/Technical_Deliverables/Transposed_Embeddings_CSV/"

vector_dir = "/Users/arnabraychaudhari/Documents/6317/Project_LLM_and_RAG_2024_GWU/Technical_Deliverables/Vector_DB_Embeddings/"

**LLM CHOICE**

In [4]:
#chosen_model = "llama3.1"
chosen_model = "llama3.2"

**APPLICATION DOMAIN**

Set the application domain (USSP, GAO or any future domains that you may implement) as per your requirement. Going forward, if you are implementing a new application domain, a new application code needs to be appended to the exisiting list.

In [5]:
#application = "USSP" # Use case: Use case: The internal teams at FI Consulting are focused on analyzing the competitive landscape of institutions awarded federal contracts. To deepen their understanding, they are starting with the contracts awareded by US Department of Agriculture. To support this analysis, FI Consulting has developed a RAG-LLM application that enables users to submit domain-specific prompts and receive semantically relevant, contextually informed responses from the LLM model. Source of unstructured text: https://api.usaspending.gov/api/v2/agency/012/?fiscal_year=2024

#application = "USSP_Light" # Use case: The internal teams at FI Consulting are focused on analyzing the competitive landscape of institutions awarded federal contracts. To deepen their understanding, they are starting with the contracts awareded by US Department of Agriculture. More sprecifically, they have decided to review key legislative texts that have been enacted into law. Among these is the U.S. “Infrastructure Investment and Jobs Act” (Public Law 117-58), signed in 2021, which provides comprehensive details on funding allocations and provisions for various infrastructure initiatives. To support this analysis, FI Consulting has developed a RAG-LLM application that enables users to submit domain-specific prompts and receive semantically relevant, contextually informed responses from the LLM model. Source of unstructured text: http://www.govinfo.gov/content/pkg/PLAW-117publ58/pdf/PLAW-117publ58.pdf

application = "Farm_Bill" # Use case: The internal teams at FI Consulting are conducting a comparative analysis of the 2024 Farm Bill (H.R. 8467) and the Agriculture Improvement Act of 2018 (Public Law 115-334). Their objective is to examine major changes and assess how these differences may influence future contracts awarded by the U.S. Department of Agriculture. The RAG-LLM application developed by FI Consulting utilizes unstructured text from the Congressional Research Service to build context and provide informed responses to prompts submitted by FI staff. Source of unstructured text: https://crsreports.congress.gov/product/pdf/R/R48167

# Government Accountability Office (GAO)
#application = "GAO" # Use case: FI Consulting wants to gain insight into the specific needs of colonias, economically disadvantaged communities along the U.S.-Mexico border, primarily inhabited by Hispanic populations. The report, prepared by the U.S. Government Accountability Office (GAO), reviews the economic, infrastructure, and environmental challenges faced by colonias and evaluates the effectiveness and limitations of assistance programs provided by the Department of Housing and Urban Development (HUD) and the U.S. Department of Agriculture (USDA). FI wants to study the impact of federal programs, and the proposed recommendations for improving assistance delivery and policy adjustments to better serve these communities. This will enable them to estimate how federal contracts in near future could be awarded. Source of unstructured text: https://www.gao.gov/assets/gao-24-106732.pdf

**DOCUMENT DIRECTORY**

Do you have new unstructured text that needs to be embedded? Upload the text as .TXT files to your working directory and set the appropriate path. If you are using extraction programs to parse online PDFs and generate text files, these files should already be in the required directory. Simply ensure that these folders are included in your working directory for this program.

If you are working within the same application domains used during the pilot release of this prototype tool, you may continue using the directories specified in the following code cell. However, based on your application domain, you may need to manually set the correct document directory.

In [6]:
# Set the raw directory for USA Spending.Gov
#raw_doc_directory = f"{before_chunk_dir}USSP_DOWNLOAD_20241031_015055"

# Set the raw directory for USA Spending.Gov (Light weight app)
#raw_doc_directory = f"{before_chunk_dir}USSP_Light_DOWNLOAD_20241030_051135"

# Set the raw directory for Government Accountability Office (GAO)
#raw_doc_directory = f"{before_chunk_dir}GAO_DOWNLOAD_20241020_220257"

# Set the raw directory for Farm Bill app
raw_doc_directory = f"{before_chunk_dir}Farm_Bill_DOWNLOAD_20241107_202042" # Farm Bill

**OVERWRITE DOC CHUNK**

If you are running this tool on an average computing platform—such as the 10-core CPU and 16 GB RAM setup used by developers during testing—it is recommended to select single-digit document chunks for retrieval from the vector database. This approach helps prevent runtime errors due to memory limitations. In such cases, set the overwrite_doc_chunks field to “Yes” to manage memory efficiently.

In [7]:
overwrite_doc_chunks = "Yes"
#overwrite_doc_chunks = "No"

if overwrite_doc_chunks=="Yes":
    doc_chunk_val = 4

**UNRAG RESPONSE**

If you are demonstrating this tool to leadership or conducting testing, you may wish to compare the RAG response with the UNRAG version (where the LLM is prompted without contextual awareness). This comparison is available in Batch mode but is currently unavailable in Online mode. Therefore, custodians of this tool should decide whether to offer the UNRAG functionality to end users.

In [8]:
need_unrag_response = "Yes"
#need_unrag_response = "No"
# As discussed with Sinan on Oct 22, this is not a requirement even from a testing perspective. GWU project team proposed this feature to support the testing process that enables comparison between RAG augment response and any LLM generated response that is not augmented by contextual information.

**RERANKING**

Please set the flag in the following code cell to Yes, if you need More Finetuned Contextual Response with ReRanking? Changing this value affects the batch mode only. Online Users have the ability to set their preference on the UI.

In [9]:
#need_reranking = "Yes"
need_reranking = "No"

**USER PROMPTS**

If you are running this program in the batch mode, you may need to prompt the application. Review the following prompts that are closely related to the domain and can be best answered when you augment an LLM with unstructured information.

In case you need a new prompt, you may edit the following code cell and append it to the list.

1 - USSP Prompts

In [10]:
#user_prompt = "Compare the Commodity Policy- documented in previous farm bill, the Agriculture Improvement Act of 2018 (P.L. 115-334)- with the new law introduced through Farm, Food, and National Security Act of 2024 (H.R. 8467), and provide me with a list of key differences."
#user_prompt = "What are the total estimated changes in mandatory spending for each Farm Bill title in H.R. 8467 from FY2025 to FY2033?"
#user_prompt = "What are the projected budget changes in key Farm Bill titles, such as Nutrition, Conservation, and Commodities, by FY2029 compared to FY2025?"
#user_prompt = "What are the projected annual mandatory spending amounts in H.R. 8467 for rural development, crop insurance, and research from FY2025 to FY2033?"
#user_prompt = "What is the PRIMARY PURPOSE of the Continuing Appropriations Act, 2018 and Supplemental Appropriations for Disaster Relief Requirements Act, 2017?"
#user_prompt = "How does the act - Continuing Appropriations Act, 2018 and Supplemental Appropriations for Disaster Relief Requirements Act, 2017 - address EDUCATION accountability and what are the key educational initiatives mentioned?"
#user_prompt = "As per the act-Continuing Appropriations Act, 2018 and Supplemental Appropriations for Disaster Relief Requirements Act, 2017,- what provisions have been made for DISASTER RELIEF and how are the allocated funds distributed among various agencies?"
#user_prompt = "What are the key guidelines or metrics for evaluating the EFFECTIVENESS of the programs under this act-Continuing Appropriations Act, 2018 and Supplemental Appropriations for Disaster Relief Requirements Act, 2017?"
#user_prompt = "How does the act-Continuing Appropriations Act, 2018 and Supplemental Appropriations for Disaster Relief Requirements Act, 2017- balance NATIONAL DEBT management while funding various appropriations?"
#user_prompt = "How did the temporary changes in unemployment insurance policies under the “Emergency Unemployment Insurance Stabilization and Access Act” impact claim rates and benefit accessibility?"
#user_prompt = "To what extent has the stipulation requiring producers to purchase crop insurance in subsequent years, as a condition of receiving aid, impacted the adoption of insurance programs among historically underserved producers? How could federal policies be adjusted to improve participation rates in these programs?"
#user_prompt = "In what ways have the financial provisions for the Agricultural Programs under Title I improved food security, rural development, and conservation programs in 2021? How effectively have the funds addressed the challenges faced by smallholder farmers and rural communities?"
#user_prompt = "What challenges did healthcare providers face in implementing telehealth solutions, and how did the CARES Act address those challenges?"
#user_prompt = "How effective has the supplemental funding for agricultural losses due to natural disasters (e.g., Hurricanes Michael and Florence, wildfires, floods) been in restoring crop production and supporting affected farmers? What measures could enhance the resilience of agricultural systems to such events in the future?"
#user_prompt = "What are the loan amounts for different agricultural credit programs?"

2.a - USSP Light Prompts (use them to test the charting functionality)

In [11]:
#user_prompt = "How does the funding for the Bridge Investment Program vary across fiscal years from 2022 to 2026?"
#user_prompt = "What is the total funding allocated annually for the Federal-Aid Highway Program between 2022 and 2026?"
#user_prompt = "What are the total appropriations for transportation programs like Tribal Transportation, Federal Lands Transportation, and Federal Lands Access from 2022 to 2026?"
#user_prompt = "How is the funding allocated to different pilot programs, such as the Wildlife Crossings Pilot Program, across fiscal years from 2022 to 2026?"
#user_prompt = "What is the annual breakdown of funding for the Reconnecting Communities Pilot Program (planning vs. capital construction grants) from 2022 to 2026?"
#user_prompt = "What are the total estimated changes in mandatory spending for each Farm Bill title in H.R. 8467 from FY2025 to FY2033?"
#user_prompt = "What are the projected budget changes in key Farm Bill titles, such as Nutrition, Conservation, and Commodities, by FY2029 compared to FY2025?"
#user_prompt = "What are the projected annual mandatory spending amounts in H.R. 8467 for rural development, crop insurance, and research from FY2025 to FY2033?"

2.b - USSP Light Prompts (may not generate output for charting)

In [12]:
#user_prompt = "How does the allocation of federal infrastructure funding between rural and urban areas reflect the government’s stated priorities for equitable development?"
#user_prompt = "What potential impact might the funding trends for bridge repairs and maintenance have on overall transportation safety in the US by 2026?"
#user_prompt = "How effectively does the Reconnecting Communities Pilot Program address the socioeconomic disparities in transportation infrastructure, based on the funding distribution from 2022 to 2026?"
#user_prompt = "How does the funding structure for tribal transportation programs compare to other transportation initiatives in terms of scope, allocation, and expected outcomes?"
#user_prompt = "In what ways could the Wildlife Crossings Pilot Program’s funding allocation contribute to broader environmental sustainability goals within federal infrastructure projects?"

3 - GAO Prompts

In [13]:
#user_prompt ="To what extent have federal programs, such as HUD’s CDBG colonias set-aside and USDA’s water and housing assistance, reduced economic disparities in these regions from 2020 to 2023?"
#user_prompt = "How effective have federal programs been in addressing infrastructure gaps (e.g., water and wastewater services) in colonias, based on recent site visits and assessments?"
#user_prompt ="How do climate projections of increasing temperatures impact the living conditions and infrastructure needs of colonias?"
#user_prompt = "What are the main barriers to accessing federal assistance in colonias, and how can program criteria be adjusted to improve eligibility and support?"
#user_prompt = "How do changes in population and demographics in colonias influence eligibility for federal assistance programs, and what legislative adjustments might be needed to ensure continued support?"

4 - Farm Bill Prompts

In [14]:
user_prompt ="What are the projected budget changes in key titles of the Farm Bill, particularly for domestic nutrition programs?"
#user_prompt = "Compare the Commodity Policy- documented in previous farm bill, the Agriculture Improvement Act of 2018 (P.L. 115-334)- with the new law introduced through Farm, Food, and National Security Act of 2024 (H.R. 8467), and provide me with a list of key differences in a tabular format."
#user_prompt = "What are the total estimated changes in mandatory spending for each Farm Bill title in H.R. 8467 from FY2025 to FY2033?"
#user_prompt = "What are the projected budget changes in key Farm Bill titles, such as Nutrition, Conservation, and Commodities, by FY2029 compared to FY2025?"
#user_prompt = "What are the projected annual mandatory spending amounts in H.R. 8467 for rural development, crop insurance, and research from FY2025 to FY2033?"

The GWSB developers working on this project in Fall 2024 explored ways to present a tabular comparison between two versions of the Farm Bill. Emphasizing user prompts with supplemental instructions was one of the initial solution options they considered. Consequently, you will see relevant code in the following cells, though it is currently disabled. Please note that this work is still in progress.

In [15]:
#farm_prompt_emphasis = "Note - I need the comparison in a tabular format wherein I can see one column for the previous law and another one dedicated for the new law. The items listed in one row should refer to one and only aspect of comparison. Enlist these aspects in bullet format. Ensure there is a header for each column; the offcial titles (with any identifiers) used for each law should appear as a header."

In [16]:
# if application == "Farm_Bill":
#     user_prompt = f"{user_prompt}. {farm_prompt_emphasis}"

**PERSISTENT DIRECTORY**

In this tool, embeddings are created with persistence, i.e. once generated they are written to a database and can be recalled later.

For the developers, it was sensisble to isolate the persistent directories per application domain (i.e. USSP, GAO etc.) and depending on the LLM being used. The team later realized that logical separation based on LLM model wasn't necessary. This approach can be corrected me future releases.

In [17]:
if chosen_model=="llama3.1":
    new_persist_directory = f"{vector_dir}{application}_Embed_MXbai"
elif chosen_model=="llama3.2":
    new_persist_directory = f"{vector_dir}{application}_Embed_MXbai_L3.2"
print(f"Model chosen is: {chosen_model}, and the persistent directory being used is: {new_persist_directory}")

Model chosen is: llama3.2, and the persistent directory being used is: /Users/arnabraychaudhari/Documents/6317/Project_LLM_and_RAG_2024_GWU/Technical_Deliverables/Vector_DB_Embeddings/Farm_Bill_Embed_MXbai_L3.2


**EMBEDDING MODEL**

Define the class

In [18]:
class SentenceTransformerEmbeddings:
    """A wrapper to provide both document and query embeddings."""
    def __init__(self, model_name='mixedbread-ai/mxbai-embed-large-v1'):
        self.model = SentenceTransformer(model_name)

    def embed_documents(self, texts):
        """Embed a list of documents."""
        return self.model.encode(texts, batch_size=8, show_progress_bar=True)

    def embed_query(self, text):
        """Embed a single query."""
        return self.model.encode([text])[0]

Initiate the **Embedding Model**

In [19]:
embed_model = SentenceTransformerEmbeddings()

**LOAD DOCUMENTS**

Load the unstructured text for your application domain before invoking the embedding function to decide if new embeddings are required. The following code is efficient enough as it relies on a threading process and distributes the workload parallely.

In [20]:
def read_file(filepath):
    with open(filepath, 'r', encoding='utf-8') as file:
        return file.read()

def load_files_concurrently(directory, pattern):
    file_list = glob.glob(os.path.join(directory, pattern))
    data = []
    
    # Use ThreadPoolExecutor for concurrent reading
    with ThreadPoolExecutor() as executor:
        data = list(executor.map(read_file, file_list))
    
    return data

start_time = time.time()
print("Loading documents from the directory...")

# This raw_doc_directory should come from the op_par function !!

data = load_files_concurrently(raw_doc_directory, "*.txt")
# print(data1)

import random

# Check if data is loaded and handle sampling accordingly
if len(data) == 0:
    print("\nNo documents loaded.")
else:
    num_samples = min(3, len(data))  # Ensure sample size doesn't exceed data size

    # Sample randomly from the data
    sample_indices = random.sample(range(len(data)), num_samples)
    print(f"\nRandomly sampling {num_samples} documents:")
    for idx in sample_indices:
        print(f"\nSample Document {idx + 1} (First 500 characters):")
        print(data[idx][:500])

    total_size = sum(len(doc) for doc in data)
    print(f"\nTotal size of loaded data: {total_size} characters")

print(f"Data loaded in {time.time() - start_time:.2f} seconds")

Loading documents from the directory...

Randomly sampling 1 documents:

Sample Document 1 (First 500 characters):


[Extract from Farm_Bill_PDF_20241107_202042.pdf - Page 1]
 
 
  
 The 2024 Farm Bill: H.R. 8467 Compared with 
Current Law  
August 27, 2024  
Congressional Research Service  
https://crsreports.congress.gov  
R48167  

[Extract from Farm_Bill_PDF_20241107_202042.pdf - Page 2]
 
Congressional Research Service   
SUMMARY  
 
The 2024 Farm Bill: H.R. 8467 Compared with  
Current Law  
Congress sets food and agriculture policy through periodic legislation referred to as fa rm bills. 
The previous

Total size of loaded data: 512396 characters
Data loaded in 0.00 seconds


**CHUNKING**

**Split** the raw documents into **Chunks** before creating the **Embeddings**. Adjust chunking size and overlaps accoridng to your requirement.

In [21]:
start_time = time.time()
print_progress("Splitting text into chunks")
data_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(separators=["\n\n", "\n", ". ", " ", ""],chunk_size=512, chunk_overlap=50)
chunks = []
for doc in tqdm(data, desc="Splitting text into chunks"):
    # Assuming each doc has a 'page_content' attribute or similar, which stores the text content
    chunks.extend(data_splitter.split_text(doc))
print(f"\nTOTAL CHUNKS: {len(chunks)}")
# Print the first few chunks to inspect their content
num_chunks_to_inspect = 2  # Adjust this value to see more or fewer chunks
print(f"\nInspecting the first {num_chunks_to_inspect} chunks:")
for i, chunk in enumerate(chunks[:num_chunks_to_inspect]):
    print(f"\nChunk {i + 1}:\n{chunk}")
print(f"Chunking completed in {time.time() - start_time:.2f} seconds")


Splitting text into chunks



Splitting text into chunks: 100%|██████████| 1/1 [00:00<00:00,  3.73it/s]


TOTAL CHUNKS: 359

Inspecting the first 2 chunks:

Chunk 1:
[Extract from Farm_Bill_PDF_20241107_202042.pdf - Page 1]
 
 
  
 The 2024 Farm Bill: H.R. 8467 Compared with 
Current Law  
August 27, 2024  
Congressional Research Service  
https://crsreports.congress.gov  
R48167

Chunk 2:
[Extract from Farm_Bill_PDF_20241107_202042.pdf - Page 2]
 
Congressional Research Service   
SUMMARY  
 
The 2024 Farm Bill: H.R. 8467 Compared with  
Current Law  
Congress sets food and agriculture policy through periodic legislation referred to as fa rm bills. 
The previous farm bill, the Agriculture Improvement Act of 2018 ( P.L. 115 -334), was extended 
by one year ( P.L. 118 -22)—until September 30, 2024, and for the 2024 crop year. The farm bill 
covers  numerous  policies and programs , including commodity support, conservation, trade and 
food aid, domestic food assistance, credit, rural develo pment, research, forestry, energy, 
horticulture, and crop insurance , among others . 
The Farm, Foo




Create a Documents object from the Chunks

In [22]:
documents = [Document(page_content=chunk, metadata={}, id=str(i)) for i, chunk in enumerate(chunks)]

Review the document chunks

In [23]:
# Check if documents list is not empty
if documents:
    for i, doc in enumerate(documents):
        print(f"\nDocument {i} attributes and values:")
        print(f"ID: {doc.id}")
        print(f"Page Content: {doc.page_content}")
        print(f"Metadata: {doc.metadata}")
else:
    print("The documents list is empty.")


Document 0 attributes and values:
ID: 0
Page Content: [Extract from Farm_Bill_PDF_20241107_202042.pdf - Page 1]
 
 
  
 The 2024 Farm Bill: H.R. 8467 Compared with 
Current Law  
August 27, 2024  
Congressional Research Service  
https://crsreports.congress.gov  
R48167
Metadata: {}

Document 1 attributes and values:
ID: 1
Page Content: [Extract from Farm_Bill_PDF_20241107_202042.pdf - Page 2]
 
Congressional Research Service   
SUMMARY  
 
The 2024 Farm Bill: H.R. 8467 Compared with  
Current Law  
Congress sets food and agriculture policy through periodic legislation referred to as fa rm bills. 
The previous farm bill, the Agriculture Improvement Act of 2018 ( P.L. 115 -334), was extended 
by one year ( P.L. 118 -22)—until September 30, 2024, and for the 2024 crop year. The farm bill 
covers  numerous  policies and programs , including commodity support, conservation, trade and 
food aid, domestic food assistance, credit, rural develo pment, research, forestry, energy, 
horticulture

**EMBEDDINGS** in vector database (**ChromaDB**)

Performance-Enhanced Embedding Creation: The code is designed to assess the need for new embeddings by comparing existing document chunks (with embeddings already stored in the vector database) to new ones. Any new chunks are embedded, while existing ones are ignored to optimize performance.

In [24]:
start_time = time.time()
print_progress("Creating/updating embeddings in vector database (ChromaDB)")

# Here we set the collection name and metadata
collection_name = f"collection_{application}"
collection_metadata = {
    "hnsw:space": "cosine",
    "timestamp": datetime.now().isoformat()
}
# Start the ChromaDB client with a persistent directory

# This persistent directory should come from the op_par function !!

chroma_client = chromadb.PersistentClient(path=new_persist_directory)

# Load existing collection if it exists, or create a new one
try:
    vector_store = Chroma(
        client=chroma_client,
        collection_name=collection_name,
        embedding_function=embed_model,
        collection_metadata=collection_metadata
    )

    existing_docs = vector_store._collection.get(include=["documents"])

    # Check the type of existing_docs and adjust accordingly
    if isinstance(existing_docs, dict) and "documents" in existing_docs:
        ids_list = existing_docs["ids"]
        #print(ids_list)
        print(f"\nNumber of document IDs in the old collection: {len(ids_list)}")
        documents_list = existing_docs["documents"]  # Extract the list of documents
        print(f"\nNumber of documents in the old collection: {len(documents_list)}")

        # # Inspect a few of the documents in the list
        # num_to_inspect = min(3, len(ids_list), len(documents_list))
        
        # print(f"\nInspecting the first {num_to_inspect} IDs and documents:")
        # for i, (doc_id, doc) in enumerate(zip(ids_list, documents_list)):
        #     if i >= num_to_inspect:
        #         break
        #     print(f"\nID {i + 1}: {doc_id}")
        #     print(f"Document {i + 1}: {doc}")

    else:
        print("\nUnexpected structure of existing_docs:", type(existing_docs))

    # Filter documents to upsert only new/modified ones
    # new_documents = [
    #     doc for i, doc in enumerate(documents)
    #     if str(i) not in ids_list
    # ]

    new_ids = []
    same_ids = []
    new_documents = []

    # Iterate through documents with enumeration
    for i, doc in enumerate(documents):
        doc_id = str(i)
        
        # Check if the ID is new or the same
        if doc_id not in ids_list:
            new_ids.append(doc_id)
            new_documents.append(doc)
        else:
            same_ids.append(doc_id)

    # Print the new and same IDs
    #print(f"\nNew IDs: {new_ids}")
    #print(f"Same IDs: {same_ids}")
    
    print(f"\nNumber of new documents to add to the collection of embeddings: {len(new_documents)}")

    # Inspect the content of the first 3 new documents
    num_docs_to_inspect = min(3, len(new_documents))
    print(f"\nInspecting the content of the first {num_docs_to_inspect} new documents:")

    for i, doc in enumerate(new_documents[:num_docs_to_inspect]):
        print(f"\nDocument {i + 1} ID: {new_ids[i]}")
        print(f"Content (first 300 characters):\n{doc.page_content[:300]}")  # Adjust the character limit as needed
    
    if new_documents:
         print_progress("Populating the vector store with new documents...")
        
         # Prepare embeddings and IDs for new documents
         new_embeddings = embed_model.embed_documents([doc.page_content for doc in new_documents])
         new_ids = [str(i) for i in range(len(documents) - len(new_documents), len(documents))]

        # Upsert only the new/modified documents
         vector_store._collection.upsert(
             documents=[doc.page_content for doc in new_documents],
             embeddings=new_embeddings,
             ids=new_ids
         )

         print(f"Upserted {len(new_documents)} new/modified documents.")
    else:
         print("No new/modified documents found. Skipping upsert.")

except Exception as e:
    print(f"Error while creating/updating embeddings: {e}")

# Calculate and print elapsed time
elapsed_time = time.time() - start_time
hours, rem = divmod(elapsed_time, 3600)
minutes, seconds = divmod(rem, 60)
print(f"Embeddings process completed in {int(hours):02d}:{int(minutes):02d}:{seconds:.2f}")


Creating/updating embeddings in vector database (ChromaDB)


Number of document IDs in the old collection: 359

Number of documents in the old collection: 359

Number of new documents to add to the collection of embeddings: 0

Inspecting the content of the first 0 new documents:
No new/modified documents found. Skipping upsert.
Embeddings process completed in 00:00:0.29


  vector_store = Chroma(


Initiate the **RAG LLM MODEL**

In [25]:
start_time = time.time()
print_progress("Defining the large language model")
llm = Ollama(model=chosen_model, temperature = 0)
print(f"LLM model used in this application : {chosen_model}")
print(f"LLM defined in {time.time() - start_time:.2f} seconds")


Defining the large language model

LLM model used in this application : llama3.2
LLM defined in 0.01 seconds


  llm = Ollama(model=chosen_model, temperature = 0)


**VECTOR DB COLLECTION VALIDATION**

Identify the Collection for Document Retrieval: The embedding creation code is designed to first locate the appropriate collection for the application domain. It then uses this collection to upsert new embeddings. The following code snippet validates that the correct collection is in use for the subsequent processes in this program.

In [26]:
start_time = time.time()
print_progress("Listing available collections")
# PERSISTENT
# vector_store = Chroma(persist_directory="./Embed_MXbai", embedding_function=embed_model)

# NON PERSISTENT
# vector_store = Chroma(embedding_function=embed_model)

try:
    collections = vector_store._client.list_collections()
    if collections:
        # Sort collections by 'timestamp' in metadata if it exists, otherwise use 0 as default
        collections.sort(key=lambda x: datetime.fromisoformat(x.metadata['timestamp']) if x.metadata and 'timestamp' in x.metadata else datetime.min, reverse=True)
        # Print all collections
        for collection in collections:
          print(f"- {collection.name} (Metadata: {collection.metadata})")
        # Select the most recent collection (first one after sorting)
        selected_collection = collections[0]
        # Access timestamp from metadata
        latest_timestamp = selected_collection.metadata['timestamp'] if selected_collection.metadata and 'timestamp' in selected_collection.metadata else 'N/A'
        print(f"\nUsing the most recent collection: {selected_collection.name}")
    else:
        print("No collections found.")
        selected_collection = None
except Exception as e:
    print(f"Error listing collections: {e}")
    selected_collection = None
print(f"Time taken to list the collections from vector database:{time.time() - start_time:.2f} seconds")


Listing available collections

- collection_Farm_Bill (Metadata: {'hnsw:space': 'cosine', 'timestamp': '2024-11-10T02:17:46.035546'})

Using the most recent collection: collection_Farm_Bill
Time taken to list the collections from vector database:0.00 seconds


Generate the **QUERY EMBEDDING** and **Review** it before passing it to Langchain.

In [27]:
start_time = time.time()
query_embedding = embed_model.embed_query(user_prompt)
print(f"Embedding Length: {len(query_embedding)}")
# print(f"Query Embedding: {query_embedding}")
# Display only the first 10 values of the embedding
print(f"Query Embedding (first 10 values): {query_embedding[:10]}... (total {len(query_embedding)} values)")

Embedding Length: 1024
Query Embedding (first 10 values): [-0.00193667 -0.47770417 -0.02451177 -0.16811286 -0.49087334  0.10195013
  0.33996192  0.87023795  0.13355955  0.05390351]... (total 1024 values)


**COSINE DISTANCE and COSINE SIMILARITY**

Take a peek at the Document Embeddings in the Selected Collection; Calculate the Cosine Distance and Cosine Similarity with the Query Embeddings, and identify which Document Chunk(s) will be retriived and fed as context to the language chain

In [28]:
start_time = time.time()
if selected_collection:
    print_progress("Take a peek at the Document Embeddings; Calculate the Cosine Similarity with the query embeddings, and identify which Document Chunk will be retriived and fed as context to the language chain")

    try:
        # Retrieve documents and embeddings from the selected collection
        documents = selected_collection.get(include=['embeddings','documents'])

         # Print and inspect the structure of `documents`
        # print("Documents structure:")
        # print(documents)

         # Check if documents contain embeddings and content
        embeddings = documents.get("embeddings")
        if embeddings is not None and documents.get("documents"):
            docs = documents["documents"]
            # Initialize empty lists to store cosine similarity and cosine distance along with document IDs
            similarity_list = []
            distance_list = []
            
            least_similarity = float('inf')  # Initialize with positive infinity
            lest_similarity_doc_id = None  # To track the document with the least similarity
            
            # Loop through embeddings and documents for inspection
            for i, (embedding, doc) in enumerate(zip(embeddings, docs)):
                similarity = dot(query_embedding, embedding) / (norm(query_embedding) * norm(embedding))
                cosine_distance = 1 - similarity
                
                # Append the document ID and calculated values to their respective lists
                similarity_list.append({
                    'Document ID': i + 1,  # Document IDs start from 1
                    'Cosine Similarity': similarity
                })
                distance_list.append({
                    'Document ID': i + 1,
                    'Cosine Distance': cosine_distance
                })
                print(f"\nDocument {i + 1}:")
                print(f"Cosine Similarity: {similarity}")
                print(f"Cosine Distance: {cosine_distance}")
                #print(f"Embedding: {embedding}")
                print(f"Embedding Length: {len(embedding)}")
                print(f"Content: {doc}")
                
                # Track the document with the least similarity
                if cosine_distance < least_similarity:
                    least_similarity = cosine_distance
                    least_similarity_doc_id = i + 1  # Document IDs start from 1
            
            # Print the document with the least cosine distance
            print(f"\nThe most similar document is the one with Document ID {least_similarity_doc_id} , has a Cosine Distance {least_similarity}")
            
            # Sort the similarity and distance lists in descending order
            similarity_list = sorted(similarity_list, key=lambda x: x['Cosine Similarity'], reverse=True)
            distance_list = sorted(distance_list, key=lambda x: x['Cosine Distance'], reverse=True)
            
            # Print the sorted Cosine Similarity list
            print(f"\nCosine Similarity List:")
            print(f"{'Document ID':<12} {'Cosine Similarity':<18}")
            print("=" * 30)
            for item in similarity_list:
                print(f"{item['Document ID']:<12} {item['Cosine Similarity']:<18}")

            # Print the sorted Cosine Distance list
            print(f"\nCosine Distance List:")
            print(f"{'Document ID':<12} {'Cosine Distance':<18}")
            print("=" * 30)
            for item in distance_list:
                print(f"{item['Document ID']:<12} {item['Cosine Distance']:<18}")
        else:
            print("Embeddings are missing or no documents found.")
    except Exception as e:
        print(f"Error retrieving embeddings: {e}")
else:
    print("No valid collection to retrieve embeddings from.")
print(f"Time taken to list the embeddings:{time.time() - start_time:.2f} seconds")


Take a peek at the Document Embeddings; Calculate the Cosine Similarity with the query embeddings, and identify which Document Chunk will be retriived and fed as context to the language chain


Document 1:
Cosine Similarity: 0.6991544919002178
Cosine Distance: 0.30084550809978217
Embedding Length: 1024
Content: [Extract from Farm_Bill_PDF_20241107_202042.pdf - Page 1]
 
 
  
 The 2024 Farm Bill: H.R. 8467 Compared with 
Current Law  
August 27, 2024  
Congressional Research Service  
https://crsreports.congress.gov  
R48167

Document 2:
Cosine Similarity: 0.7877625831090223
Cosine Distance: 0.21223741689097775
Embedding Length: 1024
Content: [Extract from Farm_Bill_PDF_20241107_202042.pdf - Page 2]
 
Congressional Research Service   
SUMMARY  
 
The 2024 Farm Bill: H.R. 8467 Compared with  
Current Law  
Congress sets food and agriculture policy through periodic legislation referred to as fa rm bills. 
The previous farm bill, the Agriculture Improvement Act of 2018 ( P.L. 115 -334), w

**DOWNLOAD** the **EMBEDDINGS**... (if you have already downloaded them once, and the delta between old and new embeddings is 0, you may disable this code)

In [29]:
# # Retrieve embeddings, documents, ids, and metadata from the selected collection
# results = selected_collection.get(include=['embeddings', 'documents', 'metadatas'])

# embeddings = results['embeddings']
# documents = results['documents']
# ids = results['ids']
# metadata = results['metadatas']

# # Initialize a list to store data for the DataFrame
# data = []

# # Iterate over the retrieved components
# for idx, (embedding, document, id_, meta) in enumerate(zip(embeddings, documents, ids, metadata)):
#     # Create a dictionary for the current record and add it to the data list
#     record = {
#         'ID': id_,
#         'Embedding': embedding,
#         'Document': document,
#         'Metadata': meta
#     }
#     data.append(record)

# # Convert the list of records into a DataFrame
# final_df = pd.DataFrame(data)

# # Save the DataFrame to a CSV file
# csv_filename = f'{embed_download_dir}{application}_all_documents.csv'
# final_df.to_csv(csv_filename, index=False)

# # Display a confirmation message
# print(f"Data successfully saved to {csv_filename}")

**NETWORK ANALYSIS**

Generate **Labels** for each document; **Transpose** the **Embeddings**, and donwload **CSV** (if you have already downloaded them once, and the delta between old and new embeddings is 0, you may disable this code). This piece of code is required if you are performing Network Analysis using Gephi software.

In [30]:
# # Ensure required nltk data is downloaded
# nltk.download('punkt')
# nltk.download('stopwords')

# # Step 1: Retrieve embeddings, documents, ids, and metadata from the selected collection
# results = selected_collection.get(include=['embeddings', 'documents', 'metadatas'])

# embeddings = results['embeddings']
# documents = results['documents']
# ids = results['ids']
# metadata = results['metadatas']

# # Step 2: Function to extract the best word representing the document
# def extract_best_word(document):
#     tokens = word_tokenize(document.lower())
#     stop_words = set(stopwords.words('english'))
#     tokens = [word for word in tokens if word not in stop_words and word not in string.punctuation]
#     most_common_word = Counter(tokens).most_common(1)
#     return most_common_word[0][0] if most_common_word else 'unknown'

# # Step 3: Initialize a list to store column names and data
# column_names = []
# expanded_data = []

# # Step 4: Iterate over all documents in the collection
# for idx, (embedding, document, doc_id) in enumerate(zip(embeddings, documents, ids)):
#     # Extract the label for the document
#     label = extract_best_word(document)
    
#     # Convert embeddings to a 1D array (if not already)
#     embedding = np.array(embedding).flatten()

#     # Create a column name in the format 'DocID_<doc_id>_Label_<label>'
#     column_name = f"DocID_{idx+1}_Label_{label}"
#     column_names.append(column_name)

#     # Append embedding values as a new row in expanded_data
#     if len(expanded_data) < len(embedding):
#         expanded_data.extend([[] for _ in range(len(embedding) - len(expanded_data))])
    
#     for i, emb_val in enumerate(embedding):
#         expanded_data[i].append(emb_val)

# # Step 5: Convert expanded data into a DataFrame
# expanded_df = pd.DataFrame(expanded_data, columns=column_names)

# # Display the first few rows of the final expanded DataFrame
# print(expanded_df.head())

# # Step 6: Save the expanded DataFrame to a CSV file
# csv_filename = f'{transposed_embed_dir}{application}_label_with_transposed_embeddings.csv'
# expanded_df.to_csv(csv_filename, index=False)

# # Display a confirmation message
# print(f"Data successfully saved to {csv_filename}")

The **MOST SIMILAR** **CONTEXTUAL DOCUMENT** for the **User Prompt** retrieved from ChromaDB collection


In [31]:
print(f"\nThe Most Similar Contextual Document for the User Prompt - is the one with Document ID :  {least_similarity_doc_id} , and has a Cosine Distance : {least_similarity}")


The Most Similar Contextual Document for the User Prompt - is the one with Document ID :  2 , and has a Cosine Distance : 0.21223741689097775


Helper Function for printing

In [32]:
# Helper function for printing docs
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Initiate the **RETRIEVER**

In [33]:
# Check the length of 'chunks'
if overwrite_doc_chunks=="Yes":
    doc_chunks_to_retrieve = doc_chunk_val
    top_n = 1
elif len(chunks) > 200:
    doc_chunks_to_retrieve = 100
    top_n = doc_chunks_to_retrieve // 5
else:
    doc_chunks_to_retrieve = len(chunks) // 2  # Integer division
    top_n = doc_chunks_to_retrieve // 5
print(f"Top_N ={top_n}")
# Check if top_n is not an integer and try converting it
if not isinstance(doc_chunks_to_retrieve, int):
    try:
        top_n = int(doc_chunks_to_retrieve)
        print(f"Converted doc_chunks_to_retrieve to integer: {doc_chunks_to_retrieve}")
    except ValueError:
        print(f"Cannot convert doc_chunks_to_retrieve to integer: {doc_chunks_to_retrieve}")

if overwrite_doc_chunks=="Yes":
    retriever = vector_store.as_retriever(search_kwargs={"k": doc_chunk_val})        # RETRIEVER
else:
    retriever = vector_store.as_retriever(search_kwargs={"k": doc_chunks_to_retrieve})  # RETRIEVER

# Output the value of 'doc_chunks_to_retrieve'
if need_reranking=="Yes":
    print(f"Retriever initiated with {doc_chunks_to_retrieve} document chunks extracted from Vector DB. However, ReRanker has identified the top {top_n} documents for the RAG LLM chain.")
else:
    print(f"Retriever initiated with {doc_chunks_to_retrieve} document chunks extracted from Vector DB.")

Top_N =1
Retriever initiated with 4 document chunks extracted from Vector DB.


Print the **RETRIEVED CONTEXT** and initiate the **RERANKER** - if required

In [None]:
start_time = time.time()

if need_reranking=="Yes":
    encoder_model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-v2-m3")
    #encoder_model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
    compressor = CrossEncoderReranker(model=encoder_model, top_n=top_n)
    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor, base_retriever=retriever
    )
    compressed_retriever_context = compression_retriever.invoke(user_prompt)
    print(f"Top_N ={top_n}")
    pretty_print_docs(compressed_retriever_context)

elif need_reranking=="No":
    retrieved_context = retriever.invoke(user_prompt)
    pretty_print_docs(retrieved_context)

# Convert elapsed time into hours, minutes, and seconds
elapsed_time = time.time() - start_time
hours, rem = divmod(elapsed_time, 3600)  # Divide by 3600 to get hours
minutes, seconds = divmod(rem, 60)       # Divide the remainder by 60 to get minutes and seconds

# Print the result in hh:mm:ss format
print(f"Response obtained in {int(hours):02d}:{int(minutes):02d}:{seconds:.2f}")

Document 1:

[Extract from Farm_Bill_PDF_20241107_202042.pdf - Page 2]
 
Congressional Research Service   
SUMMARY  
 
The 2024 Farm Bill: H.R. 8467 Compared with  
Current Law  
Congress sets food and agriculture policy through periodic legislation referred to as fa rm bills. 
The previous farm bill, the Agriculture Improvement Act of 2018 ( P.L. 115 -334), was extended 
by one year ( P.L. 118 -22)—until September 30, 2024, and for the 2024 crop year. The farm bill 
covers  numerous  policies and programs , including commodity support, conservation, trade and 
food aid, domestic food assistance, credit, rural develo pment, research, forestry, energy, 
horticulture, and crop insurance , among others . 
The Farm, Food, and National Security Act of 2024  (H.R. 8467 ) was introduced on May 21, 2024 . On May 23, 2024, the 
House Committee on Agriculture ordered  the bill  reported favorably to the House by a vote of 33 -21. On August 2, 2024, the 
Congressional Budget Office published a sc

Initiate the **RETRIEVAL CHAIN**

In [35]:
# Verify that the retriever is fetching the correct and relevant documents before feeding them into the chain.

print_progress("Pulling retrieval QA chat prompt from the hub")
retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")      # Q&A CHAT

print_progress("Preparing the document chain")
combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)  # LLM + Q&A CHAT

# Create a list to capture all Page Content values
page_content_list = []

if need_reranking=="Yes":

    # Extract Page Content values and add to the list
    for doc in compressed_retriever_context:
        page_content_list.append(doc.page_content)

    print_progress("Building the Retrieval QA chain with ReRanking feature")
    chain = create_retrieval_chain(combine_docs_chain=llm, retriever=compression_retriever) # LLM + COMPRESSED_RETRIEVER

    retrieval_chain = create_retrieval_chain(compression_retriever, combine_docs_chain)    # COMPRESSED_RETRIVER + LLM + CHAT
elif need_reranking=="No":
   
    for doc in retrieved_context:
        page_content_list.append(doc.page_content)
    print_progress("Building the Retrieval QA chain without ReRanking feature")
    chain = create_retrieval_chain(combine_docs_chain=llm, retriever=retriever) # LLM + RETRIEVER

    retrieval_chain = create_retrieval_chain(retriever, combine_docs_chain)  # RETRIVER + LLM + CHAT

page_content_string = "\n\n".join(page_content_list)

print(f"Chain created in:{time.time() - start_time:.2f} seconds")


Pulling retrieval QA chat prompt from the hub






Preparing the document chain


Building the Retrieval QA chain without ReRanking feature

Chain created in:0.38 seconds


**INVOKE** the **RETRIEVAL CHAIN** and print the **RAG AUGMENTED RESPONSE**

In [36]:
print_progress("The retrieval chain is working its magic to craft a thoughtful, context-rich response. Hang tight – it’s on the way!")

start_time = time.time()

if need_reranking=="No":
    print("Reranking - No")
    rag_response = retrieval_chain.invoke({"input": user_prompt})
    
    print("\nUSER PROMPT :\n", user_prompt)
    print("\nRETRIEVER AUGMENTED RESPONSE :\n", rag_response['answer'])

elif need_reranking=="Yes":
    print("Reranking - Yes")
    rag_response = retrieval_chain.invoke({"input": user_prompt})
    
    print("\nUSER PROMPT :\n", user_prompt)
    print("\nRETRIEVER AUGMENTED RESPONSE :\n", rag_response['answer'])

# Convert elapsed time into hours, minutes, and seconds
elapsed_time = time.time() - start_time
hours, rem = divmod(elapsed_time, 3600)  # Divide by 3600 to get hours
minutes, seconds = divmod(rem, 60)       # Divide the remainder by 60 to get minutes and seconds

# Print the result in hh:mm:ss format
print(f"Response obtained in {int(hours):02d}:{int(minutes):02d}:{seconds:.2f}")


The retrieval chain is working its magic to craft a thoughtful, context-rich response. Hang tight – it’s on the way!

Reranking - No

USER PROMPT :
 What are the projected budget changes in key titles of the Farm Bill, particularly for domestic nutrition programs?

RETRIEVER AUGMENTED RESPONSE :
 According to the Congressional Research Service report, the 2024 Farm Bill (H.R. 8467) would make significant changes to various titles, including domestic nutrition programs.

For domestic nutrition programs, specifically the Supplemental Nutrition Assistance Program (SNAP), the bill would:

* Increase outlays by $43.4 billion over nine years
* Raise the earned income deduction, further reducing the extent of benefit reduction for low-income households
* Limit scheduled future updates to SNAP benefits from the Thrifty Food Plan (TFP)
* Reduce the TFP limits plus CBO-scored interaction effects with Title IV provisions, resulting in a net reduction of $20.6 billion over nine years

In contrast

Overall **EXECUTION TIME**

In [37]:
# Convert elapsed time into hours, minutes, and seconds
elapsed_time = time.time() - overall_start_time
hours, rem = divmod(elapsed_time, 3600)  # Divide by 3600 to get hours
minutes, seconds = divmod(rem, 60)       # Divide the remainder by 60 to get minutes and seconds

# Print the result in hh:mm:ss format
print(f"Response obtained in {int(hours):02d}:{int(minutes):02d}:{seconds:.2f}")

Response obtained in 00:00:32.22


**ONLINE MODE**

In [38]:
prompts = {
    
    "Farm_Bill": ["Compare the Commodity Policy- documented in previous farm bill, the Agriculture Improvement Act of 2018 (P.L. 115-334)- with the new law introduced through Farm, Food, and National Security Act of 2024 (H.R. 8467), and provide me with a list of key differences in a tabular format.", "What are the total estimated changes in mandatory spending for each Farm Bill title in H.R. 8467 from FY2025 to FY2033?", "What are the projected budget changes in key Farm Bill titles, such as Nutrition, Conservation, and Commodities, by FY2029 compared to FY2025?", "What are the projected annual mandatory spending amounts in H.R. 8467 for rural development, crop insurance, and research from FY2025 to FY2033?"],
    
    "USSP_Light": ["What are the total estimated changes in mandatory spending for each Farm Bill title in H.R. 8467 from FY2025 to FY2033?", "What are the projected budget changes in key Farm Bill titles, such as Nutrition, Conservation, and Commodities, by FY2029 compared to FY2025?", "What are the projected annual mandatory spending amounts in H.R. 8467 for rural development, crop insurance, and research from FY2025 to FY2033?", "Compare the Commodity Policy- documented in previous farm bill, the Agriculture Improvement Act of 2018 (P.L. 115-334)- with the new law introduced through Farm, Food, and National Security Act of 2024 (H.R. 8467), and provide me with a list of key differences.","How does the funding for the Bridge Investment Program vary across fiscal years from 2022 to 2026?",
                    "What is the total funding allocated annually for the Federal-Aid Highway Program between 2022 and 2026?",
                    "What are the total appropriations for transportation programs like Tribal Transportation, Federal Lands Transportation, and Federal Lands Access from 2022 to 2026?",
                    "How is the funding allocated to different pilot programs, such as the Wildlife Crossings Pilot Program, across fiscal years from 2022 to 2026?",
                    "What is the annual breakdown of funding for the Reconnecting Communities Pilot Program (planning vs. capital construction grants) from 2022 to 2026?",
                    "How does the allocation of federal infrastructure funding between rural and urban areas reflect the government’s stated priorities for equitable development?",
                    "What potential impact might the funding trends for bridge repairs and maintenance have on overall transportation safety in the US by 2026?",
                    "How effectively does the Reconnecting Communities Pilot Program address the socioeconomic disparities in transportation infrastructure, based on the funding distribution from 2022 to 2026?",
                    "How does the funding structure for tribal transportation programs compare to other transportation initiatives in terms of scope, allocation, and expected outcomes?",
                    "In what ways could the Wildlife Crossings Pilot Program’s funding allocation contribute to broader environmental sustainability goals within federal infrastructure projects?"
                    ],

    "GAO": ["To what extent have federal programs, such as HUD’s CDBG colonias set-aside and USDA’s water and housing assistance, reduced economic disparities in these regions from 2020 to 2023?",
            "How effective have federal programs been in addressing infrastructure gaps (e.g., water and wastewater services) in colonias, based on recent site visits and assessments?",
            "How do climate projections of increasing temperatures impact the living conditions and infrastructure needs of colonias?",
            "What are the main barriers to accessing federal assistance in colonias, and how can program criteria be adjusted to improve eligibility and support?",
            "How do changes in population and demographics in colonias influence eligibility for federal assistance programs, and what legislative adjustments might be needed to ensure continued support?"
            ],

    "USSP": ["What are the total estimated changes in mandatory spending for each Farm Bill title in H.R. 8467 from FY2025 to FY2033?", "What are the projected budget changes in key Farm Bill titles, such as Nutrition, Conservation, and Commodities, by FY2029 compared to FY2025?", "What are the projected annual mandatory spending amounts in H.R. 8467 for rural development, crop insurance, and research from FY2025 to FY2033?", "Compare the Commodity Policy- documented in previous farm bill, the Agriculture Improvement Act of 2018 (P.L. 115-334)- with the new law introduced through Farm, Food, and National Security Act of 2024 (H.R. 8467), and provide me with a list of key differences.",
             "What is the PRIMARY PURPOSE of the Continuing Appropriations Act, 2018 and Supplemental Appropriations for Disaster Relief Requirements Act, 2017?",
             "How does the act - Continuing Appropriations Act, 2018 and Supplemental Appropriations for Disaster Relief Requirements Act, 2017 - address EDUCATION accountability and what are the key educational initiatives mentioned?",
             "As per the act-Continuing Appropriations Act, 2018 and Supplemental Appropriations for Disaster Relief Requirements Act, 2017,- what provisions have been made for DISASTER RELIEF and how are the allocated funds distributed among various agencies?",
             "What are the key guidelines or metrics for evaluating the EFFECTIVENESS of the programs under this act-Continuing Appropriations Act, 2018 and Supplemental Appropriations for Disaster Relief Requirements Act, 2017?",
             "How does the act-Continuing Appropriations Act, 2018 and Supplemental Appropriations for Disaster Relief Requirements Act, 2017- balance NATIONAL DEBT management while funding various appropriations?",
             "How did the temporary changes in unemployment insurance policies under the “Emergency Unemployment Insurance Stabilization and Access Act” impact claim rates and benefit accessibility?",
             "To what extent has the stipulation requiring producers to purchase crop insurance in subsequent years, as a condition of receiving aid, impacted the adoption of insurance programs among historically underserved producers? How could federal policies be adjusted to improve participation rates in these programs?",
             "In what ways have the financial provisions for the Agricultural Programs under Title I improved food security, rural development, and conservation programs in 2021? How effectively have the funds addressed the challenges faced by smallholder farmers and rural communities?",
             "What challenges did healthcare providers face in implementing telehealth solutions, and how did the CARES Act address those challenges?",
             "How effective has the supplemental funding for agricultural losses due to natural disasters (e.g., Hurricanes Michael and Florence, wildfires, floods) been in restoring crop production and supporting affected farmers? What measures could enhance the resilience of agricultural systems to such events in the future?",
             "What are the loan amounts for different agricultural credit programs?"]

}

Operational Parameters

In [39]:
def op_par(user_app, user_llm):
    if user_llm=="llama3.1":
        new_persist_directory = f"{vector_dir}{user_app}_Embed_MXbai"
    if user_llm=="llama3.2":
        new_persist_directory = f"{vector_dir}{user_app}_Embed_MXbai_L3.2"
    print(f"We are working with {user_app} app, LLM chosen is: {user_llm}, and the persistent directory being used is: {new_persist_directory}")

    return new_persist_directory

In [40]:
def know_your_retriever(app,persist_dir,rerank):
    
    # Start the ChromaDB client with a persistent directory
    chroma_client = chromadb.PersistentClient(path=persist_dir)

    # Load existing collection if it exists, or create a new one
    collection_name = f"collection_{app}"
    
    vector_store = Chroma(
        client=chroma_client,
        collection_name=collection_name,
        embedding_function=embed_model,
        collection_metadata=collection_metadata
    )

    if overwrite_doc_chunks=="Yes":
        doc_chunks_to_retrieve = doc_chunk_val
        top_n = 1
    elif len(chunks) > 200:
        doc_chunks_to_retrieve = 10
        top_n = doc_chunks_to_retrieve // 2
    else:
        doc_chunks_to_retrieve = len(chunks) // 2  # Integer division
        top_n = doc_chunks_to_retrieve // 5
    print(f"Top_N ={top_n}")

    if not isinstance(doc_chunks_to_retrieve, int):
        try:
            top_n = int(doc_chunks_to_retrieve)
            print(f"Converted doc_chunks_to_retrieve to integer: {doc_chunks_to_retrieve}")
        except ValueError:
            print(f"Cannot convert doc_chunks_to_retrieve to integer: {doc_chunks_to_retrieve}")

    if overwrite_doc_chunks=="Yes":
        retriever = vector_store.as_retriever(search_kwargs={"k": doc_chunk_val})        # RETRIEVER
        compression_retriever = None
    else:
        retriever = vector_store.as_retriever(search_kwargs={"k": doc_chunks_to_retrieve})  # RETRIEVER
        compression_retriever = None

    if rerank=="Yes":
        encoder_model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-v2-m3")
        #encoder_model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
        compressor = CrossEncoderReranker(model=encoder_model, top_n=top_n)
        compression_retriever = ContextualCompressionRetriever(
            base_compressor=compressor, base_retriever=retriever
        )
        retriever = None
        # compressed_retriever_context = compression_retriever.invoke(user_prompt)
        print(f"Top_N ={top_n}")
        # pretty_print_docs(compressed_retriever_context)

        
    # Output the value of 'doc_chunks_to_retrieve'

    if not retriever and not compression_retriever:
        raise ValueError("Failed to initialize a valid retriever")
    
    if rerank=="Yes":
        print(f"Retriever initiated with {doc_chunks_to_retrieve} document chunks extracted from Vector DB. However, ReRanker has identified the top {top_n} documents for the RAG LLM chain.")
    else:
        print(f"Retriever initiated with {doc_chunks_to_retrieve} document chunks extracted from Vector DB.")

    return retriever, compression_retriever

In [41]:
def initialize_llm(model_name):
    print_progress("Defining the large language model")
    llm = Ollama(model=model_name, temperature = 0)
    print(f"LLM model used in this application : {chosen_model}")
    return llm

In [42]:
def know_your_retrieval_chain(llm, retriever):
    retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")  
    combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)
    retrieval_chain = create_retrieval_chain(retriever, combine_docs_chain)

    return retrieval_chain
    

Define the function to handle query and augment LLM response using ChromaDB retrieval

In [43]:
# Function to handle query and augment LLM response using ChromaDB retrieval
def query_data(app, llm, user_input, need_reranking_user):

    # Retrieve relevant documents from ChromaDB
    print("Retrieving documents...")
    new_persist_directory = op_par(app, llm)
    retriever, compression_retriever = know_your_retriever(app,new_persist_directory,need_reranking_user)
    if app=="Farm_Bill":
        # farm_prompt_emphasis = "Note - I need the comparison in a tabular format wherein I can see one column for the previous law and another one dedicated for the new law. The items listed in one row should refer to one and only aspect of comparison. Enlist these aspects in bullet format. Ensure there is a header for each column; the offcial titles (with any identifiers) used for each law should appear as a header."
        farm_prompt_emphasis = ""
        user_input = f"{user_input}. {farm_prompt_emphasis}"
    if need_reranking_user=="Yes" and compression_retriever:
        # Retrieve compressed context
        retrieved_docs = compression_retriever.invoke(user_input)
        
    elif need_reranking_user=="No" and retriever:
        retrieved_docs = retriever.invoke(user_input)

    else:
        raise ValueError("Retriever or compression retriever was not properly initialized")
    
    # Combine retrieved content into a single context
    context = "\n\n".join([doc.page_content for doc in retrieved_docs])

    # Create a list to capture all Page Content values
    context_page_content_list = []
    for doc in retrieved_docs:
        context_page_content_list.append(doc.page_content)
    context_page_content_string = "\n\n".join(context_page_content_list)

    query_with_context = f"Context:\n{context}\n\nQuestion: {user_input}"

    llm = initialize_llm(llm)

    if need_reranking_user=="No" and retriever: #and emphasised_prompt=="No":
        
        # Invoke the LLM chain with the combined input
        print("Invoking LLM - Only the defined context shaped this response...")

        # call the function to fetch the correct retriever chain ****
        retrieval_chain = know_your_retrieval_chain(llm,retriever)
        response = retrieval_chain.invoke({"input": query_with_context})
        context_response = response.get('answer', 'No response generated.')
        context_response_reranked = "The response you are looking for is displayed in the box titled 'RAG augmented LLM Response'"

    elif need_reranking_user=="Yes" and compression_retriever: #and emphasised_prompt=="No":

        # Invoke the LLM chain with the combined input
        print("Invoking LLM - Only the reranked context shaped this response...")

        # call the function to fetch the correct retrieval chain ****
        retrieval_chain = know_your_retrieval_chain(llm,compression_retriever)
        response = retrieval_chain.invoke({"input": query_with_context})
        context_response_reranked = response.get('answer', 'No response generated.')
        context_response ="The response you are looking for is displayed in the box titled 'RAG augmented LLM Response with ReRanking'"

    return context_response_reranked, context_response, context_page_content_string

**Gradio Interface**

In [44]:
# Function to update prompt titles based on selected application
def update_prompts(app):
    # Generate titles (Prompt 1, Prompt 2, etc.)
    prompt_titles = [f"Prompt {i+1}" for i in range(len(prompts[app]))]
    return gr.update(choices=prompt_titles)

In [45]:
# Function to populate textbox with selected full prompt
def populate_textbox(app, prompt_title):
    prompt_index = int(prompt_title.split()[-1]) - 1  # Extract number from "Prompt X"
    full_prompt = prompts[app][prompt_index]
    return gr.update(value=full_prompt)

In [46]:
# unRAG functionality
def handle_llm_unrag_response(llm, input_text):
        llm = Ollama(model=llm, temperature = 0)
        response = llm(input_text)
        return response

def compress_or_expand(visible_state):
    # Toggle compressed or expanded state; returns current toggle status
    new_visible_state =not visible_state
    return gr.update(visible=new_visible_state), new_visible_state


 # Conditional logic for unRag response display
def show_unrag_response():
    # Show the new row if 'Yes' for unRag response
    return gr.update(visible=True), gr.update(visible=True)

In [47]:
# Gradio Interface
os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0"
# Add dropdown options for Application and LLM
applications = ["Farm_Bill", "USSP_Light", "USSP", "GAO"]
llm_versions = ["llama3.1", "llama3.2"]

global_response = None

chart_suffix = ". While generating the response, please abide the following instructions that will help me in charting the information. Please output the values in the format: X-values: [1,2,3] and Y-values: [4,5,6]. Always add the numeric (or continuous) values to Y-values, and the categorical ones to X-values. Provide me with the legends for each axis in the format X-axis legend : 'abc' and Y-axis legend: 'xyz'. Also I need a title for chart in the format Title : 'mno'. While outputing the values, do not abbreviate the numbers in any form (such as 3.2million, 450 thousand, 3.5e9). Furthermore, do not add any alphabets or special characters to Y-values except for currency symbols, and do not separate the digits within a given number by comma('). For example, do not add $100000 as $100,000. Finally, add the exact values without rounding them."

def handle_llm_response(app, llm, input_text, reranking):
    global global_response
    context_response_reranked, context_response, context_page_content_string = query_data(app, llm, input_text, reranking)
    if reranking=="No":
        global_response = context_response
    elif reranking=="Yes":
        global_response = context_response_reranked
    print(f"Global Response: {global_response}")
    # Return all four values for further processing
    return context_response_reranked, context_response, context_page_content_string

def handle_llm_response_and_clear(app, llm, input_text, reranking):
    # Clear the existing chart and process response
    clear_output = preprocess_inputs()  # Get cleared chart placeholder
    context_response_reranked, context_response, context_page_content_string = handle_llm_response(app, llm, input_text, reranking)
    return context_response_reranked, context_response, context_page_content_string, clear_output[3]

def chart_it_up_combined(app, llm, user_input, reranking):
    modified_input = append_text(user_input)
    context_response_reranked, context_response, context_page_content_string = handle_llm_response(app, llm, modified_input, reranking)
    chart, temp_file = process_unstructured_llm_response(global_response)
    #print(f"Chart: {chart}")
    return context_response_reranked, context_response, context_page_content_string, chart, temp_file

def process_unstructured_llm_response(response):
    data = extract_keywords(response)
    return generate_chart(data)

def generate_chart(data):
    x_values = data['x']
    y_values = data['y']
    title = data.get('title', 'Chart')
    x_legend = data.get('x-legend', 'X-axis')
    y_legend = data.get('y-legend', 'Y-axis')

    fig, ax = plt.subplots(facecolor='#e0ffe0', figsize=(12, 6))
    ax.set_facecolor('#e0ffe0')
    if len(x_values) == len(y_values) and x_values and y_values:
        # ax.plot(x_values, y_values, marker='o', linestyle='-') # line plot
        ax.bar(x_values, y_values, color='#2a9d8f') # bar chart
        ax.set_title(title)
        ax.set_xlabel(x_legend)
        ax.set_ylabel(y_legend)

        # Rotating x-axis labels for better readability
        plt.xticks(rotation=45, ha='right')

    # if data['x'] and data['y']:
    #     ax.plot(data['x'], data['y'], marker = 'o', linestyle = '-')
    else:
        ax.text(0.5, 0.5, 'Error: x and y values must have the same length',
                fontsize=12, ha='center', va='center', color='red')
        ax.set_title('Chart Error')
        
    plt.tight_layout()

    temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".png")
    fig.savefig(temp_file.name)
    temp_file_path = temp_file.name
    temp_file.close()
    return fig, temp_file_path

def extract_keywords(response):
    x_values, y_values = [], []
    title, x_legend, y_legend = '', '', ''

    if "X-values:" in response:
        x_section = response.split("X-values:")[1].split("\n")[0]
        x_values = [x.strip() for x in x_section.replace("[", "").replace("]", "").split(",")]

    if "Y-values:" in response:
        y_section = response.split("Y-values:")[1].split("\n")[0]

        # Find all groups that start with a dollar sign followed by numbers and commas
        y_matches = re.findall(r'\$[\d,]+', y_section)
        # Remove the dollar sign and commas, then convert to integers
        y_values = [int(y.replace('$', '').replace(',', '')) for y in y_matches]

        # y_values = [int(re.sub(r'[^\d]', '', y.strip())) for y in y_section.replace("[", "").replace("]", "").split(",") if re.sub(r'[^\d]', '', y.strip()).isdigit()]

    title_match = re.search(r"Title\s*:\s*(.+)", response)
    if title_match:
        title = title_match.group(1).split("\n")[0].strip("'")

    # if "Title:" in response:
    #     title_section = response.split("Title:")[1].split("\n")[0].strip()
    #     title = title_section.strip("'")
    
    x_legend_match = re.search(r"X-axis legend\s*:\s*(.+)", response)
    if x_legend_match:
        x_legend = x_legend_match.group(1).split("\n")[0].strip("'")

    # if "X-axis legend:" in response:
    #     x_legend_section = response.split("X-axis legend:")[1].split("\n")[0].strip()
    #     x_legend = x_legend_section.strip("'")

    y_legend_match = re.search(r"Y-axis legend\s*:\s*(.+)", response)
    if y_legend_match:
        y_legend = y_legend_match.group(1).split("\n")[0].strip("'")

    # if "Y-axis legend:" in response:
    #     y_legend_section = response.split("Y-axis legend:")[1].split("\n")[0].strip()
    #     y_legend = y_legend_section.strip("'")

    print(f"x_values for plotting: {x_values}, y_values for plotting: {y_values}")
    print(f"x-values length: {len(x_values)}, y-values length: {len(y_values)}")
    return {'x': x_values, 'y': y_values, 'title': title, 'x-legend': x_legend, 'y-legend': y_legend}

def append_text(input_text):
    return f"{input_text} {chart_suffix}"

def preprocess_inputs():
    fig, ax = plt.subplots(facecolor='#e0ffe0')
    ax.set_facecolor('#e0ffe0')
    ax.text(0.5, 0.5, 'No chart generated', fontsize=14, ha='center', va='center', color='red')
    ax.set_axis_off()  # Turn off axis for a blank chart
    return gr.update(value=''), gr.update(value=''), gr.update(value=''), fig

def determine_greeting():
    """Determine the greeting based on the current time of day."""
    current_hour = datetime.now().hour
    if current_hour < 12:
        return "Good Morning"
    elif current_hour < 18:
        return "Good Afternoon"
    else:
        return "Good Evening"

def update_logos(app):
    """Update logo paths based on selected application."""
    logo_path_right = logos_dir
    if app == "USSP" or app == "USSP_Light":
        logo_path_right += "USASpending_logo_2.jpg"
    elif app == "GAO":
        logo_path_right += "GAO-Logo.jpg"
    elif app == "Farm_Bill":
        logo_path_right += "Farm_Bill_0.webp"
    return logo_path_right

def update_app_text(app):
    """Set app prefix based on selected application."""
    if app == "USSP":
        return "<h2 id='title'><b>United States Spending</b></h2>"
    elif app == "USSP_Light":
        return "<h2 id='title'><b>United States Spending (Light)</b></h2>"
    elif app == "GAO":
        return "<h2 id='title'><b>Government Accountability Office</b></h2>"
    elif app == "Farm_Bill":
        return "<h2 id='title'><b>Farm Bill Comparison</b></h2>"
    
# Set the value of app_prefix based on conditions
if "USSP" in application:
    app_prefix = "United States Spending "
elif "GAO" in application:
    app_prefix = "Government Accountability Office "
elif "USSP_Light" in application:
    app_prefix = "United States Spending "
elif "Farm_Bill" in application:
    app_prefix = "Farm Bill Comparison "

# Gradio Block

logo_path_left = f"{logos_dir}FI_logo.jpg"
if application=="USSP" or application=="USSP_Light":
    logo_path_right = f"{logos_dir}USASpending_logo_2.jpg"
elif application=="GAO":
    logo_path_right = f"{logos_dir}GAO-Logo.jpg"
elif application=="Farm_Bill":
    logo_path_right = f"{logos_dir}Farm_Bill_0.webp"
with gr.Blocks(
    css="""
    #dropdown-container {
        display: flex;
        justify-content: flex-end;
        align-items: right;
        gap: 5px;
        margin-bottom: 5px;
    }
    #dropdown {
        width: 50px;
        align-items: right;
    }
    #logo-container {
        display: flex;
        align-items: center;
        justify-content: space-between;  /* Distribute space between left and right containers */
        margin-bottom: 5px;
    }
    #logo-left {
        margin-right: 5px;
        width: 50px;  /* Adjust the size as needed */
    }
    #right-container {
        display: flex;
        flex-direction: column;  /* Stack logo and title vertically */
        align-items: center;  /* Center elements within the container */
        text-align: center;  /* Center text within the column */
    }
    #logo-right {
        margin-bottom: 5px;
        width: 50px;  /* Adjust the size as needed */
    }
    #title {
        margin-top: 5px;
    }
    """,
    theme=Glass(), title=f"{app_prefix}I am the LLM Assistant you need!") as demo:

    # Top-right dropdowns for Application and LLM selection
    with gr.Column(elem_id="dropdown-container"):
        with gr.Row(elem_id="dropdown-container"):
            selected_app = gr.Dropdown(label="Application", choices=applications, value=application, interactive=True, elem_id="dropdown")
            selected_llm = gr.Dropdown(label="LLM", choices=llm_versions, value=chosen_model, interactive=True, elem_id="dropdown")

    with gr.Row(elem_id="logo-container"):
        # Left logo
        gr.Image(logo_path_left,
                 elem_id="logo-left",
                 show_label=False,
                 width=5,
                 show_download_button=False,
                 show_fullscreen_button=False,
                 interactive=False)
        
        with gr.Column(elem_id="right-container"):
            right_logo = gr.Image(logo_path_right,
                     elem_id="logo-right",
                     show_label=False,
                     width=5,
                     show_download_button=False,
                     show_fullscreen_button=False,
                     interactive=False)
            app_text = gr.HTML(update_app_text(selected_app.value))

            selected_app.change(fn=update_logos, inputs=selected_app, outputs=right_logo)
            selected_app.change(fn=update_app_text, inputs=selected_app, outputs=app_text)
    
    gr.Markdown(f"I'm an intelligent assistant, and I can answer your questions")

    # Use the greeting function to set the dynamic message
    greeting = determine_greeting()
    textbox = gr.Textbox(
        value=user_prompt,
        label=f"{greeting} - How may I help you today?",
        placeholder="Type here and press Enter...",
        lines=3, max_lines=7
    )

    # Initialize with default prompt titles for the first application in the list
    initial_app = applications[0]  # Set to the default selected application
    initial_prompt_titles = [f"Prompt {i+1}" for i in range(len(prompts[initial_app]))]

    # Preset Prompts dropdown (starts with titles based on the default application)
    preset_prompts = gr.Dropdown(label="Preset Prompts", choices=initial_prompt_titles, interactive=True)
    selected_app.change(fn=update_prompts, inputs=selected_app, outputs=preset_prompts)

    # Populate textbox with full prompt on selection
    preset_prompts.change(fn=populate_textbox, inputs=[selected_app, preset_prompts], outputs=textbox)

    # Need ReRanking option - User Choice
    need_reranking_user = gr.Radio(choices=["Yes", "No"], label="Need ReRanking?", value=need_reranking)
    # need_chart = gr.Radio(choices=["Yes", "No"], label="Chart it up", value="Yes")
    with gr.Row():

        submit_button = gr.Button("RagIt", variant="primary")
        chart_it_up_button = gr.Button("Chart it up", variant="secondary")
        # UnRag response button
        unrag_button = gr.Button("unRag", visible= need_unrag_response == "Yes",variant="primary")
        wrap_button = gr.Button("Wrap", visible= need_unrag_response == "Yes", variant="stop")

    # Response Row 1:
    with gr.Row():
        if need_reranking=="No": #and emphasised_prompt=="No":
            output1 = gr.Textbox(
                value="The response you are looking for is displayed in the box titled 'RAG augmented LLM Response'",
                lines=5, max_lines=10,
                label="RAG augmented LLM Response with ReRanking",
                placeholder="Generating the RAG augmented LLM Response with ReRanking..."
                )
        
        elif need_reranking=="Yes": #and emphasised_prompt=="No":
            output1 = gr.Textbox(
                value=rag_response.get('answer', 'No response generated.'),
                lines=5, max_lines=10,
                label="RAG augmented LLM Response with ReRanking",
                placeholder="Generating the RAG augmented LLM Response with ReRanking..."
                )
    # Response Row 2:
    with gr.Row():
        if need_reranking=="No": #and emphasised_prompt=="No":
            output2 = gr.Textbox(
                value=rag_response.get('answer', 'No response generated.'),
                lines=5, max_lines=10,
                label="RAG augmented LLM Response",
                placeholder="Generating the RAG augmented LLM Response..."
                )
        
        elif need_reranking=="Yes": #and emphasised_prompt=="No":
            output2 = gr.Textbox(
                value="The response you are looking for is displayed in the box titled 'RAG augmented LLM Response with ReRanking'",
                lines=5, max_lines=10,
                label="RAG augmented LLM Response",
                placeholder="Generating the RAG augmented LLM Response..."
                )
    # Response Row 3:
    with gr.Row():
        if need_unrag_response == "Yes":
            unrag_response_box = gr.Textbox(label="unRag Response", placeholder="Press the unRag button to get a response", visible=need_unrag_response == "Yes", interactive=False, lines=2, max_lines=5)
        visible_state = gr.State(False) 
    # Response Row 4:
    with gr.Row():

            output3 = gr.Textbox(
                value=page_content_string,
                lines=10, max_lines=30,
                label="Context specific documents retrieved from Vector DB",
                placeholder="Extracting the context..."
                )
    
    # Chart Row 4:
    with gr.Row():
        chart_output = gr.Plot(label="Generated Chart")
    with gr.Row():
        download_link = gr.File(label="Download Chart", visible=True)

    #submit_button.click(fn = preprocess_inputs, inputs=[], outputs=[output1, output2, output3, chart_output], preprocess=True)
    
    # textbox.submit(fn = preprocess_inputs, inputs=[], outputs=[output1, output2, output3, chart_output], preprocess=True)
  
    submit_button.click(
        lambda app, llm, user_input, reranking: handle_llm_response_and_clear(app, llm, user_input, reranking),
        inputs=[selected_app, selected_llm, textbox, need_reranking_user],
        outputs=[output1, output2, output3, chart_output]
    )

    # textbox.submit(
    #     lambda app, llm, user_input, reranking: handle_llm_response_and_clear(app, llm, user_input, reranking),
    #     inputs=[selected_app, selected_llm, textbox, need_reranking_user],
    #     outputs=[output1, output2, output3, chart_output]
    #     ) # only works when no. of lines is set to 1

    # Chart it up button behavior (combines all steps)
    chart_it_up_button.click(fn = preprocess_inputs, inputs=[], outputs=[output1, output2, output3, chart_output], preprocess=True)

    chart_it_up_button.click(
        chart_it_up_combined,
        inputs=[selected_app, selected_llm, textbox, need_reranking_user],
        outputs=[output1, output2, output3, chart_output, download_link]
    )

    # Button click to trigger LLM unRag response
    unrag_button.click(
        fn=handle_llm_unrag_response,
        inputs=[selected_llm, textbox],
        outputs=unrag_response_box
    )

    # Expand/Compress toggle for the unRag response box
    wrap_button.click(fn=compress_or_expand, inputs=[visible_state], outputs=[unrag_response_box, visible_state])

    # Display the UI

demo.launch(allowed_paths=[f"{logos_dir}"])


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.




Retrieving documents...
We are working with USSP_Light app, LLM chosen is: llama3.1, and the persistent directory being used is: /Users/arnabraychaudhari/Documents/6317/Project_LLM_and_RAG_2024_GWU/Technical_Deliverables/Vector_DB_Embeddings/USSP_Light_Embed_MXbai
Top_N =1
Top_N =1
Retriever initiated with 4 document chunks extracted from Vector DB. However, ReRanker has identified the top 1 documents for the RAG LLM chain.

Defining the large language model

LLM model used in this application : llama3.2
Invoking LLM - Only the reranked context shaped this response...




Global Response: Based on the context provided, here is the answer to your question:

The funding for the Bridge Investment Program varies across fiscal years from 2022 to 2026 as follows:

X-values: [2022, 2023, 2024, 2025, 2026]
Y-values: [$600000000, $640000000, $650000000, $675000000, $700000000]

Title : 'Funding for Bridge Investment Program across Fiscal Years'
X-axis legend : 'Fiscal Year'
Y-axis legend: '$ (USD)'

Let me know if you'd like me to help with anything else!
x_values for plotting: ['2022', '2023', '2024', '2025', '2026'], y_values for plotting: [600000000, 640000000, 650000000, 675000000, 700000000]
x-values length: 5, y-values length: 5
Retrieving documents...
We are working with USSP_Light app, LLM chosen is: llama3.1, and the persistent directory being used is: /Users/arnabraychaudhari/Documents/6317/Project_LLM_and_RAG_2024_GWU/Technical_Deliverables/Vector_DB_Embeddings/USSP_Light_Embed_MXbai
Top_N =1
Retriever initiated with 4 document chunks extracted from V



Global Response: According to the text, the funding for the Bridge Investment Program varies as follows:

* For fiscal year 2022, $16,000,000 is allocated.
* For fiscal year 2023, $18,000,000 is allocated.
* For fiscal year 2024, $20,000,000 is allocated.
* For fiscal year 2025, $22,000,000 is allocated.
* For fiscal year 2026, $24,000,000 is allocated.

Note that the funding increases by $2 million each year.
Retrieving documents...
We are working with USSP_Light app, LLM chosen is: llama3.1, and the persistent directory being used is: /Users/arnabraychaudhari/Documents/6317/Project_LLM_and_RAG_2024_GWU/Technical_Deliverables/Vector_DB_Embeddings/USSP_Light_Embed_MXbai
Top_N =1
Top_N =1
Retriever initiated with 4 document chunks extracted from Vector DB. However, ReRanker has identified the top 1 documents for the RAG LLM chain.

Defining the large language model

LLM model used in this application : llama3.2
Invoking LLM - Only the reranked context shaped this response...




Global Response: According to the text, the funding for the Bridge Investment Program (section 202(d)) varies as follows:

* Fiscal year 2022: $16,000,000
* Fiscal year 2023: $18,000,000
* Fiscal year 2024: $20,000,000
* Fiscal year 2025: $22,000,000
* Fiscal year 2026: $24,000,000

This represents an increase of $2 million per year.
Retrieving documents...
We are working with USSP_Light app, LLM chosen is: llama3.1, and the persistent directory being used is: /Users/arnabraychaudhari/Documents/6317/Project_LLM_and_RAG_2024_GWU/Technical_Deliverables/Vector_DB_Embeddings/USSP_Light_Embed_MXbai
Top_N =1
Top_N =1
Retriever initiated with 4 document chunks extracted from Vector DB. However, ReRanker has identified the top 1 documents for the RAG LLM chain.

Defining the large language model

LLM model used in this application : llama3.2
Invoking LLM - Only the reranked context shaped this response...




Global Response: According to the text, the funding for the Bridge Investment Program (section 202(d)) varies as follows:

* Fiscal year 2022: $16,000,000
* Fiscal year 2023: $18,000,000
* Fiscal year 2024: $20,000,000
* Fiscal year 2025: $22,000,000
* Fiscal year 2026: $24,000,000

This represents an increase of $2 million per year.
Retrieving documents...
We are working with USSP_Light app, LLM chosen is: llama3.1, and the persistent directory being used is: /Users/arnabraychaudhari/Documents/6317/Project_LLM_and_RAG_2024_GWU/Technical_Deliverables/Vector_DB_Embeddings/USSP_Light_Embed_MXbai
Top_N =1
Retriever initiated with 4 document chunks extracted from Vector DB.

Defining the large language model

LLM model used in this application : llama3.2
Invoking LLM - Only the defined context shaped this response...




Global Response: The total funding allocated annually for the Federal-Aid Highway Program between 2022 and 2026 can be calculated by adding up the amounts specified in section (a) of the text.

Section (a) mentions several programs, but it does not specify a total amount for the Federal-Aid Highway Program. However, we can infer that the total funding allocated annually for the Federal-Aid Highway Program is the sum of the following:

* $27,500,000,000 (bridge replacement, rehabilitation, preservation, protection, and construction program)
* $285,975,000 (Federal lands access program) + $291,975,000 (Federal lands access program) + $296,975,000 (Federal lands access program) + $303,975,000 (Federal lands access program) + $308,975,000 (Federal lands access program)
* $219,000,000 (territorial and Puerto Rico highway program) + $224,000,000 (territorial and Puerto Rico highway program) + $228,000,000 (territorial and Puerto Rico highway program) + $232,500,000 (territorial and Puerto Ri



Global Response: Based on the provided context, I cannot determine the total funding allocated annually for the Federal-Aid Highway Program between 2022 and 2026. The information provided only lists specific programs with their respective funding allocations, but does not provide a comprehensive breakdown of the total funding for the Federal-Aid Highway Program.
Retrieving documents...
We are working with USSP_Light app, LLM chosen is: llama3.1, and the persistent directory being used is: /Users/arnabraychaudhari/Documents/6317/Project_LLM_and_RAG_2024_GWU/Technical_Deliverables/Vector_DB_Embeddings/USSP_Light_Embed_MXbai
Top_N =1
Top_N =1
Retriever initiated with 4 document chunks extracted from Vector DB. However, ReRanker has identified the top 1 documents for the RAG LLM chain.

Defining the large language model

LLM model used in this application : llama3.2
Invoking LLM - Only the reranked context shaped this response...




Global Response: Based on the provided context, I cannot determine the total funding allocated annually for the Federal-Aid Highway Program between 2022 and 2026. The information provided only lists specific programs with their respective funding allocations, but does not provide a comprehensive breakdown of the total funding for the Federal-Aid Highway Program.
Retrieving documents...
We are working with USSP_Light app, LLM chosen is: llama3.1, and the persistent directory being used is: /Users/arnabraychaudhari/Documents/6317/Project_LLM_and_RAG_2024_GWU/Technical_Deliverables/Vector_DB_Embeddings/USSP_Light_Embed_MXbai
Top_N =1
Retriever initiated with 4 document chunks extracted from Vector DB.

Defining the large language model

LLM model used in this application : llama3.2
Invoking LLM - Only the defined context shaped this response...




Global Response: Based on the provided extracts from USSP_Light_DOWNLOAD_20241030_051135/Z_document_1.pdf - Page 16 and USSP_Light_DOWNLOAD_20241030_051135/1_document_1.pdf - Page 16, here are the total appropriations for transportation programs like Tribal Transportation, Federal Lands Transportation, and Federal Lands Access from 2022 to 2026:

1. **Tribal Transportation Program**:
	* Total appropriation: $612,960,000 (2025) + $627,960,000 (2026) = $1,240,920,000
	* Breakdown by year:
		+ 2022: Not explicitly stated in the provided extracts.
		+ 2023: Not explicitly stated in the provided extracts.
		+ 2024: Not explicitly stated in the provided extracts.
		+ 2025: $612,960,000
		+ 2026: $627,960,000
2. **Federal Lands Transportation Program**:
	* Total appropriation: $421,965,000 (2022) + $429,965,000 (2023) + $438,965,000 (2024) + $447,965,000 (2025) + $455,965,000 (2026) = $2,193,825,000
	* Breakdown by year:
		+ 2022: $421,965,000
		+ 2023: $429,965,000
		+ 2024: $438,965,000
		+



Global Response: Based on the context provided, I can answer your question.

The total appropriations for transportation programs like Tribal Transportation, Federal Lands Transportation, and Federal Lands Access from 2022 to 2026 are:

* For fiscal year 2022: $52,488,065,375 (A) + $250,000,000 (transportation infrastructure finance and innovation program) + $578,460,000 (tribal transportation program) = $53,316,525,375
* For fiscal year 2023: $53,537,826,683 (B) + $250,000,000 (transportation infrastructure finance and innovation program) + $589,960,000 (tribal transportation program) = $54,377,786,683
* For fiscal year 2024: $54,608,583,217 (C) + $250,000,000 (transportation infrastructure finance and innovation program) + $602,460,000 (tribal transportation program) = $55,461,043,217
* For fiscal year 2025: $55,700,754,881 (D) + $250,000,000 (transportation infrastructure finance and innovation program) + $602,460,000 (tribal transportation program) = $56,553,214,881
* For fiscal ye



Global Response: Based on the context provided, I will calculate the total appropriations for transportation programs like Tribal Transportation, Federal Lands Transportation, and Federal Lands Access from 2022 to 2026.

For Fiscal Year 2022:
- Tribal Transportation Program: $578,460,000
- Transportation Infrastructure Finance and Innovation Program: $250,000,000
Total for FY 2022: $828,460,000

For Fiscal Year 2023:
- Tribal Transportation Program: $589,960,000
- Transportation Infrastructure Finance and Innovation Program: $250,000,000
Total for FY 2023: $839,960,000

For Fiscal Year 2024:
- Tribal Transportation Program: $602,460,000
- Transportation Infrastructure Finance and Innovation Program: $250,000,000
Total for FY 2024: $852,460,000

For Fiscal Year 2025:
- Tribal Transportation Program: $55,700,754,881 (from context A) 
- Transportation Infrastructure Finance and Innovation Program: $250,000,000
Total for FY 2025: $55,950,754,881

For Fiscal Year 2026:
- Tribal Transportati



Global Response: Based on the provided text, I can extract the following information:

For Tribal Transportation:
X-values: [2022, 2023, 2024, 2025, 2026]
Y-values: [$1,050,000, $1,050,000, $1,050,000, $1,050,000, $1,050,000]

For Federal Lands Transportation:
X-values: [2022, 2023, 2024, 2025, 2026]
Y-values: [$1,050,000, $1,050,000, $1,050,000, $1,050,000, $1,050,000]

For Federal Lands Access:
X-values: [2022, 2023, 2024, 2025, 2026]
Y-values: [$1,050,000, $1,050,000, $1,050,000, $1,050,000, $1,050,000]

Here is the chart with the extracted values:

Title : 'Transportation Program Appropriations (2022-2026)'
X-axis legend : 'Year'
Y-axis legend: '$ in millions'

Note that all three programs have the same appropriation amount of $1,050,000 for each year from 2022 to 2026.
x_values for plotting: ['2022', '2023', '2024', '2025', '2026'], y_values for plotting: [1050000, 1050000, 1050000, 1050000, 1050000]
x-values length: 5, y-values length: 5
Retrieving documents...
We are working wit



Global Response: Based on the provided extracts from USSP_Light_DOWNLOAD_20241030_051135/Z_document_1.pdf - Page 16 and USSP_Light_DOWNLOAD_20241030_051135/1_document_1.pdf - Page 16, here are the total appropriations for transportation programs like Tribal Transportation, Federal Lands Transportation, and Federal Lands Access from 2022 to 2026:

1. **Tribal Transportation Program**:
	* Total appropriation: $612,960,000 (2025) + $627,960,000 (2026) = $1,240,920,000
	* Breakdown by year:
		+ 2022: Not explicitly stated in the provided extracts.
		+ 2023: Not explicitly stated in the provided extracts.
		+ 2024: Not explicitly stated in the provided extracts.
		+ 2025: $612,960,000
		+ 2026: $627,960,000
2. **Federal Lands Transportation Program**:
	* Total appropriation: $421,965,000 (2022) + $429,965,000 (2023) + $438,965,000 (2024) + $447,965,000 (2025) + $455,965,000 (2026) = $2,193,825,000
	* Breakdown by year:
		+ 2022: $421,965,000
		+ 2023: $429,965,000
		+ 2024: $438,965,000
		+



Global Response: Based on the context provided, I can answer your question.

The total appropriations for transportation programs like Tribal Transportation, Federal Lands Transportation, and Federal Lands Access from 2022 to 2026 are:

* For fiscal year 2022: $52,488,065,375 (A) + $250,000,000 (transportation infrastructure finance and innovation program) + $578,460,000 (tribal transportation program) = $53,316,525,375
* For fiscal year 2023: $53,537,826,683 (B) + $250,000,000 (transportation infrastructure finance and innovation program) + $589,960,000 (tribal transportation program) = $54,377,786,683
* For fiscal year 2024: $54,608,583,217 (C) + $250,000,000 (transportation infrastructure finance and innovation program) + $602,460,000 (tribal transportation program) = $55,461,043,217
* For fiscal year 2025: $55,700,754,881 (D) + $250,000,000 (transportation infrastructure finance and innovation program) + $602,460,000 (tribal transportation program) = $56,553,214,881
* For fiscal ye



Global Response: Based on the context provided, I can answer your question.

The total appropriations for transportation programs like Tribal Transportation, Federal Lands Transportation, and Federal Lands Access from 2022 to 2026 are:

* For fiscal year 2022: $52,488,065,375 (A) + $250,000,000 (transportation infrastructure finance and innovation program) + $578,460,000 (tribal transportation program) = $53,316,525,375
* For fiscal year 2023: $53,537,826,683 (B) + $250,000,000 (transportation infrastructure finance and innovation program) + $589,960,000 (tribal transportation program) = $54,377,786,683
* For fiscal year 2024: $54,608,583,217 (C) + $250,000,000 (transportation infrastructure finance and innovation program) + $602,460,000 (tribal transportation program) = $55,461,043,217
* For fiscal year 2025: $55,700,754,881 (D) + $250,000,000 (transportation infrastructure finance and innovation program) + $602,460,000 (tribal transportation program) = $56,553,214,881
* For fiscal ye



Global Response: Based on the provided text, I can extract the following information:

For Tribal Transportation:
X-values: [2022, 2023, 2024, 2025, 2026]
Y-values: [$1,050,000, $1,050,000, $1,050,000, $1,050,000, $1,050,000]

For Federal Lands Transportation:
X-values: [2022, 2023, 2024, 2025, 2026]
Y-values: [$1,050,000, $1,050,000, $1,050,000, $1,050,000, $1,050,000]

For Federal Lands Access:
X-values: [2022, 2023, 2024, 2025, 2026]
Y-values: [$1,050,000, $1,050,000, $1,050,000, $1,050,000, $1,050,000]

Here is the chart with the extracted values:

Title : 'Transportation Program Appropriations (2022-2026)'
X-axis legend : 'Year'
Y-axis legend: '$ in millions'

Note that all three programs have the same appropriation amount of $1,050,000 for each year from 2022 to 2026.
x_values for plotting: ['2022', '2023', '2024', '2025', '2026'], y_values for plotting: [1050000, 1050000, 1050000, 1050000, 1050000]
x-values length: 5, y-values length: 5
Retrieving documents...
We are working wit

  fig, ax = plt.subplots(facecolor='#e0ffe0')



Defining the large language model

LLM model used in this application : llama3.2
Invoking LLM - Only the reranked context shaped this response...




Global Response: Based on the context provided, I will calculate the total appropriations for transportation programs like Tribal Transportation, Federal Lands Transportation, and Federal Lands Access from 2022 to 2026.

For Fiscal Year 2022:
- Tribal Transportation Program: $578,460,000
- Transportation Infrastructure Finance and Innovation Program: $250,000,000
Total for FY 2022: $828,460,000

For Fiscal Year 2023:
- Tribal Transportation Program: $589,960,000
- Transportation Infrastructure Finance and Innovation Program: $250,000,000
Total for FY 2023: $839,960,000

For Fiscal Year 2024:
- Tribal Transportation Program: $602,460,000
- Transportation Infrastructure Finance and Innovation Program: $250,000,000
Total for FY 2024: $852,460,000

For Fiscal Year 2025:
- Tribal Transportation Program: $55,700,754,881 (from context A) 
- Transportation Infrastructure Finance and Innovation Program: $250,000,000
Total for FY 2025: $55,950,754,881

For Fiscal Year 2026:
- Tribal Transportati

  response = llm(input_text)
