# RAG for Legislative Bill Classification

Manually classifying legislative text is a major bottleneck, especially for a label space exceeding 120 unique policy codes. This approach uses a Retrieval-Augmented Generation (RAG) framework to automate the process.

The RAG is built on a high-quality ground truth corpus of over 300 bills annotated by domain experts. The RAG system works as follows:

- **Embedding**: A `bge-large-en-v1.5` model creates dense vector representations of all coded paragraphs and definitions
- **Indexing**: The embeddings are stored in a **FAISS** vector index for efficient, real-time similarity search
- **Retrieval**: For any new paragraph, the system retrieves the most semantically relevant examples from the index
- **Prompts**: Instead of overwhelming an LLM with all 120+ possible codes, the system builds a highly focused prompt with only a few relevant examples, overcoming context window limitations and improving accuracy

## Model Evaluation

To start, I am conducting a comparative analysis to find the optimal balance of performance and cost by testing two distinct LLM strategies:

### 1. Local Inference with Gemma 2 (9B)
- **Model**: A locally-hosted instance of Google `Gemma 2 9B`, deployed with **4-bit precision** using `BitsAndBytesConfig`
- **Hardware**: Runs locally on my workstation using an **NVIDIA RTX 4000 Ada Generation GPU (20GB)**
- **Advantages**: This setup provides an unconstrained environment for experimentation, free from API latency, rate limits, or costs

### 2. API-Based Inference with Gemini 2.5 Flash-Lite
- **Model**: Google's `Gemini 2.5 Flash-Lite`, accessed via the Google AI Studio API
- **Purpose**: Serves as a state-of-the-art performance baseline without requiring extensive local hardware
- **Trade-Offs**: This approach is subject to API quotas, network latency, and potential costs

The main objective is to quantify the performance differential between these two approaches to identify the most effective architecture for this task.

## Next Steps: Multi-Agent Collaboration Framework

The next phase will evolve the project from a comparative analysis into a multi-agent framework. The plan is to explore how collaboration between models can enhance classification accuracy and robustness.

- **Goal**: Move beyond single-model classification to a system where multiple agents work together through a consensus or hierarchical structure.
- **Agents**: The framework will orchestrate four models: two running locally (`Gemma 2 9B` and Microsoft `Phi-3-mini`) and two accessed via API (`Gemini 2.5 Flash-Lite` and `Gemini 2.0 Flash`)
- **Next**: Research the specific strengths and weaknesses of each model to assign specialized roles, such as pre-processing, candidate generation, or final verification

In [None]:
import os
import re
import gc
import glob
import fitz
import json
import faiss
import torch
import jinja2
import textwrap
import numpy as np
import pandas as pd
from pathlib import Path
from google import genai
from google.genai import types
from collections import defaultdict
from sentence_transformers import SentenceTransformer
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig


In [None]:
%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=0

In [None]:
with open('nested_output.json', 'r') as f:
    nested_output = json.load(f)
    
#display(nested_output)

PDF_DIRECTORY = "Policy_PDFs"

In [None]:
rag_documents = []
code_names = []
code_with_definitions = {}

for category, subcategories in nested_output.items():
    for subcategory, codes in subcategories.items():
        for code_info in codes:

            document_text = (
                f"Code Category: {category}. "
                f"Code Subcategory: {subcategory}. "
                f"Code Name: {code_info['Code Name']}. "
                f"Definition: {code_info['Definition']}"
            )
            #print(document_text)
            code_names.append(code_info['Code Name'])
            rag_documents.append(document_text)
            code_with_definitions[code_info['Code Name']] = document_text

#print(code_names)
#print(rag_documents)

In [None]:
codebook = pd.read_csv("coded_paragraphs.csv")

save_paragraphs = {}
rag_paragraphs = {}

for idx, row in codebook.iterrows():
    if len(row["paragraph"]) < 1000 and len(row["paragraph"]) > 150:
        #row code is a string so we need to convert it to a list ex:"['code1', 'code2']"
        as_list = row["codes"].strip("[]").replace("'", "").split(", ")
        row["codes"] = as_list

        full_def = []
        code_key = []
        for c in row["codes"]:
            if c in code_with_definitions:
                full_def.append(code_with_definitions[c])
                code_key.append(c)

        # join the full_def list into a single string
        full_def_str = " ".join(full_def)

        save_paragraphs[row["document_name"]] = {"codes":row["codes"],"codes_definition":full_def,"paragraph":row["paragraph"]}
        rag_paragraphs[tuple(code_key)] = f"{full_def_str} | Paragraph: {row["paragraph"]}"

#print(len(save_paragraphs))

#print(save_paragraphs.keys())
display(len(rag_paragraphs))
#save to json file
# import json
# with open("paragraph_with_codes_definitions.json", "w") as f:
#     json.dump(save_paragraphs, f, indent=4)

In [None]:
emd_model = "backend/data/embeddings/bge-large"
embedder = SentenceTransformer(emd_model)

definition_embeddings = embedder.encode(
    list(rag_paragraphs.values()),
    show_progress_bar=True,
    convert_to_numpy=True
).astype('float32')

embedding_dimension = definition_embeddings.shape[1]
index = faiss.IndexFlatIP(embedding_dimension)
faiss.normalize_L2(definition_embeddings)
index.add(definition_embeddings)

# test query
query = "example code definition about affordable housing and state government the ppeople who need it will need to apply through a state agency"

query_embedding = embedder.encode([query], convert_to_numpy=True).astype('float32')
faiss.normalize_L2(query_embedding)
D, I = index.search(query_embedding, k=3)

for i, idx in enumerate(I[0]):
    print(f"Rank {i+1}:")
    #print(rag_paragraphs[list(rag_paragraphs.keys())[idx]])
    print(f"Similarity Score: {D[0][i]}\n")

In [None]:
with open('secrets.json', "r") as f:
    secrets = json.load(f)

#print(type(secrets))

client = genai.Client(api_key=secrets["key"])

In [None]:
def gemini_with_rag(paragraph):
    # relevant context
    query_embedding = embedder.encode([paragraph], convert_to_numpy=True).astype('float32')
    faiss.normalize_L2(query_embedding)
    distances, indices = index.search(query_embedding, k=7)

    contexts = []
    for i, idx in enumerate(indices[0]):
        contexts.append(rag_paragraphs[list(rag_paragraphs.keys())[idx]])
    
    print(contexts)
    #gather only the code names and definitions
    codes_with_def = set()
    just_para = ""
    count = 1
    for par in contexts:
        parts = par.split(" | ")
        just_para += f"\nEXAMPLE {count}: \n{parts[1]}\nCodes Assigned: {parts[0]}"

        count -= 1
        if count == 0:
            break

    for par in contexts:
        try:
            codes_string, _ = par.split(' | ')
        except ValueError:
            continue

        pattern = r"Code Name:\s*(.+?)\.\s*Definition:\s*(.+?)(?=Code Category:|$)"
        matches = re.findall(pattern, codes_string, re.DOTALL)

        for code_name, definition in matches:
            # captured groups
            clean_name = code_name.strip()
            clean_definition = definition.strip()

            formatted_entry = f"- {clean_name}. Definition: {clean_definition}"
            codes_with_def.add(formatted_entry)

    # the final string
    codes_context = "CODES AND DEFINITIONS:\n" + "\n".join(sorted(list(codes_with_def)))
    #print(codes_context)

    full_system_prompt = f"""
    You will be shown some examples of code definitions and their corresponding paragraphs from US state policies/acts/bills.
    Use these examples to get an idea of how to classify the paragraph into the appropriate codes.
    
    {just_para}

    {codes_context}

    Use the codes and their definitions to assign the most appropriate codes to the following paragraph.
    Then, conclude with a comma-separated list of the codes names only.
    """
    try:
        response = client.models.generate_content(
        model="gemini-2.5-flash-lite",
        config=types.GenerateContentConfig(
            system_instruction=full_system_prompt),
            contents=f"Please code this paragraph: '{paragraph}'"
        )
        print(response.text)
        return response.text

    except Exception as e:
        print(f"\n[Error]: {e}")
        return "Error during API call."

In [None]:
quantization_config = BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_compute_dtype=torch.bfloat16)
base = Path("backend/data/LLMs/Gemma-2-9B")

model = AutoModelForCausalLM.from_pretrained(str(base),quantization_config=quantization_config,device_map="auto",local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained(str(base),local_files_only=True)

In [None]:
def load_pdfs_from_directory(directory_path):

    if not Path(directory_path).exists():
        print(f"Directory '{directory_path}' not found. Please create it and add your PDFs.")
        return []
    
    documents = []

    for filename in os.listdir(directory_path):
        if filename.endswith(".pdf"):
            filepath = os.path.join(directory_path, filename)
            
            try:
                doc = fitz.open(filepath)
                full_text = "".join(page.get_text() for page in doc)
                doc.close()

                if "References" in full_text:
                    full_text = full_text[:full_text.index("References")]

                documents.append({"page_content": full_text,"metadata": {"source": filename}})

            except Exception as e:
                print(f"Error processing {filename}: {e}")
                
    print(f"Loaded {len(documents)} documents from '{directory_path}'")
    return documents

def split_by_section(text):
    pattern = r'(?=SECTION \d+)'
    clean_text = text.replace('\n', ' ').replace('\u00a0', ' ').replace('\u00a7', '')
    chunks = re.split(pattern, clean_text)

    final_chunks = []

    for chunk in chunks:
        chunk = chunk.strip()
        if not chunk:
            continue
        
        if len(chunk) > 1000:
            sub_chunks = textwrap.wrap(chunk, width=1000, break_long_words=False)
            final_chunks.extend(sub_chunks)
            
        elif len(chunk) >= 100:
            final_chunks.append(chunk)

    return final_chunks


In [None]:
def classify_with_gemma(paragraph):
    query_embedding = embedder.encode([paragraph], convert_to_numpy=True).astype('float32')
    faiss.normalize_L2(query_embedding)
    distances, indices = index.search(query_embedding, k=5)

    contexts = []
    for i, idx in enumerate(indices[0]):
        contexts.append(rag_paragraphs[list(rag_paragraphs.keys())[idx]])
    
    codes_with_def = set()

    for par in contexts:
        try:
            codes_string, _ = par.split(' | ')
        except ValueError:
            continue

        pattern = r"Code Name:\s*(.+?)\.\s*Definition:\s*(.+?)(?=Code Category:|$)"
        matches = re.findall(pattern, codes_string, re.DOTALL)

        for code_name, definition in matches:
            # captured groups
            clean_name = code_name.strip()
            clean_definition = definition.strip()

            formatted_entry = f"- {clean_name}. Definition: {clean_definition}"
            codes_with_def.add(formatted_entry)

    # the final string
    codes_context = "Codes and definitions:\n" + "\n".join(sorted(list(codes_with_def)))
    #print(codes_context)
    #load .txt file with the knowledge_base
    with open("knowledge_base.txt", "r") as f:
        knowledge_base = f.read()
 
    system_prompt = f"""
    {knowledge_base}

    ### Code Definitions
    {codes_context}

    ## YOUR TASK
    Apply all rules and definitions from the KNOWLEDGE BASE above to the following paragraph. 
    Your entire output MUST BE a single, comma-separated list of the resulting 'Code Name's. 
    Do not explain your reasoning or show your work.
    """

    combined_prompt = f"{system_prompt}\n\nParagraph to analyze: {paragraph}"
    messages = [{"role": "user", "content": combined_prompt}]

    prompt = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.inference_mode():
        outputs = model.generate(**inputs,max_new_tokens=300,do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|return|>")])

    raw_output = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
    return raw_output


In [None]:
RESULTS_FILE_LOCAL = "classification_results_gemma_local.json"
def classify_gemma_local():
    classification_results = defaultdict(list)
    documents = load_pdfs_from_directory(PDF_DIRECTORY)
    
    for doc in documents:
        filename = doc['metadata']['source']
        print(f"\n NOw Processing {filename}\n")
        
        chunks = split_by_section(doc['page_content'])
        print(f"Found {len(chunks)} chunks to classify")
        
        for i, chunk in enumerate(chunks):
            predicted_codes = classify_with_gemma(chunk)
            
            paragraph_data = {
                "chunk_number": i + 1,
                "text_snippet": chunk,
                "predicted_codes": predicted_codes
            }
            classification_results[filename].append(paragraph_data)
            print(f"Chunk {i+1}: {predicted_codes}")

            torch.cuda.empty_cache()
            gc.collect()

        print(f"\nSaving Doc Results to: {RESULTS_FILE_LOCAL}\n")
        with open(RESULTS_FILE_LOCAL, 'w') as f:
            json.dump(classification_results, f, indent=4)

classify_gemma_local()

In [None]:

RESULTS_FILE = "classification_results_gemini_master.json"
def classify_gemini_api():
    classification_results = defaultdict(list)
    documents = load_pdfs_from_directory(PDF_DIRECTORY)
    
    for doc in documents:
        filename = doc['metadata']['source']
        print(f"\n NOw Processing {filename}\n")
        
        chunks = split_by_section(doc['page_content'])
        print(f"Found {len(chunks)} chunks to classify.")
        
        for i, chunk in enumerate(chunks):
            predicted_codes_gemini = gemini_with_rag(chunk)
            
            paragraph_data = {
                "chunk_number": i + 1,
                "text_snippet": chunk,
                "predicted_codes": predicted_codes_gemini
            }
            classification_results[filename].append(paragraph_data)
            print(f"Chunk {i+1}: {predicted_codes_gemini}")

            torch.cuda.empty_cache()
            gc.collect()

        print(f"\nSaving Doc Results to: {RESULTS_FILE}")
        with open(RESULTS_FILE, 'w') as f:
            json.dump(classification_results, f, indent=4)


classify_gemini_api()