#  Strategy & Ingestion Notebook

This notebook outlines the end-to-end process of preparing and querying a semantic retrieval system tailored for legal documents. It covers:

1. Mock Data Creation
2. Custom Parsing and Chunking
3. Embedding and Indexing
4. Semantic Search and Response Generation

## 1.  Mock Data Creation
We simulate 15 fake legal judgment documents, each with the following structure:
- `metadata_header` (includes case ID)
- `summary`
- `facts_of_case`
- `plaintiff_arguments`
- `defendant_arguments`
- `verdict`

These fields model a real-world legal case breakdown.

## 2.  Custom Parsing & Chunking
We define a function `get_chunk()` to convert each document into 5 distinct chunks:
- Each chunk includes the `document_id`, `section_type`, and the actual `text`
- This granularity allows semantic search over specific parts like the `verdict` or `arguments`

In [2]:

import json


def get_chunk(doc):
    doc_id=doc['metadata_header']['case_id']
    return [
        {"document_id": doc_id, "section_type": "summary", "text": doc["summary"]},
        {"document_id": doc_id, "section_type": "facts_of_case", "text": doc["facts_of_case"]},
        {"document_id": doc_id, "section_type": "plaintiff_arguments", "text": doc["plaintiff_arguments"]},
        {"document_id": doc_id, "section_type": "defendant_arguments", "text": doc["defendant_arguments"]},
        {"document_id": doc_id, "section_type": "verdict", "text": doc["verdict"]}
    ]

with open(r"C:\Users\hp\Documents\synbrains_trainee_works\ragwork\ragwork\mock_legal_data.json") as f:
    documents = json.load(f)

all_chunks = []
for doc in documents:
    all_chunks.extend(get_chunk(doc))

print(f"Parsed {len(all_chunks)} chunks from {len(documents)} documents.")
    

Parsed 100 chunks from 20 documents.


## 3.  Embedding & Indexing
We use `all-MiniLM-L6-v2` from SentenceTransformers to convert all chunk texts into dense vector embeddings. These are saved using `pickle` for later retrieval.

In [3]:

import os
from sentence_transformers import SentenceTransformer
import pickle

os.environ["TRANSFORMERS_NO_TF"] = "1"  

model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [chunk["text"] for chunk in all_chunks]
embeddings = model.encode(texts)

with open("embeddings.pkl", "wb") as f:
    pickle.dump({"chunks": all_chunks, "embeddings": embeddings}, f)

print("Embeddings saved to embeddings.pkl")
    

  from .autonotebook import tqdm as notebook_tqdm


Embeddings saved to embeddings.pkl


## 4.  Semantic Retrieval + LLM Synthesis
We:
- Load embeddings and compute cosine similarity for ranking
- Retrieve the top-k most relevant chunks
- Format these into a prompt and pass it to an LLM (via Langchain-Groq)
- Return a natural language answer

In [4]:

import pickle
from sklearn.metrics.pairwise import cosine_similarity
from langchain_groq import ChatGroq
from dotenv import load_dotenv

from sentence_transformers import SentenceTransformer

load_dotenv()
llm = ChatGroq(groq_api_key=os.getenv("GROQ_API_KEY"), model_name="llama3-8b-8192")
model = SentenceTransformer("all-MiniLM-L6-v2")

def retrieve(query, index_path="embeddings.pkl", top_k=3):
    with open(index_path, "rb") as f:
        data = pickle.load(f)
    chunks = data["chunks"]
    embeddings = data["embeddings"]
    query_embedding = model.encode([query])
    sims = cosine_similarity(query_embedding, embeddings)[0]
    top_indices = sims.argsort()[::-1][:top_k]
    return [chunks[i] for i in top_indices]

def generate_response(query, retrieved_chunks):
    context = "\n\n".join(f"[{c['document_id']} - {c['section_type']}]\n{c['text']}" for c in retrieved_chunks)
    prompt = f"Query: {query}\n\nContext:\n{context}\n\nAnswer:"
    return llm.invoke(input=prompt).content.strip()

query = "What was the verdict in the product liability case?"
top_chunks = retrieve(query)
response = generate_response(query, top_chunks)

print("Answer:\n", response)
    

Answer:
 The verdict in the product liability case is that the court finds for the plaintiff and awards $180,000 in damages.
