<a href="https://colab.research.google.com/github/adityakangune/LangChain_Will/blob/main/LangChain_Wills.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [57]:
pdf_path = "/content/drive/Othercomputers/My Laptop/Purdue University/Semester 4/Interview Prep/wealth.com/sample_will.pdf"

In [58]:
# !pip install langchain_community

In [59]:
# !pip install pypdf

In [60]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [61]:
loader = PyPDFLoader(pdf_path)
pages = loader.load()

In [62]:
print("📄 Total Pages Loaded:", len(pages))
print("🔹 Sample Page Text:\n", pages[0].page_content[:500])

📄 Total Pages Loaded: 5
🔹 Sample Page Text:
 Last Will and Testament 
of 
___________________________________ 
 
I, ________________________, resident in the City of ____________________, 
County of ____________________, State of ____________________, being of sound 
mind, not acting under duress or undue influence, and fully understanding the nature 
and extent of all my property and of this disposition thereof, do hereby make, publish, 
and declare this document to be my Last Will and Testament, and hereby revoke any 
and all other wills


In [63]:
# Create the splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
# ✅ Purpose: We split each page into overlapping 500-character chunks.
# 🧠 Why overlap? To preserve context across sentences that cross chunk boundaries.

In [64]:
chunks = splitter.split_documents(pages)

In [65]:
print("🧩 Total Chunks Created:", len(chunks))
print("📌 First Chunk Preview:\n", chunks[0].page_content)

🧩 Total Chunks Created: 29
📌 First Chunk Preview:
 Last Will and Testament 
of 
___________________________________ 
 
I, ________________________, resident in the City of ____________________, 
County of ____________________, State of ____________________, being of sound 
mind, not acting under duress or undue influence, and fully understanding the nature 
and extent of all my property and of this disposition thereof, do hereby make, publish, 
and declare this document to be my Last Will and Testament, and hereby revoke any


## Step 2: Embedding

In [66]:
# Convert each chunk of the will into a numeric vector (embedding)
# Store those vectors in a searchable database (FAISS)

In [67]:
# !pip install langchain_huggingface

In [68]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
# Facebook AI Similarity Search

In [69]:
# small transformer model that turns text into 384-dimension vectors.
# If two sentences mean similar things, their embeddings will be close together in vector space.
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
sample_vec = embedding_model.embed_query(chunks[0].page_content)
print("🔢 Vector length:", len(sample_vec))
print("📊 First 5 dims:", sample_vec[:5])

🔢 Vector length: 384
📊 First 5 dims: [-0.048935726284980774, 0.141635462641716, 0.023208405822515488, -0.112810418009758, -0.05388515070080757]


In [70]:
# !pip install faiss-cpu

In [71]:
vector_store = FAISS.from_documents(chunks, embedding_model)
print("✅ FAISS index created with", len(chunks), "chunks.")

✅ FAISS index created with 29 chunks.


In [72]:
vector_store.save_local("faiss_index")
print("💾 Saved FAISS index to disk.")

💾 Saved FAISS index to disk.


## Step 3: Semantic Retrieval

In [73]:
# Load the saved FAISS index
vector_store = FAISS.load_local("faiss_index", embedding_model, allow_dangerous_deserialization=True)
# Ask the user for a question about the will
query = input("❓ Ask something about the will: ")

❓ Ask something about the will: Who is this will about?


In [74]:
# Search for top 3 relevant chunks
docs = vector_store.similarity_search(query, k=3)

print("\n🔍 Top Relevant Chunks:")
for i, doc in enumerate(docs, 1):
    print(f"\n--- Chunk #{i} ---\n{doc.page_content}")



🔍 Top Relevant Chunks:

--- Chunk #1 ---
upon all affected. 
VII. CONTESTING BENEFICIARY 
If any beneficiary under this Will, or any trust herein mentioned, contests or attacks this 
Will or any of its provisions, any share or interest in my estate given to that contesting

--- Chunk #2 ---
beneficiary under this Will is revoked and shall be disposed of in the same manner 
provided herein as if that contesting beneficiary had predeceased me. 
VIII. GUARDIAN AD LITEM NOT REQUIRED 
I direct that the representation by a guardian ad litem of the interests of persons unborn, 
unascertained or legally incompetent to act in proceedings for the allowance of 
accounts hereunder be dispensed with to the extent permitted by law. 
IX. GENDER

--- Chunk #3 ---
and all other wills and codicils heretofore made by me. 
I. EXPENSES & TAXES 
I direct that all my debts, and expenses of my last illness, funeral, and burial, be paid as 
soon after my death as may be reasonably convenient, and I hereby aut

##  Step 4: Generate Answers using Retrieved Chunks

Take the top chunks we retrieved from FAISS and feed them into a local language model, like GPT4All, to generate a natural-language answer.

This is where retrieval + generation = RAG

In [75]:
# !pip install llama-cpp-python

In [76]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

In [77]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")


In [80]:
from transformers import pipeline

pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

response = pipe("Summarize this will: " + docs[0].page_content, max_new_tokens=200)
print(response[0]['generated_text'])


Device set to use cpu


a beneficiary under this Will, or any trust herein mentioned, shall be forfeited to that contesting beneficiary.
