# Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.


#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

Install the required dependencies:

In [None]:
!pip install -q cassio datasets langchain openai tiktoken

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.9/3.9 MB[0m [31m87.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

In [None]:
!pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.22-py3-none-any.whl.metadata (2.4 kB)
Collecting langchain-core<1.0.0,>=0.3.55 (from langchain-community)
  Downloading langchain_core-0.3.55-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain<1.0.0,>=0.3.24 (from langchain-community)
  Downloading langchain-0.3.24-py3-none-any.whl.metadata (7.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-

Import the packages you'll need:

In [None]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

In [None]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [None]:
from PyPDF2 import PdfReader

### Setup

In [None]:
ASTRA_DB_APPLICATION_TOKEN = "your_AstraDB_app_token" # enter the "AstraCS:..." string found in in your Token JSON file
ASTRA_DB_ID = "your_AstraDB_ID" # enter your Database ID

#### Provide your secrets:

Replace the following with your Astra DB connection details and your OpenAI API key:

In [None]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('event-1.pdf')

In [None]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [None]:
from google.colab import drive
drive.mount('/content/drive')

MessageError: Error: credential propagation was unsuccessful

In [None]:
raw_text

'To view complete details for this event, click here to view the announcement\nCHOUFLI CODE\nThe \nChoufliCode Hackathon 2025\n, organized by the IEEE Student Branch Computer Society on\nfebruary 22nd, blended intense coding challenges with opportunities for soft skill\ndevelopment. Participants tackled real-world problems through teamwork and innovation, while\ncoffee breaks featured games like bowling, baby foot, and Pool. These activities fostered\ncamaraderie, strategic thinking, and spontaneous collaboration, enhancing communication and\nstress management in a relaxed setting. The event not only celebrated technical excellence\nat its Award Ceremony but also emphasized holistic growth, merging programming rigor with\nplayful networking. A heartfelt thank you to all participants, mentors, and sponsors for\nmaking this dynamic fusion of tech and teamwork a memorable success! \x00\x00\x00                  \n    \n***********************************************************************

Initialize the connection to your database:

_(do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)_

In [None]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

ValueError: Generic error when fetching the URL to the secure-bundle.

Create the LangChain embedding and LLM objects for later usage:

In [None]:
import google.generativeai as genai

Create your LangChain vector store ... backed by Astra DB!

In [None]:
from langchain_core.embeddings import Embeddings
from typing import List

class GeminiEmbeddings(Embeddings):
    def __init__(self, model_name="models/embedding-001", api_key=None):
        genai.configure(api_key=api_key)
        self.model = model_name

    def embed_query(self, text: str) -> List[float]:
        response = genai.embed_content(model=self.model, content=text)
        return response["embedding"]

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        embeddings = []
        for text in texts:
            response = genai.embed_content(model=self.model, content=text)
            embeddings.append(response["embedding"])
        return embeddings

# Step 4: Initialize Gemini (free tier)
gemini_embeddings = GeminiEmbeddings(
    api_key="your_Gemini_API_Key"  # Get from https://aistudio.google.com/
)

# Step 5: Connect to Cassandra/Astra DB
astra_vector_store = Cassandra(
    embedding=gemini_embeddings,  # Using free Gemini embeddings
    table_name="qa_demo",
    session=None,  # Auto-initialized if CASSIO_TOKEN, CASSIO_DB_ID are set
    keyspace=None,  # Optional
)


ValueError: DB session not set.

In [None]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [None]:
texts[:50]

['GOVERNMENT OF INDIA\nBUDGET 2021-2022\nSPEECH\nOF\nNIRMALA SITHARAMAN\nMINISTER OF FINANCE\nFebruary 1,  2021CONTENTS  \nPART -A \n  Page No.  \n\uf0b7 Introduction  1 \n\uf0b7 Health and Wellbeing  5 \n\uf0b7 Physical and Financial Capital and Infrastructure  7 \n\uf0b7 Inclusive Development for Aspirational India  19 \n\uf0b7 Reinvigorating Human Capital  22 \n\uf0b7 Innovation and R&D  23 \n\uf0b7 Minimum Government, Maximum Governance  24 \n\uf0b7 Fiscal Position  25 \n \nPART B  \nDirect Tax Proposals  28 \n\uf0b7 Relief to Senior Citizens  \n\uf0b7 Reduction in Time for Income Tax Proceedings  \n\uf0b7 Setting up the Dispute Resolution Committee  \n\uf0b7 Faceless ITAT  \n\uf0b7 Relaxation to NRI  \n\uf0b7 Exemption from Audit  \n\uf0b7 Relief for Dividend  \n\uf0b7 Attracting foreign investment into infrastructure sector  \n\uf0b7 Affordable Housing/Rental Housing  \n\uf0b7 Tax incentives to IFSC',
 '\uf0b7 Relaxation to NRI  \n\uf0b7 Exemption from Audit  \n\uf0b7 Relief for 

### Load the dataset into the vector store



In [None]:

astra_vector_store.add_texts(texts[:50])

print("Inserted %i headlines." % len(texts[:50]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 50 headlines.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- _What is the current GDP?_
- _How much the agriculture target will be increased to and what the focus will be_


In [None]:
!pip install langchain_google_genai



In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI  # Official LangChain integration
from langchain.chains import RetrievalQA

# Initialize Gemini LLM properly for LangChain
llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-pro",
    google_api_key="your_Gemini_API_Key",  # Same key from AI Studio
    temperature=0.7
)

# Create a RetrievalQA chain instead of direct query
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=astra_vector_store.as_retriever()
)

# Modified question loop
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)

    # Use the QA chain instead of direct query
    answer = qa_chain.invoke({"query": query_text})["result"].strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


Enter your question (or type 'quit' to exit): how much public health units

QUESTION: "how much public health units"




ANSWER: "This document mentions the following about public health units:

* **3,382 block public health units** will be set up in 11 states as part of the PM AtmaNirbhar Swasth Bharat Yojana.
* **17 new Public Health Units** will be operationalized at Points of Entry.
* **33 existing Public Health Units** will be strengthened at Points of Entry.


It does not provide a single total number of public health units."

FIRST DOCUMENTS BY RELEVANCE:




    [0.8385] "public health units in  11 states;  
c. Establishing critical care hospital blocks i ..."
    [0.7996] "Health Systems  
 
30. A new centrally sponsored scheme, PM AtmaNirbhar Swasth Bhara ..."
    [0.7917] "g. Setting up of 15 Health Emergency Operation Centers and 2 mobile 
hospitals; and  ..."
    [0.7875] "i. Health and Wellbeing  
ii. Physical & Financial Capital, and Infrastructure  5 
  ..."

What's your next question (or type 'quit' to exit): quit
