# Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an Groq_API_key for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

Install the required dependencies:

In [1]:
!pip install -q cassio datasets langchain openai tiktoken

Import the packages you'll need:

In [None]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

from langchain_groq import ChatGroq


In [3]:
!pip install PyPDF2

Collecting PyPDF2
  Using cached pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Using cached pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [4]:
from PyPDF2 import PdfReader

### Setup

In [5]:
ASTRA_DB_APPLICATION_TOKEN = "ASTRA_DB_APPLICATION_TOKEN" # enter the "AstraCS:..." string found in in your Token JSON file
ASTRA_DB_ID = "ASTRA_DB_ID" # enter your Database ID

Groq_API_Key = "Groq_API_Key"


#### Provide your secrets:

Replace the following with your Astra DB connection details and your OpenAI API key:

In [7]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('/home/kshatra/Documents/NAD_PR_29112024.pdf')

In [8]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [9]:
raw_text

'This Press Release is embargoed against publication, telecast or circulation on internet till 4:00 pm on 29th November , 2024 . \n \n \n   \n \nPRESS NOTE  \nON \nESTIMATES OF GROSS DOMESTIC  \nPRODUCT  FOR THE SECOND  QUARTER  \n(JULY -SEPTEMBER ) OF 2024 -25  \n \n \n \n \n \n \nNATIONAL ACCOUNTS DIVISION  \nNATIONAL STATISTIC S OFFICE  \nMINISTRY OF STATISTICS & PROGRAMME \nIMPLEMENTATION  \nGOVERNMENT OF INDIA  \nThis Press Release is embargoed against publication, telecast or circulation on internet till 4:00 pm on 29th November , 2024 . \n1 \n GOVERNMENT OF INDIA  \nMINISTRY OF STATISTICS AND PROGRAMME IMPLEMENTATION  \n Dated 8 Agrahayana , 194 6 Saka  \n 29th November , 202 4  \nPRESS NOTE  \nON \nESTIMATES OF GROSS DOMESTIC PRODUCT FOR THE SECOND  QUARTER  \n(JULY -SEPTEMBER ) OF 2024 -25 \n \nThe National Statistic s Office (NSO), Ministry of Statistics and Programme Implementation \n(MoSPI) is releasing in this Press Note, Quarterly Estimates of Gross Domestic Product (GDP)

Initialize the connection to your database:

_(do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)_

In [10]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

Create the LangChain embedding and LLM objects for later usage:

In [17]:
# llm = OpenAI(openai_api_key=OPENAI_API_KEY)
# embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
from langchain_groq import ChatGroq
from langchain.embeddings import HuggingFaceEmbeddings

llm = ChatGroq(model="llama3-8b-8192", groq_api_key=Groq_API_Key)
embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


  embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Create your LangChain vector store ... backed by Astra DB!

In [18]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [19]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [20]:
texts[:50]

['This Press Release is embargoed against publication, telecast or circulation on internet till 4:00 pm on 29th November , 2024 . \n \n \n   \n \nPRESS NOTE  \nON \nESTIMATES OF GROSS DOMESTIC  \nPRODUCT  FOR THE SECOND  QUARTER  \n(JULY -SEPTEMBER ) OF 2024 -25  \n \n \n \n \n \n \nNATIONAL ACCOUNTS DIVISION  \nNATIONAL STATISTIC S OFFICE  \nMINISTRY OF STATISTICS & PROGRAMME \nIMPLEMENTATION  \nGOVERNMENT OF INDIA  \nThis Press Release is embargoed against publication, telecast or circulation on internet till 4:00 pm on 29th November , 2024 . \n1 \n GOVERNMENT OF INDIA  \nMINISTRY OF STATISTICS AND PROGRAMME IMPLEMENTATION  \n Dated 8 Agrahayana , 194 6 Saka  \n 29th November , 202 4  \nPRESS NOTE  \nON \nESTIMATES OF GROSS DOMESTIC PRODUCT FOR THE SECOND  QUARTER  \n(JULY -SEPTEMBER ) OF 2024 -25',
 'Dated 8 Agrahayana , 194 6 Saka  \n 29th November , 202 4  \nPRESS NOTE  \nON \nESTIMATES OF GROSS DOMESTIC PRODUCT FOR THE SECOND  QUARTER  \n(JULY -SEPTEMBER ) OF 2024 -25 \n \nThe Na

### Load the dataset into the vector store



In [21]:

astra_vector_store.add_texts(texts[:50])

print("Inserted %i headlines." % len(texts[:50]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 40 headlines.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- _What is the current GDP?_
- _How much the agriculture target will be increased to and what the focus will be_


In [22]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


QUESTION: "Half-Yearly Estimates and Growth Rates"
ANSWER: "According to the text, the half-yearly estimates and growth rates are as follows:

* Real GDP (GDP at Constant Prices) in H1 (April-September) of 2024-25 is estimated at ₹139.78 lakh crore, showing a growth rate of 6.2% over H1 of the previous financial year.
* Nominal GVA (Gross Value Added) in H1 of 2024-25 is estimated at ₹139.78 lakh crore, showing a growth rate of 8.9% over H1 of the previous financial year.

These estimates are based on the data provided in Fig. 4: Half-Yearly GDP and GVA Estimates along with Y-o-Y Growth Rates from H1 2021-22 to H1 2024-25 at Constant Prices."

FIRST DOCUMENTS BY RELEVANCE:
    [0.7788] "September) Estimates of Gross Value Added (GVA) at Basic Prices by kind of economic  ..."
    [0.7703] "4 
 Fig. 3: Composition and Growth Rates of Quarterly GVA in Broad Sectors  
 
 
 
  ..."
    [0.7577] "H1, 2023 -24, showing a growth rate of 6. 2%. Nominal GVA in H1 of 2024 -25 is estim ..."
    [