# Quickstart: Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

Install the required dependencies:

In [64]:
!pip install -q cassio datasets langchain openai tiktoken langchain-HuggingFace langchain-community langchain-groq

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/129.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.4/129.4 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h

Import the packages you'll need:

In [65]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint
from langchain_groq import ChatGroq
# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

In [6]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [7]:
from PyPDF2 import PdfReader

### Setup

In [None]:
ASTRA_DB_APPLICATION_TOKEN = "your api keys" # enter the "AstraCS:..." string found in in your Token JSON file
ASTRA_DB_ID = "your api keys" # enter your Database ID

groq_API_KEY = "your api keys" # enter your OpenAI key

In [9]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('Paper.pdf')

In [10]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [11]:
raw_text

'See discussions, st ats, and author pr ofiles f or this public ation at : https://www .researchgate.ne t/public ation/359246513\nSoftware-Engineering Design Patterns for Machine Learning Applications\nArticle \xa0\xa0 in\xa0\xa0Comput er · Mar ch 2022\nDOI: 10.1109/MC.2021.3137227\nCITATIONS\n58READS\n2,898\n7 author s, including:\nHironori W ashiz aki\nWaseda Univ ersity\n445 PUBLICA TIONS \xa0\xa0\xa03,670  CITATIONS \xa0\xa0\xa0\nSEE PROFILE\nYann-Gaël Guéhéneuc\nConc ordia Univ ersity\n368 PUBLICA TIONS \xa0\xa0\xa011,389  CITATIONS \xa0\xa0\xa0\nSEE PROFILE\nHironori T akeuchi\nMusashi Univ ersity\n34 PUBLICA TIONS \xa0\xa0\xa0181 CITATIONS \xa0\xa0\xa0\nSEE PROFILE\nSatoshi Ok uda\nPrimestyle c o.\n8 PUBLICA TIONS \xa0\xa0\xa081 CITATIONS \xa0\xa0\xa0\nSEE PROFILE\nAll c ontent f ollo wing this p age was uplo aded b y Yann-Gaël Guéhéneuc  on 17 May 2022.\nThe user has r equest ed enhanc ement of the do wnlo aded file.Software Engineering Patterns for Machine Learning\nApplicatio

Initialize the connection to your database:

_(do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)_

In [12]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

Create the LangChain embedding and LLM objects for later usage:

In [69]:
# Create the LangChain embedding and LLM objects for later usage:
# Use a model known to be compatible with HuggingFaceEndpoint
llm = ChatGroq(model_name = 'gemma2-9b-it', api_key=groq_API_KEY)
embedding = HuggingFaceEmbeddings(model = 'sentence-transformers/all-MiniLM-L6-v2')

Create your LangChain vector store ... backed by Astra DB!

In [70]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [71]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [72]:
texts[:50]

['See discussions, st ats, and author pr ofiles f or this public ation at : https://www .researchgate.ne t/public ation/359246513\nSoftware-Engineering Design Patterns for Machine Learning Applications\nArticle \xa0\xa0 in\xa0\xa0Comput er · Mar ch 2022\nDOI: 10.1109/MC.2021.3137227\nCITATIONS\n58READS\n2,898\n7 author s, including:\nHironori W ashiz aki\nWaseda Univ ersity\n445 PUBLICA TIONS \xa0\xa0\xa03,670  CITATIONS \xa0\xa0\xa0\nSEE PROFILE\nYann-Gaël Guéhéneuc\nConc ordia Univ ersity\n368 PUBLICA TIONS \xa0\xa0\xa011,389  CITATIONS \xa0\xa0\xa0\nSEE PROFILE\nHironori T akeuchi\nMusashi Univ ersity\n34 PUBLICA TIONS \xa0\xa0\xa0181 CITATIONS \xa0\xa0\xa0\nSEE PROFILE\nSatoshi Ok uda\nPrimestyle c o.\n8 PUBLICA TIONS \xa0\xa0\xa081 CITATIONS \xa0\xa0\xa0\nSEE PROFILE\nAll c ontent f ollo wing this p age was uplo aded b y Yann-Gaël Guéhéneuc  on 17 May 2022.',
 'SEE PROFILE\nSatoshi Ok uda\nPrimestyle c o.\n8 PUBLICA TIONS \xa0\xa0\xa081 CITATIONS \xa0\xa0\xa0\nSEE PROFILE\nAll c o

### Load the dataset into the vector store



In [73]:

astra_vector_store.add_texts(texts)

print("Inserted %i headlines." % len(texts))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 39 headlines.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)


In [75]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


Enter your question (or type 'quit' to exit): design pattern for ML engineers





QUESTION: "design pattern for ML engineers"




ANSWER: "According to the text provided, the ML design patterns are:

* "Handshake"
* "Isolate and Validate Output of Model"
* "ML Versioning"
* "Test Infrastructure Independently from ML"
* "Wrap Black-box Packages into Common APIs" 


Let me know if you have any other questions."

FIRST DOCUMENTS BY RELEVANCE:
    [0.8480] "ML design patterns: “Handshake”, “Isolate and Validate Output of Model”, “ML Version ..."
    [0.8480] "ML design patterns: “Handshake”, “Isolate and Validate Output of Model”, “ML Version ..."
    [0.8480] "ML design patterns: “Handshake”, “Isolate and Validate Output of Model”, “ML Version ..."
    [0.8480] "ML design patterns: “Handshake”, “Isolate and Validate Output of Model”, “ML Version ..."

What's your next question (or type 'quit' to exit): quit
