## Querying PDF using Langchain and Astra DB

In [1]:
!pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-16.0.0-cp39-cp39-macosx_11_0_arm64.whl.metadata (3.0 kB)
Collecting numpy>=1.16.6 (from pyarrow)
  Using cached numpy-1.26.4-cp39-cp39-macosx_11_0_arm64.whl.metadata (61 kB)
Downloading pyarrow-16.0.0-cp39-cp39-macosx_11_0_arm64.whl (26.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.0/26.0 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached numpy-1.26.4-cp39-cp39-macosx_11_0_arm64.whl (14.0 MB)
Installing collected packages: numpy, pyarrow
Successfully installed numpy-1.26.4 pyarrow-16.0.0


In [31]:
!pip install python-dotenv

Collecting python-dotenv
  Using cached python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Using cached python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [2]:
!pip install -q cassio datasets langchain openai tiktoken

In [3]:
!pip install PyPDF2

Collecting PyPDF2
  Using cached pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Using cached pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


### Import the packages needed

In [32]:
# LangChain components 
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset
 
import cassio
from PyPDF2 import PdfReader
import os
from dotenv import load_dotenv

### Setup

In [33]:
load_dotenv()

True

In [34]:
ASTRA_DB_APPLICATION_TOKEN = os.getenv("ASTRA_DB_APPLICATION_TOKEN")
ASTRA_DB_ID = os.getenv("ASTRA_DB_ID")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [35]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('Generative_AI.pdf')

In [36]:
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [37]:
raw_text

'CATCHWORD\nGenerative AI\nStefan Feuerriegel •Jochen Hartmann •Christian Janiesch •\nPatrick Zschech\nReceived: 29 April 2023 / Accepted: 7 August 2023 / Published online: 12 September 2023\n/C211The Author(s) 2023\nKeywords Generative AI /C1Artiﬁcial intelligence /C1\nDecision support /C1Content creation /C1Information systems\n1 Introduction\nTom Freston is credited with saying ‘‘Innovation is taking\ntwo things that exist and putting them together in a new\nway’’. For a long time in history, it has been the prevailingassumption that artistic, creative tasks such as writing\npoems, creating software, designing fashion, and compos-\ning songs could only be performed by humans. Thisassumption has changed drastically with recent advances in\nartiﬁcial intelligence (AI) that can generate new content in\nways that cannot be distinguished anymore from humancraftsmanship.The term generative AI refers to computational tech-\nniques that are capable of generating seemingly new,\nmeaningful c

### Initializing connection to the database

In [38]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

### Create the LangChain embedding and LLM objects

In [39]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

### Create your LangChain vector store (backed by Astra DB)

In [40]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [41]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [42]:
texts[:50]

['CATCHWORD\nGenerative AI\nStefan Feuerriegel •Jochen Hartmann •Christian Janiesch •\nPatrick Zschech\nReceived: 29 April 2023 / Accepted: 7 August 2023 / Published online: 12 September 2023\n/C211The Author(s) 2023\nKeywords Generative AI /C1Artiﬁcial intelligence /C1\nDecision support /C1Content creation /C1Information systems\n1 Introduction\nTom Freston is credited with saying ‘‘Innovation is taking\ntwo things that exist and putting them together in a new\nway’’. For a long time in history, it has been the prevailingassumption that artistic, creative tasks such as writing\npoems, creating software, designing fashion, and compos-\ning songs could only be performed by humans. Thisassumption has changed drastically with recent advances in\nartiﬁcial intelligence (AI) that can generate new content in',
 'ing songs could only be performed by humans. Thisassumption has changed drastically with recent advances in\nartiﬁcial intelligence (AI) that can generate new content in\nways that c

### Load the dataset into the vector store

In [43]:
astra_vector_store.add_texts(texts)
print("Inserted %i headlines." % len(texts))
astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 150 headlines.


### Run the Q/A cycle

In [44]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


QUESTION: "What are the concerns when embedding deep learning models in generative AI?"
ANSWER: "Concerns include the need for guidelines and governance frameworks, verifying model outputs, relying appropriately on generative AI systems, and addressing bias and fairness issues."

FIRST DOCUMENTS BY RELEVANCE:
    [0.9325] "material such as teaching cases and recap questions. Fur-
ther, the educator’s commu ..."
    [0.9324] "material such as teaching cases and recap questions. Fur-
ther, the educator’s commu ..."
    [0.9319] "when to accept outputs of generative AI and when not.
Bias and fairness. Societal bi ..."
    [0.9319] "when to accept outputs of generative AI and when not.
Bias and fairness. Societal bi ..."
