# PDF query with Langchain and AstraDB

## Overview of the Notebook

This notebook follows a systematic process for document retrieval and question answering using the Langchain library. The workflow can be summarized as follows:

### Document Processing:

1/ Read the PDF Document:
Utilizes PyPDF2 to read the content of a PDF document.

2/ Chunk the Text:
Breaks down the extracted text into manageable chunks, employing Langchain's text splitting mechanism. This helps control token size and optimize processing efficiency.

3/ Embed the Chunks:
Applies embedding techniques, specifically using the OpenAI language model, to convert each text chunk into a vector representation. This step captures the semantic meaning of the text for further analysis.

4/ Add Embeddings to Vector Database:
Incorporates the embedded chunks into a Langchain vector database, leveraging AstraDB for efficient storage and retrieval. This ensures organized and scalable management of the vectorized text data.

### Query Process:

1/ Embed the Question:
Embeds the user's input question using the same language model, creating a vector representation that captures the question's semantic content.

2/ Search for Top K Similar Embeddings:
Queries the vector database to identify the top K similar embeddings to the embedded question. This similarity search is crucial for finding the most relevant information based on the semantic context of the query.

3/ Output the Answer with Top K Similarities:
Presents the answer to the user by utilizing the embeddings associated with the top K similar documents. This approach ensures that the response is based on the most relevant information, as determined by the similarity scores of the vectorized data.

In essence, the notebook seamlessly combines document processing with vectorization and retrieval techniques to provide a robust and interactive question-answering system. The utilization of Langchain components and integration with AstraDB contribute to the efficiency and scalability of the overall workflow.

In [1]:
# Langchain componenets
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
# providing the AstraDB integration with langchain
import cassio
# read the pdfs
from PyPDF2 import PdfReader
# load env vars
import os
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
ASTRA_DB_APPLICATION_TOKEN = os.getenv("ASTRA_DB_APPLICATION_TOKEN")
ASTRA_DB_ID = os.getenv("ASTRA_DB_ID")
OPENAI_API_KEY= os.getenv("OPENAI_API_KEY")

### Document processing

In [3]:
pdfreader = PdfReader("attention_is_all_you_need.pdf")

In [4]:
from typing_extensions import Concatenate
# extract all text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content= page.extract_text()
    if content:
        raw_text += content

In [5]:
# Conncetion with the AstraDB
cassio.init(token= ASTRA_DB_APPLICATION_TOKEN, database_id= ASTRA_DB_ID)

In [6]:
llm = OpenAI(openai_api_key = OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

  warn_deprecated(
  warn_deprecated(


In [7]:
# Create Langchain Vector Store
astra_vector_store = Cassandra(
    embedding= embedding,
    table_name="qa_pdf",
    session= None,
    keyspace=None,
)

In [8]:
from langchain.text_splitter import CharacterTextSplitter
# Split text into chuncks to not increase the token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size=800,
    chunk_overlap=200,
    length_function= len,
)
texts = text_splitter.split_text(raw_text)

In [9]:
# Load the dataset into the vector store
astra_vector_store.add_texts(texts)
print("Inserted %i headlines" %len(texts))
astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 67 headlines


### Query process

In [10]:
while True:
    query_text= input("\nEnter your question (or type 'quit' to exit):")
    if query_text.lower() =="quit":
        break
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("\Answer: %s " %answer)
    print(" First documents by relevance:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=3):
        print("[%0.2f] \ %s" %(score, doc.page_content[:100]))

\Answer: Transformers are transduction models that rely entirely on self-attention to compute representations of input and output without using sequence-aligned RNNs or convolution. They have an encoder-decoder structure and use self-attention to generate continuous representations of input sequences, which are then used by the decoder to generate an output sequence one element at a time. 
 First documents by relevance:
[0.90] \ To the best of our knowledge, however, the Transformer is the first transduction model relying
entir
[0.90] \ To the best of our knowledge, however, the Transformer is the first transduction model relying
entir
[0.90] \ To the best of our knowledge, however, the Transformer is the first transduction model relying
entir
\Answer: An attention mechanism is a method used in sequence modeling and transduction tasks to allow the model to focus on relevant parts of the input or output sequence, regardless of their distance. This is achieved by assigning weights to dif