<a href="https://colab.research.google.com/github/angelinflorence/largelanguagemodels/blob/main/exp4_PdfQueryLangchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## PDF Query Using Langchain

In [None]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken

In [None]:
!pip install -U langchain-community
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS

In [3]:
import os
os.environ["OPENAI_API_KEY"] = "please provide your key here"
os.environ["HF_TOKEN"]="please provide your key here"

In [4]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('/content/10.pdf')

In [5]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [6]:
raw_text

'Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright ©2023. All\nrights reserved. Draft of February 3, 2024.\nCHAPTER\n10Transformers and\nLarge Language Models\n“How much do we know at any time? Much more, or so I believe, than we\nknow we know. ”\nAgatha Christie, The Moving Finger\nFluent speakers of a language bring an enormous amount of knowledge to bear dur-\ning comprehension and production. This knowledge is embodied in many forms,\nperhaps most obviously in the vocabulary, the rich representations we have of words\nand their meanings and usage. This makes the vocabulary a useful lens to explore\nthe acquisition of knowledge from text, by both people and machines.\nEstimates of the size of adult vocabularies vary widely both within and across\nlanguages. For example, estimates of the vocabulary size of young adult speakers of\nAmerican English range from 30,000 to 100,000 depending on the resources used\nto make the estimate and the deﬁnition of what it

In [7]:
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [8]:
len(texts)

141

In [None]:
pip install langchain-huggingface

In [16]:
from langchain_huggingface import HuggingFaceEmbeddings

encode_kwargs = {'normalize_embeddings': False}
model_kwargs = {'device': 'cuda'}
# Initialize HuggingFaceEmbeddings without the 'model' parameter
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2",
                                   model_kwargs = model_kwargs,
                                   encode_kwargs = encode_kwargs)


In [17]:
document_search = FAISS.from_texts(texts, embeddings)

In [18]:
document_search


<langchain_community.vectorstores.faiss.FAISS at 0x78c6c2316c50>

In [19]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint

llm = HuggingFaceEndpoint(
    repo_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation",
    max_new_tokens=512,
    do_sample=False,
    repetition_penalty=1.03,
)

In [20]:
chain = load_qa_chain(llm, chain_type="stuff")

In [21]:
query = "context meaning"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

" In causal, or backward looking self-attention, the context refers to any of the prior words in the sequence. This type of self-attention uses the context as a way to build a contextualized representation of the meaning of a word at a specific position in the sequence. By combining information from the representation of the word at the previous layer with information from the representations of neighboring words, a more accurate and meaningful representation of the word can be produced. Essentially, the context helps to provide a richer understanding of the word's meaning by taking into account the words that come before and after it in the sequence."

In [22]:
query = "how to calculate query and key"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

' According to the text, the query and key vectors are calculated by projecting each input vector xi into its role as a key or query using weight matrices WQ and WK, respectively. This is done using the following equations:\nqi = xiWQ, ki = xiWK.\nThese weight matrices are introduced by transformers to capture the three different roles that an input can have: query, key, and value. The resulting matrices Q, K, and V are then used to compute the attention output vector a using the self-attention formula:\na = Softmax(QK^T pdk) pdk(V), where pdk is a scaling factor.\nThis computation reduces the entire self-attention step for an entire sequence of N tokens to a single matrix multiplication.'