<a href="https://colab.research.google.com/github/alfredolozano/pdf-RAG/blob/main/Hybrid_Search_Scanned_PDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Chat with the Scanner PDF

In [None]:
!pip install langchain rank_bm25 pypdf unstructured chromadb

In [None]:
!pip install unstructured['pdf'] unstructured

In [None]:
!apt-get install poppler-utils

In [None]:
!apt-get install -y tesseract-ocr
!apt-get install -y libtesseract-dev
!pip install pytesseract

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings
from langchain.vectorstores import Chroma

from langchain.llms import HuggingFaceHub
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

from langchain.retrievers import BM25Retriever, EnsembleRetriever

In [None]:
import os
from getpass import getpass

HF_TOKEN = "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"
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HF_TOKEN

In [None]:
path1 = "./scan.pdf"
data1 = UnstructuredPDFLoader(path1)
content = data1.load()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
print(content[0].page_content)

In [None]:
path2 = "./sample_2.pdf"
data2 = UnstructuredPDFLoader(path2)
content2 = data2.load()

In [None]:
content2

In [None]:
docs = content + content2

In [None]:
docs

In [None]:
splitter = RecursiveCharacterTextSplitter(chunk_size=256,chunk_overlap=50)
chunks = splitter.split_documents(docs)

In [None]:
embeddings = HuggingFaceInferenceAPIEmbeddings(
    api_key=HF_TOKEN, model_name="BAAI/bge-base-en-v1.5"
)

In [None]:
vectorstore = Chroma.from_documents(chunks, embeddings)

In [None]:
retriever_vectordb = vectorstore.as_retriever(search_kwargs={"k": 2})

In [None]:
keyword_retriever = BM25Retriever.from_documents(chunks)
keyword_retriever.k =  2

In [None]:
ensemble_retriever = EnsembleRetriever(retrievers=[retriever_vectordb,keyword_retriever],
                                       weights=[0.5, 0.5])

In [None]:
llm = HuggingFaceHub(
    repo_id="huggingfaceh4/zephyr-7b-alpha",
    model_kwargs={"temperature": 0.5,"max_new_tokens":512}
)

In [None]:
template = """
<|system|>>
You are an AI Assistant that follows instructions extremely well.
Please be truthful and give direct answers. Please tell 'I don't know' if user query is not in CONTEXT

Keep in mind, you will lose the job, if you answer out of CONTEXT questions

CONTEXT: {context}
</s>
<|user|>
{query}
</s>
<|assistant|>
"""

In [None]:
prompt = ChatPromptTemplate.from_template(template)
output_parser = StrOutputParser()

In [None]:
chain = (
    {"context": ensemble_retriever, "query": RunnablePassthrough()}
    | prompt
    | llm
    | output_parser
)

In [None]:
print(chain.invoke("In what year was the letter sent to PN Condall in scan document?"))

The year mentioned in the scan document is 1972. The letter was sent to Dr. P.N. Cundall, Mining Surveys Ltd., Holroyd Road, Reading, Berks, on January 18th of that year.


In [None]:
print(chain.invoke("who is PJ Cross in scan document?"))

PJ Cross is the Group Leader - Facsimile Research mentioned in the third document of the scan document provided.


In [None]:
print(chain.invoke("who is Messi?"))

I don't have information about the specific context you are referring to. however, based on the given context, messi is not mentioned. messi is a professional football player who plays for the spanish football club barcelona and the argentine national team.
