In [None]:
!pip install langchain-community==0.2.15 langchain-chroma==0.1.3 langchain-text-splitters==0.2.2 langchain-huggingface==0.0.3 langchain-groq==0.1.9 unstructured==0.15.0 unstructured[pdf]==0.15.0 nltk==3.8.1



In [None]:
!apt-get install poppler-utils
!apt install tesseract-ocr

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.5).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [None]:
import os

from langchain_community.document_loaders import UnstructuredPDFLoader, DirectoryLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain_groq import ChatGroq
from langchain.chains import RetrievalQA

In [None]:
GROQ_API_KEY = "your_groq_api_key"

In [None]:
os.environ["GROQ_API_KEY"] = GROQ_API_KEY

In [None]:
loader = DirectoryLoader("data/", glob="./*.pdf", loader_cls=UnstructuredPDFLoader)
documents = loader.load()

In [None]:
text_splitter = CharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=500
)

text_chunks = text_splitter.split_documents(documents)



In [None]:
persist_directory = "doc_db"

In [None]:
embedding = HuggingFaceEmbeddings()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
vectorstore = Chroma.from_documents(
    documents=text_chunks,
    embedding=embedding,
    persist_directory=persist_directory
)

In [None]:
retriever = vectorstore.as_retriever()

In [None]:
llm = ChatGroq(
    model="llama-3.1-8b-instant",
    temperature=0
)

In [None]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

In [None]:
query = "What does the document say about DELINEATING THE CLINICAL SYNDROME?"
response = qa_chain.invoke({"query":query})

In [None]:
print(response)

{'query': 'What does the document say about DELINEATING THE CLINICAL SYNDROME?', 'result': 'The document states that it is useful to divide Respiratory Tract Infections (RTIs) into those involving the upper and the lower tracts. \n\nThe clinical syndromes involving the upper tract include:\n\n- Otitis media\n- Mastoiditis\n- Sinusitis\n- Pharyngitis\n\nInfections of the lower tract can be divided into:\n\n- Tracheobronchitis\n- Bronchiolitis\n- Pneumonia\n\nMost of these conditions can exist in acute and chronic forms. Acute disease is usually caused by viral or bacterial infections, and chronic disease is usually caused by fungi, slow-growing bacteria such as mycobacteria, bacteria adapted to persist in biofilms, and occasional less common pathogens such as parasites. Chronic infections can also develop when structural changes occur as a result of recurrent or severe acute infections, surgical intervention, or other processes that alter the structural integrity of the respiratory trac

In [None]:
print(response["result"])

The document states that it is useful to divide Respiratory Tract Infections (RTIs) into those involving the upper and the lower tracts. 

The clinical syndromes involving the upper tract include:

- Otitis media
- Mastoiditis
- Sinusitis
- Pharyngitis

Infections of the lower tract can be divided into:

- Tracheobronchitis
- Bronchiolitis
- Pneumonia

Most of these conditions can exist in acute and chronic forms. Acute disease is usually caused by viral or bacterial infections, and chronic disease is usually caused by fungi, slow-growing bacteria such as mycobacteria, bacteria adapted to persist in biofilms, and occasional less common pathogens such as parasites. Chronic infections can also develop when structural changes occur as a result of recurrent or severe acute infections, surgical intervention, or other processes that alter the structural integrity of the respiratory tract.


In [None]:
query = "What does the document say about Receptor-binding domain?"
response = qa_chain.invoke({"query":query})
print(response["result"])

The document discusses the Receptor-binding domain (RBD) of the SARS-CoV-2 spike glycoprotein in several sections. Here are the key points:

1. **Location and Structure**: The RBD is located on the S1 subunit of the spike glycoprotein and has a string of domains, including the N-terminal domain (NTD), subdomain 1 (SD1), and the RBD itself. The RBD harbors the ACE2-binding site, which lies across the top of the RBD, spanning the neck and shoulders.

2. **Conformation**: The RBD adopts a range of configurations on the spike, from 'up' to 'down', and only the up conformation can interact with ACE2.

3. **Binding of Neutralizing Antibodies**: The RBD is a major target for neutralizing antibodies, which can be grouped into several clusters based on their epitopes: left shoulder, neck, right shoulder, left flank, and right flank.

4. **Importance of RBD in Neutralization**: Most potent neutralizing antibodies induced by vaccination or natural infection target the RBD and usually interfere wi