<h1 style="text-align: center;">MedicalQA using RAG

## Install Libraries

In [11]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [12]:
%cd /content/drive/My Drive/LLMs

/content/drive/My Drive/LLMs


In [13]:
%ls


[0m[01;34mDocuments[0m/  LLM_1_Prompt.ipynb  LLM_2_RAG.ipynb  requirements.txt


In [22]:
import os
os.getcwd()

'/content/drive/MyDrive/LLMs'

In [14]:
!pwd

/content/drive/My Drive/LLMs


In [None]:
!pip install -r requirements.txt

## Import Dependencies

In [16]:
import pinecone
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Pinecone
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

In [17]:
# setting Hugging face key
from langchain.llms import HuggingFaceHub
import os
from getpass import getpass
os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass("Hugging Face token: ")

Hugging Face token: ··········


## Load Documents

In [30]:
# Loads PDF files available in a directory with pypdf
def load_docs(directory):
  loader = PyPDFDirectoryLoader(directory)
  documents = loader.load()
  return documents

In [31]:
directory = 'Documents/'
documents = load_docs(directory)
len(documents)

36

In [32]:
documents[:3]

[Document(page_content='ScienceDirect\nAvailable online at www.sciencedirect.com\nProcedia Computer Science 192 (2021)  1031–1040\n1877-0509 © 2021 The Authors. Published by Elsevier B.V .\nThis is an open access article under the CC BY-NC-ND license ( https://creativecommons.org/licenses/by-nc-nd/4.0 )\nPeer-review under responsibility of the scientific committee of KES International.\n10.1016/j.procs.2021.08.106\n10.1016/j.procs.2021.08.106 1877-0509© 2021 The Authors. Published by Elsevier B.V .\nThis is an open access article under the CC BY-NC-ND license ( https://creativecommons.org/licenses/by-nc-nd/4.0 )\nPeer-review under responsibility of the scientific committee of KES International. Available online at www.sciencedirect.com\nProcedia Computer Science 00 (2021) 000–000\nwww.elsevier.com/locate/procedia\n25th International Conference on Knowledge-Based and Intelligent Information & Engineering\nSystems\nThyroid Disease Treatment prediction with machine learning\napproaches\nL

## Transform Documents

In [33]:
def split_docs(documents, chunk_size=512, chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap,
                                                 separators=["\n\n", "\n", " ", ""])
  docs = text_splitter.split_documents(documents)
  return docs

In [34]:
docs = split_docs(documents)
print(len(docs))

337


In [36]:
docs[:2]

[Document(page_content='ScienceDirect\nAvailable online at www.sciencedirect.com\nProcedia Computer Science 192 (2021)  1031–1040\n1877-0509 © 2021 The Authors. Published by Elsevier B.V .\nThis is an open access article under the CC BY-NC-ND license ( https://creativecommons.org/licenses/by-nc-nd/4.0 )\nPeer-review under responsibility of the scientific committee of KES International.\n10.1016/j.procs.2021.08.106\n10.1016/j.procs.2021.08.106 1877-0509© 2021 The Authors. Published by Elsevier B.V .', metadata={'source': 'Documents/thyroid.pdf', 'page': 0}),
 Document(page_content='This is an open access article under the CC BY-NC-ND license ( https://creativecommons.org/licenses/by-nc-nd/4.0 )\nPeer-review under responsibility of the scientific committee of KES International. Available online at www.sciencedirect.com\nProcedia Computer Science 00 (2021) 000–000\nwww.elsevier.com/locate/procedia\n25th International Conference on Knowledge-Based and Intelligent Information & Engineering\

## Generate Text Embeddings

[MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard)

In [37]:
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [38]:
query_result = embeddings.embed_query("Hello Buddy")
len(query_result)

384

In [39]:
query_result

[-0.06978831440210342,
 0.054206281900405884,
 0.07814785838127136,
 0.033901214599609375,
 0.024947449564933777,
 -0.096737340092659,
 0.05952310189604759,
 0.05897821485996246,
 -0.017896702513098717,
 -0.023178933188319206,
 -0.01900021918118,
 0.0005969105404801667,
 0.024666057899594307,
 -0.07030836492776871,
 -0.007522519677877426,
 0.010224493220448494,
 -0.011180886067450047,
 -0.02124863490462303,
 -0.03859451785683632,
 0.026550395414233208,
 -0.06505241245031357,
 0.0650002583861351,
 0.009431838989257812,
 -0.06271228194236755,
 -0.023625416681170464,
 -0.0306380782276392,
 0.0599612221121788,
 0.07367488741874695,
 -0.03286781534552574,
 -0.026061031967401505,
 -0.00696710916236043,
 0.030617885291576385,
 0.059396643191576004,
 0.0014719105092808604,
 0.012021646834909916,
 0.0282936692237854,
 -0.05922521650791168,
 -0.07919757813215256,
 0.04896365851163864,
 0.023090068250894547,
 0.055362772196531296,
 -0.026251334697008133,
 -0.01732112281024456,
 0.0055110952816903

## Vector store - PINECONE

In [45]:
from pinecone import Pinecone as PineconeClient
from langchain_community.vectorstores import Pinecone

# os.environ["PINECONE_API_KEY"] = "76791976-6619-4901-ba27-464f7590f48a"

In [46]:
PineconeClient(api_key="76791976-6619-4901-ba27-464f7590f48a", environment="gcp-starter")
index_name = "myindex"
index = Pinecone.from_documents(docs, embeddings, index_name=index_name)

## Retrieve Answers

In [53]:
def get_similar_docs(query, k=4):
    similar_docs = index.similarity_search(query, k=k)
    return similar_docs

In [55]:
rd = get_similar_docs("what is asthma")
rd

[Document(page_content='Introduction to Asthma \nChildhood asthma can be controlled through educatio n and treatment. Working as a team, school \npersonnel, healthcare providers and parents can hel p children with asthma participate fully in school,  \nsports and home activities.  \nWhat is Asthma?  \nAsthma is one of the most common chronic diseases o f \nchildhood, affecting more than 6 million children. \nAsthma is a chronic inflammatory lung disease that can \ncause repeated episodes of cough, wheezing and brea thing', metadata={'page': 0.0, 'source': 'Documents/asthma.pdf'}),
 Document(page_content='Asthma Triggers \n• Allergic reactions to environmental allergens such  as pollens, molds, dust mite or animal dander \n• Colds and viral respiratory infections \n• Exercise \n• Changes in weather, exposure to cold air or sudden  temperature change \n• Irritants such as tobacco smoke, air pollution, pa ints and cleaning agents \n• Strong odors and/or perfumes \n \nExercise-Induced Bron

In [56]:
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub

In [57]:
llm=HuggingFaceHub(repo_id="google/flan-t5-xxl",
                   model_kwargs={"temperature":0.1,"max_new_tokens":500})
llm

HuggingFaceHub(client=<InferenceClient(model='google/flan-t5-xxl', timeout=None)>, repo_id='google/flan-t5-xxl', task='text2text-generation', model_kwargs={'temperature': 0.1, 'max_new_tokens': 500})

In [58]:
chain = load_qa_chain(llm, chain_type="stuff", verbose=False)

In [59]:
from langchain.prompts import PromptTemplate

prompt = PromptTemplate.from_template(
    """
    Based only on the summaries provided below, answer the given question.
    If you don't know the answer, respond with "I don't know".
    Do not generate new questions.
    Do not make up your own answer from outside the source material.
    Do not write anything after the answer.

    Source material: {context}

    Question: {question}
    Answer:
    """
)

chain = load_qa_chain(
    llm,
    chain_type="stuff",
    prompt=prompt,
)

In [60]:
def get_answer(query):
  relevant_docs = get_similar_docs(query)
  response = chain.run(input_documents=relevant_docs, question=query)
  return response

In [61]:
our_query = "What is Asthma?"
answer = get_answer(our_query)
print(answer)

  warn_deprecated(


Asthma is a chronic inflammatory lung disease that can cause repeated episodes of cough, wheezing and brea thing


In [62]:
our_query = "who is Michal Jackson?"
answer = get_answer(our_query)
print(answer)

I don't know


In [63]:
our_query = "What are the triggers for Asthma?"
answer = get_answer(our_query)
print(answer)

Allergic reactions to environmental allergens such as pollens, molds, dust mite or animal dander • Colds and viral respiratory infections • Exercise • Changes in weather, exposure to cold air or sudden temperature change • Irritants such as tobacco smoke, air pollution, pa ints and cleaning agents • Strong odors and/or perfumes


In [64]:
our_query = "What is the recommended treatement for acute bronchitis?"
answer = get_answer(our_query)
print(answer)

Antibiotics are not needed to treat acute bronchitis. Using antibiotics when not needed could do more harm than good.


In [65]:
our_query = "List some cyber attacks that took place during the Covid-19 pandemic."
answer = get_answer(our_query)
print(answer)

March 2020 Czech Republic Ransomware The Brno University Hospital as one of COVID-19 testing laboratories in the countryhas been hit by a cyber-attack and wasforced to shutdown its entire IT network. 16 March 2020 UK Ransomware The Maze ransomware group has published personal and medical details of thousands of malicious cyber actors.


In [66]:
our_query = our_query = "Which classifier gave the best performance for thyroid disease treatment prediction?"
answer = get_answer(our_query)
print(answer)


Decision Trees
