### Configure a RAG system to be used to answer the question from a given context

In [77]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import FAISS
from langchain_aws import BedrockLLM

In [66]:
import glob
import os
import pandas as pd

In [68]:
import sys

In [69]:
sys.path.append(os.path.abspath('../../'))

In [67]:
abs_path = os.path.abspath('../../')
path_to_data = 'data/processed/documents'
filenames = glob.glob(os.path.join(abs_path, path_to_data)+'*.txt')

### Configuring Bedrock to be used here

In [70]:
from src.utils import bedrock

In [71]:
os.environ["AWS_DEFAULT_REGION"] = "us-east-1"
# os.environ["AWS_PROFILE"] = ""
# os.environ["BEDROCK_ASSUME_ROLE"] = ""  # E.g. "arn:aws:..."

boto3_bedrock = bedrock.get_bedrock_client(
    assumed_role=os.environ.get("BEDROCK_ASSUME_ROLE", None),
    region=os.environ.get("AWS_DEFAULT_REGION", None),
    runtime=False,
)

bedrock_runtime = bedrock.get_bedrock_client(
    assumed_role=os.environ.get("BEDROCK_ASSUME_ROLE", None),
    region=os.environ.get("AWS_DEFAULT_REGION", None)
)

Create new client
  Using region: us-east-1
boto3 Bedrock client successfully created!
bedrock(https://bedrock.us-east-1.amazonaws.com)
Create new client
  Using region: us-east-1
boto3 Bedrock client successfully created!
bedrock-runtime(https://bedrock-runtime.us-east-1.amazonaws.com)


### Load all the documents from disk to create our vector store

In [139]:
filenames = glob.glob(os.path.join(abs_path, path_to_data,'**/*.txt'))

In [163]:
# load all documents from disk
documents = load_documents(filenames, '')
# split them
documents = split_documents(documents, chunk_size=256 , overlap=50)
# get the vector store 
embeddings = get_hf_embeddings_model()
# build our vector store
vector_store = get_vector_store(splitted_documents, embeddings)
# get the llm that we are going to use
bedrock_model_id = "anthropic.claude-v2"
model_parameter = {
    "temperature": 0.0, 
    "top_p": .5, 
    "top_k": 200, 
    "max_tokens_to_sample": 400, 
    "stop_sequences": ["\n\n Human: bye"]
}

llm = BedrockLLM(
    model_id=bedrock_model_id,
    model_kwargs=model_parameter, 
    client=bedrock_runtime
)

### Orchestrate all elements to create are QA chain

In [141]:
from langchain.prompts import PromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

In [164]:
prompt_template = """
<|system|>
Using the information contained in the context,
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
{context}
---
Now here is the question you need to answer.

Question: {input}
</s>
<|assistant|>
"""

In [165]:
PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "input"]
)

In [166]:
retriever = vector_store.as_retriever(
    search_type = 'mmr', #Maximum Marginal Relevance
    search_kwargs = {"k":10, "lambda_mult":0.2}
)
question_answer_chain = create_stuff_documents_chain(llm, PROMPT)
chain = create_retrieval_chain(retriever, question_answer_chain)

In [167]:
def get_answer_to_user_question(query, chain):
    response = chain.invoke({"input": query})
    return response['answer']

In [168]:
query = "How many people are affected by Ochoa syndrome?"
get_answer_to_user_question(query, chain)

' Based on the context provided, the frequency of Ochoa syndrome is not explicitly stated. The passage mentions that 18q deletion syndrome occurs in an estimated 1 in 40,000 newborns, and Usher syndrome is inherited in an autosomal recessive manner, but does not provide statistics on how many people are affected by Ochoa syndrome specifically. Since the requested information cannot be deduced from the given context, I cannot provide an answer to the question asked.'

In [169]:
query = "What is (are) Heart Disease in Women?"
get_answer_to_user_question(query, chain)

' Based on the context, heart disease in women with 17q23.1q23.2 microdeletion syndrome refers to congenital heart defects that can occur as part of the condition. The passage states that heart defects are one of the features associated with 17q23.1q23.2 microdeletion syndrome. However, it does not provide any specific information about what types of heart defects occur or their frequency. The context does not mention heart disease in women in general.'

In [170]:
query = "What is Glaucoma?"
get_answer_to_user_question(query, chain)

' Glaucoma is a group of eye diseases that damage the optic nerve, which is important for good vision. This damage is often caused by an abnormally high pressure in the eye. There are different types of glaucoma, but they can all lead to vision loss and blindness if not treated. Some key facts about glaucoma:\n\n- It is one of the leading causes of blindness worldwide. \n\n- There are often no early symptoms. Vision loss starts with peripheral vision and can occur gradually.\n\n- Risk factors include older age, family history, high eye pressure, and certain medical conditions like diabetes.\n\n- Treatment focuses on lowering eye pressure through eye drops, pills, laser procedures, or surgery. Early detection and treatment are key to preventing vision loss.\n\n- Glaucoma cannot be cured, but early treatment helps slow or prevent further vision loss. Regular eye exams are important, especially for those at higher risk.\n\nThe context provided does not contain information specifically abo

In [171]:
query = "Do you have information about Veterans and Military Health?"
get_answer_to_user_question(query, chain)

' Unfortunately I do not have specific information about veterans and military health. However, based on the context provided, I can suggest looking into resources from the U.S. Department of Veterans Affairs. They have information on health services, benefits, and resources for veterans, service members, and their families. The VA website (www.va.gov) has sections dedicated to health issues, disability claims, caregiving support, and more. The Department of Defense also has health resources for active duty service members and families at www.health.mil. I hope this helps point you in the right direction to find information relevant to veterans and military health.'

In [173]:
query = "What is Retinoblastoma?"
get_answer_to_user_question(query, chain)

' Retinoblastoma is a rare eye cancer that develops in the retina, the light-sensing tissue in the back of the eye. It is caused by mutations in the RB1 gene and usually affects young children. Key features of retinoblastoma include:\n\n- It develops in one or both eyes, usually before age 5.\n\n- Common signs are a white pupil, crossed eyes, poor vision or red and swollen eyes.\n\n- Treatment involves surgery, radiation, laser therapy, freezing (cryotherapy), or chemotherapy. \n\n- If caught early, retinoblastoma has a high cure rate, though vision loss can occur. Untreated, it can spread to other parts of the body.\n\n- It can be inherited (due to RB1 gene mutation) or non-inherited (somatic mutation). The inherited form tends to be bilateral and diagnosed earlier.\n\nSo in summary, retinoblastoma is a rare pediatric eye cancer arising from retinal cells, often due to RB1 gene mutations. Early detection and treatment are key for preserving vision and achieving a cure.'

### UTILS

In [None]:
def load_documents(filenames, data_root):
    """
    Read all documents that are passed in filenames
    """
    documents = []
    for file in filenames:
        loader = TextLoader(data_root+file)
        document = loader.load()
        for d in document:
            d.metadata['source'] = '/'.join(file.split('/')[-2:])
        documents += document
    return documents


def split_documents(documents, chunk_size=512, overlap=100):
    """
    Split the documents into chunks for further processing
    """
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
    documents = text_splitter.split_documents(documents)
    return documents


def get_hf_embeddings_model():
    """
    Get the embedding model to create our vector_store.
    The embedding model to use is fine tuned on a medical dataset.
    
    @software{balachandran2024medembed,
    author = {Balachandran, Abhinand},
    title = {MedEmbed: Medical-Focused Embedding Models},
    year = {2024},
    url = {https://github.com/abhinand5/MedEmbed}
    }
    """
    model_name ="abhinand/MedEmbed-small-v0.1"
    model_kwargs = {'device': 'cuda'}
    encode_kwargs = {'normalize_embeddings': True}
    model = HuggingFaceBgeEmbeddings(
        model_name=model_name,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs,
    )
    return model