# Overview

In this Jupyter Notebook, I'll walk through the process of developing a RAG (Retrieval Augmented Generation) pipeline leveraging various technologies. My pipeline will utilize a custom Sentence Transformers model for semantic similarity computations, integrate with Gemini LLM for answer generation, and interface with Qdrant as the vector store for efficient retrieval.



*  **Custom Sentence Transformers Model**: I fine-tuned [a custom Sentence Transformers model](https://huggingface.co/aleynahukmet/bge-medical-small-en-v1.5) trained on medical data to encode text into dense semantic vectors, facilitating semantic similarity computations within the pipeline.
*  **Gemini LLM Integration:** Gemini LLM will be utilized for answer generation, leveraging its language generation capabilities to produce coherent and contextually relevant responses based on the retrieved passages.

*   **Qdrant as Vector Store:** Qdrant serves as the vector store, enabling efficient similarity search operations by storing and indexing the dense semantic vectors generated by the Sentence Transformers model.






## SETUP

In [None]:
#pip install dependencies
!pip install --upgrade qdrant-client sentence-transformers google-generativeai
!pip install datasets
!pip install sentence-transformers

## VECTOR STORE

I'm using the local mode for Qdrant to avoid the hastle to run a Docker instance, but you may want to switch to a Docker-based instance or the cloud offering by Qdrant for production.

In [2]:
#import libraries
from qdrant_client import QdrantClient
client = QdrantClient(path='./vectorstore')

## INDEXING DATA

The FunPang/medical_dataset consists of 31880 auto-generated question-answer pairs in the medical domain. I will apply a simple preprocessing step to remove `<HUMAN> , <ASSISTANT>` markups before indexing them in the vector store.



In [11]:
#import libraries
import datasets
from datasets import load_dataset

#load huggingface dataset
ds = load_dataset("FunPang/medical_dataset",split='train')
ds

Downloading data:   0%|          | 0.00/190k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.72M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.81M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/10.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/31880 [00:00<?, ? examples/s]

Dataset({
    features: ['Unnamed: 0', 'text', 'raw_data_id'],
    num_rows: 31880
})

In [12]:
#remove unnecessary columns
ds=ds.remove_columns('Unnamed: 0')
ds=ds.remove_columns('raw_data_id')
ds

Dataset({
    features: ['text'],
    num_rows: 31880
})

In [13]:
#define function to remove unwanted markups
def process_tokens(example):
    example['text'] = example['text'].replace("<HUMAN>:", "")
    example['text'] = example['text'].replace("<ASSISTANT>:", "")
    return example

# Apply the function to dataset
ds= ds.map(process_tokens)
ds

Map:   0%|          | 0/31880 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 31880
})

## GENERATING EMBEDDINGS

I will use a custom fine-tuned model to map texts into 384 dimensional vectors. You may want to check out [this repo](https://github.com/aleynahukmet/medical-qa) to see how I fine-tuned it.

In [15]:
#import SentenceTransformers library
from sentence_transformers import SentenceTransformer

#load the model from HuggingFace
model_name = "aleynahukmet/bge-medical-small-en-v1.5"
model = SentenceTransformer(model_name)
model

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.37k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/706 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [17]:
#you can adjust batch size according to your gpu memory
vectors = model.encode(ds['text'], batch_size=32, show_progress_bar=True)

Batches:   0%|          | 0/997 [00:00<?, ?it/s]

In [18]:
#we can verify that we encoded the whole dataset by printing the shape
vectors.shape

(31880, 384)

## CREATING COLLECTION

Qdrant organizes vectors in what is called a collection. Collections can store payload data alongside the vectors. To create a collection, we need to specify the vector dimension (384 in our case) and the distance metric (cosine in our case).

In [19]:
from qdrant_client.models import Distance, VectorParams
vector_dim = 384
collection_name = "medical_rag"

operation_info = client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)

operation_info

True

In [21]:
#prepare the payload data at a list of dictionaries
payload = [{"text": text} for text in ds["text"]]
#generate ids as a list of the length of the dataset
ids = list(range(len(payload)))

#use upload_collection to upload the data in the vector store in batches
operation_info = client.upload_collection(collection_name=collection_name, vectors=vectors, payload=payload, ids=ids, batch_size=32)

In [28]:
#check the collection info after the upload
collection_info = client.get_collection(collection_name)
collection_info

CollectionInfo(status=<CollectionStatus.GREEN: 'green'>, optimizer_status=<OptimizersStatusOneOf.OK: 'ok'>, vectors_count=None, indexed_vectors_count=0, points_count=31880, segments_count=1, config=CollectionConfig(params=CollectionParams(vectors=VectorParams(size=384, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None), shard_number=None, sharding_method=None, replication_factor=None, write_consistency_factor=None, read_fan_out_factor=None, on_disk_payload=None, sparse_vectors=None), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=None, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=20000, flush_interval_sec=5, max_optimization_threads=1), wal_config=WalConfig(wal_capacity_mb=32, wal_segments_ahead=0), quantization_config=N

## RETRIEVING RELEVANT TEXTS

Now that we uploaded the vectors and payload data to the vector store, we can query it to retrieve relevant texts. I will implement a function that encodes the given text and run the query against the vector store.

In [30]:
#limit is the number of results to be retrieved
def search(text,limit):
  text_vector = model.encode(text)
  search_result = client.search(
    collection_name=collection_name, query_vector=text_vector, limit=limit
)
  return search_result



In [31]:
#test the function with a simple query
search("reasons for headache", 2)

[ScoredPoint(id=18631, version=0, score=0.8791221380233765, payload={'text': ' What can cause headaches?   The most common reasons for headaches include having a cold or the flu, experiencing stress, drinking too much alcohol, having bad posture or eyesight problems, not consuming regular meals or enough fluids, taking too many painkillers, or experiencing your period or menopause.\nReferences:\n- https://www.nhs.uk/conditions/headache/ <|eos|> <|eod|>'}, vector=None, shard_key=None),
 ScoredPoint(id=27003, version=0, score=0.8632267713546753, payload={'text': ' What are the triggers of hormone headaches?  Hormone headaches can be caused by various factors, such as changes in hormonal levels related to menstrual cycles, the use of the combined oral contraceptive pill, menopause or pregnancy.\nReferences:\n- https://www.nhs.uk/conditions/hormone-headaches/ <|eos|> <|eod|>'}, vector=None, shard_key=None)]

## GENERATING ANSWERS

I will use the Gemini API as the LLM to generate answers. You need to [generate an API key](https://aistudio.google.com/app/apikey) and store it as a secret on Colab to run this section.

In [33]:
#get the api key from secrets and store it as a variable
from google.colab import userdata
token = userdata.get('API_KEY')

In [35]:
#configure the library with the api key to initialize it
import google.generativeai as genai
genai.configure(api_key=token)

## PROMPT TEMPLATE

And here's the fun part. I will define a prompt template to inject the user question and the retreived texts later on. Note how the prompt instructs the LLM to base its answers on the context specified.

In [36]:
prompt = """You are a helpful assistant to answer users' questions about health and medical issues.

Answer the following question according to the context below that might contain the answer. If the context does not contain an answer, reply by saying that you don't have the answer right now but you are actively learning.

## Question
{question}

## Context
{context}
"""

## PUTTING EVERYTHING TOGETHER

Now it's time to put everything together and actually generate answers with RAG. I will implement a function that uses the search function under the hood to retrieve text relevant to the user question. Then, it will form the full prompt based on the prompt template and call the Gemini API to generate the answer.

In [37]:
#create a Gemini model instance
gemini_model = genai.GenerativeModel('gemini-1.5-pro-latest')

#define the function to generate answers
def answer(question):
    search_results = search(question, 5)
    context = "\n\n".join([r.payload['text'] for r in search_results])
    full_prompt = prompt.format(question=question, context=context)
    response = gemini_model.generate_content([full_prompt])

    return response.text

In [39]:
#test the RAG pipeline with a query
answer("What are the reasons for headache?")

"## Reasons for Headaches:\n\nHeadaches can arise from various causes, ranging from everyday factors to underlying medical conditions. Here are some common reasons:\n\n* **Common triggers:**\n    * **Illness:** Colds, flu, or sinus infections.\n    * **Lifestyle factors:** Stress, alcohol consumption, poor posture, eyesight issues, irregular meals, dehydration, medication overuse.\n    * **Hormonal fluctuations:** Menstrual cycles, oral contraceptives, menopause, pregnancy.\n* **Non-pathological causes:**\n    * **Tension headaches:** Muscle tension in the head, neck, or shoulders.\n    * **Dehydration:** Insufficient fluid intake.\n    * **Sleep disturbances:** Lack of sleep or irregular sleep patterns. \n    * **Caffeine withdrawal:** Decreased caffeine intake.\n    * **Eyestrain:** Prolonged focus on screens or close work.\n* **Types of headaches:**\n    * **Tension headaches:** Band-like pain around the head, often due to stress or muscle strain.\n    * **Migraines:** Severe throbb

## CONCLUSION

In this notebook, I used a custom SentenceTransformers model to retrieve relevant medical texts stored in a vector store and generated answers to user questions based on those texts without relying on complicated heavy libraries such as LangChain. I hope it will be useful to understand how RAG works under the hood. Feel free to checkout [the model repo](https://github.com/aleynahukmet/medical-qa) used here to see the training details if you want to do the same.