In [12]:
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

--2025-10-05 18:02:30--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4350 (4.2K) [text/plain]
Saving to: ‘minsearch.py’


2025-10-05 18:02:31 (93.8 MB/s) - ‘minsearch.py’ saved [4350/4350]



In [2]:
import pandas as pd
import json

## 1. Ingestion

In [59]:
news_path = "data/medquad.csv"
df = pd.read_csv(news_path)

In [60]:
df.columns

Index(['question', 'answer', 'source', 'focus_area'], dtype='object')

In [61]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16412 entries, 0 to 16411
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   question    16412 non-null  object
 1   answer      16407 non-null  object
 2   source      16412 non-null  object
 3   focus_area  16398 non-null  object
dtypes: object(4)
memory usage: 513.0+ KB


In [62]:
df = df.dropna()

In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16393 entries, 0 to 16411
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   question    16393 non-null  object
 1   answer      16393 non-null  object
 2   source      16393 non-null  object
 3   focus_area  16393 non-null  object
dtypes: object(4)
memory usage: 640.4+ KB


In [64]:
documents = df.to_dict(orient='records')

In [65]:
documents = documents[:500] # too long to embed

### 1.1 intake via minsearch (keyword search)

In [66]:
import minsearch

In [67]:
index_min = minsearch.Index(
    text_fields=['question', 'answer', 'source'],
    keyword_fields=['focus_area']
)

In [68]:
index_min.fit(documents)

<minsearch.Index at 0x7a91f18df3d0>

### 1.2 intake via Qdrant(semantic/vector search) 

#### 1.2.1 setup and ipormt libraries & Connect to Qdrant

Run a Qdrant instance in a Docker container, using the following commands in Linux terminal:

```bash
docker pull qdrant/qdrant

docker run -p 6333:6333 -p 6334:6334 \
   -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
   qdrant/qdrant
```

In [24]:
from qdrant_client import QdrantClient, models

In [25]:
# initialize the client
client_q = QdrantClient("http://localhost:6333") #connecting to local Qdrant instance

#### 1.2.2 Study the Dataset

In [14]:
documents[:2]

[{'question': 'What is (are) Glaucoma ?',
  'answer': "Glaucoma is a group of diseases that can damage the eye's optic nerve and result in vision loss and blindness. While glaucoma can strike anyone, the risk is much greater for people over 60. How Glaucoma Develops  There are several different types of glaucoma. Most of these involve the drainage system within the eye. At the front of the eye there is a small space called the anterior chamber. A clear fluid flows through this chamber and bathes and nourishes the nearby tissues. (Watch the video to learn more about glaucoma. To enlarge the video, click the brackets in the lower right-hand corner. To reduce the video, press the Escape (Esc) button on your keyboard.) In glaucoma, for still unknown reasons, the fluid drains too slowly out of the eye. As the fluid builds up, the pressure inside the eye rises. Unless this pressure is controlled, it may cause damage to the optic nerve and other parts of the eye and result in loss of vision. 

#### 1.2.3 Choosing the Embedding Model with FastEmbed

Now that we know we're embedding small chunks of English text (medical-related question and answer pairs), we can choose a suitable embedding model to convert this data into vectors.

In [15]:
# select an embedding model to use for the mecial question answers, stored in text fields, from the options supported by FastEmbed.
from fastembed import TextEmbedding
# TextEmbedding.list_supported_models()

It makes sense to choose a model that produces small-to-moderate-sized embeddings (e.g., 512 dimensions), so we don’t overuse resources in our simple setup. We need an embedding model suitable for **English text**. 
It also makes sense to select a **unimodal** model, since we’re not including images in our search, and specifically tailored solutions are usually better than universal ones.
It seems like `jina-embedding-small-en` is a good choice!

In [16]:
model_handle = "jinaai/jina-embeddings-v2-small-en"

Like most dense embedding models, `jina-embedding-small-en` was trained to measure semantic closeness using **cosine similarity**.  

> The parameters of the chosen embedding model, including the output embedding `dimensions` and the `semantic similarity` (distance) metric, are required to configure semantic search in Qdrant.


#### 1.2.4: Create a Collection

In [18]:
# Define the collection name
collection_name = "medicalQA-rag500"
EMBEDDING_DIMENSIONALITY =512
# Create the collection with specified vector parameters
client_q.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONALITY,  # Dimensionality of the vectors
        distance=models.Distance.COSINE  # Distance metric for similarity search
    )
)

True

#### 1.2.5: Create, Embed & Insert Points into the Collection

[Points](https://qdrant.tech/documentation/concepts/points/#points) are the core data entities in Qdrant. Each point consists of:

1. **ID**. A unique identifier. Qdrant supports both 64-bit unsigned integers and UUIDs.  
2. **Vector**. The embedding that represents the data point in vector space.  
3. **Payload** *(optional)*. Additional metadata as key-value pairs.


In [19]:
points = []
id = 0

for doc in documents:
    point = models.PointStruct( # models is an imported module
        id=id,
        vector=models.Document(text=doc['answer'], model=model_handle), #embed text locally with "jinaai/jina-embeddings-v2-small-en" from FastEmbed
        payload={
            "text": doc['answer'], # have to be here or else not vectorized.
            "source": doc['source'],
            "focus_area": doc['focus_area']
        } #save all needed metadata fields
    )
    points.append(point)

    id += 1

Now, embed and upload points to our collection.

First, FastEmbed will fetch&download the selected model (path defaults to os.path.join(tempfile.gettempdir(), "fastembed_cache")), and perform inference directly on your machine.
Then, the generated points will be upserted into the collection, and the vector index will be built.

In [20]:
client_q.upsert(
    collection_name=collection_name,
    points=points
)

Fetching 5 files: 100%|███████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  4.46it/s]


UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

500points about 2 min. 
The speed of upsert mainly depends on the time spent on local inference. To speed this up, you could run FastEmbed on GPUs or use a machine with more resources.

##### Study Data Visually

Let’s explore the uploaded data in the Qdrant Web UI at [http://localhost:6333/dashboard](http://localhost:6333/dashboard) to study semantic similarity visually.

Click the `Visualize` tab in the `medicalQA-rag500` collection, run the following command:

```json
{
  "limit": 1000,
  "color_by": {
    "payload": "source"
  }
}
```

This 2D representation is the result of dimensionality reduction applied to `jina-embeddings`.

#### 1.2.6: Running a Similarity Search

In [94]:
# define a search function
def search(query, limit=1):

    results = client_q.query_points(
        collection_name=collection_name,
        query=models.Document( #embed the query text locally with "jinaai/jina-embeddings-v2-small-en"
            text=query,
            model=model_handle 
        ),
        limit=limit, # top closest matches
        with_payload=True #to get metadata in the results
    )

    return results

Now let’s pick a random question from the course data.
As you remember, we didn’t upload the questions to Qdrant.

In [27]:
import random, json

doc_piece= random.choice(documents)
print(json.dumps(doc_piece, indent=2))

{
  "question": "What are the treatments for Prostate Cancer ?",
  "answer": "Surgery, radiation therapy, and hormonal therapy all have the potential to disrupt sexual desire or performance for a short while or permanently. Discuss your concerns with your health care provider. Several options are available to help you manage sexual problems related to prostate cancer treatment.",
  "source": "NIHSeniorHealth",
  "focus_area": "Prostate Cancer"
}


In [28]:
# Let's see which answer we get:
result = search(doc_piece['question'])
result

QueryResponse(points=[ScoredPoint(id=186, version=0, score=0.90219444, payload={'text': "There are a number of ways to treat prostate cancer, and the doctor will develop a treatment to fit each man's needs. The choice of treatment mostly depends on the stage of the disease and the grade of the tumor. But doctors also consider a man's age, general health, and his feelings about the treatments and their possible side effects. Treatment for prostate cancer may involve watchful waiting, surgery, radiation therapy, or hormonal therapy. Some men receive a combination of therapies. A cure is probable for men whose prostate cancer is diagnosed early.", 'source': 'NIHSeniorHealth', 'focus_area': 'Prostate Cancer'}, vector=None, shard_key=None, order_value=None)])

In [29]:
# Let’s compare the original and retrieved answers for our randomly selected question.

print(f"Question:\n{doc_piece['question']}\n")
print("Top Retrieved Answer:\n{}\n".format(result.points[0].payload['text']))
print("Original Answer:\n{}".format(doc_piece['answer']))

Question:
What are the treatments for Prostate Cancer ?

Top Retrieved Answer:
There are a number of ways to treat prostate cancer, and the doctor will develop a treatment to fit each man's needs. The choice of treatment mostly depends on the stage of the disease and the grade of the tumor. But doctors also consider a man's age, general health, and his feelings about the treatments and their possible side effects. Treatment for prostate cancer may involve watchful waiting, surgery, radiation therapy, or hormonal therapy. Some men receive a combination of therapies. A cure is probable for men whose prostate cancer is diagnosed early.

Original Answer:
Surgery, radiation therapy, and hormonal therapy all have the potential to disrupt sexual desire or performance for a short while or permanently. Discuss your concerns with your health care provider. Several options are available to help you manage sexual problems related to prostate cancer treatment.


In [30]:
# Now let’s search the answer to a question that wasn’t in the initial dataset.

print(search("Who is at risk for Prostate Cancer?").points[0])


id=179 version=0 score=0.90598774 payload={'text': 'Age is the most important risk factor for prostate cancer. The disease is extremely rare in men under age 40, but the risk increases greatly with age. More than 60 percent of cases are diagnosed in men over age 65. The average age at the time of diagnosis is 65.', 'source': 'NIHSeniorHealth', 'focus_area': 'Prostate Cancer'} vector=None shard_key=None order_value=None


### 1.3 intake by Qdrant via Hybrid Search

#### 1.3.1: Sparse vector search with BM25

We are going to use the same dataset as before, but this time we are going to create sparse vectors with BM25 only.

In [22]:
# Create the collection with specified sparse vector parameters
from qdrant_client import models

client_q.create_collection( # same client_q name with vector search
    collection_name="medicalQA-sparse",
    sparse_vectors_config={
        "bm25": models.SparseVectorParams(
            modifier=models.Modifier.IDF,
        )
    }
)

True

FastEmbed comes with a BM25 implementation that we can use as any other model.

In [23]:
import uuid

# Send the points to the collection
client_q.upsert(
    collection_name="medicalQA-sparse",
    points=[
        models.PointStruct(
            id=uuid.uuid4().hex,
            vector={
                "bm25": models.Document(
                    text=doc["answer"], 
                    model="Qdrant/bm25",
                ),
            },
            payload={
                "text": doc["answer"],
                "source": doc["source"],
                "focus_area": doc["focus_area"],
            }
        )
        for doc in documents
    ]
)

Fetching 18 files: 100%|████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 35.36it/s]


UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

runs pretty quick, a few seconds. I am surprised how fast the upload operation was. BM25 does not require a neural network, so it is fast compared to dense embedding models.

#### 1.3.2: Running sparse vector search with BM25

Right now, our vectors are ready to be searched over. Let's create a helper function.

In [83]:
def search(query: str, limit: int = 2) -> list[models.ScoredPoint]:
    results = client_q.query_points(
        collection_name="medicalQA-sparse",
        query=models.Document(
            text=query,
            model="Qdrant/bm25",
        ),
        using="bm25",
        limit=limit,
        with_payload=True,
    )

    return results.points

In [84]:
results = search("qdrant")
results

[]

In [87]:
results = search("cancer")
print(results[1].payload["text"])

How Tumors Form The body is made up of many types of cells. Normally, cells grow, divide and produce more cells as needed to keep the body healthy. Sometimes, however, the process goes wrong. Cells become abnormal and form more cells in an uncontrolled way. These extra cells form a mass of tissue, called a growth or tumor. Tumors can be benign, which means not cancerous, or malignant, which means cancerous. Breast cancer occurs when malignant tumors form in the breast tissue. Who Gets Breast Cancer?  Breast cancer is one of the most common cancers in American women. It is most common among women between the ages of 45-85. (Watch the video to learn more about breast cancer survival rates. To enlarge the videos appearing on this page, click the brackets in the lower right-hand corner of the video screen. To reduce the videos, press the Escape (Esc) button on your keyboard.) Men can get breast cancer too, although they account for only 1 percent of all reported cases. Read more about brea

In [86]:
results[1].score

2.751687

Scores returned by BM25 are not calculated with cosine similarity, but with BM25 formula. They are not bounded to a specific range, but are virtually unbounded.

That's an important observation before we start implementing hybrid search.

#### 1.3.3 Natural language like queries

Let's try again with a random question from our dataset to see how well sparse vector search can work with longer, natural language like queries.

In [88]:
import random
import json

random.seed(202510)

doc_piece = random.choice(documents)
print(json.dumps(doc_piece, indent=2))

{
  "question": "What are the treatments for Paget's Disease of Bone ?",
  "answer": "The Food and Drug Administration has approved several medications that can stop or slow down the progression of the disease and reduce pain and other symptoms. These medications fall into two categories: bisphosphonates and calcitonin. Doctors most often prescribe one of the four strongest bisphosphonates, which are risedronate, alendronate, pamidronate, and zoledronic acid.",
  "source": "NIHSeniorHealth",
  "focus_area": "Paget's Disease of Bone"
}


In [89]:
results = search(doc_piece["question"])
print(results[0].payload["text"])

Early Diagnosis is Important Although there is no cure for Paget's disease of bone, it is treatable. Treatment is most effective when the disease is diagnosed early, before it causes major changes in the affected bones. The goal of treatment is to relieve bone pain and prevent the disease from progressing. Medications Are Available The Food and Drug Administration has approved several medications that can stop or slow down the progression of the disease and reduce pain and other symptoms. These medications fall into two categories: bisphosphonates and calcitonin. Both medications work by stopping or reducing the excessive breakdown of old bone that leads to excessive formation of new, but weaker, bone. People with Paget's disease should talk to their doctors about which medication is right for them. Bisphosphonates Six bisphosphonates are currently available for patients with Paget's disease. Doctors most commonly recommend the strongest ones, which include - risedronate  - alendronate

#### 1.3.4: Qdrant Universal Query API - prefetching

Qdrant's `.query_points` method allows building multi-step search pipelines which can incorporate various methods into a single call. For example, we can retrieve some candidates with dense vector search, and then rerank them with sparse search, or use a fast method for initial retrieval and precise, but slow, reranking.

```ascii
┌─────────────┐           ┌─────────────┐
│             │           │             │
│  Retrieval  │ ────────► │  Reranking  │
│             │           │             │
└─────────────┘           └─────────────┘
```

Let's create another collection that will keep both dense and sparse representations. Qdrant named vectors allow us to store multiple representations per point and it proves useful especially when we want to use mulitple models in our applications.

In [90]:
# Create the collection with both vector types
client_q.create_collection(
    collection_name="medicalQA-sparse-and-dense",
    vectors_config={
        # Named dense vector for jinaai/jina-embeddings-v2-small-en
        "jina-small": models.VectorParams(
            size=512,
            distance=models.Distance.COSINE,
        ),
    },
    sparse_vectors_config={
        "bm25": models.SparseVectorParams(
            modifier=models.Modifier.IDF,
        )
    }
)

True

In [92]:
# Upload all the vectors into the newly created collection.
client_q.upsert(
    collection_name="medicalQA-sparse-and-dense",
    points=[
        models.PointStruct(
            id=uuid.uuid4().hex,
            vector={ # contain two items here.
                "jina-small": models.Document(
                    text=doc["answer"],
                    model= model_handle # "jinaai/jina-embeddings-v2-small-en",
                ),
                "bm25": models.Document(
                    text=doc["answer"], 
                    model="Qdrant/bm25",
                ),
            },
            payload={
                "text": doc["answer"],
                "source": doc["source"],
                "focus_area": doc["focus_area"],
            }
        )
        for doc in documents
    ]
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [104]:
def multi_stage_search(query: str, limit: int = 1) -> list[models.ScoredPoint]:
    results = client_q.query_points(
        collection_name="medicalQA-sparse-and-dense",
        prefetch=[
            # only one prefetch
            models.Prefetch(
                query=models.Document(
                    text=query,
                    model="jinaai/jina-embeddings-v2-small-en",
                ),
                using="jina-small",
                # Prefetch ten times more results, then
                # expected to return, so we can really rerank
                limit=(10 * limit),
            ),
        ],
        query=models.Document(
            text=query,
            model="Qdrant/bm25", 
        ),
        using="bm25",
        limit=limit,
        with_payload=True,
    )

    return results.points

In [103]:
print(json.dumps(doc_piece, indent=2))

{
  "question": "What are the treatments for Paget's Disease of Bone ?",
  "answer": "The Food and Drug Administration has approved several medications that can stop or slow down the progression of the disease and reduce pain and other symptoms. These medications fall into two categories: bisphosphonates and calcitonin. Doctors most often prescribe one of the four strongest bisphosphonates, which are risedronate, alendronate, pamidronate, and zoledronic acid.",
  "source": "NIHSeniorHealth",
  "focus_area": "Paget's Disease of Bone"
}


In [102]:
results = multi_stage_search(doc_piece["question"])
print(results[0].payload["text"])

Early Diagnosis is Important Although there is no cure for Paget's disease of bone, it is treatable. Treatment is most effective when the disease is diagnosed early, before it causes major changes in the affected bones. The goal of treatment is to relieve bone pain and prevent the disease from progressing. Medications Are Available The Food and Drug Administration has approved several medications that can stop or slow down the progression of the disease and reduce pain and other symptoms. These medications fall into two categories: bisphosphonates and calcitonin. Both medications work by stopping or reducing the excessive breakdown of old bone that leads to excessive formation of new, but weaker, bone. People with Paget's disease should talk to their doctors about which medication is right for them. Bisphosphonates Six bisphosphonates are currently available for patients with Paget's disease. Doctors most commonly recommend the strongest ones, which include - risedronate  - alendronate

#### 1.3.5: Building Hybrid Search

In real production systems, you don't need to choose just one vector type. You never know what kind of queries your users will send to the system. E-commerce search might be just fine with lexical search on top of sparse vectors, as people will tend to send keywords, but in conversational systems, such as chatbots, natural language questions might be more frequent. Using one model as a retriever and another one as reranker is not the only way of how to use dense and sparse in a single system.

Hybrid Search is a technique for combining results coming from different search methods - for example dense and sparse. There isn't a clear definition of how exactly to implement it, as the main problem is how to mix results coming from methods which are incompatible. Dense and sparse search scores can't be compared directly, so we need another method that will order the final results somehow.

There are two terms important for Hybrid Search: **fusion** and **reranking**.

##### Fusion

Fusion is a set of methods which work on the scores/ranking as returned by the individual methods. There are various ways of how to achieve that, but Reciprocal Rank Fusion is the most popular technique. It is based on the rankings of the documents in each methods used, and these rankings are used to calculate the final scores. You will never calculate these scores, as Qdrant has some built-in capabilities that we will use. However, the following example can give you a rough intuition:

| Document | Dense ranking | Sparse ranking | RRF score | Final ranking |
| --- | --- | --- | --- | --- |
| D1 | **1** | 5 | 0.0318 | 2 |
| D2 | 2 | 4 | 0.0317 | 3 |
| D3 | 3 | 2 | 0.0320 | **1** |
| D4 | 4 | 3 | 0.0315 | 5 |
| D5 | 5 | **1** | 0.0318 | 2 |

In [107]:
def rrf_search(query: str, limit: int = 1) -> list[models.ScoredPoint]:
    results = client_q.query_points(
        collection_name="medicalQA-sparse-and-dense",
        prefetch=[
            # two Prefetch() here 
            models.Prefetch(
                query=models.Document(
                    text=query,
                    model="jinaai/jina-embeddings-v2-small-en",
                ),
                using="jina-small",
                limit=(5 * limit),
            ),
            models.Prefetch(
                query=models.Document(
                    text=query,
                    model="Qdrant/bm25",
                ),
                using="bm25",
                limit=(5 * limit),
            ),
        ],
        
        # Fusion query enables fusion on the prefetched results
        query=models.FusionQuery(fusion=models.Fusion.RRF),
        
        with_payload=True,
    )

    return results.points

In [108]:
results = rrf_search(doc_piece["question"])
print(json.dumps(doc_piece, indent=2))
print(results[0].payload["text"])

{
  "question": "What are the treatments for Paget's Disease of Bone ?",
  "answer": "The Food and Drug Administration has approved several medications that can stop or slow down the progression of the disease and reduce pain and other symptoms. These medications fall into two categories: bisphosphonates and calcitonin. Doctors most often prescribe one of the four strongest bisphosphonates, which are risedronate, alendronate, pamidronate, and zoledronic acid.",
  "source": "NIHSeniorHealth",
  "focus_area": "Paget's Disease of Bone"
}
Paget's disease of bone is a disease that causes affected bones to become enlarged and misshapen. Our bones are living tissue, and our bodies are constantly breaking down old bone and replacing it with new bone. In Paget's disease, however, old bone is broken down and replaced at a faster rate than normal. The new bone is larger and weaker than normal bone.


## 2 RAG flow

### 2.1 RAG flow with keyword search

In [56]:
from openai import OpenAI
import os
from dotenv import load_dotenv
load_dotenv()

# api_key = os.getenv("OPENAI_API_KEY") # create a .env in the jupy project directory ($pwd) and write OPENAI_API_KEY="your_open_ai_key"
api_key = os.environ['HPG_API_KEY']

client = OpenAI(api_key=api_key,
                base_url="https://api.ai.it.ufl.edu" # add it if use HPG_API_KEY, comment it if use OPENAI_API_KEY 
                )

In [57]:
def search(query):
    boost = {'question': 3.0, 'source': 0.5}

    results = index_min.search(
        query=query,
        filter_dict={}, # example {'scource': 'NIHSeniorHealth'}
        boost_dict=boost,
        num_results=6
    )

    return results

In [69]:
query = "What is Glaucoma, the reason why it happens and suitalbe treatment?"
context_snippets = search(query) # what's retrived

In [29]:
context_snippets

[{'question': 'What is (are) Glaucoma ?',
  'answer': "Glaucoma is a group of diseases that can damage the eye's optic nerve and result in vision loss and blindness. The most common form of the disease is open-angle glaucoma. With early treatment, you can often protect your eyes against serious vision loss. (Watch the video to learn more about glaucoma. To enlarge the video, click the brackets in the lower right-hand corner. To reduce the video, press the Escape (Esc) button on your keyboard.)  See this graphic for a quick overview of glaucoma, including how many people it affects, whos at risk, what to do if you have it, and how to learn more.  See a glossary of glaucoma terms.",
  'source': 'NIHSeniorHealth',
  'focus_area': 'Glaucoma'},
 {'question': 'What is (are) Glaucoma ?',
  'answer': "Glaucoma is a group of diseases that can damage the eye's optic nerve and result in vision loss and blindness. While glaucoma can strike anyone, the risk is much greater for people over 60. How G

In [70]:
def build_prompt(query, search_results):
    prompt_template = """
    You're an assistant at the front desk of meadical information. Answer the QUESTION based on the CONTEXT from the FAQ database.
    Use only the facts from the CONTEXT when answering the QUESTION.
    
    QUESTION: {question}
    
    CONTEXT: 
    {context}
    """.strip()

    entry_template = """
    question_in_db: {question}
    answer_in_db: {answer}
    cource_in_db: {source}
    focus_area_indb: {focus_area}
    """.strip()

    retrieved_context = ""
    
    for doc in search_results:
        # "**" unpacks the key-value pairs from the doc dictionary and passes them as keyword arguments to the format() method.
        # note to "**": the keys in the dictionary MUST match the placeholders in your template string
        retrieved_context += entry_template.format(**doc) + "\n\n"
    
    prompt = prompt_template.format(question=query, context=retrieved_context).strip()
    return prompt

In [82]:
def llm(prompt, model='gpt-oss-120b'):
    response = client.chat.completions.create(model=model,
                                              messages=[{"role": "user", "content": prompt}]
                                             )
    
    return response.choices[0].message.content

* client.images.generate(): Creates an image based on a text prompt using models like DALL-E 3.
* client.images.edit(): Creates an edited or extended version of an image based on a prompt and an original image.
* client.images.create_variation(): Generates a variation of a given image. 

* client.audio.transcriptions.create(): Transcribes an audio file into text using the Whisper model. This method is highly accurate and multilingual.
* client.audio.translations.create(): Translates an audio file into English.
* client.audio.speech.create(): Converts text into natural-sounding speech. 

* client.embeddings.create(): Creates a numerical vector representation (embedding) of a given text input. This is useful for semantic search, clustering, and classification.
  
* client.fine_tuning.jobs.create(): Creates a fine-tuning job to train a custom model on your own data. The client also has methods for listing, retrieving, and canceling these jobs.
  
* client.moderations.create(): Checks whether a given text input is inappropriate or unsafe, helping to filter potentially harmful content.
  
* client.models.list(): Lists all the models available to your account.
* client.files.list(): Lists all files that have been uploaded to your OpenAI account.
* client.files.upload(): Uploads a file for a specific purpose, such as fine-tuning or file search. 

In [72]:
def rag(query, model='gpt-oss-120b'):
    # retrieve info relevant to the query
    search_results = search(query)

    # integrate the original query with the retrieved info as a new prompt
    prompt = build_prompt(query, search_results)
    
    #feed the integrated prompt to a LLM
    answer = llm(prompt, model=model)
    
    return answer

In [33]:
question = "what is the reason of developing a cancer?"
answer = rag(question)
print(answer)

Cancer does not arise from a single cause; doctors look at several risk factors when assessing why a person might develop it.  
According to the information in the FAQ database, the main factors that can contribute to the development of cancer include:

* **Personal medical history** – past health conditions can influence risk.  
* **Smoking history** – tobacco use is specifically mentioned as a key element doctors evaluate.  
* **Exposure to environmental or occupational substances** – contact with certain chemicals or pollutants can increase risk.  
* **Family history of cancer** – having close relatives with cancer is another factor doctors consider.

These elements are examined together to understand an individual’s chance of developing cancer. (Source: NIH Senior Health, Lung Cancer FAQ).


### 2.2 RAG with vector search

In [34]:
def search_in_source(query, source="NIHSeniorHealth", limit=1):

    results = client_q.query_points(
        collection_name=collection_name,
        query=models.Document( #embed the query text locally with "jinaai/jina-embeddings-v2-small-en"
            text=query,
            model=model_handle
        ),
        query_filter=models.Filter( # filter by course name
            must=[
                models.FieldCondition(
                    key="source",
                    match=models.MatchValue(value=source)
                )
            ]
        ),
        limit=limit, # top closest matches
        with_payload=True #to get metadata in the results
    )

    return results

In [35]:
def search_in_source2(query, source="NIHSeniorHealth", limit=1):

    query_points = client_q.query_points(
        collection_name=collection_name,
        query=models.Document( #embed the query text locally with "jinaai/jina-embeddings-v2-small-en"
            text=query,
            model=model_handle
        ),
        query_filter=models.Filter( # filter by course name
            must=[
                models.FieldCondition(
                    key="source",
                    match=models.MatchValue(value=source)
                )
            ]
        ),
        limit=limit, # top closest matches
        with_payload=True #to get metadata in the results
    )

    results = []
    for point in query_points.points:
        results.append(point.payload)
    
    return results

In [36]:
def build_prompt_vec(query, search_results):
    prompt_template = """
    You're an assistant at the front desk of meadical information. Answer the QUESTION based on the CONTEXT from the FAQ database.
    Use only the facts from the CONTEXT when answering the QUESTION.
    
    QUESTION: {question}
    
    CONTEXT: 
    {context}
    """.strip()

    entry_template = """
    answer_in_db: {text}
    cource_in_db: {source}
    focus_area_indb: {focus_area}
    """.strip()

    retrieved_context = ""
    
    for doc in search_results:
        # "**" unpacks the key-value pairs from the doc dictionary and passes them as keyword arguments to the format() method.
        # note to "**": the keys in the dictionary MUST match the placeholders in your template string
        retrieved_context += entry_template.format(**doc) + "\n\n"
    
    prompt = prompt_template.format(question=query, context=retrieved_context).strip()
    return prompt

In [84]:
def rag(query, model='gpt-oss-120b'):
    search_results = search_in_source2(query)
    prompt = build_prompt_vec(query, search_results)
    answer = llm(prompt, model=model)
    return answer

In [38]:
question = "what is the reason of developing a cancer?"
answer = rag(question)
print(answer)

Cancer develops when the normal process of creating new cells goes wrong. Errors cause cells to become abnormal and to divide uncontrollably, leading to the formation of cancerous tissue.


vector search is much better than keyword search

## 3. Retrieval evaluation

Qdrant vector search can't handle too manch records

In [8]:
df_question = pd.read_csv('ground-truth-data_576.csv', usecols=['question', 'id'])

with open('documents-with-ids.json', 'r') as f:
    js = json.load(f)

# convert to df for filling nan
df = pd.DataFrame(js)
df.dropna(inplace = True)

In [9]:
matching_records_df = df[df['id'].isin(df_question['id'])]

In [10]:
matching_records_df

Unnamed: 0,question,answer,source,focus_area,id
0,What is (are) Glaucoma ?,Glaucoma is a group of diseases that can damag...,NIHSeniorHealth,Glaucoma,641e393e8
1,What causes Glaucoma ?,"Nearly 2.7 million people have glaucoma, a lea...",NIHSeniorHealth,Glaucoma,dc2b69cd1
2,What are the symptoms of Glaucoma ?,Symptoms of Glaucoma Glaucoma can develop in ...,NIHSeniorHealth,Glaucoma,c6c94a788
3,What are the treatments for Glaucoma ?,"Although open-angle glaucoma cannot be cured, ...",NIHSeniorHealth,Glaucoma,b44df4cf1
4,What is (are) Glaucoma ?,Glaucoma is a group of diseases that can damag...,NIHSeniorHealth,Glaucoma,fbabdc01a
...,...,...,...,...,...
571,How to diagnose Peripheral Arterial Disease (P...,Your Family and Medical History P.A.D. is diag...,NIHSeniorHealth,Peripheral Arterial Disease (P.A.D.),2ce3ee1e6
572,What is (are) Peripheral Arterial Disease (P.A...,Peripheral arterial disease (P.A.D.) is a dise...,NIHSeniorHealth,Peripheral Arterial Disease (P.A.D.),9fb667609
573,What causes Peripheral Arterial Disease (P.A.D...,The most common cause of P.A.D. is atheroscler...,NIHSeniorHealth,Peripheral Arterial Disease (P.A.D.),86e605559
574,Who is at risk for Peripheral Arterial Disease...,Smoking is the main risk factor for P.A.D. You...,NIHSeniorHealth,Peripheral Arterial Disease (P.A.D.),78fa66811


In [11]:
ground_truth = df_question.to_dict(orient='records')

In [14]:
documents_wID = matching_records_df.to_dict(orient='records')

### 3.1 keyword search via minsearch

In [13]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

In [17]:
index = minsearch.Index(
    text_fields=['question', 'answer', 'source', 'focus_area'],
    keyword_fields=['id']
)

index.fit(documents_wID)

<minsearch.Index at 0x7a927bdcef80>

In [18]:
def minsearch_search(query):
    boost = {}

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

In [19]:
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        #print(q)
        doc_id = q['id']
        results = search_function(q)
        #print(results)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

In [20]:
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [21]:
eval_res = evaluate(ground_truth, lambda q: minsearch_search(q['question']))

100%|██████████████████████████████████████████████████████████████████████████████| 2880/2880 [00:13<00:00, 211.83it/s]


In [22]:
eval_res

{'hit_rate': 0.45243055555555556, 'mrr': 0.21952325837742503}

### 3.2 Finding the best parameters for keyword search

In [39]:
df_validation = df_question[:100]
df_test = df_question[100:]

In [40]:
import random

def simple_optimize(param_ranges, objective_function, n_iterations=10):
    best_params = None
    best_score = float('-inf')  # Assuming we're minimizing. Use float('-inf') if maximizing.

    for _ in range(n_iterations):
        # Generate random parameters
        current_params = {}
        for param, (min_val, max_val) in param_ranges.items():
            if isinstance(min_val, int) and isinstance(max_val, int):
                current_params[param] = random.randint(min_val, max_val)
            else:
                current_params[param] = random.uniform(min_val, max_val)
        
        # Evaluate the objective function
        current_score = objective_function(current_params)
        
        # Update best if current is better
        if current_score > best_score:  # Change to > if maximizing
            best_score = current_score
            best_params = current_params
    
    return best_params, best_score

In [41]:
gt_val = df_validation.to_dict(orient='records')

In [42]:
def minsearch_search(query, boost=None):
    if boost is None:
        boost = {}

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

In [43]:
param_ranges = {
    'question': (0.0, 3.0),
    'answer': (0.0, 3.0),
    'source': (0.0, 3.0),
    'focus_area': (0.0, 3.0)
}

def objective(boost_params):
    def search_function(q):
        return minsearch_search(q['question'], boost_params)

    results = evaluate(gt_val, search_function)
    return results['mrr']

In [44]:
simple_optimize(param_ranges, objective, n_iterations=20)

100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 207.29it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 176.79it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 174.53it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 182.97it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 182.99it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 206.31it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 202.74it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 199.59it/s]
100%|███████████████████████████

({'question': 0.5081262391450764,
  'answer': 1.5753476546149885,
  'source': 1.6267111286897529,
  'focus_area': 0.07916752364891522},
 0.3166309523809524)

best parameters:
{'question': 0.5081262391450764,
  'answer': 1.5753476546149885,
  'source': 1.6267111286897529,
  'focus_area': 0.07916752364891522}

mrr = 0.3166309523809524

In [46]:
def minsearch_improved(query):
    boost = {
        'question': 0.508,
        'answer': 1.575,
        'source': 1.627,
        'focus_area': 0.079
    }

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

In [47]:
eval_res1 = evaluate(ground_truth, lambda q: minsearch_improved(q['question']))
eval_res1

100%|██████████████████████████████████████████████████████████████████████████████| 2880/2880 [00:15<00:00, 190.10it/s]


{'hit_rate': 0.6545138888888888, 'mrr': 0.3385154872134044}

With boost option, the metrics for keyword search improved obviously:


{'hit_rate': 0.31493055555555555, 'mrr': 0.12612640542328032} ----> 
{'hit_rate': 0.6545138888888888, 'mrr': 0.3385154872134044}

### 3.3 vector search via Qdrant

In [None]:
# Define the collection name
collection_name = "medicalQA-rag576"
EMBEDDING_DIMENSIONALITY =512
# Create the collection with specified vector parameters
client_q.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONALITY,  # Dimensionality of the vectors
        distance=models.Distance.COSINE  # Distance metric for similarity search
    )
)

In [49]:
points = []
id = 0
model_handle="jinaai/jina-embeddings-v2-small-en"
for doc in documents_wID:
    point = models.PointStruct( # models is an imported module
        id=id,
        vector=models.Document(text=doc['answer'], model=model_handle), #embed text locally with "jinaai/jina-embeddings-v2-small-en" from FastEmbed
        payload={
            "text": doc['answer'], # have to be here or else not vectorized.
            "source": doc['source'],
            "focus_area": doc['focus_area'],
            "id": doc['id']
        } #save all needed metadata fields
    )
    points.append(point)

    id += 1

In [29]:
client_q.upsert(
    collection_name=collection_name,
    points=points
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [30]:
# define a search function
def search_vec(query, limit=10):

    results = client_q.query_points(
        collection_name=collection_name,
        query=models.Document( #embed the query text locally with "jinaai/jina-embeddings-v2-small-en"
            text=query,
            model=model_handle 
        ),
        limit=limit, # top closest matches
        with_payload=True #to get metadata in the results
    )

    return results

In [31]:
def evaluate_vec(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        #print(q)
        doc_id = q['id']
        results = search_function(q)
        #print(results)
        relevance = [point.payload['id'] == doc_id for point in results.points]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

In [33]:
eval_res2 = evaluate_vec(ground_truth, lambda q: search_vec(q['question']))

100%|███████████████████████████████████████████████████████████████████████████████| 2880/2880 [00:33<00:00, 86.43it/s]


In [34]:
eval_res2

{'hit_rate': 0.8659722222222223, 'mrr': 0.5811078042328033}

In [52]:
pd.DataFrame([eval_res, eval_res1, eval_res2], index=["keyword_search", "keyword_search_boost", "vector_search"])

Unnamed: 0,hit_rate,mrr
keyword_search,0.452431,0.219523
keyword_search_boost,0.654514,0.338515
vector_search,0.865972,0.581108


**Retrieveal evaluation conclusion**:
* vector search via Qdrant is much better than keyword search via minsearch, even with boost option!
* vector search will be used in RAG evaluation

## 4. RAG evaluation

### 4.1 Model 'gpt-oss-120b'

In [80]:
# define a search function
collection_name = "medicalQA-rag576"
def search_vec(query, limit=1):

    results = client_q.query_points(
        collection_name=collection_name,
        query=models.Document( #embed the query text locally with "jinaai/jina-embeddings-v2-small-en"
            text=query,
            model=model_handle 
        ),
        limit=limit, # top closest matches
        with_payload=True #to get metadata in the results
    )

    return results

In [92]:
def build_prompt_vec(query, search_results):
    prompt_template = """
    You're an assistant at the front desk of meadical information. Answer the QUESTION based on the CONTEXT from the FAQ database.
    Use only the facts from the CONTEXT when answering the QUESTION.
    
    QUESTION: {question}
    
    CONTEXT: 
    {context}
    """.strip()

    entry_template = """
    answer_in_db: {text}
    cource_in_db: {source}
    focus_area_indb: {focus_area}
    """.strip()

    retrieved_context = ""
    
    for doc in search_results.points:
        # print(doc)
        # "**" unpacks the key-value pairs from the doc dictionary and passes them as keyword arguments to the format() method.
        # note to "**": the keys in the dictionary MUST match the placeholders in your template string
        retrieved_context += entry_template.format(**doc.payload) + "\n\n"
    
    prompt = prompt_template.format(question=query, context=retrieved_context).strip()
    return prompt

In [83]:
def llm(prompt, model='gpt-oss-120b'):
    response = client.chat.completions.create(model=model,
                                              messages=[{"role": "user", "content": prompt}]
                                             )
    
    return response.choices[0].message.content

In [84]:
def rag_vec(query, model='gpt-oss-120b'):
    search_results = search_vec(query)
    prompt = build_prompt_vec(query, search_results)
    answer = llm(prompt, model=model)
    return answer

In [53]:
prompt2_template = """
You are an expert evaluator for a RAG system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()

In [54]:
len(ground_truth)

2880

In [93]:
record = ground_truth[0]
question = record['question']
answer_llm = rag_vec(question)

In [94]:
print(answer_llm)

Glaucoma is a group of eye diseases that can damage the optic nerve—the nerve that carries visual information from the eye to the brain. When the optic nerve is harmed, it can result in vision loss and ultimately blindness. The most common form is open‑angle glaucoma. Because the nerve damage can be progressive, early detection and treatment are important; with timely care, many people are able to protect their eyes and avoid serious vision loss.


In [95]:
prompt = prompt2_template.format(question=question, answer_llm=answer_llm)
print(prompt)

You are an expert evaluator for a RAG system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: Could you define glaucoma for me and explain how it can lead to vision loss?
Generated Answer: Glaucoma is a group of eye diseases that can damage the optic nerve—the nerve that carries visual information from the eye to the brain. When the optic nerve is harmed, it can result in vision loss and ultimately blindness. The most common form is open‑angle glaucoma. Because the nerve damage can be progressive, early detection and treatment are important; with timely care, many people are able to protect their eyes and avoid serious vision loss.

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code 

In [79]:
len(df_question)

2880

In [96]:
df_sample = df_question.sample(n=200, random_state=1)

In [97]:
sample = df_sample.to_dict(orient='records')

In [98]:
evaluations = []

for record in tqdm(sample):
    question = record['question']
    answer_llm = rag_vec(question) 

    prompt = prompt2_template.format(
        question=question,
        answer_llm=answer_llm
    )

    evaluation = llm(prompt)
    evaluation = json.loads(evaluation)

    evaluations.append((record, answer_llm, evaluation))

100%|█████████████████████████████████████████████████████████████████████████████████| 200/200 [07:04<00:00,  2.12s/it]


In [99]:
df_eval = pd.DataFrame(evaluations, columns=['record', 'answer', 'evaluation'])

df_eval['id'] = df_eval.record.apply(lambda d: d['id'])
df_eval['question'] = df_eval.record.apply(lambda d: d['question'])

df_eval['relevance'] = df_eval.evaluation.apply(lambda d: d['Relevance'])
df_eval['explanation'] = df_eval.evaluation.apply(lambda d: d['Explanation'])

del df_eval['record']
del df_eval['evaluation']

In [100]:
df_eval.relevance.value_counts(normalize=True)

relevance
RELEVANT           0.750
PARTLY_RELEVANT    0.205
NON_RELEVANT       0.045
Name: proportion, dtype: float64

In [101]:
df_eval.to_csv('data/rag_vec-eval-gpt-oss-120b.csv', index=False)

In [102]:
df_eval[df_eval.relevance == 'NON_RELEVANT']

Unnamed: 0,answer,id,question,relevance,explanation
1,"I’m sorry, but the information you’re asking a...",49bcdbe75,"With diabetes and psoriasis, what joint clues ...",NON_RELEVANT,The generated answer does not provide any join...
12,"I’m sorry, but the information you’ve asked ab...",d2d3c2d5b,"What exactly is NET-PD, and how could this pro...",NON_RELEVANT,The answer does not explain what NET-PD is nor...
23,"The information we have about COPD notes that,...",6e9139b81,My family says my face looks blank and I walk ...,NON_RELEVANT,The answer discusses COPD progression and its ...
33,"I’m sorry, but the information you’re looking ...",295822324,Which specific brain areas are central to Park...,NON_RELEVANT,The answer does not provide any information ab...
41,"If you’re experiencing a stroke, the most obvi...",298977198,"If I have it, are the changes most obvious in ...",NON_RELEVANT,The generated answer assumes the condition is ...
75,"Based on the information in our database, sudd...",616d8ddff,My blood pressure is often low; can that make ...,NON_RELEVANT,The answer discusses stroke warning signs and ...
108,"I’m sorry, but the information you’re looking ...",967fcad01,As an older adult with heart disease and diabe...,NON_RELEVANT,The generated answer does not provide any info...
164,If your primary doctor isn’t an osteoporosis s...,178498506,If my primary doctor does not focus on osteopo...,NON_RELEVANT,The question asks which medical specialist sho...
172,"I’m sorry, but the information in our FAQ data...",d0ac3c5e3,How do the two main forms compare in speed and...,NON_RELEVANT,The answer discusses peripheral arterial disea...


### 4.2 Model 'llama-3.3-70b-instruct'

In [103]:
evaluations_llama3370b = []

for record in tqdm(sample):
    question = record['question']
    answer_llm = rag(question, model='llama-3.3-70b-instruct') 

    prompt = prompt2_template.format(
        question=question,
        answer_llm=answer_llm
    )

    evaluation = llm(prompt)
    evaluation = json.loads(evaluation)
    
    evaluations_llama3370b.append((record, answer_llm, evaluation))

100%|█████████████████████████████████████████████████████████████████████████████████| 200/200 [18:01<00:00,  5.41s/it]


In [104]:
df_eval = pd.DataFrame(evaluations_llama3370b, columns=['record', 'answer', 'evaluation'])

df_eval['id'] = df_eval.record.apply(lambda d: d['id'])
df_eval['question'] = df_eval.record.apply(lambda d: d['question'])

df_eval['relevance'] = df_eval.evaluation.apply(lambda d: d['Relevance'])
df_eval['explanation'] = df_eval.evaluation.apply(lambda d: d['Explanation'])

del df_eval['record']
del df_eval['evaluation']

In [105]:
df_eval.relevance.value_counts()

relevance
RELEVANT           97
NON_RELEVANT       52
PARTLY_RELEVANT    51
Name: count, dtype: int64

In [106]:
df_eval.relevance.value_counts(normalize=True)

relevance
RELEVANT           0.485
NON_RELEVANT       0.260
PARTLY_RELEVANT    0.255
Name: proportion, dtype: float64

In [107]:
df_eval.to_csv('data/rag_vec-eval-gpt-llama3.3-70b.csv', index=False)

**RAG evaluation conclusion**:

* **'gpt-oss-120b'** is much better than 'llama-3.3-70b-instruct' using Vector search