In [1]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
import pandas as pd
import numpy as np

from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

### Sentence Transformer

Sentence Transformers (a.k.a. SBERT) is the go-to Python module for accessing, using, and training state-of-the-art text and image embedding models. Characteristics of Sentence Transformer (a.k.a bi-encoder) models:

1) Calculates a fixed-size vector representation (embedding) given texts or images.
2) Embedding calculation is often efficient, embedding similarity calculation is very fast.
3) Applicable for a wide range of tasks, such as semantic textual similarity, semantic search, clustering, classification, paraphrase mining, and more.
4) Often used as a first step in a two-step retrieval process, where a Cross-Encoder (a.k.a. reranker) model is used to re-rank the top-k results from the bi-encoder.

https://sbert.net/index.html



In [5]:
# creating embedding model from sentence transformer for getting the embeddings of the text

txt_embedder = SentenceTransformer("all-MiniLM-L6-v2")

In [8]:
print(f"Embedding model size: {txt_embedder.get_sentence_embedding_dimension()}")

Embedding model size: 384


### Qdrant setup

Qdrant (read: quadrant) is a vector similarity search engine and vector database. It provides a production-ready service with a convenient API to store, search, and manage points—vectors with an additional payload Qdrant is tailored to extended filtering support. It makes it useful for all sorts of neural-network or semantic-based matching, faceted search, and other applications.

In [6]:
# creating the vector database client using qdrant

qdrant = QdrantClient(":memory:")

In [11]:
# creating Qdrant Collection to store the data

qdrant.recreate_collection(
    collection_name="clinical_notes",
    vectors_config=models.VectorParams(
        size=txt_embedder.get_sentence_embedding_dimension(),
        distance=models.Distance.COSINE
    )
)

True

### Data Loading

In [12]:
notes_data = pd.read_csv("../data/train.csv")

In [13]:
notes_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10822 entries, 0 to 10821
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      10822 non-null  int64 
 1   Note    10822 non-null  object
 2   json    10822 non-null  object
dtypes: int64(1), object(2)
memory usage: 253.8+ KB


In [14]:
notes_data.isnull().sum()

ID      0
Note    0
json    0
dtype: int64

In [15]:
notes_data.duplicated().sum()

np.int64(0)

In [16]:
rag_notes = notes_data.sample(1000).to_dict("records")

In [17]:
print(f"Length of the data : {len(rag_notes)}")

Length of the data : 1000


### Vectorize

In [22]:
qdrant.upload_points(
    collection_name="clinical_notes",
    points=[
        models.PointStruct(
            id=d["ID"],
            vector=txt_embedder.encode(d["Note"]).tolist(),
            payload=d
        ) for d in rag_notes
    ]
)

In [None]:
search_prompt = ""

In [None]:
# searching for some clinical suggestions
