In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
import numpy as np

from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

### Sentence Transformer

Sentence Transformers (a.k.a. SBERT) is the go-to Python module for accessing, using, and training state-of-the-art text and image embedding models. Characteristics of Sentence Transformer (a.k.a bi-encoder) models:

1) Calculates a fixed-size vector representation (embedding) given texts or images.
2) Embedding calculation is often efficient, embedding similarity calculation is very fast.
3) Applicable for a wide range of tasks, such as semantic textual similarity, semantic search, clustering, classification, paraphrase mining, and more.
4) Often used as a first step in a two-step retrieval process, where a Cross-Encoder (a.k.a. reranker) model is used to re-rank the top-k results from the bi-encoder.

https://sbert.net/index.html



In [3]:
# creating embedding model from sentence transformer for getting the embeddings of the text

txt_embedder = SentenceTransformer("all-MiniLM-L6-v2")

In [4]:
print(f"Embedding model size: {txt_embedder.get_sentence_embedding_dimension()}")

Embedding model size: 384


### Qdrant setup

Qdrant (read: quadrant) is a vector similarity search engine and vector database. It provides a production-ready service with a convenient API to store, search, and manage points—vectors with an additional payload Qdrant is tailored to extended filtering support. It makes it useful for all sorts of neural-network or semantic-based matching, faceted search, and other applications.

In [5]:
# creating the vector database client using qdrant

qdrant = QdrantClient(":memory:")

In [6]:
# creating Qdrant Collection to store the data

qdrant.recreate_collection(
    collection_name="clinical_notes",
    vectors_config=models.VectorParams(
        size=txt_embedder.get_sentence_embedding_dimension(),
        distance=models.Distance.COSINE
    )
)

True

### Data Loading

In [7]:
notes_data = pd.read_csv("../data/train.csv")

In [8]:
notes_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10822 entries, 0 to 10821
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      10822 non-null  int64 
 1   Note    10822 non-null  object
 2   json    10822 non-null  object
dtypes: int64(1), object(2)
memory usage: 253.8+ KB


In [9]:
notes_data.isnull().sum()

ID      0
Note    0
json    0
dtype: int64

In [10]:
notes_data.duplicated().sum()

np.int64(0)

In [11]:
rag_notes = notes_data.sample(1000).to_dict("records")

In [12]:
print(f"Length of the data : {len(rag_notes)}")

Length of the data : 1000


### Vectorize

In [13]:
qdrant.upload_points(
    collection_name="clinical_notes",
    points=[
        models.PointStruct(
            id=d["ID"],
            vector=txt_embedder.encode(d["Note"]).tolist(),
            payload=d
        ) for d in rag_notes
    ]
)

In [14]:
search_prompt = "I am suffering from fever, suggest what I can do as a remedy in the next two days."

In [15]:
# searching for some clinical suggestions

hits = qdrant.search(
    collection_name="clinical_notes",
    query_vector=txt_embedder.encode(search_prompt).tolist(),
    query_filter=models.Filter(
        must=[
            models.FieldCondition(
                key='ID',
                range=models.Range(lte=1000))
                ]),
    limit=3
)

In [16]:
for hit in hits:
  print(hit.payload, "score:", hit.score)

{'ID': 829, 'Note': "**Clinical Notes**\n\nPatient, a 70-year-old female, presents with influenza (flu) symptoms.\n\nThe patient reports a fever of 38.5°C, with an increased heart rate of 118 bpm, indicating signs of infection and potential cardiovascular stress. The presence of swollen lymph nodes further supports this, suggesting a possible viral or bacterial etiology. \n\nAdditionally, the patient is experiencing fatigue, headache, sore throat, vomiting, diarrhea, loss of taste and smell, weight loss, facial pain, anxiety, and difficulty concentrating. These symptoms collectively point towards a severe case of influenza.\n\nNotably, the rash is present on the patient's body, which could be indicative of a secondary bacterial infection or an allergic reaction to the virus.\n\nThe low oxygen saturation of 97.2% suggests potential respiratory involvement, possibly indicating pneumonia or bronchitis as a complication of the flu.\n\nGiven these symptoms and vital signs, it appears that t

### Integrating RAG with LLAMA

In [17]:
# defining the search results
search_results = [hit.payload for hit in hits]

In [18]:
assistant_content = (
    "Based on the search results, here is some information:\n" +
    "\n".join([str(item) for item in search_results])
)

In [19]:
import ollama

In [20]:
chat_completion = ollama.chat(
    model="llama3.2:latest",
    messages=[
        {"role": "system", "content": "You are a chatbt, a clinical notes specialist. Your top priority is to help users into understanding their condition and provide necessary medication and suggestions."},
        {"role": "user", "content": "I am suffering from fever, suggest what I can do as a remedy in the next two days."},
        {"role": "assistant", "content": assistant_content}
    ]
)

In [21]:
response_content = chat_completion.get("message", {}).get("content", "")

print("LLM Response Content:")
print(response_content)

LLM Response Content:
}
```

The provided logs contain medical records of patients with various conditions such as pneumonia, fever, and respiratory distress. The data includes patient demographics, symptoms, laboratory results, and treatment plans.

To analyze this data, we can use various techniques such as natural language processing (NLP) to extract relevant information, machine learning algorithms to identify patterns and predict outcomes, and data visualization tools to present the findings in a clear and concise manner.

Some potential analysis tasks that can be performed on this data include:

1.  **Symptom severity scoring**: Develop a scoring system to quantify the severity of symptoms such as fever, cough, and difficulty breathing.
2.  **Treatment effectiveness assessment**: Analyze the treatment plans and assess their effectiveness in managing patient conditions.
3.  **Predicting patient outcomes**: Use machine learning algorithms to predict patient outcomes based on histor