setup the weaviate client using this docker-compose file: 

version: '3.4'
services:
  weaviate:
    image: cr.weaviate.io/semitechnologies/weaviate:1.28.4
    container_name: mimic_weaviate
    ports:
      - 9000:8080    # Maps host port 9000 to container port 8080
      - 9001:50051   # Maps host port 9001 to container port 50051
    volumes:
      - /var/weaviate:/var/lib/weaviate
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      CLUSTER_HOSTNAME: 'node1'


In [30]:
pip install pandas weaviate-client


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd

mimic_df = pd.read_csv('symptoms_train.csv')
mimic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33684 entries, 0 to 33683
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         33684 non-null  int64 
 1   Symptoms           33684 non-null  object
 2   id                 33684 non-null  int64 
 3   text               33684 non-null  object
 4   long_texts         33684 non-null  object
 5   short_texts        33684 non-null  object
 6   discharge_summary  33684 non-null  object
 7   short_codes        33684 non-null  object
dtypes: int64(2), object(6)
memory usage: 2.1+ MB


In [2]:
mimic_df.head()

Unnamed: 0.1,Unnamed: 0,Symptoms,id,text,long_texts,short_texts,discharge_summary,short_codes
0,0,"['Substernal Chest Pain', 'Sharp Pain', 'Cresc...",147171,CHIEF COMPLAINT: Substernal Chest Pain\n\nPRES...,Acute myocardial infarction of other anterior ...,"AMI anterior wall, init,Ac systolic hrt failur...",Admission Date: [**2102-9-26**] ...,4101142821997142714140142804273145829
1,1,['Back pain'],199961,CHIEF COMPLAINT: \n\nPRESENT ILLNESS: The pati...,"Ankylosing spondylitis,Hypertensive chronic ki...","Ankylosing spondylitis,Hyp kid NOS w cr kid V,...",Admission Date: [**2115-6-29**] Dischar...,"7200,40391,8052,8471,E8859,78057,2859,25060"
2,2,"['Shortness of breath', 'Cough', 'Occasional n...",136812,CHIEF COMPLAINT: \n\nPRESENT ILLNESS: This is ...,Obstructive chronic bronchitis with (acute) ex...,"Obs chr bronc w(ac) exac,Pneumonia, organism N...",Admission Date: [**2106-4-14**] Dischar...,491214862800427894261143889729892449
3,3,"['left arm pain', 'left leg pain', 'pulmonary ...",175700,CHIEF COMPLAINT: s/p rollover MVC with prolong...,"Closed fracture of shaft of fibula with tibia,...","Fx shaft fib w tib-clos,Pneumococcal pneumonia...",Admission Date: [**2159-2-9**] D...,"82322,481,86121,5180,2851,81322,8072,E8160,883..."
4,4,"['Palpitations', 'Shortness of breath', 'Sore ...",193486,CHIEF COMPLAINT: Shortness of breath\n\nPRESEN...,"Other specified cardiac dysrhythmias,End stage...","Cardiac dysrhythmias NEC,End stage renal disea...",Admission Date: [**2136-10-4**] ...,"42789,5856,6822,6164,42830,2761,2869,4160,0411..."


In [3]:
import weaviate
from weaviate.connect import ConnectionParams

client = weaviate.connect_to_local(
    port=9000,      # Custom HTTP port
    grpc_port=9001  # Custom gRPC port
)

# Verify connection
print(f"Client is ready: {client.is_ready()}")


Client is ready: True


In [None]:
# pip install tqdm

Define the collections : the types of documents with their datatype and other properties

In [6]:
import weaviate
from weaviate.classes.config import Configure, Property, DataType, Tokenization



# Create collection with BM25 configuration
client.collections.create(
    "MedicalRecords_v5",
    vectorizer_config=Configure.Vectorizer.none(),  # No vectorizer needed for BM25
    properties=[
        Property(
            name="symptoms",
            data_type=DataType.TEXT,
            tokenization=Tokenization.WORD,  # Word-level tokenization for better search
            indexing=["bm25"]
        ),

        Property(
            name="text",
            data_type=DataType.TEXT,
            tokenization=Tokenization.WORD,
            indexing=["bm25"]
        ),
        Property(
            name="long_texts",
            data_type=DataType.TEXT,
            tokenization=Tokenization.WORD,
            indexing=["bm25"]
        ),
        Property(
            name="short_texts",
            data_type=DataType.TEXT,
            tokenization=Tokenization.WORD,
            indexing=["bm25"]
        ),
        Property(
            name="discharge_summary",
            data_type=DataType.TEXT,
            tokenization=Tokenization.WORD,
            indexing=["bm25"]
        ),
        Property(
            name="short_codes",
            data_type=DataType.TEXT,
            tokenization=Tokenization.WORD
        )
    ]
)


<weaviate.collections.collection.sync.Collection at 0x171f6e5d0>

Create the collection in current weaviate client and using the current dataframe, push all the data in the weaviate-client's collection

In [8]:
from tqdm import tqdm
# Get the collection

collection = client.collections.get("MedicalRecords_v5")

# Import data from dataframe
for _, row in tqdm(mimic_df.iterrows()):
    collection.data.insert({
        "symptoms": row["Symptoms"],
        "text": row["text"],
        "long_texts": row["long_texts"],
        "short_texts": row["short_texts"],
        "discharge_summary": row["discharge_summary"],
        "short_codes": row["short_codes"]
    })


33684it [04:15, 131.77it/s]


finally we only need this code for retrieval

In [9]:
import weaviate
from weaviate.connect import ConnectionParams

client = weaviate.connect_to_local(
    port=9000,      # Custom HTTP port
    grpc_port=9001  # Custom gRPC port
)

# Verify connection
print(f"Client is ready: {client.is_ready()}")
# Get the collection
collection = client.collections.get("MedicalRecords_v5")

# Perform a BM25 search
response = collection.query.bm25(
    query="chest pain",
    limit=1
)

# Print results
for item in response.objects:
    print(f"Score: {item.metadata.score}")
    print(f"Text: {item.properties['text']}\n")


Client is ready: True
Score: None
Text: CHIEF COMPLAINT: chest pain

PRESENT ILLNESS: Ms. [**Known lastname **] is a 78yo woman w/hx of CAD s/p 4V CABG [**2088**] and
multiple PCIs who presented to [**Hospital3 8834**] with
chest pain.  She reports intermittent chest pain over past 4
days with constant chest pain since awakening at 3:45am this
morning.  Pain described as a tightness/pressure in center of
chest with heaviness in the arms bilaterally.  She has
associated SOB but no nausea/vomiting/diaphoresis.  She states
that over the past 2 weeks she has had intermittent chest pain
with climbing stairs and performing household activities.  Prior
to this she denies chest pain since her prior catheterization.
She does not take nitroglycerine at home due to headache.  Of
note, she states that she missed 2 days of her aspirin and
plavix around [**Holiday **] because she forgot to take them.
.
She initially went to [**Hospital3 8834**] where she was
started on a heparin gtt, given SL NTG X 