# ICD-10 Lookup
In this notebook we demonstrate storing ICD-10 codes in a vector database via embeddings. Subsequently, we will look up ICD-10 codes for phrases in a clinical note.

In [None]:
!pip install -q langchain langchain_openai gensim spacy scipy==1.12 lancedb langchain_community datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.5/973.5 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.4/38.4 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.9/18.9 MB[0m [31m39.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m308.5/308.5 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.8/122.8 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.6/320.6 kB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━

We will require some additional packages for LoRa inference

In [None]:
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q --no-deps xformers trl peft accelerate bitsandbytes

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.4/102.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for unsloth (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.7/222.7 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.2/245.2 kB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m

In [None]:
import pandas as pd
import numpy as np
from langchain_openai import OpenAIEmbeddings
from langchain_openai import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

import lancedb
from lancedb.pydantic import Vector, LanceModel
from datasets import load_dataset


## Store ICD10 codes in a vector database
In this section we parse the ICD-10 code table, get vector embeddgins for the codes and store the embeddings in a vector database.
### Load the ICD-10 code table
We start with loading the ICD-10 code table into a pandas dataframe. The dataset is adapted from the 2024 update of ICDCM codes found here: https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/ICD10CM/2024-Update/


We have additionally created embeddings for the description and saved it as huggingface datasets

In [None]:
dataset= load_dataset("abhiwebshar/icd10cm-codes-with-embeddings")
df = dataset['train'].to_pandas()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/911M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/74044 [00:00<?, ? examples/s]

In [None]:
df.head(2)

Unnamed: 0,Code,Description,Embeddings
0,A000,"'Cholera due to Vibrio cholerae 01, biovar cho...","[-0.005155704, -0.00016552847, -0.007303367, -..."
1,A001,"'Cholera due to Vibrio cholerae 01, biovar eltor'","[-0.011605786, -0.008258221, -0.009707267, -0...."


In [None]:
df[['Code', 'Description']].sample(3)

Unnamed: 0,Code,Description
12284,L97914,'Non-pressure chronic ulcer of unspecified par...
53643,S92416B,'Nondisplaced fracture of proximal phalanx of ...
40044,S62161A,"'Displaced fracture of pisiform, right wrist, ..."


## Store ICD10 codes in a vector database
In this section we will generate embeddings for the ICD-10 descriptions and store them in a vector database.

In [None]:
def create_vector_database(df=None, force_update=False):
    """
    Create a vector database using the ICD10 descriptions in the given DataFrame or parquet file.
    This method will create the vector database index if it does not exist.
    The parameter force_update can be set to True to force the creation of the index even if it already exists.
    df: The DataFrame containing the ICD10 codes and descriptions. If None, it will be loaded from the parquet file.
    force_update: A boolean indicating whether to force the creation of the index.
    """
    # Connect to LanceDB
    db = lancedb.connect("lancedb")
    table_name = "icd10_codes"


    # Define the LanceModel schema
    class ICD10(LanceModel):
        Code: str
        Description: str
        Embeddings: Vector(1536)  # Adjust the vector dimensions based on your data

    # Create the table with the defined schema
    table = db.create_table(table_name, schema=ICD10)



    # Generate embeddings for the descriptions if not already present in the DataFrame
    if 'Embeddings' not in df.columns:
        # Initialize OpenAI embeddings
        embedder = OpenAIEmbeddings()
        descriptions = df['Description'].tolist()
        embeddings = embedder.embed_documents(descriptions)
        df['Embeddings'] = embeddings

    # Convert the DataFrame to a list of Pydantic models
    pydantic_model_items = [
        ICD10(Code=row['Code'], Description=row['Description'], Embeddings=row['Embeddings'])
        for _, row in df.iterrows()
    ]

    # Insert the Pydantic models into the table
    table.add(pydantic_model_items)

In [None]:
import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········


In [None]:
def lookup_icd10_codes(phrase, n=5):
    """
    Use the given phrase to look up the top-n matching codes in the vector database.
    Return a list of dictionaries, each containing the ICD10 code and description.
    phrase: The input phrase to search for.
    n: The number of results to return.
    """
    # Initialize OpenAI embeddings
    openai = OpenAI()
    embedder = OpenAIEmbeddings()

    # Load the Lancedb
    db = lancedb.connect("lancedb")

    table = db.open_table("icd10_codes")

    # Generate embedding for the input phrase
    query_embedding = embedder.embed_query(phrase)


    # Convert query_embedding to a NumPy array
    query_embedding = np.array(query_embedding)

    # Search for the nearest neighbors

    # Retrieve the matching ICD10 codes and descriptions
    df_similar = (table.search(query_embedding, vector_column_name="Embeddings")
                  .metric("L2") #vector distance metric, can be cosine, L2, dot product
                  .limit(n) #n similar items
                  .to_pandas()
                  .filter(items=['Code', 'Description']))

    return df_similar

In [None]:
#Initialising lancedb vector database, run this cell only once
create_vector_database(df)

In [None]:
# Let's look up codes for a few phrases
# phrase = "infection caused by salmonella"
# phrase = "diabetes with peripheral autonomic neuropathy"
phrase = "left femur fracture"
results = lookup_icd10_codes(phrase, n=5)
print(results)
print()

      Code                                        Description
0  M84752A  'Incomplete atypical femoral fracture, left le...
1  M84752S  'Incomplete atypical femoral fracture, left le...
2  S72492A  'Other fracture of lower end of left femur, in...
3  S72492S  'Other fracture of lower end of left femur, se...
4  M84758S  'Complete oblique atypical femoral fracture, l...



In [None]:
from unsloth import FastLanguageModel

class DiagnosisExtractionModel:
    def __init__(self, model_name, max_seq_length, dtype, load_in_4bit):
        self.model, self.tokenizer = FastLanguageModel.from_pretrained(
            model_name=model_name,
            max_seq_length=max_seq_length,
            dtype=dtype,
            load_in_4bit=load_in_4bit,
        )
        FastLanguageModel.for_inference(self.model)

    def generate_alpaca_input(self, instruction, input_text):
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input_text}

### Response:
"""

model = DiagnosisExtractionModel("abhiwebshar/lora_model", 2048, None, True)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


adapter_config.json:   0%|          | 0.00/732 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
import json

In [None]:
def extract_phrases(prompt, model):
    """
    Extract salient diagnoses from clinical note. The function takes a clinical note as input and returns a list of extracted phrases that refer to diagnoses.
    It does that by calling OpenAI's LLM model with a prompt that instructs the model to extract salient diagnoses from the note.
    prompt: The clinical note for which diagnoses need to be extracted.
    model: The OpenAI language model to use for extracting diagnoses.
    """
    # Define the prompt template for extracting phrases

    instruction="""You are a professional clinician.
                You have been given a doctor's note after a patient's visit.
                Your task is to extract the main diagnoses from the note and provide them in a structured format.
                Instructions:

                  Focus on the assessment, plan, impression, recommendation or similar sections.
                  Exclude any items that are negated or ruled out.
                  Do not include any extraneous information.
                  Do not include the phrase "diagnosis" in your search.
                  Provide only the names of the diagnoses, without any additional details.

                Please provide the output in the following strucutured format:
                {{
                    "diagnoses": [
                        "<Diagnosis 1>",
                        "<Diagnosis 2>",
                        ...
                    ]
                }}"""



    # Generate the alpaca-formatted input
    alpaca_input = model.generate_alpaca_input(instruction, prompt)

    # Tokenize the input and move it to the GPU
    inputs = model.tokenizer(alpaca_input, return_tensors="pt").to("cuda")
    input_length = inputs['input_ids'].shape[1]

    # Generate the model's response
    outputs = model.model.generate(**inputs, max_new_tokens=256, use_cache=True)
    response = model.tokenizer.decode(outputs.squeeze()[input_length:], skip_special_tokens=True).strip()
    print(response)

    try:
        # Preprocess the response string
        response = response.strip('`').strip()
        if response.startswith('json'):
            response = response[4:].strip()

        # Parse the JSON response and extract the diagnoses
        response_dict = json.loads(response)
        phrases = response_dict.get("diagnoses", [])
    except json.JSONDecodeError:
        # Handle the case where the response is not a valid JSON
        phrases = []

    return phrases


def format_output(results):
    """
    Format the output of the ICD-10 code lookup function for display.
    This is done by creating a dataframe that has three columns: Phrase, ICD10 code, and Description.
    results: A dictionary where the keys are phrases and the values are dataframes containing ICD-10 codes and their descriptions.
    """
    # Create an empty list to store the code dataframes
    code_dfs = []

    for phrase, code_df in results.items():
        # Add the phrase column to the code dataframe
        code_df['Phrase'] = phrase

        # Append the code dataframe to the list
        code_dfs.append(code_df[['Phrase', 'Code', 'Description']])

    # Concatenate all the code dataframes vertically
    result_df = pd.concat(code_dfs, ignore_index=True)

    # Rename the 'Code' column to 'ICD10 code'
    result_df.rename(columns={'Code': 'ICD10 code'}, inplace=True)
    # Remove duplicates based on the 'ICD10 code' column
    result_df.drop_duplicates(subset=['ICD10 code'], inplace=True)

    return result_df

In [None]:
def process_clinical_note(prompt, model):
    """
    Process a clinical note to extract relevant diagnoses and lookup ICD-10 codes for them.
    prompt: The clinical note for which diagnoses need to be extracted and ICD-10 codes need to be looked up.
    """

    # Extract relevant phrases from the prompt
    phrases = extract_phrases(prompt, model)
    print(f"phrases: {phrases}")

    # Lookup ICD-10 codes and descriptions for each phrase
    all_results = {}
    for phrase in phrases:
        results = lookup_icd10_codes(phrase, n=2)
        all_results[phrase] = results

    return all_results

In [None]:
clinical_note = '''

REASON FOR CONSULTATION:
Coronary artery disease (CAD), prior bypass surgery.
HISTORY OF PRESENT ILLNESS:
The patient is a 70-year-old gentleman who was admitted for management of fever.  The patient has history of elevated PSA and BPH.  He had a prior prostate biopsy and he recently had some procedure done, subsequently developed urinary tract infection, and presently on antibiotic.  From cardiac standpoint, the patient denies any significant symptom except for fatigue and tiredness.  No symptoms of chest pain or shortness of breath.His history from cardiac standpoint as mentioned below.
CORONARY RISK FACTORS:
History of hypertension, history of diabetes mellitus, nonsmoker.  Cholesterol elevated.  History of established coronary artery disease in the family and family history positive.
FAMILY HISTORY:
Positive for coronary artery disease.
SURGICAL HISTORY:
Coronary artery bypass surgery and a prior angioplasty and prostate biopsies.
MEDICATIONS:
1.  Metformin.
2.  Prilosec.
3.  Folic acid.
4.  Flomax.
5.  Metoprolol.
6.  Crestor.
7.  Claritin.
ALLERGIES:
DEMEROL, SULFA.
PERSONAL HISTORY:
He is married, nonsmoker, does not consume alcohol, and no history of recreational drug use.
PAST MEDICAL HISTORY:
Significant for multiple knee surgeries, back surgery, and coronary artery bypass surgery with angioplasty, hypertension, hyperlipidemia, elevated PSA level, BPH with questionable cancer.  Symptoms of shortness of breath, fatigue, and tiredness.
REVIEW OF SYSTEMS:
CONSTITUTIONAL  No history of fever, rigors, or chills except for recent fever and rigors. HEENT  No history of cataract or glaucoma. CARDIOVASCULAR  As above. RESPIRATORY  Shortness of breath.  No pneumonia or valley fever. GASTROINTESTINAL  Nausea and vomiting.  No hematemesis or melena. UROLOGICAL  Frequency, urgency. MUSCULOSKELETAL  No muscle weakness. SKIN  None significant. NEUROLOGICAL  No TIA or CVA.  No seizure disorder. PSYCHOLOGICAL  No anxiety or depression. ENDOCRINE  As above. HEMATOLOGICAL  None significant.
PHYSICAL EXAMINATION:
VITAL SIGNS  Pulse of 75, blood pressure 130/68, afebrile, and respiratory rate 16 per minute. HEENT  Atraumatic, normocephalic. NECK  Veins flat.  No significant carotid bruits. LUNGS  Air entry bilaterally fair. HEART  PMI displaced.  S1 and S2 regular. ABDOMEN  Soft, nontender.  Bowel sounds present. EXTREMITIES  No edema.  Pulses are palpable.  No clubbing or cyanosis. CNS  Benign.
EKG:
Normal sinus rhythm, incomplete right bundle-branch block.
LABORATORY DATA:
H&H stable, BUN and creatinine within normal limits.
IMPRESSION:
1.  History of coronary artery disease, prior bypass surgery, angioplasty, significant shortness of breath.2.  Fever with possible urinary tract infection versus prostatitis.
3.  Hypertension, hyperlipidemia, diabetes mellitus.4.  Contemplated prostate surgery down the road.
RECOMMENDATION:
1.  From cardiac standpoint, medical management including antibiotic for his fever.
2.  We will consider cardiac workup in terms of to rule out ischemia and patency of the graft.  If he decides to go for surgery, I would like him to wait until the fever has subsided and is well under control.  Discussed with the patient the plan of care, consent was obtained.  All the questions answered in detail.
'''

In [None]:
results = process_clinical_note(clinical_note, model)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


```json
{
    "diagnoses": [
        "Coronary artery disease",
        "Bypass surgery",
        "Angioplasty",
        "Shortness of breath",
        "Hypertension",
        "Hyperlipidemia",
        "Diabetes mellitus",
        "Urinary tract infection",
        "Prostatitis",
        "Fever"
    ]
}
```
phrases: ['Coronary artery disease', 'Bypass surgery', 'Angioplasty', 'Shortness of breath', 'Hypertension', 'Hyperlipidemia', 'Diabetes mellitus', 'Urinary tract infection', 'Prostatitis', 'Fever']


In [None]:
#results

In [None]:
formatted_output = format_output(results)
formatted_output

Unnamed: 0,Phrase,ICD10 code,Description
0,Coronary artery disease,I2542,'Coronary artery dissection'
1,Coronary artery disease,I2541,'Coronary artery aneurysm'
2,Bypass surgery,Z9884,'Bariatric surgery status'
3,Bypass surgery,Z951,'Presence of aortocoronary bypass graft'
4,Angioplasty,Z9862,'Peripheral vascular angioplasty status'
5,Angioplasty,Z9861,'Coronary angioplasty status'
6,Shortness of breath,R0602,'Shortness of breath'
7,Shortness of breath,R071,'Chest pain on breathing'
8,Hypertension,I1A0,'Resistant hypertension'
9,Hypertension,I950,'Idiopathic hypotension'
