<a target="_blank" href="https://colab.research.google.com/github/abhiwebshar/llm_lab_tutorials/blob/main/codelookup.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Setup

Installing all the required libraries

In [1]:
!pip install -q langchain langchain_openai gensim spacy scipy==1.12 lancedb langchain_community datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.5/973.5 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.4/38.4 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.1/19.1 MB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.2/310.2 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m124.4/124.4 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.7/320.7 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━

# ICD-10 Lookup
In this notebook we demonstrate storing ICD-10 codes in a vector database via embeddings. Subsequently, we will look up ICD-10 codes for phrases in a clinical note.

In [2]:
import pandas as pd
import numpy as np
from langchain.storage import LocalFileStore
from langchain_community.vectorstores import LanceDB
from langchain_openai import OpenAIEmbeddings
from langchain_openai import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
import lancedb
from lancedb.pydantic import Vector, LanceModel

from datasets import load_dataset

## Store ICD10 codes in a vector database
In this section we parse the ICD-10 code table, get vectort embeddgins for the codes and store the embeddings in a vector database.
### Load the ICD-10 code table
We start with loading the ICD-10 code table into a pandas dataframe. The dataset is adapted from the 2024 update of ICDCM codes found here: https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/ICD10CM/2024-Update/

In [3]:
!mkdir -p data
!wget -P data https://raw.githubusercontent.com/abhiwebshar/llm_lab_tutorials/main/data/icd10cm-codes-April-2024.txt

--2024-06-03 14:03:55--  https://raw.githubusercontent.com/abhiwebshar/llm_lab_tutorials/main/data/icd10cm-codes-April-2024.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6393250 (6.1M) [text/plain]
Saving to: ‘data/icd10cm-codes-April-2024.txt’


2024-06-03 14:03:55 (68.3 MB/s) - ‘data/icd10cm-codes-April-2024.txt’ saved [6393250/6393250]



In [4]:
# The file data/icd10cm-codes-April-2024.txt has two columns: code and description. The columns are separated by 2 or more whitespaces. There is no header row.
# Load this file into a pandas DataFrame, giving the columns the names "code" and "description".
df = pd.read_csv("data/icd10cm-codes-April-2024.txt", sep='|', header=None, names=["code", "description"])
df.sample(10)

Unnamed: 0,code,description
22114,O99112,'Other diseases of the blood and blood-forming...
17223,M8450XA,"'Pathological fracture in neoplastic disease, ..."
70087,W44F4XA,'Insect entering into or through a natural ori...
62503,T63063A,'Toxic effect of venom of other North and Sout...
62829,T63793S,'Toxic effect of contact with other venomous p...
41926,S63244D,'Subluxation of distal interphalangeal joint o...
33524,S4490XS,'Injury of unspecified nerve at shoulder and u...
18386,M89342,"'Hypertrophy of bone, left hand'"
31226,S3732XS,"'Contusion of urethra, sequela'"
71256,Y272XXD,"'Contact with hot fluids, undetermined intent,..."


## Store ICD10 codes in a vector database
In this section we will generate embeddings for the ICD-10 descriptions and store them in a vector database.

In [5]:
import os
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········


In [7]:
from pathlib import Path

def create_vector_database(df=None, force_update=False):
    """
    Create a vector database using the ICD10 descriptions in the given DataFrame or parquet file.
    This method will create the vector database index if it does not exist.
    The parameter force_update can be set to True to force the creation of the index even if it already exists.
    df: The DataFrame containing the ICD10 codes and descriptions. If None, it will be loaded from the parquet file.
    force_update: A boolean indicating whether to force the creation of the index.
    """
    # Connect to LanceDB
    db = lancedb.connect("lancedb")
    table_name = "icd10_codes"


    # Define the LanceModel schema
    class ICD10(LanceModel):
        Code: str
        Description: str
        Embeddings: Vector(1536)  # Adjust the vector dimensions based on your data

    # Create the table with the defined schema
    table = db.create_table(table_name, schema=ICD10, exist_ok = True)


    # Generate embeddings for the descriptions if not already present in the DataFrame
    if 'embeddings' not in df.columns:
        # Initialize OpenAI embeddings
        embedder = OpenAIEmbeddings()
        descriptions = df['description'].tolist()
        embeddings = embedder.embed_documents(descriptions)
        df['embeddings'] = embeddings

    # Convert the DataFrame to a list of Pydantic models
    pydantic_model_items = [
        ICD10(Code=row['code'], Description=row['description'], Embeddings=row['embeddings'])
        for _, row in df.iterrows()
    ]

    # Insert the Pydantic models into the table
    table.add(pydantic_model_items)


def lookup_icd10_codes(phrase, n=5):
    """
    Use the given phrase to look up the top-n matching codes in the vector database.
    Return a list of dictionaries, each containing the ICD10 code and description.
    phrase: The input phrase to search for.
    n: The number of results to return.
    """
    # Initialize OpenAI embeddings
    openai = OpenAI()
    embedder = OpenAIEmbeddings()

    # Load the Lancedb
    db = lancedb.connect("lancedb")

    table = db.open_table("icd10_codes")

    # Generate embedding for the input phrase
    query_embedding = embedder.embed_query(phrase)


    # Convert query_embedding to a NumPy array
    query_embedding = np.array(query_embedding)

    # Search for the nearest neighbors

    # Retrieve the matching ICD10 codes and descriptions
    df_similar = (table.search(query_embedding, vector_column_name="Embeddings")
                  .metric("L2") #vector distance metric, can be cosine, L2, dot product
                  .limit(n) #n similar items
                  .to_pandas()
                  .filter(items=['Code', 'Description']))

    results = []
    for idx in range(len(df_similar)):
      result = {
          'icd10_code': df_similar.iloc[idx]['Code'],
          'description': df_similar.iloc[idx]['Description']
      }
      results.append(result)


    return results


In [7]:
# Let's create the index. This should only be done once. It may take a few minutes to complete. Set False to True to run this cell
if False:
  create_vector_database(df)

In [8]:
# Faster recommeneded way to setup vector db, using already embedded data stored in huggingface. Set False to True to run this cell, run only one cell
if True:
  dataset= load_dataset("abhiwebshar/icd10cm-codes-with-embeddings")
  df = dataset['train'].to_pandas()
  df.columns = [col.lower() for col in df.columns]
  create_vector_database(df) #This df already has embeddings as a column

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/911M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/74044 [00:00<?, ? examples/s]

In [9]:
# Let's look up codes for a few phrases
# phrase = "infection caused by salmonella"
# phrase = "diabetes with peripheral autonomic neuropathy"
phrase = "left femur fracture"
results = lookup_icd10_codes(phrase, n=5)
for result in results:
    print(result)
    print()

{'icd10_code': 'M84752A', 'description': "'Incomplete atypical femoral fracture, left leg, initial encounter for fracture'"}

{'icd10_code': 'M84752S', 'description': "'Incomplete atypical femoral fracture, left leg, sequela'"}

{'icd10_code': 'S72492A', 'description': "'Other fracture of lower end of left femur, initial encounter for closed fracture'"}

{'icd10_code': 'S72492S', 'description': "'Other fracture of lower end of left femur, sequela'"}

{'icd10_code': 'M84758S', 'description': "'Complete oblique atypical femoral fracture, left leg, sequela'"}



In [10]:
def extract_phrases(prompt, model):
    """
    Extract salient diagnoses from clinical note. The function takes a clinical note as input and returns a list of extracted phrases that refer to diagnoses.
    It does that by calling OpenAI's LLM model with a prompt that instructs the model to extract salient diagnoses from the note.
    prompt: The clinical note for which diagnoses need to be extracted.
    model: The OpenAI language model to use for extracting diagnoses.
    """
    # Define the prompt template for extracting phrases
    prompt_template = PromptTemplate(
        input_variables=["text"],
        template='''You are a professional clinician.
                    You have been given a doctor's note after a patient's visit.
                    Your task is to extract the main diagnoses from the note.
                    Note:
                        1. Focus on the assessment, plan, impression, recommendation or similar sections.
                        2. Exclude any items that are negated or ruled out.
                        3. Do not include any extraneous information.
                        4. Do not include the phrase "diagnosis" in your search
                    {text}'''
    )

    # Set up the LLM chain
    chain = LLMChain(llm=model, prompt=prompt_template)

    # Run the chain to get the phrases
    response = chain.run({"text": prompt})

    # Process the response to extract phrases
    phrases = response.strip().split('\n')
    return [phrase.strip() for phrase in phrases if phrase.strip()]

def format_output(results):
    """
    Format the output of the ICD-10 code lookup function for display. This is done by creating a dataframe that has two columns: ICD10 code and Description.
    results: A dictionary where the keys are phrases and the values are lists of dictionaries, each containing an ICD-10 code and its description.
    """
    # Creeate a result dataframe
    result_df = pd.DataFrame(columns=["ICD10 code", "Description"])
    for phrase, codes in results.items():
        for code_info in codes:
            icd_code = code_info['icd10_code']
            description = code_info['description']
            result_df.loc[len(result_df)] = [icd_code, description]
    # Remove duplicates based on the 'ICD10 code' column
    result_df.drop_duplicates(subset=['ICD10 code'], inplace=True)
    return result_df


In [11]:
def process_clinical_note(prompt):
    """
    Process a clinical note to extract relevant diagnoses and lookup ICD-10 codes for them.
    prompt: The clinical note for which diagnoses need to be extracted and ICD-10 codes need to be looked up.
    """
    # Initialize the OpenAI language model
    model = OpenAI()

    # Extract relevant phrases from the prompt
    phrases = extract_phrases(prompt, model)

    # Lookup ICD-10 codes and descriptions for each phrase
    all_results = {}
    for phrase in phrases:
        results = lookup_icd10_codes(phrase, n=2)
        all_results[phrase] = results

    return all_results

In [12]:
clinical_note = '''

REASON FOR CONSULTATION:
Coronary artery disease (CAD), prior bypass surgery.
HISTORY OF PRESENT ILLNESS:
The patient is a 70-year-old gentleman who was admitted for management of fever.  The patient has history of elevated PSA and BPH.  He had a prior prostate biopsy and he recently had some procedure done, subsequently developed urinary tract infection, and presently on antibiotic.  From cardiac standpoint, the patient denies any significant symptom except for fatigue and tiredness.  No symptoms of chest pain or shortness of breath.His history from cardiac standpoint as mentioned below.
CORONARY RISK FACTORS:
History of hypertension, history of diabetes mellitus, nonsmoker.  Cholesterol elevated.  History of established coronary artery disease in the family and family history positive.
FAMILY HISTORY:
Positive for coronary artery disease.
SURGICAL HISTORY:
Coronary artery bypass surgery and a prior angioplasty and prostate biopsies.
MEDICATIONS:
1.  Metformin.
2.  Prilosec.
3.  Folic acid.
4.  Flomax.
5.  Metoprolol.
6.  Crestor.
7.  Claritin.
ALLERGIES:
DEMEROL, SULFA.
PERSONAL HISTORY:
He is married, nonsmoker, does not consume alcohol, and no history of recreational drug use.
PAST MEDICAL HISTORY:
Significant for multiple knee surgeries, back surgery, and coronary artery bypass surgery with angioplasty, hypertension, hyperlipidemia, elevated PSA level, BPH with questionable cancer.  Symptoms of shortness of breath, fatigue, and tiredness.
REVIEW OF SYSTEMS:
CONSTITUTIONAL  No history of fever, rigors, or chills except for recent fever and rigors. HEENT  No history of cataract or glaucoma. CARDIOVASCULAR  As above. RESPIRATORY  Shortness of breath.  No pneumonia or valley fever. GASTROINTESTINAL  Nausea and vomiting.  No hematemesis or melena. UROLOGICAL  Frequency, urgency. MUSCULOSKELETAL  No muscle weakness. SKIN  None significant. NEUROLOGICAL  No TIA or CVA.  No seizure disorder. PSYCHOLOGICAL  No anxiety or depression. ENDOCRINE  As above. HEMATOLOGICAL  None significant.
PHYSICAL EXAMINATION:
VITAL SIGNS  Pulse of 75, blood pressure 130/68, afebrile, and respiratory rate 16 per minute. HEENT  Atraumatic, normocephalic. NECK  Veins flat.  No significant carotid bruits. LUNGS  Air entry bilaterally fair. HEART  PMI displaced.  S1 and S2 regular. ABDOMEN  Soft, nontender.  Bowel sounds present. EXTREMITIES  No edema.  Pulses are palpable.  No clubbing or cyanosis. CNS  Benign.
EKG:
Normal sinus rhythm, incomplete right bundle-branch block.
LABORATORY DATA:
H&H stable, BUN and creatinine within normal limits.
IMPRESSION:
1.  History of coronary artery disease, prior bypass surgery, angioplasty, significant shortness of breath.2.  Fever with possible urinary tract infection versus prostatitis.
3.  Hypertension, hyperlipidemia, diabetes mellitus.4.  Contemplated prostate surgery down the road.
RECOMMENDATION:
1.  From cardiac standpoint, medical management including antibiotic for his fever.
2.  We will consider cardiac workup in terms of to rule out ischemia and patency of the graft.  If he decides to go for surgery, I would like him to wait until the fever has subsided and is well under control.  Discussed with the patient the plan of care, consent was obtained.  All the questions answered in detail.
'''

In [13]:
results = process_clinical_note(clinical_note)

  warn_deprecated(
  warn_deprecated(


In [14]:
formatted_output = format_output(results)
formatted_output

Unnamed: 0,ICD10 code,Description
0,I2542,'Coronary artery dissection'
1,I25110,'Atherosclerotic heart disease of native coron...
2,Z951,'Presence of aortocoronary bypass graft'
3,Z980,'Intestinal bypass and anastomosis status'
4,Z9862,'Peripheral vascular angioplasty status'
5,Z9861,'Coronary angioplasty status'
6,R0602,'Shortness of breath'
7,R071,'Chest pain on breathing'
8,I1A0,'Resistant hypertension'
9,I10,'Essential (primary) hypertension'
