<a href="https://colab.research.google.com/github/beverm2391/NLP-CDST/blob/main/NLP_CDST_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Setup

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Install Dependencies

In [None]:
!pip install -U sentence-transformers openai python-dotenv

In [3]:
import pickle
import numpy as np
from typing import List, Tuple, Dict
from sentence_transformers import SentenceTransformer, util
import os
import pandas as pd
from transformers import GPT2TokenizerFast
import openai
from dotenv import load_dotenv

In [4]:
# I have already taken my context, in this case the DSM-5 and loaded the text of each page into a list. I used a pdf parser for python called pdfminer.six
# ["Page one text", "Page two text",...]

# I'm using pickle to save/load the list as binary

with open("/content/drive/MyDrive/PDFs-for-Parsing/DSM-5_page_list_v1", "rb") as fp:
  dsm_list = pickle.load(fp)

# I'll refer to this context as text_to_encode

In [7]:
# Specify the file path to store and load the encoded text (which is in the form of vector embeddings)
fpath = "/content/drive/MyDrive/PDFs-for-Parsing/DSM-5_embeddings_v1"

In [8]:
# Set the OpenAI API key environment variable
# load from master.env file in My Drive
# this gives access to the GPT- mode
# your own API key can be obtained here: https://openai.com/api/

load_dotenv("/content/drive/MyDrive/master.env")
api_key = os.environ.get('OPENAI-API-KEY')
openai.api_key = api_key

##Some Function Definition

This function either loads the embeddings, or returns None if none exist.

In [9]:
def load_embeddings(fpath : str) -> List[List[float]]:
  # if the csv file containg our context embeddings at fpath exists, read it and convert it into list format
  if os.path.exists(fpath):
    with open(fpath, 'r') as f:
      embeddings_df = pd.read_csv(fpath)
      # convert the df back into a list of lists
      embeddings = embeddings_df.values.tolist()
      print("loaded embeddings")
      return embeddings
  else:
    return None

This next funciton encodes text into a vector embedding

https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1

This model "maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search.
It has been trained on 215M (question, answer) pairs from diverse sources."
It will encode our corpus of text into vector embeddings

Make sure you have added a GPU to the runtime. Just click the "Runtime" menu item up top (you might have to restart the runtime)

In [10]:
def get_embedding(text: str) -> List[float]:
  model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')
  embedding = model.encode(text)
  return embedding

In [11]:
def encode_text(text_to_encode : List[str], fpath : str):

  embeddings = load_embeddings(fpath)
  # create empty list of n items if no embeddings
  if embeddings == None:
    print("No file found, creating blank list")
    embeddings = [None for _ in text_to_encode]

  failed_embeddings = []
  embeddings_to_generate = embeddings.count(None)

  if embeddings_to_generate == 0:
    print("No embeddings to generate")
    return embeddings

  print(f"Generating {embeddings_to_generate} embeddings")

  for idx, value in enumerate(text_to_encode):
    if embeddings[idx] is None:
      try:
        embedding = get_embedding(value)
        embeddings[idx] = embedding.tolist()
        print(f"Embeddding {idx + 1} generated")
      # if embedding fails
      except Exception as e:
        print(f"Embedding {idx + 1} failed")
        failed_embeddings.append(idx)

  print("Encoding Successful")

  if len(failed_embeddings) > 0:
    print(f"{len(failed_embeddings)} embeddings failed. Indexes: {failed_embeddings}")
  else:
    print("No embeddings failed")

  # This is where we'll write the embeddings to CSV
  temp_df_for_storage = pd.DataFrame(embeddings)
  temp_df_for_storage.to_csv(fpath, index=False)

  print("Written to csv")

  return embeddings

In [None]:
embeddings = encode_text(dsm_list, fpath)

## Using the tool

In [12]:
def get_response(prompt, model):
    response = openai.Completion.create(
        model=model,
        prompt=prompt,
        temperature=0,
        max_tokens=1000,
    )
    return response['choices'][0]['text'].strip(" \n")

This function iterates over the training text and encodes each list item into a vector embedding, which is then stored in a DataFrame and written to csv for storage. 

This encoding process will take some time, even with the T4 GPU. It took me about 15 minutes to encode a list of ~800 pages, each with around 1k tokens.

After Once the corpus has been encoded once, the embeddings just needs to be loaded once each session.

In [15]:
embeddings = load_embeddings(fpath)

loaded embeddings


Now that we have two lists, one with the corpus text and one with the encoded text (vector embedding), we'll aggregate them into one data structure.

In [16]:
def aggregate_text_and_embeddings(text_to_encode, embeddings) -> Dict[str , Tuple[str, List[float]]]:
  assert embeddings is not None

  text_to_encode = text_to_encode[:727]
  embeddings = embeddings[:727]

  text_and_embeddings = {}
  for idx, text in enumerate(text_to_encode):
    text_and_embeddings[f"Chunk {idx + 1}"] = (text, embeddings[idx])
  return text_and_embeddings

# Example
# {"Chunk 1" : ("This is the text of chunk 1...", [.23434, .12324, .52323, ...]),
#  "Chunk 2" : ("This is the text of chunk 2...", [.20934, .16524, .78362, ...])}

Then the text and embeddings get aggregated into one dictionary

In [17]:
text_and_embeddings = aggregate_text_and_embeddings(dsm_list, embeddings)

In [13]:
def get_relevant_info(case_note: str, instructions : str) -> str:
  prompt = "Case Note:\n" + case_note + "\n" + instructions
  model = "text-davinci-003"
  response = get_response(prompt, model) 
  return response

In [14]:
def get_case_note_key_terms(case_note: str) -> str:
  header = "summarize any symptoms of mental illness:"
  prompt = "Case Note:" + "\n" + case_note + "\n" + header
  # im using davinci 2 so that the response is more concise
  model = "text-davinci-003"

  response = get_response(prompt, model)
  print("Screening for:\n" + response + "\n")

  return response

In [18]:
def vector_similarity(x: List[float], y: List[float]) -> float:
    return np.dot(np.array(x), np.array(y))

def order_by_similarity(text_and_embeddings : Dict[str, Tuple[str, List[float]]]) -> List[Tuple[float, str, str]]:

  # get key terms from the case note
  relevant_info = get_relevant_info(case_note, instructions)
  # get an embedding of those key terms
  query_embedding = get_embedding(relevant_info)

  ordered_text = sorted([(vector_similarity(query_embedding, chunk[1]), chunk, idx) for idx, chunk in text_and_embeddings.items()], reverse=True)

  print("Text ordered by similarity")
  return ordered_text

# Example
# [(0.12398640147, "Chunk 47", "This is the text of chunk 47...")...]

This cell below can be used to see which embeddings are similar to any relevant info (as derived from any case note)


**Creating the prompt**

This term is used to separate the sections of text that get pulled into the prompt each time it gets built. Checking the number of tokens so that we stay in adherance to all our token limits down the line.

In [19]:
seperator = "\n* "

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
separator_len = len(tokenizer.tokenize(seperator))

f"Context separator contains {separator_len} tokens"

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

'Context separator contains 3 tokens'

In [32]:
def build_prompt(question : str, text_and_embeddings : Dict[str , Tuple[str, List[float]]]) -> str:
  # first we order the context by how simlar their embeddings are to our query embedding
  ordered_text = order_by_similarity(text_and_embeddings)

  context_list = []
  context_tokens = 0
  context_token_limit = 2000

  # then we append the most similar context until we reach our token limit
  for item in ordered_text:
    text = item[1][0]

    context_tokens += len(tokenizer.tokenize(text)) + separator_len

    if context_tokens > context_token_limit:
      break

    context_list.append(seperator + text.replace("\n", " "))

  # now to actually construct the prompt
  header = """Answer as truthfully as possible using the provided context."""
  # get relevant clinical information
  relevant_info = get_relevant_info(case_note, instructions)

  prompt = header + "".join(context_list) + "\n\nRelevant Information:\n" + relevant_info + "\n\nQuestion:\n" + question + "\nAnswer:"

  print("Prompt Constructed: \n")
  print(prompt)

  return prompt


This funciton builds the prompt, which is a string that is sent to OpenAI's GPT-3 for the final response

If you want to modify the header, do it here

## User Input

In [21]:
instructions = """Screen this patient for illness.
1. Identify the patient
2. List the patient's symptoms
3. List other key patient information

Stop
"""

You can modify the input below

In [34]:
case_note = """
Michael is a 15-year-old male who presents with fatigue and lack of interest in activities. He has lost weight and his parents are concerned. He denies any other symptoms. Medical history is significant for normal development. He has no known allergies and is up to date on his vaccinations. He reports that he used to be a good student but his grades have been slipping and he’s been skipping class. He used to be popular with his classmates but now he feels like he doesn’t fit in and has no friends. He denies any history of bullying. Mental status exam reveals a cooperative, well-groomed male who is alert and oriented to person, place, and time. He has a flat affect and speaks in a monotone voice. He reports feeling “down” but denies any suicidal ideation
"""

In [24]:
# get the relevant diagnostic info from the prompt
question = """1. Does the patient meet any diagnostic criteria?
2. Does the patient meets enough criteria to warrant further screening for anything. If so, specify what they should be screened for."
"""

Run the tool

In [25]:
def run_tool(question, corpus):
    prompt = build_prompt(question, corpus)
    model = "text-davinci-003"

    response = get_response(prompt, model)
    return response

What if the tool is wrong? This is why its a *Decision Support Tool*, and thus is only meant to provide the physician with useful information - NOT to replace the physician.

In [35]:
output = run_tool(question, text_and_embeddings)
print(output)

Text ordered by similarity
Prompt Constructed: 

Answer as truthfully as possible using the provided context.
* Major Depressive Disorder  163  The essential feature of a major depressive episode is a period of at least 2 weeks during which there is either depressed mood or the loss of interest or pleasure in nearly all activi- ties (Criterion A). In children and adolescents, the mood may be irritable rather than sad. The individual must also experience at least four additional symptoms drawn from a list that includes changes in appetite or weight, sleep, and psychomotor activity; decreased en- ergy; feelings of worthlessness or guilt; difficulty thinking, concentrating, or making deci- sions; or recurrent thoughts of death or suicidal ideation or suicide plans or attempts. To count toward a major depressive episode, a symptom must either be newly present or must have clearly worsened compared with the person’s pre-episode status. The symptoms must persist for most of the day, nearly e