<a href="https://colab.research.google.com/github/beverm2391/NLP-CDST/blob/main/NLP_CDST_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Setup

Add a Tesla T4 GPU to the runtime to support the text encoding process

In [29]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



Install Dependencies

In [30]:
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [31]:
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [32]:
!pip install python-dotenv

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [33]:
import pickle
import numpy as np
from typing import List, Tuple, Dict
from sentence_transformers import SentenceTransformer, util
import os
import pandas as pd
from transformers import GPT2TokenizerFast
import openai
from dotenv import load_dotenv

Mount google drive to colab so that you can access its files. This is used to save/load relevant data like encoded text.

In [34]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [35]:
# I have already taken my context, in this case the DSM-5 and loaded the text of each page into a list. I used a pdf parser for python called pdfminer.six
# ["Page one text", "Page two text",...]

# I'm using pickle to save/load the list as binary

with open("/content/drive/MyDrive/PDFs-for-Parsing/DSM-5_page_list_v1", "rb") as fp:
  dsm_list = pickle.load(fp)

# I'll refer to this context as text_to_encode

In [36]:
# Specify the file path to store and load the encoded text (which is in the form of vector embeddings)
fpath = "/content/drive/MyDrive/PDFs-for-Parsing/DSM-5_embeddings_v1"

In [37]:
# Set the OpenAI API key environment variable
# load from master.env file in My Drive
# this gives access to the GPT- mode
# your own API key can be obtained here: https://openai.com/api/

load_dotenv("/content/drive/MyDrive/master.env")
api_key = os.environ.get('OPENAI-API-KEY')
openai.api_key = api_key

##Some Function Definition

This function either loads the embeddings, or returns None if none exist.

In [25]:
def load_embeddings(fpath : str) -> List[List[float]]:
  # if the csv file containg our context embeddings at fpath exists, read it and convert it into list format
  if os.path.exists(fpath):
    with open(fpath, 'r') as f:
      embeddings_df = pd.read_csv(fpath)
      # convert the df back into a list of lists
      embeddings = embeddings_df.values.tolist()
      print("loaded embeddings")
      return embeddings
  else:
    return None

This next funciton encodes text into a vector embedding

https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1

This model "maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search.
It has been trained on 215M (question, answer) pairs from diverse sources."
It will encode our corpus of text into vector embeddings

In [26]:
def get_embedding(text: str) -> List[float]:
  model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')
  embedding = model.encode(text)
  return embedding

This funciton makes the API call to GPT-3 and returns the response.

In [27]:
def get_response(question, corpus):
    prompt = build_prompt(question, corpus)
    response = openai.Completion.create(
        model="text-davinci-002",
        prompt=prompt,
        temperature=0,
        max_tokens=500,
    )
    return response['choices'][0]['text'].strip(" \n")

## Encoding the training data

This function iterates over the training text and encodes each list item into a vector embedding, which is then stored in a DataFrame and written to csv for storage. 

In [None]:
def encode_text(text_to_encode : List[str], fpath : str):

  embeddings = load_embeddings(fpath)
  # create empty list of n items if no embeddings
  if embeddings == None:
    print("No file found, creating blank list")
    embeddings = [None for _ in text_to_encode]

  failed_embeddings = []
  embeddings_to_generate = embeddings.count(None)

  if embeddings_to_generate == 0:
    print("No embeddings to generate")
    return embeddings

  print(f"Generating {embeddings_to_generate} embeddings")

  for idx, value in enumerate(text_to_encode):
    if embeddings[idx] is None:
      try:
        embedding = get_embedding(value)
        embeddings[idx] = embedding.tolist()
        print(f"Embeddding {idx + 1} generated")
      # if embedding fails
      except Exception as e:
        print(f"Embedding {idx + 1} failed")
        failed_embeddings.append(idx)

  print("Encoding Successful")

  if len(failed_embeddings) > 0:
    print(f"{len(failed_embeddings)} embeddings failed. Indexes: {failed_embeddings}")
  else:
    print("No embeddings failed")

  # This is where we'll write the embeddings to CSV
  temp_df_for_storage = pd.DataFrame(embeddings)
  temp_df_for_storage.to_csv(fpath, index=False)

  print("Written to csv")

  return embeddings

This encoding process will take some time, even with the T4 GPU. It took me about 15 minutes to encode a list of ~800 pages, each with around 1k tokens.

In [None]:
embeddings = encode_text(dsm_list, fpath)

No file found, creating blank list
Generating 806 embeddings
Embeddding 1 generated
Embeddding 2 generated
Embeddding 3 generated
Embeddding 4 generated
Embeddding 5 generated
Embeddding 6 generated
Embeddding 7 generated
Embeddding 8 generated
Embeddding 9 generated
Embeddding 10 generated
Embeddding 11 generated
Embeddding 12 generated
Embeddding 13 generated
Embeddding 14 generated
Embeddding 15 generated
Embeddding 16 generated
Embeddding 17 generated
Embeddding 18 generated
Embeddding 19 generated
Embeddding 20 generated
Embeddding 21 generated
Embeddding 22 generated
Embeddding 23 generated
Embeddding 24 generated
Embeddding 25 generated
Embeddding 26 generated
Embeddding 27 generated
Embeddding 28 generated
Embeddding 29 generated
Embeddding 30 generated
Embeddding 31 generated
Embeddding 32 generated
Embeddding 33 generated
Embeddding 34 generated
Embeddding 35 generated
Embeddding 36 generated
Embeddding 37 generated
Embeddding 38 generated
Embeddding 39 generated
Embeddding 4

##Using the Tool

`make sure you run the *"some functions to define"* section even if you're not encoding any new context`

After Once the corpus has been encoded once, the embeddings just needs to be loaded once each session.

In [11]:
embeddings = load_embeddings(fpath)

loaded embeddings


Now that we have two lists, one with the corpus text and one with the encoded text (vector embedding), we'll aggregate them into one data structure.

In [12]:
def aggregate_text_and_embeddings(text_to_encode, embeddings) -> Dict[str , Tuple[str, List[float]]]:
  assert embeddings is not None

  text_and_embeddings = {}
  for idx, text in enumerate(text_to_encode):
    text_and_embeddings[f"Chunk {idx + 1}"] = (text, embeddings[idx])
  return text_and_embeddings

# Example
# {"Chunk 1" : ("This is the text of chunk 1...", [.23434, .12324, .52323, ...]),
#  "Chunk 2" : ("This is the text of chunk 2...", [.20934, .16524, .78362, ...])}

Then the text and embeddings get aggregated into one dictionary called "corpus"

In [13]:
text_and_embeddings = aggregate_text_and_embeddings(dsm_list, embeddings)

Next we need to aggreate the text of each chunk to its respective vector embedding

These functions caluclate similarity between the question embedding and each corpus embedding using a dot product, then return the text chunks in order of similarity

In [16]:
def vector_similarity(x: List[float], y: List[float]) -> float:
    return np.dot(np.array(x), np.array(y))

def order_by_similarity(question : str, text_and_embeddings : Dict[str, Tuple[str, List[float]]]) -> List[Tuple[float, str, str]]:
  
  # include the case note in the quesiton embedding
  question_embedding = get_embedding(f"{case_note}\n{question}")
  # don't include the case note in the question embedding
  question_embedding = get_embedding(question)

  ordered_text = sorted([(vector_similarity(question_embedding, chunk[1]), chunk, idx) for idx, chunk in text_and_embeddings.items()], reverse=True)

  print("Text ordered by similarity")
  return ordered_text

# Example
# [(0.12398640147, "Chunk 47", "This is the text of chunk 47...")...]

**Creating the prompt**

This term is used to separate the sections of text that get pulled into the prompt each time it gets built. Checking the number of tokens so that we stay in adherance to all our token limits down the line.

In [18]:
seperator = "\n* "

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
separator_len = len(tokenizer.tokenize(SEPARATOR))

f"Context separator contains {separator_len} tokens"

'Context separator contains 3 tokens'

This funciton builds the prompt, which is a string that is sent to OpenAI's GPT-3 for the final response

If you want to modify the header, do it here

In [19]:
def build_prompt(question : str, text_and_embeddings : Dict[str , Tuple[str, List[float]]]) -> str:
  # first we order the context by how simlar their embeddings are to our query embedding
  ordered_text = order_by_similarity(question, text_and_embeddings)

  context_list = []
  context_tokens = 0
  context_token_limit = 2000

  # then we append the most similar context until we reach our token limit
  for item in ordered_text:
    text = item[1][0]

    context_tokens += len(tokenizer.tokenize(text)) + separator_len

    if context_tokens > context_token_limit:
      break

    context_list.append(seperator + text.replace("\n", " "))

  # now to actually construct the prompt
  header = """Answer as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""

  prompt = header + "".join(context_list) + "\n\nCase Note: \n" + case_note + "\n\nQuestion: " + question + "\nAnswer:"

  print("Prompt Constructed: \n")
  print(prompt)

  return prompt


You can modify the input below

In [45]:
question = """Screen this patient for depression.

1. Identify the patient
2. List the patient's core symptoms
3. List other key patient information
4. Figure out if the patient meets any diagnostic criteria
5. Decide if the patient meets enough criteria to warrant further screening for anything"""

In [39]:
case_note = """
Michael is a 15-year-old male who presents with fatigue and lack of interest in activities. He has lost weight and his parents are concerned. He denies any other symptoms. Medical history is significant for normal development. He has no known allergies and is up to date on his vaccinations. He reports that he used to be a good student but his grades have been slipping and he’s been skipping class. He used to be popular with his classmates but now he feels like he doesn’t fit in and has no friends. He denies any history of bullying. Mental status exam reveals a cooperative, well-groomed male who is alert and oriented to person, place, and time. He has a flat affect and speaks in a monotone voice. He reports feeling “down” but denies any suicidal ideation
"""

Get your response

In [44]:
response = get_response(question, text_and_embeddings)
print(response)

Text ordered by similarity
Prompt Constructed: 

Answer as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."

Context:

* Other Specified Depressive Disorder  183  condition, b) the probability that the associated medical condition has a potential to pro- mote or cause a depressive disorder, and c) a course of the depressive symptoms shortly after the onset or worsening of the medical condition, especially if the depressive symp- toms remit near the time that the medical disorder is effectively treated or remits.  Medication-induced depressive disorder. An important caveat is that some medical con- ditions are treated with medications (e.g., steroids or alpha-interferon) that can induce depres- sive or manic symptoms. In these cases, clinical judgment, based on all the evidence in hand, is the best way to try to separate the most likely and/or the most important of two etiological fac- tors (i.e., associatio