<a href="https://colab.research.google.com/github/beverm2391/NLP-CDST/blob/main/NLP_CDST_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Setup

Add a Tesla T4 GPU to the runtime to support the text encoding process

In [None]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



Install Dependencies

In [None]:
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 753 kB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 20.0 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 42.2 MB/s 
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 44.9 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 20.4 MB/s 
Building wheels for collected 

In [None]:
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.25.0.tar.gz (44 kB)
[K     |████████████████████████████████| 44 kB 1.7 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting pandas-stubs>=1.1.0.11
  Downloading pandas_stubs-1.2.0.62-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 7.5 MB/s 
Building wheels for collected packages: openai
  Building wheel for openai (PEP 517) ... [?25l[?25hdone
  Created wheel for openai: filename=openai-0.25.0-py3-none-any.whl size=55880 sha256=2f5c9a396ea148bd95172a4fedba2c2b309cccce66623192dbab3314262ed030
  Stored in directory: /root/.cache/pip/wheels/19/de/db/e82770b480ec30fd4a6d67108744b9c52be167c04fcf4af7b5
Successfully built openai
Installing collected packages: pandas-stubs, openai
Successfully in

In [None]:
!pip install python-dotenv

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-dotenv
  Downloading python_dotenv-0.21.0-py3-none-any.whl (18 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-0.21.0


In [None]:
import pickle
import numpy as np
from typing import List, Tuple, Dict
from sentence_transformers import SentenceTransformer, util
import os
import pandas as pd
from transformers import GPT2TokenizerFast
import openai
from dotenv import load_dotenv

Mount google drive to colab so that you can access its files. This is used to save/load relevant data like encoded text.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# I have already taken my context, in this case the DSM-5 and loaded the text of each page into a list. I used a pdf parser for python called pdfminer.six
# ["Page one text", "Page two text",...]

# I'm using pickle to save/load the list as binary

with open("/content/drive/MyDrive/PDFs-for-Parsing/DSM-5_page_list_v1", "rb") as fp:
  dsm_list = pickle.load(fp)

# I'll refer to this context as text_to_encode

In [None]:
# Specify the file path to store and load the encoded text (which is in the form of vector embeddings)
fpath = "/content/drive/MyDrive/PDFs-for-Parsing/DSM-5_embeddings_v1"

In [None]:
# Set the OpenAI API key environment variable
# load from master.env file in My Drive
# this gives access to the GPT- mode
# your own API key can be obtained here: https://openai.com/api/

load_dotenv("/content/drive/MyDrive/master.env")
api_key = os.environ.get('OPENAI-API-KEY')
openai.api_key = api_key

##Some Function Definition

This function either loads the embeddings, or returns None if none exist.

In [None]:
def load_embeddings(fpath : str) -> List[List[float]]:
  # if the csv file containg our context embeddings at fpath exists, read it and convert it into list format
  if os.path.exists(fpath):
    with open(fpath, 'r') as f:
      embeddings_df = pd.read_csv(fpath)
      # convert the df back into a list of lists
      embeddings = embeddings_df.values.tolist()
      print("loaded embeddings")
      return embeddings
  else:
    return None

This next funciton encodes text into a vector embedding

https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1

This model "maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search.
It has been trained on 215M (question, answer) pairs from diverse sources."
It will encode our corpus of text into vector embeddings

In [None]:
def get_embedding(text: str) -> List[float]:
  model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')
  embedding = model.encode(text)
  return embedding

This funciton makes the API call to GPT-3 and returns the response.

In [None]:
def get_response(prompt, model):
    response = openai.Completion.create(
        model=model,
        prompt=prompt,
        temperature=0,
        max_tokens=1000,
    )
    return response['choices'][0]['text'].strip(" \n")

In [None]:
def run_tool(question, corpus):
    prompt = build_prompt(question, corpus)
    model = "text-davinci-002"

    response = get_response(prompt, model)
    return response

## Encoding the training data

This function iterates over the training text and encodes each list item into a vector embedding, which is then stored in a DataFrame and written to csv for storage. 

In [None]:
def encode_text(text_to_encode : List[str], fpath : str):

  embeddings = load_embeddings(fpath)
  # create empty list of n items if no embeddings
  if embeddings == None:
    print("No file found, creating blank list")
    embeddings = [None for _ in text_to_encode]

  failed_embeddings = []
  embeddings_to_generate = embeddings.count(None)

  if embeddings_to_generate == 0:
    print("No embeddings to generate")
    return embeddings

  print(f"Generating {embeddings_to_generate} embeddings")

  for idx, value in enumerate(text_to_encode):
    if embeddings[idx] is None:
      try:
        embedding = get_embedding(value)
        embeddings[idx] = embedding.tolist()
        print(f"Embeddding {idx + 1} generated")
      # if embedding fails
      except Exception as e:
        print(f"Embedding {idx + 1} failed")
        failed_embeddings.append(idx)

  print("Encoding Successful")

  if len(failed_embeddings) > 0:
    print(f"{len(failed_embeddings)} embeddings failed. Indexes: {failed_embeddings}")
  else:
    print("No embeddings failed")

  # This is where we'll write the embeddings to CSV
  temp_df_for_storage = pd.DataFrame(embeddings)
  temp_df_for_storage.to_csv(fpath, index=False)

  print("Written to csv")

  return embeddings

This encoding process will take some time, even with the T4 GPU. It took me about 15 minutes to encode a list of ~800 pages, each with around 1k tokens.

In [None]:
embeddings = encode_text(dsm_list, fpath)

No file found, creating blank list
Generating 806 embeddings
Embeddding 1 generated
Embeddding 2 generated
Embeddding 3 generated
Embeddding 4 generated
Embeddding 5 generated
Embeddding 6 generated
Embeddding 7 generated
Embeddding 8 generated
Embeddding 9 generated
Embeddding 10 generated
Embeddding 11 generated
Embeddding 12 generated
Embeddding 13 generated
Embeddding 14 generated
Embeddding 15 generated
Embeddding 16 generated
Embeddding 17 generated
Embeddding 18 generated
Embeddding 19 generated
Embeddding 20 generated
Embeddding 21 generated
Embeddding 22 generated
Embeddding 23 generated
Embeddding 24 generated
Embeddding 25 generated
Embeddding 26 generated
Embeddding 27 generated
Embeddding 28 generated
Embeddding 29 generated
Embeddding 30 generated
Embeddding 31 generated
Embeddding 32 generated
Embeddding 33 generated
Embeddding 34 generated
Embeddding 35 generated
Embeddding 36 generated
Embeddding 37 generated
Embeddding 38 generated
Embeddding 39 generated
Embeddding 4

##Using the Tool

`make sure you run the *"some functions to define"* section even if you're not encoding any new context`

After Once the corpus has been encoded once, the embeddings just needs to be loaded once each session.

In [None]:
embeddings = load_embeddings(fpath)

loaded embeddings


Now that we have two lists, one with the corpus text and one with the encoded text (vector embedding), we'll aggregate them into one data structure.

In [None]:
def aggregate_text_and_embeddings(text_to_encode, embeddings) -> Dict[str , Tuple[str, List[float]]]:
  assert embeddings is not None

  text_to_encode = text_to_encode[:727]
  embeddings = embeddings[:727]

  text_and_embeddings = {}
  for idx, text in enumerate(text_to_encode):
    text_and_embeddings[f"Chunk {idx + 1}"] = (text, embeddings[idx])
  return text_and_embeddings

# Example
# {"Chunk 1" : ("This is the text of chunk 1...", [.23434, .12324, .52323, ...]),
#  "Chunk 2" : ("This is the text of chunk 2...", [.20934, .16524, .78362, ...])}

Then the text and embeddings get aggregated into one dictionary called "corpus"

In [None]:
text_and_embeddings = aggregate_text_and_embeddings(dsm_list, embeddings)

Next we need to aggreate the text of each chunk to its respective vector embedding

These functions caluclate similarity between the question embedding and each corpus embedding using a dot product, then return the text chunks in order of similarity

In [None]:
def get_case_note_key_terms(case_note: str) -> str:
  header = "summarize any symptoms of mental illness:"
  prompt = "Case Note:" + "\n" + case_note + "\n" + header
  # im using davinci 2 so that the response is more concise
  model = "text-davinci-002"

  response = get_response(prompt, model)
  reseponse = case_note
  print("Screening for:\n" + response + "\n")

  return response

In [None]:
def get_relevant_info(case_note: str, instructions : str) -> str:
  prompt = "Case Note:\n" + case_note + "\n" + instructions
  model = "text-davinci-002"
  response = get_response(prompt, model) 
  return response

In [None]:
instructions = """Screen this patient for illness.
1. Identify the patient
2. List the patient's symptoms
3. List other key patient information

Stop
"""

In [None]:
def vector_similarity(x: List[float], y: List[float]) -> float:
    return np.dot(np.array(x), np.array(y))

def order_by_similarity(text_and_embeddings : Dict[str, Tuple[str, List[float]]]) -> List[Tuple[float, str, str]]:

  # get key terms from the case note
  revelant_info = get_relevant_info(case_note, instructions)
  # get an embedding of those key terms
  query_embedding = get_embedding(revelant_info)

  ordered_text = sorted([(vector_similarity(query_embedding, chunk[1]), chunk, idx) for idx, chunk in text_and_embeddings.items()], reverse=True)

  print("Text ordered by similarity")
  return ordered_text

# Example
# [(0.12398640147, "Chunk 47", "This is the text of chunk 47...")...]

This cell below can be used to see which embeddings are similar to any relevant info (as derived from any case note)


In [None]:
ordered_text = order_by_similarity(text_and_embeddings)
for item in ordered_text[0:7]:
  print(item)

Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.65k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Text ordered by similarity
(26.218427990590744, ('192\n\nAnxiety Disorders\n\ndividual’s separation anxiety (e.g., destruction of the family through fire, murder, or other\ncatastrophe) (Criterion A7). Physical symptoms (e.g., headaches, abdominal complaints,\nnausea, vomiting) are common in children when separation from major attachment fig-\nures occurs or is anticipated (Criterion A8). Cardiovascular symptoms such as palpitations,\ndizziness, and feeling faint are rare in younger children but may occur in adolescents and\nadults.\n\nThe disturbance must last for a period of at least 4 weeks in children and adolescents\nyounger than 18 years and is typically 6 months or longer in adults (Criterion B). However,\nthe duration criterion for adults should be used as a general guide, with allowance for\nsome degree of flexibility. The disturbance must cause clinically significant distress or im-\npairment in social, academic, occupational, or other important areas of functioning (Cri-\nte

**Creating the prompt**

This term is used to separate the sections of text that get pulled into the prompt each time it gets built. Checking the number of tokens so that we stay in adherance to all our token limits down the line.

In [None]:
seperator = "\n* "

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
separator_len = len(tokenizer.tokenize(seperator))

f"Context separator contains {separator_len} tokens"

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

'Context separator contains 3 tokens'

This funciton builds the prompt, which is a string that is sent to OpenAI's GPT-3 for the final response

If you want to modify the header, do it here

In [None]:
def build_prompt(question : str, text_and_embeddings : Dict[str , Tuple[str, List[float]]]) -> str:
  # first we order the context by how simlar their embeddings are to our query embedding
  ordered_text = order_by_similarity(text_and_embeddings)

  context_list = []
  context_tokens = 0
  context_token_limit = 2000

  # then we append the most similar context until we reach our token limit
  for item in ordered_text:
    text = item[1][0]

    context_tokens += len(tokenizer.tokenize(text)) + separator_len

    if context_tokens > context_token_limit:
      break

    context_list.append(seperator + text.replace("\n", " "))

  # now to actually construct the prompt
  header = """Answer as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""
  # get relevant clinical information
  relevant_info = get_relevant_info(case_note, instructions)

  prompt = header + "".join(context_list) + "\n\nRelevant Information:\n" + relevant_info + "\n\nQuestion:\n" + question + "\nAnswer:"

  print("Prompt Constructed: \n")
  print(prompt)

  return prompt


## User Input

You can modify the input below

In [None]:
case_note = """
Michael is a 15-year-old male who presents with fatigue and lack of interest in activities. He has lost weight and his parents are concerned. He denies any other symptoms. Medical history is significant for normal development. He has no known allergies and is up to date on his vaccinations. He reports that he used to be a good student but his grades have been slipping and he’s been skipping class. He used to be popular with his classmates but now he feels like he doesn’t fit in and has no friends. He denies any history of bullying. Mental status exam reveals a cooperative, well-groomed male who is alert and oriented to person, place, and time. He has a flat affect and speaks in a monotone voice. He reports feeling “down” but denies any suicidal ideation
"""

In [None]:
case_note = """
John is a 15-year-old male who has been referred to the clinic for evaluation. He reports that he has been feeling worried for the past few months, and that his worry has been increasing in intensity. He describes feeling overwhelmed and unable to concentrate on tasks, and reports that he has been avoiding social situations and activities that he used to enjoy. He also reports having difficulty sleeping, and has been having frequent nightmares. He has been feeling irritable and has been having difficulty controlling his emotions. He has also been experiencing physical symptoms such as headaches, stomachaches, and muscle tension. He has been avoiding school and has been having difficulty focusing in class. He reports that he has been feeling hopeless and helpless, and has been having thoughts of self-harm.
"""

In [None]:
# get the relevant diagnostic info from the prompt
question = """1. Does the patient meet any diagnostic criteria?
2. Does the patient meets enough criteria to warrant further screening for anything. If so, specify what they should be screened for."
"""

In [None]:
question_and_relevant_info = relevant_info + "\n\nQuestion:\n" + question

NameError: ignored

Run the tool

In [None]:
output = run_tool(question, text_and_embeddings)
print(output)

Text ordered by similarity
Prompt Constructed: 

Answer as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."

Context:

* 192  Anxiety Disorders  dividual’s separation anxiety (e.g., destruction of the family through fire, murder, or other catastrophe) (Criterion A7). Physical symptoms (e.g., headaches, abdominal complaints, nausea, vomiting) are common in children when separation from major attachment fig- ures occurs or is anticipated (Criterion A8). Cardiovascular symptoms such as palpitations, dizziness, and feeling faint are rare in younger children but may occur in adolescents and adults.  The disturbance must last for a period of at least 4 weeks in children and adolescents younger than 18 years and is typically 6 months or longer in adults (Criterion B). However, the duration criterion for adults should be used as a general guide, with allowance for some degree of flexibility. The disturbance must ca