Title:  Project Workbook OpenAI

Authors:  Matthew Lopes and Chris Kabat

This notebook was created to allow for word/sentence embeddings to be created using Microsoft's Azure Open AI Cognitive Service to support our CS 598 DLH project. We do  actually create the embeddings in this notebook to avoid saving large files, but prepare the data for the creation of them. The paper we have chosen for the reproducibility project is:
***Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification from Clinical Notes ***

Abstract:  The main goal of the paper is to extract Morbidity from clinical notes.  The idea was to use a combination of classical and deep learning methods to determine the best approach for classifying these notes in one or more of 16 morbidity conditions.  These models used a combination of NLP techniques including embeddings and bag of words implementations.  It also measured the effect including of stop words.  Lastly, it used ensemble techniques to tie together a number of the classical and deep learning models to provide the most accurate results.

The data cannot be shared publicly due to the agreements required to obtain the data so we are storing the data locally and not putting in GitHub.

We are only creating embeddings for data that includes stop words.  Note, access to this service requires an Azure Subscription.  There is a cost for the service, so we only executed once and saved the dataframe with the stored embeddings.  You must also set the following environment variables:
setx AZURE_OPENAI_API_KEY = "(key)"
setx AZURE_OPENAI_ENDPOINT = "(endpoint)"

In this workbook, we are taking the following steps:

* Clean and tokenize the data
* Retrieve and store the document and sentence vectors from the Azure Open AI Service.

 First we load the required libraries and get our environment variables.

In [None]:
pip install openai num2words matplotlib plotly scipy scikit-learn pandas tiktoken

In [1]:
import openai
import os
import re
import requests
import sys
from num2words import num2words
import os
import pandas as pd
import numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity, get_embeddings
import tiktoken
import torch

# set seed
seed = 24
np.random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
# define data path
DATA_PATH = './obesity_data/'
AOAI_PATH = './aoai/'

alldocs_df = pd.read_pickle(DATA_PATH + '/alldocs_df.pkl')

API_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
RESOURCE_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT") 



Next we see what models in the Azure Open AI Service we have access to.  This also tests the connectivity.

In [2]:
openai.api_type = "azure"
openai.api_key = API_KEY
openai.api_base = RESOURCE_ENDPOINT
openai.api_version = "2022-12-01"

url = openai.api_base + "/openai/deployments?api-version=2022-12-01" 

r = requests.get(url, headers={"api-key": API_KEY})

print(r.text)

{
  "data": [
    {
      "scale_settings": {
        "scale_type": "standard"
      },
      "model": "text-curie-001",
      "owner": "organization-owner",
      "id": "text-curie-001",
      "status": "succeeded",
      "created_at": 1673986855,
      "updated_at": 1673986855,
      "object": "deployment"
    },
    {
      "scale_settings": {
        "scale_type": "standard"
      },
      "model": "text-davinci-002",
      "owner": "organization-owner",
      "id": "text-davinci-002",
      "status": "succeeded",
      "created_at": 1674680690,
      "updated_at": 1674680690,
      "object": "deployment"
    },
    {
      "scale_settings": {
        "scale_type": "standard"
      },
      "model": "gpt-35-turbo",
      "owner": "organization-owner",
      "id": "gpt35kabat",
      "status": "succeeded",
      "created_at": 1678383593,
      "updated_at": 1678383593,
      "object": "deployment"
    },
    {
      "scale_settings": {
        "scale_type": "standard"
      },
     

Next we clean the data using some guidance from Microsoft's tutorial here: https://learn.microsoft.com/en-us/azure/cognitive-services/openai/tutorials/embeddings?tabs=command-line

In [3]:
#Very minimal cleansing as discussed in the AOAI tutorial
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    # remove all instances of multiple spaces
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.strip()
    
    return s

alldocs_df['text_clean']= alldocs_df["text"].apply(lambda x : normalize_text(x))

Next we deterimine the amount of tokens used and check to see if it fits within the service's token limits.

In [4]:
#Need to tokenize this for Azure Open AI, don't plan on splitting as they all fit within token limit
tokenizer = tiktoken.get_encoding("cl100k_base")
alldocs_df['n_tokens'] = alldocs_df["text_clean"].apply(lambda x: len(tokenizer.encode(x)))

print('# too big:',len(alldocs_df[alldocs_df.n_tokens>=8192]))
print('Total Number of Tokens:',sum(alldocs_df['n_tokens']))   


# too big: 0
Total Number of Tokens: 2012345


This code actually calls the service for each document using the ADA v2 embedding models from OpenAI.

In [5]:
#This retrieves the embedding.  Since there is a cost, this is commented out.  To use this, you need to set two environment variables
alldocs_df['ada_v2'] = alldocs_df["text_clean"].apply(lambda x : get_embedding(x, engine = 'text-embedding-ada-002')) 
# engine should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model
alldocs_df.to_pickle(AOAI_PATH + '/alldocs_df_aoai.pkl') 

#commented embdding call because there is a charge - loading from pkl file to do analysis
alldocs_df = pd.read_pickle(AOAI_PATH + '/alldocs_df_aoai.pkl') 


In this code, we prepare the data to be used by sentence instead of the entire document.

In [6]:
#Try and sentence tokenize
from nltk.tokenize import sent_tokenize
##Sentences - we can't do data cleansing until after sentence tokenized
alldocs_df['sentence_tokenized'] = alldocs_df['text_clean'].apply(lambda x: sent_tokenize(x)) # this is a list of sentences

alldocs_df['sentence_count'] = alldocs_df['sentence_tokenized'].apply(lambda x: len(x))
sentence_max_aoai = np.max(alldocs_df['sentence_count'])
print('Max Sentences:', sentence_max_aoai)


#need to create tokens add '\n' to reach max_sentences
def token_and_pad_sentence(input_sentences, sentence_max):
    pad_spaces = sentence_max - len(input_sentences)
    result = input_sentences
    if pad_spaces > 0:
        for i in range(pad_spaces):
            result.append('\n')

alldocs_df_expanded['sentence_tokenized'] = alldocs_df_expanded['sentence_tokenized'].apply(lambda x: token_and_pad_sentence(x, sentence_max))

Max Sentences: 381


In this code, we make sure the tokens will fit within the service limits.

In [10]:

#Need to tokenize this for Azure Open AI, don't plan on splitting as they all fit within token limit
tokenizer = tiktoken.get_encoding("cl100k_base")

def get_sentence_tokens(input_sentences):
    tokens = 0
    for isx, sentence in enumerate(input_sentences):
        tokens = tokens + len(tokenizer.encode(sentence))
    return tokens

def get_max_sentence_tokens(input_sentences):
    tokens = 0
    for isx, sentence in enumerate(input_sentences):
        sent_tokens  = len(tokenizer.encode(sentence))
        if sent_tokens > tokens:
            tokens = sent_tokens

    return tokens

alldocs_df['n_sent_tokens'] = alldocs_df["sentence_tokenized"].apply(lambda x: get_sentence_tokens(x))
alldocs_df['max_sent_tokens'] = alldocs_df["sentence_tokenized"].apply(lambda x: get_max_sentence_tokens(x))

print('# too big:',len(alldocs_df[alldocs_df.max_sent_tokens>=2046]))
print('Total Number of Tokens:',sum(alldocs_df['n_sent_tokens']))   


# too big: 0
Total Number of Tokens: 2012890


In this code, we call the same embedding model but with sentences instead of the entire document.  Note, the service has limits on how often it can be called, so retry logic needed to be implemented.  We tried this for both the Ada and Babbage models.  Note, this generated a very large file (almost 7 GB).

In [None]:
#Now get the sentence embeddings
import time

batch = 0

def process_sentence(sentence):

    done = False

    return_array = None
    cnt = 0

    while not done:
        try:
            return_array = get_embedding(sentence, engine = 'text-similarity-babbage-001') #text-embedding-ada-002
            done = True
        except Exception as e:
            print(f'Exception {batch} {str(e)}')
            cnt = cnt + 1
            if cnt > 5:
                print('Too many retries')
                done = True
            else:
                print('Sleeping')
                time.sleep(60)
    
    return return_array

def get_padded_embeddings(input_sentences, sentence_max):
    global batch

    #output_array = np.zeros((sentence_max, 1536))
    output_array = np.zeros((sentence_max, 2048))
    pad_zeros = sentence_max - len(input_sentences)
    
    batch = batch + 1
    size = len(input_sentences)
    print(f"Running batch {batch}:Size:{size}")

    cnt = 0
    done = False

    for idx, sentence in enumerate(input_sentences):
        output_array[idx,:] = process_sentence(sentence)

    if pad_zeros > 0:
        for i in range(pad_zeros):
            idx = idx + 1
            #output_array[idx,:] = np.zeros(1536)
            output_array[idx,:] = np.zeros(2048)

    return output_array

#df_test = alldocs_df.head(2).copy()
#df_test['ada_v2_sent'] = df_test["sentence_tokenized"].apply(lambda x: get_padded_embeddings(x, sentence_max_aoai))
#print(df_test['ada_v2_sent'])

alldocs_df['ada_v2_sent'] = alldocs_df["sentence_tokenized"].apply(lambda x: get_padded_embeddings(x, sentence_max_aoai))
#alldocs_df['bab_v1_sent'] = alldocs_df["sentence_tokenized"].apply(lambda x: get_padded_embeddings(x, sentence_max_aoai))
alldocs_df.to_pickle(AOAI_PATH + '/alldocs_df_aoai.pkl') 

#commented embdding call because there is a charge - loading from pkl file to do analysis
alldocs_df = pd.read_pickle(AOAI_PATH + '/alldocs_df_aoai.pkl') 


Here we verify all of the data was returned as expected.

In [13]:
#There were a couple retries.  Make sure the shapes are all correct
sum(alldocs_df['ada_v2_sent'].apply(lambda x: x.shape) != (381,1536))
#sum(alldocs_df['bab_v1_sent'].apply(lambda x: x.shape) != (381,2048))




0