# Using Text Fields in Predictive Models - An Application of Large Language Models via the *TextEmbeddingFE* Package

In this tutorial, we illustrate how various functions in the *TextEmbeddingFE* Python package can be used for feature extraction from text fields, for interpreting the extracted information, and for including it in predictive models. For more details on the concepts, please see the paper [include link to arxiv or published paper]. Let's start with loading and exploring the data.

## Data

In [1]:
import pandas as pd

df = pd.read_csv('../data/data.csv')
df.head()

Unnamed: 0,diagnoses,age,height,is_female,weight,optime,aki_severity
0,155500. Cardiac conduit complication;010125. P...,18.11,148,1,80.9,112,0
1,091591. Aortic regurgitation;091519. Congenita...,18.23,169,1,56.1,144,1
2,155516. Cardiac conduit failure;090101. Common...,16.86,166,1,61.6,114,0
3,010116. Partial anomalous pulmonary venous con...,16.88,162,1,44.3,109,0
4,155516. Cardiac conduit failure;010133. Left h...,18.12,175,0,70.5,119,0


The data corresponds to >800 pediatric patients undergoing cardiopulmonary bypass (CPB). Besides basic patient information (sex, age/height/weight at the time of operation), we have three additional field:
- *diagnoses*: One or more standard diagnosis codes (including descriptions) for the patient, separated by ';'.
- *optime*: Length of CPB operation in minutes.
- *aki_severity*: A binary indicator of whether the patient suffered a severe form of post-operative acute kidney injury (AKI).
The data included here is part of a larger collection; for a more detailed context on data collection see [this](https://www.medrxiv.org/content/10.1101/2024.03.18.24304520v1) paper.

Our goal is to predict the postoperative outcome (*aki_severity*) from the remaining columns, which are all available before surgery. While age, height, gender, weight and operation time are all numeric, yet the *diagnoses* column is not, and hence cannot be directly used in a predictive model.

## Options for Encoding Text

- *Multivalued Binary Encoding*: Perhaps the most straightforward approach for encoding the *diagnoses* column is what can be described as *multivalued binary encoding* (MBE). We can assign a binary indicator to each individual diagnostic code, and to each patient we assign 1 to all diagnostic codes present, and 0 to the absent codes. This would be a generalzation of one-hot encoding, where multiple 1's can be present in the resulting binary vector for each observation, rather than only one.
- *doc2vec*: [add description of gensim doc2vec, its genesis in word2vec, and its limitations]
- *LLM embeddings*: Modern text-generating LLMs such as GPT-4 from OpenAI are deep neural networks, where the initial layers encode input text into a numeric representation. It is therefore possible to extract an embedding LLM from a text-generating LLM, followed by fine-tuning for embedding tasks. The resulting models tend to better utilize context for inferring the meaning of text and hence have improved semantic mapping in their embeddings.

LLM embeddings have a couple of significant properties:
1. They are *high-dimensional*. For instance, OpenAI's *text-embedding-3-large* model returns a vector of length 3072! Directly including this long vector in predictive problems with limited data can lead to problems such as long training time and overfitting.
2. They are *normalized*, i.e., their norm is meaningless and instead only the direction that they point in the high-dimensional space is relevant.

The tools offered in *TextEmbeddingFE* are designed to address the above two properties.

## Token Counting

Before proceeding, we mention a utility function, *count_tokens*, provided in *TextEmbeddingFE* for counting the tokens in a list of strings. It returns both the maximum number of tokens in the list of strings, and the total token count. They are useful fortwo purposes:
1. Both embedding and text-generation models from OpenAI - and likely in other LLMs - have maximum limits on token count. For instance, as of this writing, OpenAI's embedding model *text-embedding-3-large* has a maximum input length of 8191 tokens (see [here](https://platform.openai.com/docs/guides/embeddings/embedding-models)). By comparing the maximum number of tokens in our list against these product specifications, we can ensure that no inadvertent truncation of the data and/or error would occur.
2. We can also use the token count to estimate the costs, since prices are often expressed per token. For OpenAI's pricing details, see [here](https://openai.com/api/pricing).

As an example, let's check the maximum and total token count for the *diagnoses* column in our data. Note that the name of OpenAI's embedding and text-generation must be specified - with default values used below - since each may use a different tokenizer library:

In [2]:
from TextEmbeddingFE.main import count_tokens
count_tokens(
    text_list = list(df['diagnoses'])
    , openai_embedding_model = 'text-embedding-3-large'
    , openai_textgen_model = 'gpt-4-turbo'
)

(138, 24539)

We see that while the maximum token count in this text column is only 138, the total token count is 24,539 which is higher than the context length for many of OpenAI's models. For instance, the *gpt-4* model as of this writing has a context length of 8,192 (see [here](https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4)).

## Using OpenAI to Embed Text

*TextEmbeddingFE* offers a convenient function, *embed_text*, that calls OpenAI's API function for text embedding (accessed via the *openai* Python package), and returns a numpy matrix. To call *embed_text*, we first must have created an OpenAI API key, which we can use to create a connection:

In [3]:
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()
api_key = os.getenv('OPENAI_API_KEY')

from openai import OpenAI
client = OpenAI(api_key = api_key)

Note that the above assumes we have a valid OpenAI API key, saved under the environment-variable *OPENAI_API_KEY* to the file named *.env* in the current folder. For more on generating and using OpenAI API keys, see [this post](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key).

We now embed the *diagnoses* column using the following call. 

In [4]:
from TextEmbeddingFE.main import embed_text

my_embedding_matrix = embed_text(
    openai_client = client
    , text_list = list(df['diagnoses'][:5]) # submitting only the first 5 rows for illustration
    , openai_embedding_model = 'text-embedding-3-large'
)
my_embedding_matrix.shape

(5, 3072)

Each row of the returned matrix is a 3,072-long embedding vector that provides a numeric representation of the diagnoses codes/descriptions for that patient. It is also easy to check that the embedding vectors are L2-normalized, and hence only represent a direction in the high-dimensional space, rather than magnitude information:

In [5]:
import numpy as np

np.apply_along_axis(lambda x: sum(x * x), 1, my_embedding_matrix)

array([1.00000006, 1.00000001, 0.99999995, 0.99999983, 1.0000001 ])

In the rest of this tutorial, we present two workflows for taking advantage of text embeddings returned by OpenAI's LLM. First is a explanatory workflow, focused on interpreting the embeddings and understanding what risk factors they impply. Secondly, we present a predictive workflow to incorporate the embeddings into a predictive model which would includes other variables. In both workflows, we take into consideration the normalized property of the embedding vectors.

## Explanatory Workflow

In this workflow, we use a text-generating LLM to intrepret the output of the embedding LLM. More specifically, we follow these steps:
1. Cluster observations using the embedding vector as feature.
2. Group the observations according the embedding-based clusters and assemble a prompt.
3. Submit the prompt to OpenAI to solicit cluster labels/descriptions.

### Clustering using Text Embeddings

Ideally, we would like the clustering algorithm to use a magnitude-insensitive distance metric such as the cosine distance:
$$
d(\mathbf{x}, \mathbf{y}) = 1 - \frac{\mathbf{x}^T \, \mathbf{y}}{(\mathbf{x}^T \mathbf{x})^{1/2} \, (\mathbf{y}^T \mathbf{y})^{1/2}}
$$
[continue to describe why normalization alone doesn't make regular kmeans equivalent to spherical kmeans, mention that in the future we may implement spherical kmeans, also that the sister R package currently has native spherical kmeans support]

For illustration, we load a truncated embedding file for the *diagnoses* column, where the first 100 elements of the 3072-long embedding vector for each patient were extracted:

In [8]:
dfEmbeddings = pd.read_csv('../data/embeddings.csv')
dfEmbeddings.head()

Unnamed: 0,project_id,operation_no,X1,X2,X3,X4,X5,X6,X7,X8,...,X91,X92,X93,X94,X95,X96,X97,X98,X99,X100
0,PR-00000001,1,-0.004914,0.065536,0.000462,-0.002728,0.007441,0.03403,-0.065837,0.027918,...,-0.022181,0.049798,0.003935,0.039526,0.032888,-0.044452,0.028218,0.003186,-0.058388,-0.022827
1,PR-00000002,2,-0.003297,0.067997,0.001257,0.000272,0.012815,0.031194,-0.037501,0.004168,...,-0.022067,0.032728,-0.007845,0.025801,0.051881,-0.030496,0.003235,0.010762,-0.061985,-0.023136
2,PR-00000003,3,-0.024688,-0.002821,0.005994,0.030009,0.01428,0.039548,-0.047434,0.019382,...,-0.040968,0.022296,0.009749,0.047869,0.081995,-0.024761,0.023021,0.020992,-0.043491,0.00436
3,PR-00000004,4,0.019465,0.059981,-0.004884,0.010679,0.000434,0.057856,-0.02159,0.008793,...,-0.022216,0.000365,0.041841,0.014515,0.063999,-0.021314,0.027341,-0.011377,-0.060039,-0.014886
4,PR-00000005,5,0.00631,0.052803,0.001488,-0.010779,0.011241,0.012157,-0.034958,0.025509,...,-0.017798,0.046048,-0.010954,0.029525,0.041459,-0.018786,0.014388,0.013798,-0.066921,-0.013504


We extract the embedding matrix from the above dataframe:

In [13]:
from TextEmbeddingFE.main import cluster_embeddings

my_embedding_colnames = ["X" + str(n+1) for n in range(100)]
my_embedding_matrix = dfEmbeddings[my_embedding_colnames].to_numpy()

We can now call the *cluster_embeddings* function. In addition to the embedding matrix, we should also specify two hyperparameters:
- *n_clusters*: Number of clusters to use in the k-means algorithm.
- *n_init*: Number of initializations of k-means to try. Th best solution - the one minimizing total within-cluster distances - is returned.

[explain how these two hyperparameters can be selected]

In [19]:
my_clusters = cluster_embeddings(
    X = my_embedding_matrix
    , n_clusters = 10
    , n_init = 10
)
np.bincount(my_clusters)

array([ 54, 141,  84,  85,  81,  46,  34, 220,  67, 151], dtype=int64)

### Assembling a Prompt