# Using Text Fields in Predictive Models - An Application of Large Language Models via the *TextEmbeddingFE* Package

In this tutorial, we illustrate how various functions in the *TextEmbeddingFE* Python package can be used for feature extraction from text fields, for interpreting the extracted information, and for including it in predictive models. For more details on the concepts, please see the paper [include link to arxiv or published paper]. Let's start with loading and exploring the data.

## Data

In [1]:
import pandas as pd

df = pd.read_csv('../data/raw.csv')
df.head()

Unnamed: 0,is_female,age,height,weight,optime,diagnoses,aki_severity
0,1,18.11,148,80.9,112,155500. Cardiac conduit complication;010125. P...,0
1,1,18.23,169,56.1,144,091591. Aortic regurgitation;091519. Congenita...,1
2,1,16.86,166,61.6,114,155516. Cardiac conduit failure;090101. Common...,0
3,1,16.88,162,44.3,109,010116. Partial anomalous pulmonary venous con...,0
4,0,18.12,175,70.5,119,155516. Cardiac conduit failure;010133. Left h...,0


The data corresponds to >800 pediatric patients undergoing cardiopulmonary bypass (CPB). Besides basic patient information (sex, age/height/weight at the time of operation), we have three additional field:
- *diagnoses*: One or more standard diagnosis codes (including descriptions) for the patient, separated by ';'.
- *optime*: Length of CPB operation in minutes.
- *aki_severity*: A binary indicator of whether the patient suffered a severe form of post-operative acute kidney injury (AKI).
The data included here is part of a larger collection; for a more detailed context on data collection see [this](https://www.medrxiv.org/content/10.1101/2024.03.18.24304520v1) paper.

Our goal is to predict the postoperative outcome (*aki_severity*) from the remaining columns, which are all available before surgery. While age, height, gender, weight and operation time are all numeric, yet the *diagnoses* column is not, and hence cannot be directly used in a predictive model.

## Options for Encoding Text

- *Multivalued Binary Encoding*: Perhaps the most straightforward approach for encoding the *diagnoses* column is what can be described as *multivalued binary encoding* (MBE). We can assign a binary indicator to each individual diagnostic code, and to each patient we assign 1 to all diagnostic codes present, and 0 to the absent codes. This would be a generalzation of one-hot encoding, where multiple 1's can be present in the resulting binary vector for each observation, rather than only one.
- *doc2vec*: [add description of gensim doc2vec, its genesis in word2vec, and its limitations]
- *LLM embeddings*: Modern text-generating LLMs such as GPT-4 from OpenAI are deep neural networks, where the initial layers encode input text into a numeric representation. It is therefore possible to extract an embedding LLM from a text-generating LLM, followed by fine-tuning for embedding tasks. The resulting models tend to better utilize context for inferring the meaning of text and hence have improved semantic mapping in their embeddings.

LLM embeddings have a couple of significant properties:
1. They are *high-dimensional*. For instance, OpenAI's *text-embedding-3-large* model returns a vector of length 3072! Directly including this long vector in predictive problems with limited data can lead to problems such as long training time and overfitting.
2. They are *normalized*, i.e., their norm is meaningless and instead only the direction that they point in the high-dimensional space is relevant.

The tools offered in *TextEmbeddingFE* are designed to address the above two properties.

## Token Counting

Before proceeding, we mention a utility function, *count_tokens*, provided in *TextEmbeddingFE* for counting the tokens in a list of strings. It returns both the maximum number of tokens in the list of strings, and the total token count. They are useful fortwo purposes:
1. Both embedding and text-generation models from OpenAI - and likely in other LLMs - have maximum limits on token count. For instance, as of this writing, OpenAI's embedding model *text-embedding-3-large* has a maximum input length of 8191 tokens (see [here](https://platform.openai.com/docs/guides/embeddings/embedding-models)). By comparing the maximum number of tokens in our list against these product specifications, we can ensure that no inadvertent truncation of the data and/or error would occur.
2. We can also use the token count to estimate the costs, since prices are often expressed per token. For OpenAI's pricing details, see [here](https://openai.com/api/pricing).

As an example, let's check the maximum and total token count for the *diagnoses* column in our data. Note that the name of OpenAI's embedding and text-generation must be specified - with default values used below - since each may use a different tokenizer library:

In [2]:
from TextEmbeddingFE.main import count_tokens
count_tokens(
    text_list = list(df['diagnoses'])
    , openai_embedding_model = 'text-embedding-3-large'
    , openai_textgen_model = 'gpt-4-turbo'
)

(138, 24539)

We see that while the maximum token count in this text column is only 138, the total token count is 24,539 which is higher than the context length for many of OpenAI's models. For instance, the *gpt-4* model as of this writing has a context length of 8,192 (see [here](https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4)).

## Using OpenAI to Embed Text

*TextEmbeddingFE* offers a convenient function, *embed_text*, that calls OpenAI's API function for text embedding (accessed via the *openai* Python package), and returns a numpy matrix. To call *embed_text*, we first must have created an OpenAI API key, which we can use to create a connection:

In [3]:
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()
api_key = os.getenv('OPENAI_API_KEY')

from openai import OpenAI
client = OpenAI(api_key = api_key)

Note that the above assumes we have a valid OpenAI API key, saved under the environment-variable *OPENAI_API_KEY* to the file named *.env* in the current folder. For more on generating and using OpenAI API keys, see [this post](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key).

We now embed the *diagnoses* column using the following call. 

In [4]:
from TextEmbeddingFE.main import embed_text

my_embedding_matrix = embed_text(
    openai_client = client
    , text_list = list(df['diagnoses'][:5]) # submitting only the first 5 rows for illustration
    , openai_embedding_model = 'text-embedding-3-large'
)
my_embedding_matrix.shape

(5, 3072)

Each row of the returned matrix is a 3,072-long embedding vector that provides a numeric representation of the diagnoses codes/descriptions for that patient. It is also easy to check that the embedding vectors are L2-normalized, and hence only represent a direction in the high-dimensional space, rather than magnitude information:

In [5]:
import numpy as np

np.apply_along_axis(lambda x: sum(x * x), 1, my_embedding_matrix)

array([1.00000006, 1.00000001, 0.99999995, 0.99999983, 1.0000001 ])

In the rest of this tutorial, we present two workflows for taking advantage of text embeddings returned by OpenAI's LLM. First is a explanatory workflow, focused on interpreting the embeddings and understanding what risk factors they impply. Secondly, we present a predictive workflow to incorporate the embeddings into a predictive model which would includes other variables. In both workflows, we take into consideration the normalized property of the embedding vectors.

## Explanatory Workflow

In this workflow, we use a text-generating LLM to intrepret the output of the embedding LLM. More specifically, we follow these steps:
1. Cluster observations using the embedding vector as feature.
2. Group the observations according the embedding-based clusters and assemble a prompt.
3. Submit the prompt to OpenAI to solicit cluster labels/descriptions.

### Clustering using Text Embeddings

Ideally, we would like the clustering algorithm to use a magnitude-insensitive distance metric such as the cosine distance:
$$
d(\mathbf{x}, \mathbf{y}) = 1 - \frac{\mathbf{x}^T \, \mathbf{y}}{(\mathbf{x}^T \mathbf{x})^{1/2} \, (\mathbf{y}^T \mathbf{y})^{1/2}}
$$
[continue to describe why normalization alone doesn't make regular kmeans equivalent to spherical kmeans, mention that in the future we may implement spherical kmeans, also that the sister R package currently has native spherical kmeans support]

For illustration, we load a truncated embedding file for the *diagnoses* column, where the first 100 elements of the 3072-long embedding vector for each patient were extracted:

In [4]:
dfEmbeddings = pd.read_csv('../data/embeddings.csv')
dfEmbeddings.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X91,X92,X93,X94,X95,X96,X97,X98,X99,X100
0,-0.004914,0.065536,0.000462,-0.002728,0.007441,0.03403,-0.065837,0.027918,0.009296,0.029449,...,-0.022181,0.049798,0.003935,0.039526,0.032888,-0.044452,0.028218,0.003186,-0.058388,-0.022827
1,0.019465,0.059981,-0.004884,0.010679,0.000434,0.057856,-0.02159,0.008793,0.009179,0.028753,...,-0.022216,0.000365,0.041841,0.014515,0.063999,-0.021314,0.027341,-0.011377,-0.060039,-0.014886
2,0.00631,0.052803,0.001488,-0.010779,0.011241,0.012157,-0.034958,0.025509,-0.003456,-0.012611,...,-0.017798,0.046048,-0.010954,0.029525,0.041459,-0.018786,0.014388,0.013798,-0.066921,-0.013504
3,-0.024605,0.027766,0.000341,0.025616,0.003682,0.025818,-0.011927,0.065287,0.015571,0.043525,...,-0.017621,0.022542,0.02247,0.042399,0.021228,-0.050567,0.058822,-0.009142,-0.051751,-0.005033
4,0.013725,0.064173,-0.00158,0.010255,0.013958,0.01255,-0.011009,0.009173,-0.015934,0.016448,...,-0.008481,0.048487,0.014635,0.02734,0.041392,-0.035852,0.03912,0.001916,-0.061403,-0.00901


We extract the embedding matrix from the above dataframe:

In [5]:
from TextEmbeddingFE.main import cluster_embeddings

my_embedding_colnames = ["X" + str(n+1) for n in range(100)]
my_embedding_matrix = dfEmbeddings[my_embedding_colnames].to_numpy()

We can now call the `cluster_embeddings` function. In addition to the embedding matrix, we should also specify two hyperparameters:
- `n_clusters`: Number of clusters to use in the k-means algorithm.
- `n_init`: Number of initializations of k-means to try. Th best solution - the one minimizing total within-cluster distances - is returned.

[explain how these two hyperparameters can be selected]

In [8]:
my_clusters = cluster_embeddings(
    X = my_embedding_matrix
    , n_clusters = 10
    , n_init = 10
)
np.bincount(my_clusters)

array([102, 141, 127,  38,  65,  75,  77,  36,  58, 111], dtype=int64)

### Assembling a Prompt
To assemble a prompt for soliciting cluster descriptions, we use the function `generate_prompt`. Required fields include:
- `text_list`: The text column whose embeddings were used for clustering the observations.
- `cluster_labels`: The vector of cluster labels - in consecutive integers starting at 0 - corresponding to the same observations in *text_list*.
- `prompt_observations`: What the observations should be called, in plural form.
- `prompt_texts`: What the text field values represent, also in plural form.

The function returns the total token count for the prompt, followed by the prompt text. Below is an example call for our problem:

In [9]:
from TextEmbeddingFE.main import generate_prompt
token_count, my_prompt = generate_prompt(
    text_list = list(df['diagnoses'])
    , cluster_labels = my_clusters
    , prompt_observations = 'patients'
    , prompt_texts = 'diagnoses'
)
print('token count:', token_count)

token count: 25196


Let's examine the prompt:

In [10]:
print(my_prompt)

The following is a list of 830 patients. Text lines represent diagnoses. Patients have been grouped into 10 groups, according to their diagnoses. Please suggest group labels that are representative of their members, and also distinct from each other:

=====

Group 1:

010109. Hypoplastic left heart syndrome
010109. Hypoplastic left heart syndrome
010109. Hypoplastic left heart syndrome;060291. Mitral regurgitation
060191. Tricuspid regurgitation; 060103. Dysplasia of tricuspid valve; 010109. Hypoplastic left heart syndrome
010109. Hypoplastic left heart syndrome
070901. LV outflow tract obstruction;092911. Aortic arch hypoplasia;070900. Subaortic stenosis
101369. Acquired left pulmonary arterial stenosis;010118. Double outlet right ventricle with subpulmonary ventricular septal defect, transposition type;092901. Coarctation of aorta
071103. Trabecular muscular ventricular septal defect (VSD) apical;092911. Aortic arch hypoplasia;092901. Coarctation of aorta;091522. Bicuspid aortic valv

It contains a 'preamble' paragraph, followed by the list of patients, grouped into clusters, and for each patient the corresponding text field is printed. It is recommended that users explore the auto-generated preamble, and if they wish to edit it, override the default value for the `preamble` argument, perhaps via fine-tuning the auto-generated one. Note that, in this case, providing values for `prompt_observations` and `prompt_texts` is not necessary:

In [14]:
my_better_preamble = (
    "The following represents a group of pediatric patients undergoing cardiopulmonary bypass."
    " Each row contains one or more surgical procedures performed on the patient during bypass, separated by ';'."
    " Patients are grouped into 10 groups according to the similarity of their surgical procedures."
    " Please suggest group labels that are representative of their members, and also distinct from each other:"
)
_, my_better_prompt = generate_prompt(
    text_list = list(df['diagnoses'])
    , cluster_labels = my_clusters
    , preamble = my_better_preamble
)

In [15]:
print(my_better_prompt)

The following represents a group of pediatric patients undergoing cardiopulmonary bypass. Each row contains one or more surgical procedures performed on the patient during bypass, separated by ';'. Patients are grouped into 10 groups according to the similarity of their surgical procedures. Please suggest group labels that are representative of their members, and also distinct from each other:

=====

Group 1:

010109. Hypoplastic left heart syndrome
010109. Hypoplastic left heart syndrome
010109. Hypoplastic left heart syndrome;060291. Mitral regurgitation
060191. Tricuspid regurgitation; 060103. Dysplasia of tricuspid valve; 010109. Hypoplastic left heart syndrome
010109. Hypoplastic left heart syndrome
070901. LV outflow tract obstruction;092911. Aortic arch hypoplasia;070900. Subaortic stenosis
101369. Acquired left pulmonary arterial stenosis;010118. Double outlet right ventricle with subpulmonary ventricular septal defect, transposition type;092901. Coarctation of aorta
071103. T

We are now ready to submit the above prompt to OpenAI and solicit cluster labels. Since our total token count of ~25,000 exceeds the context length for `gpt-4` (8192), we instead use `gpt-4-turbo` which has a much longer context of 128,000 tokens. As with text embedding API call, you can use the total token count to estimate the cost (which is about 25 cents for the above prompt as of this writing).

In [17]:
from TextEmbeddingFE.main import interpret_clusters

my_interpretation = interpret_clusters(
    openai_client = client
    , prompt = my_better_prompt
    , openai_textgen_model = 'gpt-4-turbo'
)

Let's examine the output of `gpt-4-turbo`. A logical next step would be to share the above summary of domain experts - e.g. cardiac surgeons in this case - and seek to validate the sensibility and plausibility of the taxonomy:

In [18]:
print(my_interpretation)

Given the substantial dataset provided and the variety of surgeries and cardiac conditions each group encapsulates, here are succinct and descriptive group labels reflective of the similarities in surgical interventions or conditions among the patients within each group:

### Group Labels:

1. **Complex Congenital Heart Disorders**: Reflects the presence of multiple complicated congenital disorders including frequently observed conditions like Hypoplastic Left Heart Syndrome and variations of Ventricular Septal Defects compounded by other structural anomalies.

2. **Conduit Failure and Associated Repairs**: Captures the recurrent themes of cardiac conduit failures, associated complications, and the diverse surgical repairs undertaken which are significant in the surgeries listed for this group.

3. **Central and Perimembranous VSD Focus**: A clear focus on surgeries revolving around Perimembranous Ventricular Septal Defects, emphasizing either the repair of these defects alone or in co

(consider adding the concept of regressing outcome variable on the cluster categorical variable and reporint p-values, etc, similar to the paper.)

## Predictive Workflow

### Options for Using Text Embeddings

Let's revisit the original, predictive problem: Our goal is to predictive a binary outcome - severe postoperative AKI. Besides the `diagnoses` text field, which we seek to transform to numeric via embedding, we also have a set of *baseline* variables: patient's gender/age/height/weight, and length of operation. As for including the embedding output, we can think of a few options:
1. Including all 3072 elements of the embedding vectors in a model alongside the baseline variables is unlikely to be an optimal strategy, as the embedding features may dominate the model and lead to overfitting, given limited number of observations (830).
2. We can instead apply clustering to the embedding vector, as done in the explanatory path, and include the resulting cluster labels as a categorical feature in the model (which are eventually transformed into a series of binary indicators via one-hot encoding).
3. Alternatively, we can use the embedding vectors to predict the outcome variable, and include those predictions as a synthetic feature - and alongside the baseline variables - in our final model. This is akin to the idea of stacking in ML (add ref), though here we are not technically stacking a collection of model predictions, but rather a single model prediction is being stacked on top of a collection of baseline features.

The last approach is what we docus on next. For more on comparing the above-mentioned approaches, see our paper (add link later). Importantly, in stacking we must use cross-validation to generate out-of-sample predictions within the training data; otherwise, we could inject a good amount of overfitting into the final model. The CV machinery and other nuances are handled in the `FeatureExtractor_Classifier` class in our package.

The class constructor is a thin wrapper around the `KNeighborsClassifier` class constructor from the `scikit-learn` package. While we can pass any valid arguments to the underlying class, below we only specify the number of neighbors: (we need to explain why KNN is a good choice, i.e., a flexible alternative to the rigid clusters)

In [21]:
from TextEmbeddingFE.main import FeatureExtractor_Classifier
my_fe = FeatureExtractor_Classifier(n_neighbors = 50)

We will now illustrate that, adding a feature extracted from text embeddings via the `FeatureExtractor_Classifier` class to the baseline variables improves the discriminative performance of the model predicting severe AKI. The process illustrates our recommended approach for incorporating text fields in predictive models.

### Data Preparation

Before proceeding to creating a synthetic risk score from embeddings and doing performance comparison, we do two things:
1. Extract matrix of baseline variables, matrix of text embeddings, and vector of response variable;
2. Split all three of the above arrays into train and test sets by allocating a random 30% of observations to the test set.

In [32]:
X_embedding = dfEmbeddings[my_embedding_colnames].to_numpy()
X_baseline = df.loc[:, ['is_female', 'age', 'height', 'weight', 'optime']].to_numpy()
y = df['aki_severity'].to_numpy(dtype = 'int')

from sklearn.model_selection import train_test_split
X_embedding_train, X_embedding_test, X_baseline_train, X_baseline_test, y_train, y_test = train_test_split(X_embedding, X_baseline, y, test_size = 0.3)

### Creating Embedding-Based Risk Score

We now train our `FeatureExtractor_Classifier` model on the training set. The parameter `cv` specifies the number of CV folds used inside the `fit` function.

In [35]:
my_fe.fit(X = X_embedding_train, y = y_train, cv = 10)

<TextEmbeddingFE.main.FeatureExtractor_Classifier at 0x1d25474e710>

To obtain the risk score for the training set, we call the `predict_proba` method, without passing any argument to it:

In [36]:
z_train = my_fe.predict_proba()

To get the risk score for the test set, we pass the feature (i.e., embedding) matrix for the test set:

In [38]:
z_test = my_fe.predict_proba(X = X_embedding_test)

It is important to note that it is incorrect to pass `X_embedding_train` to `predict_proba` to obtain `z_train` as the results are not the same. This can be seen below:

In [39]:
(
    my_fe.predict_proba()[:5]
    , my_fe.predict_proba(X = X_embedding_train)[:5]
)

(array([0.28, 0.26, 0.2 , 0.28, 0.2 ]),
 array([0.254, 0.238, 0.226, 0.26 , 0.182]))

(provide further explanation of the mechanics of in-sample vs. out-of-sample prediction)