# Extracting a Single Feature Column from Matrix of Embeddings: The `Compress` Module in `TabuLLM`

## Overview

As we saw in previous tutorials, modern text embedding LLMs produce high-dimensional vectors. For example, OpenAI's `text-embedding-3-large` outputs embedding vectors of length 3072. We also mentioned that clustering can be used as a dimensionality-reduction approach, where the cluster labels are treated as a categorical variable.

There is, however, an even more efficient way of distilling the information captured in the embeddings into a single variable: Train a k-nearest-neighbor classifier (or regressor, depending on type of outcome variable) on embedding vectors, wrapped in cross-fit. Use the predictions of this cross-fit-wrapped KNN model as a single feature in downstream predictive models. Cross-fit ensures that in-sample predictions are close to being out-of-sample prediction, thus minimizing the risk of overfitting.

## The `CompressClassifier` and `CompressRegressor` Transformers

These classes implement the scikit-learn transformer interface and can therefore be included in composite pipelines as a feature-extraction step. Let's examine the class constructor for `CompressClassifier`:

`
CompressClassifier(nx=None, ncv=5, logit=True, laplace=True, **kwargs)
`

Below is an explanation of the constructor arguments:
- `nx`: This determines how many elements of the embedding vector - starting at index 0 - to include in the compression algorithm. The default value is `none`, which means the entire length of the embedding vectors will be used.
- `ncv`: This determined the cross-fit strategy. If an integer, it sets the number of folds. If an object of class `sklearn.model_selection.KFold`, it will be used directly in the cross-fit algorithm.
- `logit`: This boolean decides whether the logit of predicted class probabilities should be returned as the extracted feature or not. Default is `True`.
- `laplace`: This boolean determines whether Laplace smoothing should be applied to the predicted probabilities of the KNN model. Default is `True`.
- `**kwargs`: Keyword arguments passed to the class constructor for `sklearn.neighbors.KNeighborsClassifier`. The most important one is `n_neighbors`, which sets the number of nearest neighbors used to calculate the predicted class probability for each prediction.

## Example

Let's revisit the AKI problem, and use `CompressClassifier` to generate a risk score from our text column. We begin by using a small LLM from Hugging Face to generate the text embeddings.

In [1]:
import pandas as pd
from TabuLLM.embed import TextColumnTransformer
df = pd.read_csv('../data/raw.csv')
embeddings = TextColumnTransformer(
    type = 'st'
    , embedding_model_st = 'sentence-transformers/all-MiniLM-L6-v2'
).fit_transform(df.loc[:, ['diagnoses']])
print(f'Shape of embeddings: {embeddings.shape}')

  from tqdm.autonotebook import tqdm, trange


Shape of embeddings: (830, 384)


Next, we create and train an instance of `CompressClassifier` on the embedding matrix and outcome vector:

In [7]:
from TabuLLM.compress import CompressClassifier
risk_score = CompressClassifier(
    nx = 100
    , ncv = 5
    , logit = True
    , laplace = True
    , n_neighbors = 50
).fit_transform(embeddings, df['aki_severity'])
print(f'Shape of risk_score: {risk_score.shape}')

Shape of risk_score: (830, 1)


We see that the risk score is a single column of numbers including both positive and negative:

In [10]:
# range of risk score
print(f'Min/Max of risk score: {risk_score.min()} / {risk_score.max()}')

Min/Max of risk score: -3.2188758248682006 / 0.47000362924573574


The presence of negative values is due to the flag `logit` being `True`. If we flip this to `False', we obtain probabilities instead:

In [11]:
risk_score_prob = CompressClassifier(
    nx = 100
    , ncv = 5
    , logit = False
    , laplace = True
    , n_neighbors = 50
).fit_transform(embeddings, df['aki_severity'])
print(f'Min/Max of risk score prob: {risk_score_prob.min()} / {risk_score_prob.max()}')

Min/Max of risk score prob: 0.038461538461538464 / 0.5961538461538461


## Bringing it All Together: Using a Text Column in a Predictive Model

Recall that our ultimate goal in `TabuLLM` is to incorporate text columns into predictive models by taking advantage of modern LLMs. Let's use the AKI dataset to illustrate how the `CompressClassifier` transformer can be included in a predictive pipeline, where the risk score extracted from the text column `diagnoses` is included alongside other patient attributes in a logistic regression model to predict severe postoperative AKI.

To start, we prepare our data, i.e., the feature matrix and the response vector:

In [18]:
features_baseline = ['is_female', 'age', 'height', 'weight', 'optime']
features_embedding = [f'X_{i}' for i in range(embeddings.shape[1])]
X = pd.concat([embeddings, df[features_baseline]], axis = 1)
y = df['aki_severity']

Next, we define our baseline and embedding pipelines. In the baseline case, we simply drop the embedding features and pass the baseline features to a logistic regression model. In the embedding case, we apply `CompressClassifier` to the embedding features, and then add them to the baseline features before applying logistic regression. For KNN inside `CompressClassifier`, we use 50 neighbors:

In [39]:
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

ct_baseline = ColumnTransformer([
    ('baseline', 'passthrough', features_baseline)
], remainder = 'drop')
pipeline_baseline = Pipeline([
    ('coltrans', ct_baseline)
    , ('logit', LogisticRegression(penalty = None, solver = 'newton-cholesky', max_iter = 1000))
])

ct_embedding = ColumnTransformer([
    ('baseline', 'passthrough', features_baseline)
    , ('embedding', CompressClassifier(n_neighbors = 50), features_embedding)
], remainder = 'drop')
pipeline_embedding = Pipeline([
    ('coltrans', ct_embedding)
    , ('logit', LogisticRegression(penalty = None, solver = 'newton-cholesky', max_iter = 1000))
])

We now create a `KFold` object and pass it alongside the above two pipelines to sklearn's `cross_val_score` to obtain the area under ROC for each fold:

In [36]:
#pipeline_baseline.fit(X, y)
from sklearn.model_selection import KFold, cross_val_score

kf = KFold(n_splits = 50, shuffle = True, random_state = 1234)

auc_baseline = cross_val_score(
    pipeline_baseline
    , X, y, cv = kf
    , scoring = 'roc_auc'
    #, n_jobs = self.n_jobs
    #, verbose = verbose
)

auc_embedding = cross_val_score(
    pipeline_embedding
    , X, y, cv = kf
    , scoring = 'roc_auc'
    #, n_jobs = self.n_jobs
    #, verbose = verbose
)

Finally, we perform paired t-test to compare the fold-level AUC scores:

In [38]:
# paires t-test
from scipy.stats import ttest_rel
ttest_rel(auc_baseline, auc_embedding)

TtestResult(statistic=-2.6262994278320235, pvalue=0.011486618914033459, df=49)

We see that, at the 5\% significance level, adding the text column via embedding + compression has improved AUC compared to the baseline model. Keep in mind that the LLM we used in this tutorial was a small one that is likely to be inferior to the most cutting-edge models from OpenAI and Google. For a more thorough study of text embeddings in the AKI dataset, see Sharabiani et al (2024).