# Generate Embeddings

For each case, the content represents a natural language description of the case metadata (court, state, and case name) as well as the document texts for complaints and/or opinions, concatenated to a large string together with the document type. 

The contents are then chunked to context window sized chunks to be used for generating embedding using an encoder model. For this experiments, we used ModernBERT Base with 8192 context window as our baseline. The model was chosen for its superior performance, speed, and large context window compared to its predecessors. This approach is model agnostic and can be used with any encoder model.

Since each case can have one or more issue categories (labels) with substantial class imbalance among the categories, I chose to create binary classifications for each issue category. Each model will be trained on a balanced dataset of binary labels (1 for if the case has the target issue category, 0 for if the case does not) with the content embeddings as the feature. The models experimented are logistic regression, random forest, support vector machine, and gradient boosted tree models. To ensure consistency among all models, I selected the same number of training samples for all categories (240 cases, half positive, half negative). This number was chosen as a down-sampling technique to accomodate the less common classes, such as COVID. The data is carefully split such that the same case do not appear in both train, val, and test sets, to avoid any potential leakage. When training the binary classification models, I split the training data to train/val split so I can hypertune the models. Once the best model has been identified, I run the evaluation on the validation set to identify additional potential areas for improvements, such as the hyperparameters and the classification threshold. The models are then retrained with the entire training set (without validation split) to produce the final classification models to be used for final evaluation on the test set and for production.

To generate the embeddings, the chunked contexts are provided to the model as inputs with shape [, 8192] where 8192 is the context window (ie, number of tokens). The last hidden state generated by the model is the embedding for each batch of chunks with shape [batch_size, 8192, 786] where 8192 is the context window, 786 is the embedding dimension, and the batch size is determined by the GPU memory limitation not exceeding the number of chunks for each case. For most cases, we used either 16 or 8 as the batch size. We then take the mean first along the context window dimension to reduce the embedding dimension to [batch_size, 786] such that each batch is represented by one embedding of dimension 786, we then take the mean along the batch dimension to reduce the embedding dimension to [, 786] such that each case (regardless of the number of chunks / lengths of content), is represented by one embedding of dimension 786. To the extent a case is represented by numerous batches, we follow a similar approach to reduce the dimensions, except instead of taking the mean along the batch dimension, we first concatenate the batch level dimensions of size [batch_size, 786] to produce chunk level dimensions of size [num_chunks, 786] and then take the mean along the chunk dimension to reduce the embedding to [, 768].

All notebooks in the 8-series are for generating the embeddings for validation, test, and each issue category of training data. 

The embeddings are generated on a T4 GPU instance in Google Colab, the resulting embeddings are saved in the respective embeddings folders under the data folder. This approach (using embeddings as features) allows us to experiment different classification algorithms on CPU instances without the need for continuous GPU instances. This also allows us the flexibility of adding additional classes and easily training classification models for the additional classes using CPU instances only. These embeddings can also be used for other tasks such as clustering similar cases together.