### VLM-Lens: Probeing Vision-Language Models with Lens

This notebook demonstrates how to use VLM-Lens to probe vision-language models (VLMs) on a pre-defined dataset. Here, we use the CLEVR-based dataset to illustrate the process. 

We use the `boolean` split of `compling/CLEVR_categories` dataset, which can be found on [Hugging Face Datasets](https://huggingface.co/datasets/compling/CLEVR_categories/viewer/default/boolean).
The dataset contains images and corresponding questions about the image, and asks whether the question is true or false based on the image content.

Qwen-2B will be used as the example model. 

### Step 0: Environmental Setup

As mentioned in the [minimal guide](https://github.com/compling-wat/vlm-lens/blob/main/tutorial-notebooks/guide.ipynb), we first need to install the required packages and import necessary modules.

There are a few more dependencies needed for this notebook, in addition to base environment setup. Under a virtual environment, we can install them via pip:
```bash
pip install -r envs/probe/requirements.txt
```

### Step 1: Extract Features 

Feature extraction is the core functionality of VLM-Lens. It allows you to extract features from images and text using pre-trained models.
To extract the LLAVA features, we can run the following code (note - this may take 20-30 minutes depending on your hardware):

In [None]:
! python -m src.main --config configs/models/qwen/qwen-2b-clevr.yaml

The output features will be saved in `output/qwen-boolean.db` by default. You can change the output path in the config file.

### Step 2: Run the Probe Training Script

After having the features extracted, we can run the following probe training script to train a probe model on the extracted features. The probe model will be trained to predict the answer to the question based on the image features. 

This make take ~30 minutes depending on your hardware. It will be faster if you have a customized dataset with fewer samples.

It's worth noting that the current probe training script assumes that there is no explicit splits between training and testing data; therefore, it automatically splits the data into training and testing sets. You can modify the script to use your own splits if needed.

In [None]:
! python -m src.probe.main --config configs/probe/qwen/clevr-boolean-l13-example.yaml  # run with --debug if you'd like to see more logs

Let's take a closer look at the YAML config file used in the above command: `configs/probe/qwen/clevr-boolean-l13-example.yaml`.

```yaml
model:
  - activation: ReLU
  - hidden_size: 512
  - num_layers: 2
  - save_dir: output/qwen_boolean_probe_l13
training:
  - batch_size: [64, 128]
  - num_epochs: [50]
  - learning_rate: [0.001]
  - optimizer: AdamW
  - loss: CrossEntropyLoss
test:
  - batch_size: 32
  - loss: CrossEntropyLoss
data:
  - input_db: output/qwen-boolean.db
  - db_name: tensors
  - input_layer: model.layers.13.post_attention_layernorm
```

The config file specifies the MLP non-linear activation function (ReLU), training parameters (hidden size = 512, 2 layers), and data source. 

The probe trainer will do a grid search over the training parameters (batch size, number of epochs, learning rate) to find the best combination. It will select the best parameter combination based on the cross-validation loss, and then evaluate the model on the test set.

You can modify these parameters to suit your needs.

The final probing results will be saved in `output/qwen_boolean_probe_l13/`, as specified in the config file. Under this directory, you will find: 
- `probe.pth`: the trained probe model
- `probe_data.json`: the probing results, including training and testing predictions, accuracy, loss, and best hyperparameters, as well as the p-value that the probe model is better than random guessing (labeled by "shuffle" in the probe results).