Amit Parekh, Nikolas Vitsakis, Alessandro Suglia, and Ioannis Konstas.
Unveiling the true robustness of multimodal models: A comprehensive framework to explore whether models are plausibly resilient.
Note
This codebase automatically downloads checkpoints and datasets so you don't need to do that manually. Everything is hosted on Hugging Face and uses HF so it's all cached too.
-
Clone this repository and navigate to the folder
git clone https://github.com/amitkparekh/CoGeLoT.git cd CoGeLoT
-
Install the dependencies (I used PDM and Python 3.11)
pdm install
-
Make sure everything works and is installed correctly
pdm run pytest --deselect tests/test_online_evaluation.py
-
Train a model
pdm run python src/cogelot/entrypoints/train.py --experiment=01_their_vima
-
Evaluate a model from one of the provided checkpoints
pdm run python src/cogelot/entrypoints/evaluate.py trainer.devices=1 model.model.wandb_run_id=8lkml12g
For model.model.wandb_run_id
, you can use any of the Run IDs from the table below.
Note
This project is codenamed cogelot
so that's what the library is called to prevent needing to rewrite everything.
Everything. You should be able to run every single experiment from the paper. Datasets and models are hosted on HF.
While I tried to bring everything front and centre, some things might be buried. If you think these things should be brought forward, feel free to open a PR and bring them forward! I'll definitely be taking opinions regarding this into consideration for future projects.
Additionally, I've tried to work in a constrained, clean, and robust manner. I hope that helps you as much as it helped me.
Below is a table of each model run and where to find the checkpoints. We're providing all checkpoints from the end of each epoch, even though we only used the one from the very last epoch.
You do not need to download checkpoints yourself. This library contains multiple methods and functions to make the checkpoints work on our framework, and it's all included for you. All model checkpoints are stored on Hugging Face, but they will not work with the Transformers library out-of-the-box.
Instruction-style | Instruction Modalities | Prompt-conditioning | Vision Encoder | Shuffled Objects? | WandB Run ID | Experiment ID |
---|---|---|---|---|---|---|
Original | Text + Visual | Cross-Attention | Object-Centric | False | 8lkml12g |
01_their_vima |
Original | Text + Visual | Cross-Attention | Object-Centric | True | ftwoyjb1 |
01_their_vima_shuffle_obj |
Original | Text + Visual | Cross-Attention | Image-Patches | N/A | ln4nrqhg |
01_their_vima_patches |
Original | Text + Visual | Concatenate | Object-Centric | False | bhuja4vo |
08_their_gpt |
Original | Text + Visual | Concatenate | Object-Centric | True | wn9jc5l8 |
08_their_gpt_shuffle_obj |
Original | Text + Visual | Concatenate | Image-Patches | N/A | efxugme9 |
08_their_gpt_patches |
Paraphrases | Text + Visual | Cross-Attention | Object-Centric | False | 2df3mwfn |
02_their_vima |
Paraphrases | Text + Visual | Cross-Attention | Object-Centric | True | 0nsnkaer |
02_their_vima_shuffle_obj |
Paraphrases | Text + Visual | Cross-Attention | Image-Patches | N/A | ah5btw8w |
02_their_vima_patches |
Paraphrases | Text + Visual | Concatenate | Object-Centric | False | fs5v61mz |
09_their_gpt |
Paraphrases | Text + Visual | Concatenate | Object-Centric | True | xb3yttg9 |
09_their_gpt_shuffle_obj |
Paraphrases | Text + Visual | Concatenate | Image-Patches | N/A | zby6xk27 |
09_their_gpt_patches |
Important
Everything that was run, in some shape or form, starts from a module in src/cogelot/entrypoints/
. This is what was used to run the dataset creation, train models, evaluate models, and more. Everything I ran started from that folder, every single time.
This is not a comprehensive library made for all use cases and every possible scenario. It's a research project. That said, I tried to make everything as clear as possible for you. In this section, I detailed how I did everything so that you can use it as an example for how to start yourself.
I have tried to make sure that docstrings and comments are relevant and detailed. If you want more information on what a function is doing or why it is doing that, feel free to make an issue. If you figure out something that I haven't described enough of, feel free to make a PR improving my documentation so that you, me, and future people can benefit from your insight.
I used PDM to manage this project. Everything you need to know about the installing the dependencies for this project can be found in the pyproject.toml
.
To quickly install and get up and running, you can run the following:
pdm install
What if you use requirements.txt
?
I have exported and included the requirements.txt
from PDM. Using it is up to you. I'm not going to be maintaining it, but it's there if you need it.
How I install dependencies on every machine
I literally just run the following on the machines I use. I don't use Windows though so I can't help you there.
mise use python@3.11 pdm@latest
pdm install
How to make sure it works on your machine
The quickest way to make sure you're all setup is to run either of the following:
-
If you know you've got a venv activated or something
python -m cogelot
-
If you're using PDM instead of activating the venv
pdm run python -m cogelot
Things happen and things break. I needed a sense check to make sure everything worked. I developed everything using tests to verify that each pieces works in isolation and together. This is the first thing I did when using a new machine or node or whatever.
You can find all the tests in the tests/
folder. The various tests are a good way of looking how different pieces were implemented and are used. While coverage is not 100%, I used the tests with breakpoints to verify things are working as expected.
How to make sure all tests can be loaded without errors
pdm run pytest --deselect tests/test_online_evaluation.py --collect-only
This is also useful for just making sure things installed correctly and that all tests can be found.
How to run all the tests
pdm run pytest --deselect tests/test_online_evaluation.py
I've also got CI doing this so you can check the badge at the top of the README to see if everything is working as expected.
Check out pytest-xdist if you want to know more about running tests in parallel, or just throw -n auto
on the end of the above commands. It makes it go faster.1
Tip
Before spawning an instance and starting to train with GPUs, you can run the above command on your machine to make sure everything works on CPU. As Lightning handles all of the GPU communication, if it works on CPU, there's a 99% chance it'll work on GPU.2
Important
All model training was orchestrated with Hydra, and can be found in the configs/
folder.
I went all out on the Hydra stuff and everything is pretty compositional. The configs/experiments
sub-dir contains the experiments that were run (and directly connect to the checkpoints table). As a result, if you want to just train a model, you can run:
pdm run python src/cogelot/entrypoints/train.py --experiment=01_their_vima
Tip
You can find the experiments in the folder, or check the Experiment ID
column in the above table for what each one means since the names aren't the clearest.
Training on different hardware
The configs/hardware
folder contains the hardware configurations that were used to run the experiments. These are used to set the number of GPUs, the number of CPUs, and the memory available to the model. These were preset for the cluster I was using, but you can adjust them to your needs.
How to train models on OCI
This was a while ago now, but I had a setup script which you can find at scripts/setup-oci-a100.sh
. This was used to setup the environment on the OCI instance I was using. It's not perfect, but it's a good starting point.
How I trained models on K8s
Running on K8s was a bit more involved but it's all here. That said, it will be different for your setups.
My pod spec was:
apiVersion: v1
kind: Pod
metadata:
name: &name cogelot-1
namespace: ???
spec:
restartPolicy: Never
containers:
- name: 1st
image: amitkparekh/python-pdm-cuda:latest
envFrom:
- secretRef:
name: amit-cogelot
imagePullPolicy: Always
command: ["/bin/bash", "-c"]
args:
- gh repo clone amitkparekh/cogelot cogelot &&
cd cogelot &&
bash ./scripts/setup-eidf.sh 2>&1 | tee setup-eidf.log &&
sleep infinity
#bash ./scripts/run-sweep-4.sh
resources:
requests:
cpu: &num-cpu 10
memory: &num-memory "150Gi"
nvidia.com/gpu: &num-gpu 4
limits:
cpu: *num-cpu
memory: *num-memory
nvidia.com/gpu: *num-gpu
volumeMounts:
- mountPath: /mnt/ceph_rbd
name: volume
# this is necessary for training in distributed mode - used for different processes to communicate
- mountPath: /dev/shm
name: dshm1
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB
volumes:
- name: volume
persistentVolumeClaim:
claimName: *name
- name: dshm1
emptyDir:
medium: Memory
The Dockerfile is public and the secrets contained the following:
WANDB_API_KEY=???
HUGGING_FACE_HUB_TOKEN=???
GH_TOKEN=???
HF_HUB_VERBOSITY=info
WANDB_CONFIG_DIR=/mnt/ceph_rbd/wandb
WANDB_CACHE_DIR=/mnt/ceph_rbd/wandb
HF_HOME=/mnt/ceph_rbd/huggingface
TORCH_HOME=/mnt/ceph_rbd/torch
Again, this uses Hydra so, like training, the entrypoint is src/cogelot/entrypoints/evaluate.py
and the config for it is configs/evaluate.yaml
.
To run the evaluation in the environment, I used the following command:
pdm run python src/cogelot/entrypoints/evaluate.py trainer.devices=1 model.model.wandb_run_id=8lkml12g
How to choose your checkpoint
model.model.wandb_run_id
parameter is important used to get the checkpoint to evaluate. The checkpoint ID is the one from the table above.
By default, we use the epoch from the last checkpoint, but if you want to change the epoch, just add model.model.epoch=<epoch_number>
to the command.
How to run multiple runs in parallel
The trainer.devices
creates multiple CPU processes for evaluation as eval does not need the GPU.
Change the number in the command to however many processes you want.
Important things to note:
- The more processes/devices you use, the more memeory you need since multiple instantiations of the model are loaded into the memory.
- I did not do anything fancy with batching across instances. Since we use CPU for evaluation, I didn't need to.
How to perturb the instructions
You can find all of these in configs/evaluation_instance_transform/
. For each file name, you can invoke them by appending evaluation_instance_transform=<file_name>
to the command.
Evaluation Instance Transform | Description |
---|---|
noop |
Interleave modalities in the prompt, by default |
gobbledygook_tokens |
Apply Gobbledygook Tokens to the prompt |
gobbledygook_words |
Apply Gobbledygook Words to the prompt |
reworded |
Use paraphrased instructions with interleaved modalities |
textual |
Convert visual referents to text |
textual_gobbledygook_words |
Convert visual referents to text and apply Gobbledygook Words |
textual_gobbledygook_tokens |
Convert visual referents to text and apply Gobbledygook Tokens |
textual_no_noun |
Convert visual referents to text, but remove the nouns |
textual_no_texture |
Convert visual referents to text, but remove the descriptions of nouns |
textual_generic_noun |
Convert visual referents to text, but replace each noun with a generic form (e.g. "block" becomes "thing") |
How to disable modalities in the prompt
You can find all of these in configs/evaluation_prompt_modality/
. For each file name, you can invoke them by appending evaluation_prompt_modality=<file_name>
to the command.
Evaluation Prompt Modality | Description |
---|---|
disable_none |
Do nothing |
disable_text |
Disable the text modality |
disable_visual |
Disable the visual modality |
disable_both |
Disable both modalities, basically masking every token |
How to permute object token order for observations
Append model.should_shuffle_obj_per_observations=true
to the command. This will shuffle the object tokens in the observation.
How to run on different difficulties
Append model.difficulty=<difficulty>
to the command. The difficulties are:
easy
medium
(unused)hard
(unused)extreme
distracting
extremely_distracting
How I ran the checkpoint from VIMA in the environment
I downloaded the checkpoint from VIMA's repo, renamed it to them.ckpt
and put it in storage/data/models
. If you want to change the path used, you can change the path in configs/model/from_their_policy.yaml
.
I used the following command to run the checkpoint from VIMA in the environment:
SLURM_JOB_NAME='bash' pdm run python src/cogelot/entrypoints/evaluate_theirs.py trainer.devices=20
You can use all of the other perturbations mentioned above.
How to run checkpoints with a live display
If you want to see what's going on live, you can append environment@model.environment=display
onto the evaluate command.
Importantly, only use one process because I don't know what'll happen if you don't.
Also, this wasn't run on SLURM, just on my Mac. I can't speak for every machine so your mileage may vary.
How to evaluate models on SLURM
It is very unlikely that I ran things in a tmux session and just stared at it. I don't like copy-pasting hundreds of commands.
As experiments were often run on a compute cluster, I ran commands with SLURM. You can find these contained batch files in ./scripts/slurm/
. These were made for my system, so some adjustments are likely going to be needed, but I'm hoping it's obvious and not too complicated!
So that things can be run quickly, the dataset was loaded and parsed with Pydantic, and then converted into a HF dataset. There are unit tests showing how this was done in tests/test_dataset_creation.py
.
The dataset was processed in two steps. The first step was to parse the raw data and pickle it into individual files. This was done because it was the most time-consuming part of the process. The second step was to load the pickled files and convert them into a HF dataset.
To make loading data efficient when modelling, all the instances were tokenised in advanced. Similarly, this is also available on HF, as a different config name.
Note
For each of the following commands, you can append --help
to get more information on the command and what it does and the various arguments to control it. Alternatively, you can change things using the Pydantic settings in src/cogelot/commmon/settings.py
.
For example, each command has a way of distributing the load to multiple workers, and even splitting them across multiple SLURM jobs to make it go so much faster.
Step 1. Download the raw data from VIMA
The raw data was downloaded from VIMA. Each instance is a folder of multiple files. Once extracted, the folder structure looked like this:
<project_root>
└─ storage/
└─ data/
└─ raw/
└─ vima_v6/
└─ <task_name>/
└─ <instance_id>/
I used symlinks to make it easier to manage the data, but this is what the folder structure was. If you want to use a different directory, you can change it using the Pydantic settings in src/cogelot/common/settings.py
.
Step 2. Parse the original data
The raw data was parsed and pickled into individual files, and then it was converted into a HF dataset. This is for speed.
pdm run python -m cogelot parse-original-dataset --replace-if-exists
pdm run python -m cogelot create-raw-dataset-per-task
I have separate SBATCH files for these steps:
scripts/slurm/parse-original-dataset.sh
scripts/slurm/create-raw-dataset.sh
Step 3. Tokenize and preprocess for faster training
Again, we preprocess and just dump each as pickles because it's faster before turning it into the HF dataset
pdm run python -m cogelot preprocess-instances
pdm run python -m cogelot create-preprocessed-dataset-per-task
I have separate SBATCH files for these steps:
scripts/slurm/preprocess-instances.sh
scripts/slurm/create-preprocessed-dataset.sh
Step 4. Create a dataset variant of paraphrased instructions
We use just the previous instances to make the new variations, and use environment variables to create the preprocessed versions of the dataset.
pdm run python -m cogelot create-reworded-dataset-per-task original reworded
DATASET_VARIANT=reworded pdm run python -m cogelot preprocess-instances
DATASET_VARIANT=reworded pdm run python -m cogelot create-preprocessed-dataset-per-task
Again, I have a SBATCH file for this: scripts/slurm/create-reworded-dataset.sh
, or more conveniently, a bash script to submit SBATCH jobs: scripts/submit-reworded-dataset-creation-jobs.sh
.
Step 5. Upload all the datasets
This one's just for me but refer to scripts/submit-dataset-upload-jobs.sh
for uploading all the datasets to HF as fast as possible without hitting the rate limit.
VIMA, VIMA-Bench, and all artefacts from VIMA are licensed under the MIT License. Everything within this repository continues to be licensed under the MIT License.
@misc{parekh2024investigatingroleinstructionvariety,
title = {Investigating the {{Role}} of {{Instruction Variety}} and {{Task Difficulty}} in {{Robotic Manipulation Tasks}}},
author={Amit Parekh and Nikolas Vitsakis and Alessandro Suglia and Ioannis Konstas},
year={2024},
eprint={2407.03967},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.03967},
}