This is the official implementation for our ICLR 2024 paper
"CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding"
We aims to address the question of whether an interpretable 3D visual grounding framework, capable of emulating the human perception system, can be designed as shown in the figure above.To achieve this objective, we formulate the 3D visual grounding problem as a sequence-to-sequence (Seq2Seq) task. As illustrated in the architecture above, the The input sequence comprises 3D objects from the scene and an utterance describing a specific object. In contrast to existing architectures, our model predicts both the target object and a chain of anchors on the output side.
MVT-Nr3D: (Weights and logs)
Method-Data Percentage | 10% | 40% | 70% | 100% |
---|---|---|---|---|
MVT Baseline | 27.56 | 41.64 | 51.93 | 55.22 |
MVT + Pseudo Labels | 38.16 | 54.83 | 58.81 | 60.36 |
MVT + GT Labels | 37.23 | 53.89 | 62.95 | 64.36 |
MVT-Sr3D: (Weights and logs)
Method-Data Percentage | 10% | 40% | 70% | 100% |
---|---|---|---|---|
MVT Baseline | 48.75 | 65.03 | 65.08 | 66.05 |
MVT + GT Labels | 66.37 | 72.78 | 73.53 | 73.23 |
Please refer the installation and data preparation from referit3d.
We adopt bert-base-uncased from huggingface, which can be installed using pip as follows:
pip install transformers
you can download the pretrained weight in this page, and put them into a folder, noted as PATH_OF_BERT.
- To train on Sr3d dataset, use the following command
python /CoT3D_VG/refering_codes/MVT-3DVG/train_referit3d.py \
-scannet-file $PATH_OF_SCANNET_FILE$\
-referit3D-file $PATH_OF_REFERIT3D_FILE$ \
--bert-pretrain-path $PATH_OF_BERT$ \
--log-dir logs/COT3DRef_sr3d \
--n-workers 16 \
--model 'referIt3DNet_transformer' \
--unit-sphere-norm True \
--batch-size 24 \
--encoder-layer-num 3 \
--decoder-layer-num 4 \
--decoder-nhead-num 8 \
--gpu "0" \
--view_number 4 \
--rotate_number 4 \
--label-lang-sup True \
--anchors "cot" \
--max-test-objects 52 \
--cot_type "cross" \
--predict_lang_anchors True \
--lang_filter_objs False \
--visaug_shuffle_mode 'none' \
--visaug_extracted_obj_path '$PATH_OF_SR3D_DATA' \
--visaug_pc_augment True \
--train_data_percent 1.0
- To train on NR3d dataset, just add the following commands:
--shuffle_objects_percentage 0 \
--visaug_pc_augment False \
--train_data_percent 1.0 \
--max_num_anchors 7 \
--dropout-rate 0.15 \
--textaug_paraphrase_percentage 0 \
--target_aug_percentage 0 \
--gaussian_latent False \
--distractor_aux_loss_flag True \
--train_data_repeatation 1 \
--augment-with-sr3d '$PATH_OF_SR3D_DATA'
- And Change the "--visaug_extracted_obj_path"
--visaug_extracted_obj_path '$PATH_OF_NR3D_DATA' \
- To Train on Both SR3D and NR3d datasets, use the following command:
python /CoT3D_VG/refering_codes/MVT-3DVG/train_referit3d.py \
-scannet-file $PATH_OF_SCANNET_FILE$\
-referit3D-file $PATH_OF_REFERIT3D_FILE$ \
--bert-pretrain-path $PATH_OF_BERT$ \
--log-dir logs/MVT_nr3d \
--n-workers 16 \
--model 'referIt3DNet_transformer' \
--unit-sphere-norm True \
--batch-size 64 \
--encoder-layer-num 3 \
--decoder-layer-num 4 \
--decoder-nhead-num 8 \
--gpu "0" \
--view_number 4 \
--rotate_number 4 \
--label-lang-sup True \
--anchors "cot" \
--max-test-objects 52 \
--cot_type "cross" \
--predict_lang_anchors True \
--lang_filter_objs False \
--visaug_shuffle_mode 'none' \
--shuffle_objects_percentage 0 \
--visaug_extracted_obj_path '$PATH_OF_NR3D_DATA' \
--visaug_pc_augment False \
--train_data_percent 1.0 \
--max_num_anchors 7 \
--dropout-rate 0.15 \
--textaug_paraphrase_percentage 0 \
--target_aug_percentage 0 \
--gaussian_latent False \
--distractor_aux_loss_flag True \
--train_data_repeatation 1 \
--augment-with-sr3d '$PATH_OF_SR3D_DATA'
- After each epoch of the training, the program will automatically evaluate the performance of the current model. Our code will save the last model in the training as last_model.pth, and save the best model following the original Referit3D's repo as best_model.pth.
-
At test time, the analyze_predictions will run following the original code of Referit3D.
-
The analyze_predictions will test the model multiple times, each time using a different random seed. With different random seeds, the sampled point clouds of each object are different. The average accuracy and std will be reported.
-
To test on either Nr3d or Sr3d dataset, use the following commands
python referit3d/scripts/train_referit3d.py \
--mode evaluate \
-scannet-file $PATH_OF_SCANNET_FILE$ \
-referit3D-file $PATH_OF_REFERIT3D_FILE$ \
--bert-pretrain-path $PATH_OF_BERT$ \
--log-dir logs/MVT_nr3d \
--resume-path $the_path_to_the_model.pth$ \
--n-workers 8 \
--model 'referIt3DNet_transformer' \
--unit-sphere-norm True \
--batch-size 24 \
--encoder-layer-num 3 \
--decoder-layer-num 4 \
--decoder-nhead-num 8 \
--gpu "0" \
--view_number 4 \
--rotate_number 4 \
--label-lang-sup True
- To test on joint trained model, add the following argument to the above command
--augment-with-sr3d sr3d_dataset_file.csv
The project is built based on the following repository:
eslam.abdelrahman@kaust.edu.sa
@article{bakr2023cot3dref,
title={CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding},
author={Bakr, Eslam Mohamed and Ayman, Mohamed and Ahmed, Mahmoud and Slim, Habib and Elhoseiny, Mohamed},
journal={arXiv preprint arXiv:2310.06214},
year={2023}
}