EgoTV: Egocentric Task Verification from Natural Language Task Descriptions

Official code of our ICCV 2023 paper.

Quickstart

Clone repo.

git clone https://github.com/facebookresearch/EgoTV.git

Download EgoTV dataset and set env paths.

export DATA_ROOT=<path to dataset>
export BASELINES=$(pwd)/EgoTV/baselines

Listed below are two ways of setting up the background requirements.

1. Using virtual environment

Install all requirements to run baselines. All baseline models are in the filepath: baselines/all_train

conda create -n <venv> python==3.10.0  # substitute with your own venv
source activate <venv>
bash $BASELINES/install_baseline_requirements.sh

2. Using Docker

Alternatively, we provide a Dockerfile for an easier setup. Ensure you have Docker installed.

docker pull rishihazra/alfred-dgx:torch-1.11.0
docker run -it rishihazra/alfred-dgx:torch-1.11.0

You can also generate your custom dataset. See DATA_GENERATION.md for details.

EgoTV file structure

dataset/
├── test_splits
│   ├── abstraction
│   ├── novel scenes
│   ├── novel tasks
│   └── novel steps
└── train
|   ├── heat_then_clean_then_slice
|   │   └── Apple-None-None-27
|   │       └── trial_T20220917_235349_019133
|   │           ├── pddl_states
|   │           ├── traj_data.json
|   │           └── video.mp4

More details of datafiles can be found in EgoTV/alfred/README.md.

Play with EgoTV samples

We provide a notebook to download and analyze the EgoTV samples. Refer ablations/data_analysis.ipynb

Baselines

To run baselines. Here can be replaced by clip4clip, coca, text2text, videoclip.

./run_scripts/run_<baseline>_train.sh  # for train
./run_scripts/run_<baseline>_test.sh  # for test
# if data split not preprocessed, specify "--preprocess" in the run instruction
# for attention-based models, specify "--attention" in the run instruction
# to resume training from a previously stored checkpoint, specify "--resume" in the run instruction

1. VIOLIN

text encoders: GloVe, (Distil)-BERT uncased [10], CLIP [5]
visual_encoders: ResNet18, I3D [4], S3D [7], MViT [6], CLIP [5]

Note: To run the I3D and S3D models, download the pretrained model (rgb_imagenet.pt, S3D_kinetics400.pt) from these repositories respectively:

mkdir $BASELINES/i3d/models
wget -P $BASELINES/i3d/models "https://github.com/piergiaj/pytorch-i3d/tree/master/models/rgb_imagenet.pt" "https://github.com/piergiaj/pytorch-i3d/tree/master/models/rgb_charades.pt"
wget -P $BASELINES/s3d "https://drive.google.com/uc?export=download&id=1HJVDBOQpnTMDVUM3SsXLy0HUkf_wryGO"

2. CoCa

Download CoCa model from OpenCLIP (coca_ViT-B-32 finetuned on mscoco_finetuned_laion2B-s13B-b90k)

3. VideoCLIP

The VideoCLIP has conflicting packages with EgoTV, hence we setup a new environment for it.

create a new conda env since the packages used are different from EgoTV packages

conda create -n videoclip python=3.8.8
source activate videoclip

clone the repo and run the following installations

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install -e .  # also optionally follow fairseq README for apex installation for fp16 training.
export MKL_THREADING_LAYER=GNU  # fairseq may need this for numpy
cd examples/MMPT  # MMPT can be in any folder, not necessarily under fairseq/examples.
pip install -e .
pip install transformers==3.4

download the checkpoint using

wget -P runs/retri/videoclip/ "https://dl.fbaipublicfiles.com/MMPT/retri/videoclip/checkpoint_best.pt"

Run proScript

NSG Model

Setup proScript. Details of proScript can be found in baselines/proScript. <--output_type 'nl'> for natural language graph output; <--output_type 'dsl'> for domain-specific language graph output (default: dsl)

source activate alfred_env
export DATA_ROOT=<path to dataset>
export BASELINES=$(pwd)/EgoTV/baselines
cd $BASELINES/proScript

# train
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 train_supervised.py --num_workers 4 --batch_size 32 --preprocess --test_split <> --run_id <> --epochs 20
# test
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 test.py --num_workers 4 --batch_size 32 --preprocess --test_split <> --run_id <>

Run NSG.

./run_scripts/run_nsg_train.sh  # for nsg train
./run_scripts/run_nsg_test.sh  # for nsg test

Citing EgoTV

If you find this codebase helpful for your work, please cite our paper:

@InProceedings{Hazra_2023_ICCV,
    author    = {Hazra, Rishi and Chen, Brian and Rai, Akshara and Kamra, Nitin and Desai, Ruta},
    title     = {EgoTV: Egocentric Task Verification from Natural Language Task Descriptions},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {15417-15429}
}

References

[1] Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng Yu, Yiming Yang, Jingjing Liu "VIOLIN: A Large-Scale Dataset for Video-and-Language Inference". In CVPR 2020
[2] Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, Ali Farhadi "AI2-THOR: An Interactive 3D Environment for Visual AI"
[3] Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, Dieter Fox "ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks" In CVPR 2020
[4] Joao Carreira, Andrew Zisserman "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset" In CVPR 2017
[5] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever "Learning Transferable Visual Models From Natural Language Supervision" In ICML 2021
[6] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer "Multiscale Vision Transformers" In ICCV 2021
[7] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, Kevin Murphy "Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification" In ECCV 2018
[8] Keisuke Sakaguchi, Chandra Bhagavatula, Ronan Le Bras, Niket Tandon, Peter Clark, Yejin Choi "proScript: Partially Ordered Scripts Generation" In Findings of EMNLP 2021
[9] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" In JMLR 2020
[10] Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter"
[11] Jeffrey Pennington, Richard Socher, Christopher Manning "GloVe: Global Vectors for Word Representation" In EMNLP 2014
[12] Yu, Jiahui, et al. "Coca: Contrastive captioners are image-text foundation models.", In Transactions on Machine Learning Research (2022)
[13] Xu, Hu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. "Videoclip: Contrastive pre-training for zero-shot video-text understanding." In EMNLP 2021
[14] Luo, Huaishao, et al. "CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning." Neurocomputing (2022)
[15] Zeng, Andy, et al. "Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language" ICLR 2023

License

The majority of EgoTV is licensed under CC-BY-NC, however, portions of the projects are available under separate license terms: Howto100M, I3D and HuggingFace Transformers are licensed under the Apache2.0 license; S3D and CLIP are licensed under the MIT license; CrossTask and MViT are licensed under the BSD-3.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
ablations		ablations
alfred		alfred
baselines		baselines
nsg		nsg
run_scripts		run_scripts
.DS_Store		.DS_Store
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTION.md		CONTRIBUTION.md
DATA_GENERATION.md		DATA_GENERATION.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
egoTV.png		egoTV.png
img_1.png		img_1.png
nsg-pos.gif		nsg-pos.gif
requirements_docker.txt		requirements_docker.txt

License

facebookresearch/EgoTV

Folders and files

Latest commit

History

Repository files navigation

EgoTV: Egocentric Task Verification from Natural Language Task Descriptions

Quickstart

1. Using virtual environment

2. Using Docker

EgoTV file structure

Play with EgoTV samples

Baselines

Run proScript

NSG Model

Citing EgoTV

References

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages