Official code of Video-Mined Task Graphs for Keystep Recognition in Instructional Videos, NeurIPS 2023.
This paper proposes learning a task graph to regularize keystep predictions. The proposed method outperforms prior works on zero-shot keystep recognition for CrossTask and COIN datasets. We use the task graph to pseudo-label large-scale instructional video dataset (HowTo100M) and representation learning using the obtained labels improves downstream task performance.
Please replace all the paths with sentence and feature files that can be downloaded from here. Alternate links:
- coin_all_scores.zip
- coin_processed.zip
- crosstask_all_scores.zip
- crosstask_processed.zip
- videoclip_video_features_crosstask_s3d.zip
- videoclip_video_features_coin_s3d.zip
Navigate to the zero-shot repository and run individual files with
python text.py coin # text modality evaluation for COIN datast
python text.py crosstask # text modality evaluation for Crosstask datast
python video.py coin # video modality evaluation for COIN datast
python video.py crosstask # video modality evaluation for Crosstask datast
Running these codes should result in the numbers present in the Table 1.
We use Video Distant Supervision to train the representation learning model. We replace the labels provided by them with our task graph labels. We use HowTo100M ASR narrations provided by this paper. The labels can be downloaded from here.
Feel free to open an issue in case of questions, or email me.
The instructional video representation learning is based on Distant Supervision repository. We thank the authors and maintainers of this codebase.
This codebase is licensed under the CC-BY-NC license.