Skip to content
Switch branches/tags

Learning to Segment Actions from Observation and Narration

Code for the paper:
Learning to Segment Actions from Observation and Narration
Daniel Fried, Jean-Baptiste Alayrac, Phil Blunsom, Chris Dyer, Stephen Clark, and Aida Nematzadeh
ACL, 2020


This repository provides a system for segmenting and labeling actions in a video, using a simple generative segmental (hidden semi-Markov) model of the video. This model can be used as a strong baseline for action segmentation on instructional video datasets such as CrossTask (Zhukov et al., CVPR 2019), and can be trained fully supervised (with action labels for each frame in each video) or with weak supervision from narrative descriptions and "canonical" step orderings. Please see our paper for more details.


  • python 3.6
  • pytorch 1.2
  • The semimarkov branch of my fork of pytorch-struct. (Newer versions may run out of memory on the long videos in the CrossTask dataset, due to changes to pytorch-struct that improve runtime complexity but increase memory usage.) It can be installed via
pip install git+

See env.yml for a full list of other dependencies, which can be installed with conda.


  1. Download and unpack the CrossTask dataset of Zhukov et al.:
cd data
mkdir crosstask
unzip '*.zip'
  1. Preprocess the features with PCA. In the repository's root folder, run
PYTHONPATH="src/":$PYTHONPATH python src/data/

This should generate the folder data/crosstask/crosstask_processed/crosstask_primary_pca-200_with-bkg_by-task


Here are the commands to replicate key results from Table 2 in our paper. Please contact Daniel Fried for others, or for any help or questions about the code.

Number Name Command
S6 Supervised: SMM, generative ./ pca_semimarkov_sup --classifier semimarkov --training supervised
U7 HSMM + Narr + Ord ./ pca_semimarkov_unsup_narration_ordering --classifier semimarkov --training unsupervised --mix_tasks --task_specific_steps --sm_constrain_transitions --annotate_background_with_previous --sm_constrain_with_narration train --sm_constrain_narration_weight=-1e4 --cuda


  • Parts of the data loading and evaluation code are based on this repo from Anna Kukleva.
  • Code for invertible emission distributions are based on Junxian He's structured flow code. (These didn't make it into the paper -- I wasn't able to get them to work consistently better than Gaussian emissions over the PCA features.)
  • Compound HSMM / VAE models are based on Yoon Kim's Compound PCFG code. (These also didn't make it into the paper, for the same reasons.)


Weakly-supervised action segmentation in video



No releases published


No packages published