Skip to content


Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Language2Pose:Natural Language Grounded Pose Forecasting


There are 5 steps to running this code

  • Python Virtual Environment and dependencies
  • Data download and preprocessing
  • Training
  • Sampling
  • Rendering

PS: The implementation of one of the baselines, proposed by Lin et al.[1], was not publicly available and hence we make use of our implementation of their model to generate all the results and animations marked as Lin et al. Due to the differences in training hyperparameters, dataset and experiments, the numbers reported for Lin et al. in our paper differ from the ones in the original paper [1].

PS: This repo, at the moment, is functional at best. Feel free to create issues/pull requests however you see fit.

Python Virtual Environment

Anaconda is recommended to create the virtual environment

conda create -f env.yaml
source activate torch

To handle the logistics of saving/loading models pycasper is used

git clone
cd src 
ln -s ../pycasper/pycasper .
cd ..



We use KIT Motion-Language Dataset which can be downloaded here

mkdir dataset/kit-mocap
unzip -d dataset/kit-mocap

Download Word2Vec binaries

Download the binary file here and place it in src/s2v

Pre-trained Models

Download pretrained models here and place it in src/save


python data/ -dataset KITMocap -path2data ../dataset/kit-mocap

Rendering Ground Truths

python -dataset KITMocap -path2data ../dataset/kit-mocap/new_fke -feats_kind fke

Calculating mean+variance for Z-Normalization

python dataProcessing/ -mask '[0]' -feats_kind rifke -dataset KITMocap -path2data ../dataset/kit-mocap -f_new 8


We train the models using a script (Pardon the misnomer; initially it was supposed to be word conditioned pose forecasting but then I ended up adding sentence conditioned pose forecasting as well and was too lazy to change the filename.)

All the arguments (and their corresponding help texts) used for training can be found in src/ (PS: Some of them might be deprecated, but I have not removed them in case it breaks any of the other code that I might have written in the experimentation phase. Please raise an issue/ or send me an email if you have any clarification questions about any of the arguments). It would be good to stick to the args used in the examples if you want to play with the models in the paper.

  • JL2P
python -batch_size 100 -cpk jl2p -curriculum 1 -dataset KITMocap -early_stopping 1 -exp 1 -f_new 8 -feats_kind rifke -losses "['SmoothL1Loss']" -lr 0.001 -mask "[0]" -model Seq2SeqConditioned9 -modelKwargs "{'hidden_size':1024, 'use_tp':False, 's2v':'lstm'}" -num_epochs 1000 -path2data ../dataset/kit-mocap -render_list subsets/render_list -s2v 1 -save_dir save/model/ -tb 1 -time 16 -transforms "['zNorm']" 

-modelKwargs need some explaination as they could vary based on the model

hidden_size: size of the joint embedding
use_tp: use a trajectory predictor [1]. False for JL2P models
s2v: sentence to vector model ('lstm' or 'bert')
  • Our Implementation of Lin et. al. [1]
python -batch_size 100 -cpk lin -curriculum 0 -dataset KITMocap -early_stopping 1 -exp 1 -f_new 8 -feats_kind rifke -losses "['MSELoss']" -lr 0.001 -mask "[0]" -model Seq2Seq -modelKwargs "{'hidden_size':1024, 'use_tp':True, 's2v':'lstm'}" -num_epochs 1000 -path2data ../dataset/kit-mocap -render_list subsets/render_list -s2v 1 -save_dir save/model -tb 1 -time 16 -transforms "['zNorm']"

This model has 2 training steps. uses a seq2seq model to first learn an embedding for pose sequences. Once the training is complete, is called which optimizes to map from language embeddings to pose embeddings.


Sampling from trained Models

The training scripts will sample after the stopping criterion has reached, but if you would like to manually sample run the following script

python -load <path-to-weights.p>

<path-to-weights.p> ends in _weights.p

Using Pretrained Models

Make sure you have downloaded the pre-trained models as described here.

  • JL2P
python -load save/jl2p/exp_726_cpk_jointSampleStart_model_Seq2SeqConditioned9_time_16_chunks_1_weights.p
  • Our Implementation for Lin et. al. [1]
python -load save/lin-et-al/exp_700_cpk_mooney_model_Seq2SeqConditioned10_time_16_chunks_1_weights.p 


After sampling, it would be nice to see what animation does the model generates. We only use the test samples for rendering.

If possible, use a machine with many cpu cores, as rendering animations on matplotlib is painfully slow. uses all the available cores for parallel processing.

Using your trained model

python -dataset KITMocap -load <path-to-weights.p> -feats_kind fke -render_list subsets/render_list

Using pre-trained Models

  • JL2P
python -dataset KITMocap -load save/jl2p/exp_726_cpk_jointSampleStart_model_Seq2SeqConditioned9_time_16_chunks_1_weights.p -feats_kind fke -render_list subsets/render_list
  • Our Implementation for Lin et. al. [1]
python -dataset KITMocap -load save/lin-et-al/exp_700_cpk_mooney_model_Seq2SeqConditioned10_time_16_chunks_1_weights.p -feats_kind fke -render_list subsets/render_list


[1]: Lin, Angela S., et al. "1. Generating Animated Videos of Human Activities from Natural Language Descriptions." Learning 2018 (2018).


Language2Pose: Natural Language Grounded Pose Forecasting







No releases published


No packages published