Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

What Can You Learn from Your Muscles?
Learning Visual Representation from Human Interactions

K Ehsani, D Gordon, T Nguyen, R Mottaghi, A Farhadi

Published at ICLR 2021

(Project Page) (PDF) (Slides) (Video) (Presentation)


Learning effective representations of visual data that generalize to a variety of downstream tasks has been a long quest for computer vision. Most representation learning approaches rely solely on visual data such as images or videos. In this paper, we explore a novel approach, where we use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations. For this study, we collect a dataset of human interactions capturing body part movements and gaze in their daily lives. Our experiments show that our self-supervised representation that encodes interaction and attention cues outperforms a visual-only state-of-the-art method MoCo, on a variety of target tasks:

  1. Scene classification (semantic)
  2. Action recognition (temporal)
  3. Depth estimation (geometric)
  4. Dynamics prediction (physics)
  5. Walkable surface estimation (affordance)


  1. Clone the repository using the command:
git clone
cd muscleTorch
  1. Install requirements:
pip3 install -r requirements.txt
  1. Download the images from here and extract it to HumanDataset/images.
  2. Download the sensor data from here and extract it to HumanDataset/annotation_h5.
  3. Download pretrained weights from here for reproducing the numbers in the paper, extract it to HumanDataset/saved_weights.


We introduce a new dataset of human interactions for our representation learning framework. We record egocentric videos from a GoPro camera attached to the subjects' forehead. We simultaneously capture body movements, as well as the gaze. We use Tobii Pro2 eye-tracking to track the center of the gaze in the camera frame. We record the body part movements using BNO055 Inertial Measurement Units (IMUs) in 10 different locations (torso, neck, 2 triceps, 2 forearms, 2 thighs, and 2 legs).

The structure of the dataset is as follows:

└── images
│   └── <video_stamp>
│       └── images_<video_stamp>_<INDEX>.jpg
└── annotation_h5
│   ├── [test/train]_<feature_name>.h5
│   ├── [test/train]_image_name.json
│   ├── [test/train]_h5pyind_2_frameind.json
│   └── [test/train]_timestamp.json
└── saved_weights
    ├── trained_representations
    |   └── <Learned_Representations>.pytar
    └── trained_end_tasks
        ├── Action_Recognition
        ├── Depth_Estimation
        ├── Dynamic_Prediction
        ├── Scene_Classification
        └── Walkable_Surface_Estimation
            └── <Trained_End_Tasks_Weights>.pytar


To train your own model:

python3 --gpu-ids 0 --arch MoCoGazeIMUModel --input_length 5 --sequence_length 5 --output_length 5 \
--dataset HumanContrastiveCombinedDataset --workers 20 --num_classes -1 --loss MoCoGazeIMULoss \
--num_imus 6 --imu_names neck body llegu rlegu larmu rarmu \
--input_feature_type gaze_points move_label --base-lr 0.0005 --dropout 0.5 --data PATHTODATA/human_data 

See scripts/ for additional training scripts.

End-task fineTuning and testing

To test using the pretrained model and reproduce the results in the paper refer to scripts/


If you find this project useful in your research, please consider citing:

     title={Learning Visual Representation from Human Interactions},
     author={Ehsani, Kiana and Gordon, Daniel and Nguyen, Thomas and Mottaghi, Roozbeh and Farhadi, Ali},
     journal={International Conference on Learning Representations},


What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions (







No releases published


No packages published