Skip to content
This repository has been archived by the owner. It is now read-only.


Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.

Self-supervised learning through the eyes of a child

This repository contains code for reproducing the results reported in the following paper:

Orhan AE, Gupta VV, Lake BM (2020) Self-supervised learning through the eyes of a child. Advances in Neural Information Processing Systems 34 (NeurIPS 2020).


  • pytorch == 1.5.1
  • torchvision == 0.6.1

Slightly older or newer versions will probably work fine as well.


This project uses the SAYCam dataset described in the following paper:

Sullivan J, Mei M, Perfors A, Wojcik EH, Frank MC (2020) SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective. PsyArXiv.

The dataset is hosted on the Databrary repository for behavioral science. Unfortunately, we are unable to publicly share the SAYCam dataset here due to the terms of use. However, interested researchers can apply for access to the dataset with approval from their institution's IRB.

In addition, this project also uses the Toybox dataset for evaluation purposes. The Toybox dataset is publicly available at this address.

Code description

For specific usage examples, please see the slurm scripts provided in the scripts directory.

Pre-trained models


Since the publication of the paper, we have found that training larger capacity models for longer with the temporal classification objective significantly improves the evaluation results. Hence, we provide below pre-trained resnext50_32x4d type models that are currently our best models trained with the SAYCam data. We encourage people to use these new models instead of the mobilenet_v2 type models reported in the paper (the pre-trained mobilenet_v2 models reported in the paper are also provided below for the record).

Four pre-trained resnext50_32x4d models are provided here: temporal classification models trained on data from the individual children in the SAYCam dataset (TC-S-resnext, TC-A-resnext, TC-Y-resnext) and a temporal classification model trained on data from all three children (TC-SAY-resnext). These models were all trained for 16 epochs (with batch size 256) with the following data augmentation pipeline:

import torchvision.transforms as tr

	tr.RandomResizedCrop(224, scale=(0.2, 1.)),
        tr.RandomApply([tr.ColorJitter(0.9, 0.9, 0.9, 0.5)], p=0.9),
        tr.RandomApply([GaussianBlur([.1, 2.])], p=0.5),
        tr.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

This data augmentation pipeline is similar to that used in the SimCLR paper with slightly larger random crops and slightly stronger color augmentation. Here are some evaluation results for these resnext50_32x4d models (to download the models, click on the links over the model names):

Model Toybox (iid) Toybox (exemplar) ImageNet (linear) ImageNet (1% ft + linear)
TC-SAY-resnext 90.0 57.5 36.0 45.6
TC-S-resnext 88.5 54.9 -- --
TC-A-resnext 86.8 50.4 -- --
TC-Y-resnext 87.0 53.0 -- --

Here, ImageNet (linear) refers to the top-1 validation accuracy on ImageNet with only a linear classifier trained on top of the frozen features, and ImageNet (1% ft + linear) is similar but with the entire model first fine-tuned on 1% of the ImageNet training data (~12800 images). Note that these are results from a single run, so you may observe slightly different numbers.

These models come with the temporal classification heads attached. To load these models, please do something along the lines of:

import torch
import torchvision.models as models

model = models.resnext50_32x4d(pretrained=False)
model.fc = torch.nn.Linear(in_features=2048, out_features=n_out, bias=True)
model = torch.nn.DataParallel(model).cuda()

checkpoint = torch.load('TC-SAY-resnext.tar')

where n_out should be 6269 for TC-SAY-resnext, 2765 for TC-S-resnext, 1786 for TC-A-resnext, and 1718 for TC-Y-resnext. The differences here are due to the different lengths of the datasets.

In addition, please find below the best performing ImageNet models reported above: a model with a linear ImageNet classifier trained on top of the frozen features of TC-SAY-resnext (TC-SAY-resnext-IN-linear) and a model that was first fine-tuned with 1% of the ImageNet training data (TC-SAY-resnext-IN-1pt-linear):

You can load these models in the same way as described above. Since these are ImageNet models, n_out should be set to 1000.


The following are the pre-trained mobilenet_v2 type models reported in the paper:


We are very grateful to the volunteers who contributed recordings to the SAYCam dataset. We thank Jessica Sullivan for her generous assistance with the dataset. We also thank the team behind the Toybox dataset, as well as the developers of PyTorch and torchvision for making this work possible. This project was partly funded by the NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science.


Self-supervised learning through the eyes of a child







No releases published


No packages published