Skip to content
Switch branches/tags

Latest commit


Failed to load latest commit information.
Latest commit message
Commit time

🦜 Mockingjay


  • This repo is a legacy version of when the Mockingjay paper is first released.
  • For our improved and maintaining implementation of Mockingjay, please visit the The S3PRL project.

Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders

Bitbucket open issues Contributions welcome GitHub

This is an open source project for Mockingjay, an unsupervised algorithm for learning speech representations introduced and described in the paper "Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders".

Feel free to use or modify them, any bug report or improvement suggestion will be appreciated. If you have any questions, please contact If you find this project helpful for your research, please do consider to cite this paper, thanks!


Pre-trained Models

You can find pre-trained models here:

Their usage are explained bellow and furthur in Step 3 of the Instruction Section.

Extracting Speech Representations

With this repo and the trained models, you can use it to extract speech representations from your target dataset. To do so, feed-forward the trained model on the target dataset and retrieve the extracted features by running the following example python code (

import torch
from runner_mockingjay import get_mockingjay_model

example_path = 'result/result_mockingjay/mockingjay_libri_sd1337_LinearLarge/mockingjay-500000.ckpt'
mockingjay = get_mockingjay_model(from_path=example_path)

# A batch of spectrograms: (batch_size, seq_len, hidden_size)
spec = torch.zeros(3, 800, 160)

# reps.shape: (batch_size, num_hiddem_layers, seq_len, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=True, tile=True)

# reps.shape: (batch_size, num_hiddem_layers, seq_len // downsample_rate, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=True, tile=False)

# reps.shape: (batch_size, seq_len, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=False, tile=True)

# reps.shape: (batch_size, seq_len // downsample_rate, hidden_size)
reps = mockingjay.forward(spec=spec, all_layers=False, tile=False)

spec is the input spectrogram of the mockingjay model where:

  • spec needs to be a PyTorch tensor with shape of (seq_len, mel_dim) or (batch_size, seq_len, mel_dim).
  • mel_dim is the spectrogram feature dimension which by default is mel_dim == 160, see utility/ for more preprocessing details.

reps is a PyTorch tensor of various possible shapes where:

  • batch_size is the inference batch size.
  • num_hiddem_layers is the transformer encoder depth of the mockingjay model.
  • seq_len is the maximum sequence length in the batch.
  • downsample_rate is the dimensionality of the transformer encoder layers.
  • hidden_size is the number of stacked consecutive features vectors to reduce the length of input sequences.

The output shape of reps is determined by the two arguments:

  • all_layers is a boolean which controls whether to output all the Encoder layers, if False returns the hidden of the last Encoder layer.
  • tile is a boolean which controls whether to tile representations to match the input seq_len of spec.

As you can see, reps is essentially the Transformer Encoder hidden representations in the mockingjay model. You can think of Mockingjay as a speech version of BERT if you are familiar with it.

There are many ways to incorporate reps into your downtream task. One of the easiest way is to take only the outputs of the last Encoder layer (i.e., all_layers=False) as the input features to your downstream model, feel free to explore other mechanisms.

Fine-tuning with your own downstream SLP tasks

With this repo and the trained models, you can fine-tune the pre-trained Mockingjay model on your own dataset and tasks. To do so, take a look at the following example python code (

import torch
from runner_mockingjay import get_mockingjay_model
from downstream.model import example_classifier
from downstream.solver import get_mockingjay_optimizer

# setup the mockingjay model
example_path = 'result/result_mockingjay/mockingjay_libri_sd1337_MelBase/mockingjay-500000.ckpt'
solver = get_mockingjay_model(from_path=example_path)

# setup your downstream class model
# features extracted from MelBase model have dimention 768
classifier = example_classifier(input_dim=768, hidden_dim=128, class_num=2).cuda()

# construct the Mockingjay optimizer
params = list(solver.mockingjay.named_parameters()) + list(classifier.named_parameters())
optimizer = get_mockingjay_optimizer(params=params, lr=4e-3, warmup_proportion=0.7, training_steps=50000)

# forward
example_inputs = torch.zeros(3, 800, 160) # A batch of spectrograms: (batch_size, seq_len, hidden_size)
reps = solver.forward_fine_tune(spec=example_inputs) # returns: (batch_size, seq_len, hidden_size)
loss = classifier(reps, torch.LongTensor([0, 1, 0]).cuda())

# update

# save
PATH_TO_SAVE_YOUR_MODEL = 'example.ckpt'
states = {'Classifier': classifier.state_dict(), 'Mockingjay': solver.mockingjay.state_dict()}, PATH_TO_SAVE_YOUR_MODEL)


  • Python 3
  • Pytorch 1.3.0 or above
  • Computing power (high-end GPU) and memory space (both RAM/GPU's RAM) is extremely important if you'd like to train your own model.
  • Required packages and their use are listed below, and also in requirements.txt:
editdistance     # error rate calculation
joblib           # parallel feature extraction & decoding
librosa          # feature extraction (for feature extraction only)
pydub            # audio segmentation (for MOSEI dataset preprocessing only)
pandas           # data management
tensorboardX     # logger & monitor
torch            # model & learning
tqdm             # verbosity
yaml             # config parser
matplotlib       # visualization
ipdb             # optional debugger
numpy            # array computation
scipy            # for feature extraction

The above packages can be installed by the command:

pip install -r requirements.txt

Below we list packages that need special attention, and we recommand you to install them manually:

apex             # non-essential, faster optimization (only needed if enabled in config)
sentencepiece    # sub-word unit encoding (for feature extraction only, see for install instruction)


Before you start, make sure all the packages required listed above are installed correctly

Step 0. Preprocessing - Acoustic Feature Extraction & Text Encoding

See the instructions on the Preprocess wiki page for preprocessing instructions.

Step 1. Configuring - Model Design & Hyperparameter Setup

All the parameters related to training/decoding will be stored in a yaml file. Hyperparameter tuning and massive experiment and can be managed easily this way. See config files for the exact format and examples.

Step 2. Training the Mockingjay Model for Speech Representation Learning

Once the config file is ready, run the following command to train unsupervised end-to-end Mockingjay:

python3 --train

All settings will be parsed from the config file automatically to start training, the log file can be accessed through TensorBoard.

Step 3. Using Pre-trained Models on Downstream Tasks

Once a Mockingjay model was trained, we can use the generated representations on downstream tasks. See the Experiment section for reproducing downstream task results mentioned in our paper, and see the Highlight section for incorporating the extracted representations with your own downstream task.

Pre-trained models and their configs can be download from HERE. To load with default path, models should be placed under the directory path: --ckpdir=./result_mockingjay/ and name the model file manually with --ckpt=.

Step 4. Loading Pre-trained Models and Visualize

Run the following command to visualize the model generated samples:

# visualize hidden representations
python3 --plot
# visualize spectrogram
python3 --plot --with_head

Note that the arguments --ckpdir=XXX --ckpt=XXX needs to be set correctly for the above command to run properly.

Step 5. Monitor Training Log

# open TensorBoard to see log
tensorboard --logdir=log/log_mockingjay/mockingjay_libri_sd1337/
# or
python3 -m tensorboard.main --logdir=log/log_mockingjay/mockingjay_libri_sd1337/


Application on downstream tasks

See the instructions on the Downstream wiki page to reproduce our experiments.

Comparing with APC

See the instructions on the APC wiki page to reproduce our experiments.


  1. Montreal Forced Aligner, McAuliffe et. al.
  2. CMU MultimodalSDK, Amir Zadeh.
  3. PyTorch Transformers, Hugging Face.
  4. Autoregressive Predictive Coding, Yu-An Chung.
  5. End-to-end ASR Pytorch, Alexander-H-Liu.
  6. Tacotron Preprocessing, Ryuichi Yamamoto (r9y9)


   title={Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders},
   journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
   author={Liu, Andy T. and Yang, Shu-wen and Chi, Po-Han and Hsu, Po-chun and Lee, Hung-yi},