Skip to content


Repository files navigation

TVSM Dataset

The TV Speech and Music (TVSM) dataset contains speech and music activity labels across a variety of TV shows and their corresponding audio features extracted from professionally-produced high-quality audio. The dataset aims to facilitate research on speech and music detection tasks.

Get the dataset

  • The dataset can be downloaded via
  • The paper can be downloaded via EURASIP open access.
  • This repo contains materials and codebase to reproduce the baseline experiment in the paper.

License and attribution

  title={{A Large TV Dataset for Speech and Music Activity Detection},
  author={Hung, Yun-Ning and Wu, Chih-Wei and Orife, Iroro and Hipple, Aaron and Wolcott, William and Lerch, Alexander},
  journal={EURASIP Journal on Audio, Speech, and Music Processing},

The TVSM dataset is licensed under a Apache License 2.0 license

Dataset introduction

The downloaded dataset has the following structure:

└─── READEME.txt
└─── TVSM-cuesheet/
│    └─── labels/
│    └─── mel_features/
│    └─── mfcc/
│    └─── vgg_features/
│    └─── TVSM-xxxx_metadata.csv
└─── TVSM-pseudo/
└─── TVSM-test/
  • READEME.txt: basic information about the dataset
  • TVSM-cuesheet/: smaller subset used for training. The labels are derived from cuesheet information
  • TVSM-pseudo/: larger subset used for training. The labels are labeled from a pre-trained model trained on TVSM-cuesheet
  • TVSM-test/: subset for testing. The labels are labeled by human annotators

Each subset folder has the same structure:

  • labels/: speech and music activation labels for each sample. Each row in a csv file represents "start time", "end time" and "s(speech)/m(music)"
  • mel_features/: the Mel spectrogram feature extracted from the audio of each sample
  • mfcc/: the MFCCs feature extracted from the audio of each sample
  • vgg_features/: the VGGish feature extracted from the audio of each sample
  • TVSM-xxxx_metadata.csv: the metadata of each sample

For more information, please visit our paper

Codebase introduction

Interested in inferencing existing samples? Please visit for usage.

cd training_code
python3 --audio_path test.wav

Please install git lfs first then run git-lfs pull to restore the checkpoints

Please replace line 31 in with self.save_hyperparameters(hparams) if you are using newer pytorch_lightning versions.

└─── Evaluation_Output/
│    └─── AVASpeech/
│    │    └─── T2
│    │    └─── TVSM-cuesheet
│    │    └─── TVSM-pseudo
│    └─── ...
└─── Models/
└─── training_code/
  • Evaluation_Output: the output generated by three models across five evaluation sets
    • T2: baseline method
    • TVSM-cuesheet: CRNN-P-Cue method
    • TVSM-pseudo: CRNN-P-Pseu method
  • Models: the pre-trained checkpoint from CRNN-P-Cue and CRNN-P-Pseu methods
  • training_code: code for training the model

Bug Fix

If you encounter error "batch response: This repository is over its data quota. Account responsible for LFS...", can download the model checkpoint from Google Drive


Please feel free to contact or open an issue here if you have any questions about the dataset or the support code.