Bioacoustic Event Detection with Self-Supervised Contrastive Learning
Peter C. Bermant, Leandra Brickson, Alexander J. Titus
While deep learning has revolutionized ecological data analysis, existing strategies often rely on supervised learning, which is subject to limitations on real-world applicability. In this paper, we apply self-supervised deep learning methods to bioacoustic data to enable unsupervised detection of bioacoustic event boundaries. We propose a convolutional deep neural network that operates on the raw waveform directly and is trained in accordance with the Noise Contrastive Estimation principle, which enables the system to detect spectral changes in the input acoustic stream. The model learns a representation of the input audio sampled at low frequency that encodes information regarding dissimilarity between sequential acoustic windows. During inference, we use a peak finding algorithm to search for regions of high dissimilarity in order to identify temporal boundaries of bioacoustic events. We report results using these techniques to detect sperm whale (Physeter macrocephalus) coda clicks in real-world recordings, and we demonstrate the viability of analyzing the vocalizations of other species (e.g. Bengalese finch syllable segmentation) in addition to other data modalities (e.g. animal behavioral dynamics, embryo development and tracking). We find that the self-supervised deep representation learning-based technique outperforms established threshold-based baseline methods without requiring manual annotation of acoustic datasets. Quantitatively, our approach yields a maximal R-value and F1-score of 0.887 and 0.876, respectively, and an area under the Precision-Recall curve (PR-AUC) of 0.917, while a baseline threshold detector acting on signal energy amplitude returns a maximal R-value and F1-score of 0.620 and 0.576, respectively, and a PR-AUC of 0.571. We also compare with a threshold detector using preprocessed (e.g. denoised) acoustic input. The findings of this paper establish the validity of unsupervised bioacoustic event detection using deep neural networks and self-supervised contrastive learning as an effective alternative to conventional techniques that leverage supervised methods for signal presence indication. Providing a means for highly accurate unsupervised detection, this paper serves as an important step towards developing a fully automated system for real-time acoustic monitoring of bioacoustic signals in real-world acoustic data.
preprocess.pyPreprocessing the Watkins sperm whale .wav files to get .wav infodataset.pyPyTorch dataset to select training/validation segmentslayers.pyPyTorch layers (nn.Module) used during model constructionmodels.pyPyTorch model (nn.Module) comprising the spectral boundary encoderlosses.pyThe contrastive learning objective used as the loss function during model trainingmetrics.pyMetrics (P, R, F1, r, P_det, P_FA) used during inference and a metric based on the contrastive learning objectivePyFire.pyCustom-written general-purpose training loop using the Trainer objecttrain.pyPipeline for instantiating and training modelsinference.pyUsing the trained model, encode the waveforms to the learned representationpeak_detect.pyUsing the outputs ofinference.py, compute peaks in the dissimilarity metricenergy_detect.pyCompute the baseline detections based on energy amplituderesults.pyEvaluate the P, R, F1, r, P_det, P_FA metricsutils.pyMiscellaneous utility functions
git clone https://github.com/colossal-compsci/SSLUnsupDet.git
cd SSLUnsupDet
pip install -r requirements.txt
The sperm whale click data are extracted from the ‘Best Of’ cuts from the Watkins Marine Mammal Sound Database, Woods Hole Oceanographic Institution, and the New Bedford Whaling Museum. We host .wav acoustic files used in the study (of which 71 exceed the 0.5s window width) and 42 .txt selection tables (of which 4 are consider to be of questionable certainty labeled by '.q.') in a Google storage bucket. The bengalese finch data from Nicholson et al., 2017 are available from figshare. The green sea turtle data from Jeantet et al., 2020 are available in Dryad. The embryo development data from Cicconet et al., 2014 available online.
The detection training and inference pipeline assumes the data is structured as follows:
data
│
└───wavs
│ └─ *.wav
└───selections
└─ *.selections.txt
Prior to training and inferencing a model, it might be necessary to modify the configuration in configs/config.json. For a given dataset, it is particularly important to specify "dataset" and "model" parameters. Default "training" parameters, in general, serve as a functional baseline.
python train.py
Command-line flags are as follows:
-c --config name of the config JSON
-s --seed (optional) random seed
python inference.py
Command-line flags are as follows:
-c --config name of the config JSON
-ckpt --checkpoint epoch of serialized model checkpoint
-s --search carry out 'coarse' or 'fine' grid search mode
python peak_detect.py
Command-line flags are as follows:
-c --config name of the config JSON
-ckpt --checkpoint epoch of serialized model checkpoint
-q --use_q bool to use uncertain (q) selection tables
-s --search carry out 'coarse' or 'fine' grid search mode
python energy_detect.py
Command-line flags are as follows:
-c --config name of the config JSON
-ckpt --checkpoint epoch of serialized model checkpoint
-q --use_q bool to use uncertain (q) selection tables
-s --search carry out 'coarse' or 'fine' grid search mode
-h --hpf implement high pass filter noise reduction
python results.py
Command-line flags are as follows:
-c --config name of the config JSON
-ckpt --checkpoint epoch of serialized model checkpoint
-q --use_q bool to use uncertain (q) selection tables
-s --search carry out 'coarse' or 'fine' grid search mode
-hpf --use_hpf implement high pass filter noise reduction
-e --compute_energy bool to compute energy amplitude-based detections
