SA-toolkit is a pytorch-based library providing pipelines and basic building blocs for evaluating and designing speaker anonymization techniques.
Features include:
- ASR training with a pytorch kaldi LF-MMI wrapper (evaluation, and VC linguistic feature)
- VC HiFi-GAN training with on-the-fly feature caching (anonymization)
- ASV training (evaluation)
- WER Utility and EER/Linkability/Cllr Privacy evaluations
- Clear and simplified egs directories
- Unified trainer/configs
- TorchScript YAAPT & TorchScript kaldi.fbank (with batch processing!)
- On the fly only feature extraction
- 100% TorchScript JIT-compatible network models
All data
are formatted with kaldi-like wav.scp, spk2utt, text, etc.
Kaldi is necessary for training the ASR models and the handy run.pl
/ssh.pl
/data_split
.. scripts, but most of the actual logic is performed in python; you won't have to deal kaldi ;)
The best way to install the toolkit is with the install.sh
script, which setup a miniconda environment, and kaldi.
Take a look at the script and adapt it to your cluster configuration, or leave it do it's magic.
git clone https://github.com/deep-privacy/SA-toolkit
./install.sh
This locally installs satools (the required pip dependencies are: torch
and torchaudio
).
This version gives access to the python/torch model for inference/testing, but for training use install.sh
.
You can modify tag_version
accordingly to the available model tag here.
import torch
model = torch.hub.load("deep-privacy/SA-toolkit", "anonymization", tag_version="hifigan_bn_tdnnf_wav2vec2_vq_48_v1", trust_repo=True)
wav_conv = model.convert(torch.rand((1, 77040)), target="1069")
asr_bn = model.get_bn(torch.rand((1, 77040))) # (ASR-BN extraction for disentangled linguistic features (best with hifigan_bn_tdnnf_wav2vec2_vq_48_v1))
This version does not rely on any dependencies using TorchScript.
import torch
import torchaudio
waveform, _, text_gt, speaker, chapter, utterance = torchaudio.datasets.LIBRISPEECH("/tmp", "dev-clean", download=True)[1]
torchaudio.save(f"/tmp/clear_{speaker}-{chapter}-{str(utterance)}.wav", waveform, 16000)
model = torch.jit.load("__Exp_Path__/final.jit").eval()
wav_conv = model.convert(waveform, target="1069")
torchaudio.save(f"/tmp/anon_{speaker}-{chapter}-{str(utterance)}.wav", wav_conv, 16000)
Ensure you have the model downloaded. Check the egs/vc directory for more detail.
cd egs/anon/vctk
./local/eval.py --config configs/eval_clear # eval privacy/utility of the signals
Ensure you have the corresponding evaluation model trained or downloaded.
Checkout the READMEs of egs/asr/librispeech / egs/asv/voxceleb / egs/vc/libritts.
This library is the result of the work of Pierre Champion's thesis.
If you found this library useful in academic research, please cite:
@phdthesis{champion2023,
title={Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques},
author={Pierre Champion},
year={2023},
school={Université de Lorraine - INRIA Nancy},
type={Thesis},
}
(Also consider starring the project on GitHub.)
- Idiap' pkwrap
- Jik876's HifiGAN
- A.Larcher's Sidekit
- Organazers of the VoicePrivacy Challenge
Most of the software is distributed under Apache 2.0 License (http://www.apache.org/licenses/LICENSE-2.0); the parts distributed under other licenses are indicated by a LICENSE
file in related directories.
As outlined in the thesis, selecting the appropriate target identities for voice conversion is crucial for privacy evaluation. We strongly encourage the use of any-to-one voice conversion as it provides the greatest level of guarantee regarding unlinkable speech generation and facilitates proper training of a white-box ASV evaluation model. Additionally, this approach is easy to comprehend (everyone should sounds like a single identity) and enables using one-hot encoding for target identity representation, which is simpler than x-vectors while still highly effective for utility preservation.
Furthermore, the thesis identifies a limitation in the current utility evaluation process. We believe that the best solution for proper assessment of utility is through subjective listening, which allows for accurate evaluation of any mispronunciations produced by the VC system.