This code is a pytorch version for speechflow model in "How Speech is Recognized to Be Emotional - A Study Based on Information Decomposition", which is modified for our investigations from the original Speechflow.
For ACRNN in this paper, its code can be found at 3-D ACRNN.
This project is built with python 3.6, for other packages, you can install them by pip install -r requirements.txt
.
To Prepare Training Data(take VCTK as an example here, for other dataset, you should modify some settings in below files)
-
Prepare your wavefiles
-
cd tools
-
make training data and valid data, for better training, you can make an entire training wave file for each speaker by
sh make_cat.sh
, and make separate validation wave files for each speaker bysh make_valid.sh
, here validation speakers can not appear in training data. -
Extract spectrogram and f0:
python make_spect_f0_VCTK.py
- you should provide d-vectors by a pre-trained model. D-vectors for speakers in VCTK calculated by our pre-trained model are provided in
./data/VCTK_dvec/dvector_VCTK.npz
- a mapping from speakers to IDs and another from speakers to corresponding genders are needeed
- for validation data, you should change some settings
- you should provide d-vectors by a pre-trained model. D-vectors for speakers in VCTK calculated by our pre-trained model are provided in
-
Generate training metadata:
make_metasplit_VCTK.py
-
Generate validation metadata:
make_demodata_VCTK.py
-
change setting at
hparams.py
andrun.py
-
Run the training scripts:
python run.py
An example is provide in infer_batch.py
, in which you should define the input pickle file.
This speechflow model is the most important tool for us to analyse impacts between speech components and performance of mordern emotion recognition systems. This code is modified for our task from the original Speechflow. Thanks for Kaizhi Qian providing the original code, which is much helpful for us.