Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EnhPreprocessor for Speech Enhancement #4321

Merged
merged 16 commits into from
Jun 6, 2022
Merged

Conversation

Emrys365
Copy link
Collaborator

@Emrys365 Emrys365 commented Apr 26, 2022

This PR adds a general preprocessor for adding noise and reverberation on the fly.

Below is an example usage of the newly added EnhPreprocessor in egs2/wsj0_2mix/enh1:

  • run.sh

    The added --extra_wav_list argument specifies the name of additional audio scp files to dump in stage 3, which will be used in EnhPreprocessor.
    See egs2/TEMPLATE/enh1/enh.sh#L359-L368 for more details.

    #!/usr/bin/env bash
    # Set bash to 'debug' mode, it will exit on :
    # -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
    set -e
    set -u
    set -o pipefail
    
    min_or_max=min # "min" or "max". This is to determine how the mixtures are generated in local/data.sh.
    sample_rate=8k
    
    
    train_set="tr_${min_or_max}_${sample_rate}"
    valid_set="cv_${min_or_max}_${sample_rate}"
    test_sets="tt_${min_or_max}_${sample_rate} "
    
    ./enh.sh \
        --audio_format wav \
        --train_set "${train_set}" \
        --valid_set "${valid_set}" \
        --test_sets "${test_sets}" \
        --fs "${sample_rate}" \
        --lang en \
        --ngpu 1 \
        --use_preprocessor true \
        --extra_wav_list "rirs.scp noises.scp" \
        --local_data_opts "--sample_rate ${sample_rate} --min_or_max ${min_or_max}" \
        --enh_config conf/tuning/train_enh_dprnn_tasnet_with_preprocessor.yaml \
        "$@"
  • conf/tuning/train_enh_dprnn_tasnet_with_preprocessor.yaml

    expand
    optim: adam
    init: xavier_uniform
    max_epoch: 150
    batch_type: folded
    batch_size: 4
    iterator_type: chunk
    chunk_length: 32000
    num_workers: 4
    optim_conf:
        lr: 1.0e-03
        eps: 1.0e-08
        weight_decay: 0
    patience: 4
    val_scheduler_criterion:
    - valid
    - loss
    best_model_criterion:
    -   - valid
        - si_snr
        - max
    -   - valid
        - loss
        - min
    keep_nbest_models: 1
    scheduler: reducelronplateau
    scheduler_conf:
        mode: min
        factor: 0.7
        patience: 1
    
    # preprocessor config
    rir_scp: dump/raw/tr_min_8k/rirs.scp
    rir_apply_prob: 1.0
    noise_scp: dump/raw/tr_min_8k/noises.scp
    noise_apply_prob: 1.0
    noise_db_range: "0_30"
    use_reverberant_ref: true
    num_spk: 2
    num_noise_type: 1
    sample_rate: 8000
    force_single_channel: true
    
    encoder: conv
    encoder_conf:
        channel: 64
        kernel_size: 2
        stride: 1
    decoder: conv
    decoder_conf:
        channel: 64
        kernel_size: 2
        stride: 1
    separator: dprnn
    separator_conf:
        num_spk: 2
        layer: 6
        rnn_type: lstm
        bidirectional: True  # this is for the inter-block rnn
        nonlinear: relu
        unit: 128
        segment_size: 250
        dropout: 0.1
        nonlinear: relu
    
    # A list for criterions
    # The overlall loss in the multi-task learning will be:
    # loss = weight_1 * loss_1 + ... + weight_N * loss_N
    # The default `weight` for each sub-loss is 1.0
    criterions:
      # The first criterion
      - name: si_snr
        conf:
          eps: 1.0e-7
        wrapper: pit
        wrapper_conf:
          weight: 1.0
          independent_perm: True

@Emrys365 Emrys365 added ESPnet2 SE Speech enhancement labels Apr 26, 2022
@codecov
Copy link

codecov bot commented Apr 26, 2022

Codecov Report

Merging #4321 (afe5131) into master (5fa6dcc) will decrease coverage by 0.14%.
The diff coverage is 38.72%.

@@            Coverage Diff             @@
##           master    #4321      +/-   ##
==========================================
- Coverage   82.58%   82.43%   -0.15%     
==========================================
  Files         469      469              
  Lines       40196    40358     +162     
==========================================
+ Hits        33194    33270      +76     
- Misses       7002     7088      +86     
Flag Coverage Δ
test_integration_espnet1 66.58% <ø> (ø)
test_integration_espnet2 49.51% <37.25%> (-0.08%) ⬇️
test_python 69.19% <30.39%> (-0.18%) ⬇️
test_utils 23.45% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
espnet2/train/preprocessor.py 36.71% <16.10%> (-0.85%) ⬇️
espnet2/iterators/sequence_iter_factory.py 97.43% <100.00%> (+0.21%) ⬆️
espnet2/tasks/asr.py 91.71% <100.00%> (+0.04%) ⬆️
espnet2/tasks/enh.py 99.18% <100.00%> (+0.12%) ⬆️
espnet2/tasks/enh_s2t.py 96.64% <100.00%> (ø)
espnet2/tasks/hubert.py 88.37% <100.00%> (ø)
espnet2/tasks/st.py 90.55% <100.00%> (+0.05%) ⬆️
espnet2/enh/loss/criterions/abs_loss.py 80.00% <0.00%> (-5.72%) ⬇️
espnet2/text/phoneme_tokenizer.py 83.39% <0.00%> (-3.19%) ⬇️
espnet2/bin/asr_inference.py 84.83% <0.00%> (-1.78%) ⬇️
... and 6 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@sw005320 sw005320 added this to the v.202205 milestone Apr 26, 2022
@Emrys365
Copy link
Collaborator Author

Emrys365 commented Apr 28, 2022

I modified the DataLoader initialization to set an epoch-dependent random seed for each worker, while the previous default DataLoader tends to use varying and uncontrollable random seeds at each epoch (see https://github.com/pytorch/pytorch/blob/master/torch/utils/data/_utils/worker.py#L216-L235 and https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L535).

@sw005320
Copy link
Contributor

@popcornell, could I ask you to review this PR?

@sw005320
Copy link
Contributor

@Emrys365, is it possible to make a unit test to cover this?
If it is difficult, we may consider having an integration test.

@Emrys365
Copy link
Collaborator Author

I was thinking about it, but I could not find an example unit test for the preprocessor.
Maybe an integration test is easier, but it will need some noise and RIR signals for loading.

@sw005320
Copy link
Contributor

I think it is no problem to upload such files for the integration test (or create it with some files randomly and regard them as RIRs or noises).

@Emrys365
Copy link
Collaborator Author

Emrys365 commented Apr 28, 2022

OK. I added two noise samples (from MUSAN) and two RIR samples (from https://www.openslr.org/28/) to egs/mini_an4/asr1/downloads.tar.gz:

downloads
├── an4
│   ├── etc
│   │   ├── an4.dic
│   │   ├── an4.filler
│   │   ├── an4.phone
│   │   ├── an4_test.fileids
│   │   ├── an4_test.transcription
│   │   ├── an4_train.fileids
│   │   ├── an4_train.transcription
│   │   ├── an4.ug.lm
│   │   └── an4.ug.lm.DMP
│   ├── LICENSE
│   ├── README
│   └── wav
│       ├── an4_clstk
│       │   ├── fash
│       │   │   ├── an251-fash-b.sph
│       │   │   ├── an253-fash-b.sph
│       │   │   └── cen7-fash-b.sph
│       │   ├── fbbh
│       │   │   └── cen8-fbbh-b.sph
│       │   └── mwhw
│       │       ├── an152-mwhw-b.sph
│       │       └── cen8-mwhw-b.sph
│       └── an4test_clstk
│           ├── fcaw
│           │   └── cen8-fcaw-b.sph
│           └── mmxg
│               └── cen8-mmxg-b.sph
├── noise
│   ├── noise-free-sound-0043.wav
│   └── noise-sound-bible-0001.wav
└── rirs
    ├── rir1.wav
    └── rir2.wav

@mergify mergify bot added ESPnet1 README CI Travis, Circle CI, etc labels Apr 28, 2022
@popcornell
Copy link
Contributor

@sw005320 @Emrys365 Looks like one of you need to invite me to do the review.

@sw005320
Copy link
Contributor

@sw005320 @Emrys365 Looks like one of you need to invite me to do the review.

I think you can just go through the code and leave the comments.
Please let me know if it would not work.

@popcornell
Copy link
Contributor

If it is fine for you two, I'll do it this way then !

espnet2/tasks/enh.py Show resolved Hide resolved
help="Whether to apply preprocessing to data or not",
)
group.add_argument(
"--speech_volume_normalize",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be useful to add also an option for varying dynamically the range of level for speech ?
In dynamic mixing you want often to do that too.

Copy link
Collaborator Author

@Emrys365 Emrys365 Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean randomly sampling from a range rather than a fixed value? It sounds reasonable to me.
We can just do that like for noise_db_range. @kamo-naoyuki What do you think?

One concern is that we need to care about the behavior when self.train is False. The scale should be deterministic in these cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would be good. There are many instances where you want to reduce for example the overall level of the speech utterances e.g. to simulate far field speech (correctly you re scale back after reverberation but sometime you want to simulate the low gain of speech captured by distant mics).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added support of it (only for EnhPreprocessor).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to scale the energy in dB instead of the peak value ?

espnet2/train/preprocessor.py Show resolved Hide resolved
@mergify
Copy link
Contributor

mergify bot commented May 18, 2022

This pull request is now in conflict :(

@mergify mergify bot added the conflicts label May 18, 2022
@kamo-naoyuki
Copy link
Collaborator

I changed ci tests to check the import order by isort #4372, so please also apply isort after resolving the conflicts. Sorry for bothering you.

@Emrys365
Copy link
Collaborator Author

I changed ci tests to check the import order by isort #4372, so please also apply isort after resolving the conflicts. Sorry for bothering you.

Sure, no problem.

@mergify mergify bot removed the conflicts label May 20, 2022
@Emrys365
Copy link
Collaborator Author

Hi @popcornell, could you review again? I made some changes to resolve all comments.

noise = np.pad(
noise,
[(offset, nsamples - f.frames - offset), (0, 0)],
mode="wrap",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice but could lead to funny noise clips if the noise is very short. Maybe raise a warning ? (but could be too verbose as it is in the dataloader).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. I have added support for it.

@kan-bayashi kan-bayashi modified the milestones: v.202205, v.202206 May 26, 2022
@sw005320
Copy link
Contributor

sw005320 commented Jun 6, 2022

Thanks, @Emrys365!

@sw005320 sw005320 merged commit 4806c23 into espnet:master Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Travis, Circle CI, etc ESPnet1 ESPnet2 New Features README SE Speech enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants