# Model comparisons with audio examples
This page contains audio examples for the paper "On the Generalization Ability of Complex-Valued Variational U-Networks for Single-Channel Speech Enhancement,", E. J. Nustede and J. Anemüller, in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3838-3849, 2024, doi: 10.1109/TASLP.2024.3444492.

#### Method
The paper focuses on the inclusion of a probabilistic latent space model in a U-Network for single-channel speech enhancement. We evaluate our model, the CVU-Net, and compare it with several other versions. On this page, we compare the CVU-Net with an identical model, that does not include a probabilistic latent space model, so it is only a complex U-Network, called CU-Net. Both models are trained on the magnitude and phase of the audio signals. The evaluation is done in terms of SI-SDR, PESQ and STOI, but only SI-SDR is reported here. To assess the adaptability of the models, we employ them in so-called dataset mismatched conditions. First, we train our models on either the Microsoft Deep Noise Suppression Challenge 2020 Dataset, or on VoiceBank+DEMAND, then evaluate their performance on the other dataset, that was not used for training. 

#### Chosen examples
Here we present selected audio examples for the scenarios: mismatched VoiceBank+DEMAND training to DNS with no reverberation testing, and mismatched DNS training to VoiceBank+DEMAND testing. For comparison, we also show the matched conditions, i.e., the models are trained on the same dataset that they are tested on - the test data is separate from the training data in any case.

We selected two examples for speech-like noise, specifically babble noise with a male and a female target speaker, and two non-speech noise types from the DNS test set. From the VoiceBank+DEMAND corpus, we chose a male speaker with music noise and a female speaker with babble noise. Our aim was to choose files with a noisy mixture SNR of approximately 5 dB. However, the selection of non-speech noise examples was constrained by the inherent randomness of the noise distribution within our test sets, specifically for the VoiceBank+DEMAND corpus.

In [28]:
from IPython.display import Image, display, HTML, Audio

def create_html_css(image_path, clean_path, noisy_path, denoised_path, image_title):
    audio_title_denoised = ['Matched CVU-Net de-noised audio',
                     'Matched CU-Net de-noised audio',
                     'Mismatched CVU-Net de-noised audio',
                     'Mismatched CU-Net de-noised audio']
    audio_title_2 = 'Noisy audio'
    audio_title_3 = 'Clean audio'

    image_width = "600px"
    image_height = "auto"

    html_code = f"""
    <div style="display: grid; box-sizing: border-box; grid-template-columns: 1fr 1fr; gap: 0;">
        <div style="text-align: center;">
            <h3>{audio_title_denoised[0]}</h3>
            <img src="{image_path[0]}" alt="Image 1" style="width: {image_width}; height: auto;">
            <audio controls style="width: 50%; align: center;">
                <source src="{denoised_path[0]}" type="audio/mpeg">
                Your browser does not support the audio element.
            </audio>
        </div>
        <div style="text-align: center;">
            <h3>{audio_title_denoised[1]}</h3>
            <img src="{image_path[1]}" alt="Image 2" style="width: {image_width}; height: auto;">
            <audio controls style="width: 50%; align: center;">
                <source src="{denoised_path[1]}" type="audio/mpeg">
                Your browser does not support the audio element.
            </audio>
        </div>
        <div style="text-align: center;">
            <h3>{audio_title_denoised[2]}</h3>
            <img src="{image_path[2]}" alt="Image 1" style="width: {image_width}; height: auto;">
            <audio controls style="width: 50%; align: center;">
                <source src="{denoised_path[2]}" type="audio/mpeg">
                Your browser does not support the audio element.
            </audio>
        </div>
        <div style="text-align: center;">
            <h3>{audio_title_denoised[3]}</h3>
            <img src="{image_path[3]}" alt="Image 2" style="width: {image_width}; height: auto;">
            <audio controls style="width: 50%; align: center;">
                <source src="{denoised_path[3]}" type="audio/mpeg">
                Your browser does not support the audio element.
            </audio>
        </div>
        <div style="text-align: center;">
            <h3>{audio_title_2}</h3>
            <img src="{image_path[4]}" alt="Image 1" style="width: {image_width}; height: auto;">
            <audio controls style="width: 50%; align: center;">
                <source src="{noisy_path}" type="audio/mpeg">
                Your browser does not support the audio element.
            </audio>
        </div>
        <div style="text-align: center;">
            <h3>{audio_title_3}</h3>
            <img src="{image_path[5]}" alt="Image 2" style="width: {image_width}; height: auto;">
            <audio controls style="width: 50%; align: center;">
                <source src="{clean_path}" type="audio/mpeg">
                Your browser does not support the audio element.
            </audio>
        </div>
    </div>
    """
    display(HTML(html_code))

## Evaluation on DNS dataset test files
---
### The shown, mismatched, examples in this section were trained on Voicebank+DEMAND and evaluated on DNS data, while the matched are trained and evaluated and the DNS dataset.
### Example 1: Male Speaker with Wind noise. 

In [29]:
# Wind noise with male speaker scenario
image_paths = []
denoised_paths = []
image_titles = ['Matched CVU-Net (Ma/Ph) de-noising result',
                'Matched CU-Net (Ma/Ph) de-noising result',
                'Mismatched CVU-Net (Ma/Ph) de-noising result',
                'Mismatched CU-Net (Ma/Ph) de-noising result',
                'Original clean signal',
                'Original noisy signal']

# Matched CVU-Net
image_paths.append("audio_examples/dns-norev/cvunet-maph/cvunet-maph_single_fileid_170.png")
denoised_paths.append("audio_examples/dns-norev/cvunet-maph/clnsp205_wind_407027_1_snr1_tl-24_fileid_170.wav")

# Matched CU-Net
image_paths.append("audio_examples/dns-norev/cunet-maph/cunet-maph_single_fileid_170.png")
denoised_paths.append("audio_examples/dns-norev/cunet-maph/clnsp205_wind_407027_1_snr1_tl-24_fileid_170.wav")

# Mismatched CVU-Net
image_paths.append("audio_examples/voice-dns-norev/cvunet-maph/cvunet-maph_single_fileid_170.png")
denoised_paths.append("audio_examples/voice-dns-norev/cvunet-maph/clnsp205_wind_407027_1_snr1_tl-24_fileid_170.wav")

# Mismatched CU-Net
image_paths.append("audio_examples/voice-dns-norev/cunet-maph/cunet-maph_single_fileid_170.png")
denoised_paths.append("audio_examples/voice-dns-norev/cunet-maph/clnsp205_wind_407027_1_snr1_tl-24_fileid_170.wav")

# Clean & Noisy
clean_path = "audio_examples/voice-dns-norev/clean_clnsp205_wind_407027_1_snr1_tl-24_fileid_170.wav"
noisy_path = "audio_examples/dns-norev/noisy_fileid_170.wav"
image_paths.append("audio_examples/voice-dns-norev/noisy_fileid_170.png")
image_paths.append("audio_examples/voice-dns-norev/clean_fileid_170.png")


create_html_css(image_paths, clean_path, noisy_path, denoised_paths, image_titles)



### Example 2: Male speaker with babble noise

In [30]:
# Babble male speaker
image_paths = []
denoised_paths = []
image_titles = ['Matched CVU-Net (Ma/Ph) de-noising result',
                'Matched CU-Net (Ma/Ph) de-noising result',
                'Mismatched CVU-Net (Ma/Ph) de-noising result',
                'Mismatched CU-Net (Ma/Ph) de-noising result',
                'Original clean signal',
                'Original noisy signal']

# Matched CVU-Net
image_paths.append("audio_examples/dns-norev/cvunet-maph/cvunet-maph_single_fileid_255.png")
denoised_paths.append("audio_examples/dns-norev/cvunet-maph/clnsp50_babble_188218_24_snr4_tl-29_fileid_255.wav")

# Matched CU-Net
image_paths.append("audio_examples/dns-norev/cunet-maph/cunet-maph_single_fileid_255.png")
denoised_paths.append("audio_examples/dns-norev/cunet-maph/clnsp50_babble_188218_24_snr4_tl-29_fileid_255.wav")

# Mismatched CVU-Net
image_paths.append("audio_examples/voice-dns-norev/cvunet-maph/cvunet-maph_single_fileid_255.png")
denoised_paths.append("audio_examples/voice-dns-norev/cvunet-maph/clnsp50_babble_188218_24_snr4_tl-29_fileid_255.wav")

# Mismatched CU-Net
image_paths.append("audio_examples/voice-dns-norev/cunet-maph/cunet-maph_single_fileid_255.png")
denoised_paths.append("audio_examples/voice-dns-norev/cunet-maph/clnsp50_babble_188218_24_snr4_tl-29_fileid_255.wav")

# Clean & Noisy
clean_path = "audio_examples/voice-dns-norev/clean_clnsp50_babble_188218_24_snr4_tl-29_fileid_255.wav"
noisy_path = "audio_examples/dns-norev/noisy_fileid_255.wav"
image_paths.append("audio_examples/voice-dns-norev/noisy_fileid_255.png")
image_paths.append("audio_examples/voice-dns-norev/clean_fileid_255.png")


create_html_css(image_paths, clean_path, noisy_path, denoised_paths, image_titles)

### Example 3: Female speaker with vacuum cleaner noise

In [31]:
# Vacuum noise female speaker
image_paths = []
denoised_paths = []
image_titles = ['Matched CVU-Net (Ma/Ph) de-noising result',
                'Matched CU-Net (Ma/Ph) de-noising result',
                'Mismatched CVU-Net (Ma/Ph) de-noising result',
                'Mismatched CU-Net (Ma/Ph) de-noising result',
                'Original clean signal',
                'Original noisy signal']

# Matched CVU-Net
image_paths.append("audio_examples/dns-norev/cvunet-maph/cvunet-maph_single_fileid_175.png")
denoised_paths.append("audio_examples/dns-norev/cvunet-maph/clnsp257_vacuum_273194_2_snr4_tl-18_fileid_175.wav")

# Matched CU-Net
image_paths.append("audio_examples/dns-norev/cunet-maph/cunet-maph_single_fileid_175.png")
denoised_paths.append("audio_examples/dns-norev/cunet-maph/clnsp257_vacuum_273194_2_snr4_tl-18_fileid_175.wav")

# Mismatched CVU-Net
image_paths.append("audio_examples/voice-dns-norev/cvunet-maph/cvunet-maph_single_fileid_175.png")
denoised_paths.append("audio_examples/voice-dns-norev/cvunet-maph/clnsp257_vacuum_273194_2_snr4_tl-18_fileid_175.wav")

# Mismatched CU-Net
image_paths.append("audio_examples/voice-dns-norev/cunet-maph/cunet-maph_single_fileid_175.png")
denoised_paths.append("audio_examples/voice-dns-norev/cunet-maph/clnsp257_vacuum_273194_2_snr4_tl-18_fileid_175.wav")

# Clean & Noisy
clean_path = "audio_examples/voice-dns-norev/clean_clnsp257_vacuum_273194_2_snr4_tl-18_fileid_175.wav"
noisy_path = "audio_examples/dns-norev/noisy_fileid_175.wav"
image_paths.append("audio_examples/voice-dns-norev/noisy_fileid_175.png")
image_paths.append("audio_examples/voice-dns-norev/clean_fileid_175.png")

create_html_css(image_paths, clean_path, noisy_path, denoised_paths, image_titles)

### Example 4: Female speaker with babble noise

In [32]:
# Vacuum noise female speaker
image_paths = []
denoised_paths = []
image_titles = ['Matched CVU-Net (Ma/Ph) de-noising result',
                'Matched CU-Net (Ma/Ph) de-noising result',
                'Mismatched CVU-Net (Ma/Ph) de-noising result',
                'Mismatched CU-Net (Ma/Ph) de-noising result',
                'Original clean signal',
                'Original noisy signal']

# Matched CVU-Net
image_paths.append("audio_examples/dns-norev/cvunet-maph/cvunet-maph_single_fileid_147.png")
denoised_paths.append("audio_examples/dns-norev/cvunet-maph/clnsp25_babble_188218_21_snr5_tl-25_fileid_147.wav")

# Matched CU-Net
image_paths.append("audio_examples/dns-norev/cunet-maph/cunet-maph_single_fileid_147.png")
denoised_paths.append("audio_examples/dns-norev/cunet-maph/clnsp25_babble_188218_21_snr5_tl-25_fileid_147.wav")

# Mismatched CVU-Net
image_paths.append("audio_examples/voice-dns-norev/cvunet-maph/cvunet-maph_single_fileid_147.png")
denoised_paths.append("audio_examples/voice-dns-norev/cvunet-maph/clnsp25_babble_188218_21_snr5_tl-25_fileid_147.wav")

# Mismatched CU-Net
image_paths.append("audio_examples/voice-dns-norev/cunet-maph/cunet-maph_single_fileid_147.png")
denoised_paths.append("audio_examples/voice-dns-norev/cunet-maph/clnsp25_babble_188218_21_snr5_tl-25_fileid_147.wav")

# Clean & Noisy
clean_path = "audio_examples/voice-dns-norev/clean_clnsp25_babble_188218_21_snr5_tl-25_fileid_147.wav"
noisy_path = "audio_examples/dns-norev/noisy_fileid_147.wav"
image_paths.append("audio_examples/voice-dns-norev/noisy_fileid_147.png")
image_paths.append("audio_examples/voice-dns-norev/clean_fileid_147.png")

create_html_css(image_paths, clean_path, noisy_path, denoised_paths, image_titles)

## Evaluation on VoiceBank+DEMAND test files
---
### The following mismatched examples show the performance, when the models are trained on the DNS dataset and evaluated on VoiceBank+DEMAND instead. For comparison, the matched scenario is included, where the models were trained and evaluated on the corresponding VoiceBank+DEMAND dataset.

In [33]:
def create_html_css(image_path, clean_path, noisy_path, denoised_path, image_title):
    audio_title_denoised = [
                     'Mismatched CVU-Net de-noised audio',
                     'Mismatched CU-Net de-noised audio']
    audio_title_2 = 'Noisy audio'
    audio_title_3 = 'Clean audio'

    image_width = "600px"
    image_height = "auto"

    html_code = f"""
    <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 10px;">
        <div style="text-align: center;">
            <h3>{audio_title_denoised[0]}</h3>
            <img src="{image_path[0]}" alt="Image 1" style="width: {image_width}; height: auto;">
            <audio controls style="width: 50%; align: center;">
                <source src="{denoised_path[0]}" type="audio/mpeg">
                Your browser does not support the audio element.
            </audio>
        </div>
        <div style="text-align: center;">
            <h3>{audio_title_denoised[1]}</h3>
            <img src="{image_path[1]}" alt="Image 2" style="width: {image_width}; height: auto;">
            <audio controls style="width: 50%; align: center;">
                <source src="{denoised_path[1]}" type="audio/mpeg">
                Your browser does not support the audio element.
            </audio>
        </div>
        <div style="text-align: center;">
            <h3>{audio_title_2}</h3>
            <img src="{image_path[2]}" alt="Image 1" style="width: {image_width}; height: auto;">
            <audio controls style="width: 50%; align: center;">
                <source src="{noisy_path}" type="audio/mpeg">
                Your browser does not support the audio element.
            </audio>
        </div>
        <div style="text-align: center;">
            <h3>{audio_title_3}</h3>
            <img src="{image_path[3]}" alt="Image 2" style="width: {image_width}; height: auto;">
            <audio controls style="width: 50%; align: center;">
                <source src="{clean_path}" type="audio/mpeg">
                Your browser does not support the audio element.
            </audio>
        </div>
    </div>
    """
    display(HTML(html_code))


### Example 1: Male speaker with music and background rustling noise

In [34]:
# Male speaker with music and rustling noise
image_paths = []
denoised_paths = []
image_titles = [
                'Mismatched CVU-Net (Ma/Ph) de-noising result',
                'Mismatched CU-Net (Ma/Ph) de-noising result',
                'Original clean signal',
                'Original noisy signal']

# Mismatched CVU-Net
image_paths.append("audio_examples/dns-to-voice/cvunet-maph/cvunet-maph_single_fileidp232_160.png")
denoised_paths.append("audio_examples/dns-to-voice/cvunet-maph/p232_160.wav")

# Mismatched CU-Net
image_paths.append("audio_examples/dns-to-voice/cunet-maph/cunet-maph_single_fileidp232_160.png")
denoised_paths.append("audio_examples/dns-to-voice/cunet-maph/p232_160.wav")


# Clean & Noisy
clean_path = "audio_examples/dns-to-voice/clean_p232_160.wav"
noisy_path = "audio_examples/dns-to-voice/noisy_p232_160.wav"
image_paths.append("audio_examples/dns-to-voice/noisy_fileidp232_160.png")
image_paths.append("audio_examples/dns-to-voice/clean_fileidp232_160.png")

create_html_css(image_paths, clean_path, noisy_path, denoised_paths, image_titles)

### Example 2: Female speaker with music

In [35]:
# Male speaker with music and rustling noise
image_paths = []
denoised_paths = []
image_titles = [
                'Mismatched CVU-Net (Ma/Ph) de-noising result',
                'Mismatched CU-Net (Ma/Ph) de-noising result',
                'Original clean signal',
                'Original noisy signal']

# Mismatched CVU-Net
image_paths.append("audio_examples/dns-to-voice/cvunet-maph/cvunet-maph_single_fileidp257_430.png")
denoised_paths.append("audio_examples/dns-to-voice/cvunet-maph/p257_430.wav")

# Mismatched CU-Net
image_paths.append("audio_examples/dns-to-voice/cunet-maph/cunet-maph_single_fileidp257_430.png")
denoised_paths.append("audio_examples/dns-to-voice/cunet-maph/p257_430.wav")

# Clean & Noisy
clean_path = "audio_examples/dns-to-voice/clean_p257_430.wav"
noisy_path = "audio_examples/dns-to-voice/noisy_p257_430.wav"
image_paths.append("audio_examples/dns-to-voice/noisy_fileidp257_430.png")
image_paths.append("audio_examples/dns-to-voice/clean_fileidp257_430.png")

create_html_css(image_paths, clean_path, noisy_path, denoised_paths, image_titles)

### Example 3: Female speaker with babble noise

In [36]:
# Female Speaker with babble
# Male speaker with music and rustling noise
image_paths = []
denoised_paths = []
image_titles = [
                'Mismatched CVU-Net (Ma/Ph) de-noising result',
                'Mismatched CU-Net (Ma/Ph) de-noising result',
                'Original clean signal',
                'Original noisy signal']

# Mismatched CVU-Net
image_paths.append("audio_examples/dns-to-voice/cvunet-maph/cvunet-maph_single_fileidp257_411.png")
denoised_paths.append("audio_examples/dns-to-voice/cvunet-maph/p257_411.wav")

# Mismatched CU-Net
image_paths.append("audio_examples/dns-to-voice/cunet-maph/cunet-maph_single_fileidp257_411.png")
denoised_paths.append("audio_examples/dns-to-voice/cunet-maph/p257_411.wav")

# Clean & Noisy
clean_path = "audio_examples/dns-to-voice/clean_p257_411.wav"
noisy_path = "audio_examples/dns-to-voice/noisy_p257_411.wav"
image_paths.append("audio_examples/dns-to-voice/noisy_fileidp257_411.png")
image_paths.append("audio_examples/dns-to-voice/clean_fileidp257_411.png")

create_html_css(image_paths, clean_path, noisy_path, denoised_paths, image_titles)