Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad amplitude normalization #21

Open
Cortexelus opened this issue May 25, 2018 · 3 comments
Open

Bad amplitude normalization #21

Cortexelus opened this issue May 25, 2018 · 3 comments

Comments

@Cortexelus
Copy link

Cortexelus commented May 25, 2018

Problem

Amplitudes are min-max normalized, for each audio example loaded from the dataset.

Bad for three reasons:

First reason: DC offset. The normalization was calculated by subtracting the minimum and dividing by the maximum. But if minimum peak and maximum peak are different, silence is no longer the middle value, so you introduce a DC offset into the audio.

Second reason: Each example has different peaks, so each example will have a different quantization value for silence.

Third reason: dynamics. If part of my dataset is soft, part is loud, and part is transitions between soft and loud, they will all be normalized to loud. Now SampleRNN will struggle to learn those transitions. If some [8-second] example is nearly silent, now it is super loud.

I think the only acceptable amplitude normalization would be to the entire dataset and you could do so [with ffmpeg] when creating the dataset.

The normalization happens in linear_quantize

Audio normalized upon loading:

def __getitem__(self, index):
        (seq, _) = load(self.file_names[index], sr=None, mono=True)
        return torch.cat([
            torch.LongTensor(self.overlap_len) \
                 .fill_(utils.q_zero(self.q_levels)),
            utils.linear_quantize(
                torch.from_numpy(seq), self.q_levels
            )
        ])

(Example) linear_dequantize(linear_quantize(samples)) != samples

# quantize the wav amplitude into 256 levels
q_levels = 256
# Plot original wav samples
plot(samples)
# samples = tensor([ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000])

image

# Linearly quantize the samples
lq = linear_quantize(samples, q_levels)
plot(lq)
# lq = tensor([ 133,  133,  133,  ...,  133,  133,  133])
# note, silence should be 128

image

# Unquantize the samples
ldq = linear_dequantize(lq, q_levels)
plot(ldq)
# tensor([ 0.0391,  0.0391,  0.0391,  ...,  0.0391,  0.0391,  0.0391])
# introduction of DC offset. 
# instead, this should be silent 0.0000, 0,0000, 0.0000, ...

image

Solution

Don't normalize with linear_quantize

def linear_quantize(samples, q_levels):
    samples = samples.clone()
    samples += 1
    samples /= 2
    samples *= q_levels - EPSILON
    samples += EPSILON / 2
    return samples.long()
@Cortexelus
Copy link
Author

Cortexelus commented May 25, 2018

The original SampleRNN also has this issue, though the amplitude normalization happens per-batch
soroushmehr/sampleRNN_ICLR2017#24

@StefOe
Copy link

StefOe commented Jun 6, 2018

Interestingly, this became an issue for me when I introduced Batch Normalization to the Network. Thanks for the hint!

@blibliki
Copy link

blibliki commented Mar 8, 2020

Hey! @Cortexelus,
We applied the fix you proposed but keep facing DC Offset. What could be other places that could pinpoint such behavior?

Thanks in advance
See wekaco@0b8a434

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants