You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added ability to encode audio into subsecond chunk sizes with a sliding window of prior audio as context. This helps support use-cases where the encoded audio should simulate a streaming setting. For example, many codecs will encode the same audio differently depending on the encoder's receptive field size - even with native streaming codecs like Mimi. So, when training a streaming speech-to-text audio LM, we want to encode the training audio in tiny chunks so that it resembles what will be received during live streaming. This helps prevent throwing the model out of distribution at inference time.
Use the --chunk_size_secs and --context_secs parameters with codec_bpe.audio_to_codes to configure this.
By default --chunk_size_secs=30 and --context_secs=0.0 for non-streaming usage.
--context_secs controls the sliding window encoding size, which is useful to avoid codec degradation at tiny chunk sizes. For example, --chunk_size_secs=0.08 with --context_secs=0.4 will encode audio in chunks of 80ms, each chunk receiving the previous 320ms of audio as context to the encoder's receptive field (we encode 320 + 80 = 400ms of audio at a time but only keep the final 80ms of codes).