# Preprocessing an audio dataset

* Resampling the audio data
* Filtering the dataset
* Converting audio data to model’s expected input

You may also want to read Sanchi Gandhi's [A Complete Guide to Audio Datasets](https://huggingface.co/blog/audio-datasets#a-complete-guide-to-audio-datasets).

### Resampling the audio data

c.f. [https://huggingface.co/learn/audio-course/chapter1/preprocessing#resampling-the-audio-data](https://huggingface.co/learn/audio-course/chapter1/preprocessing#resampling-the-audio-data)

In [1]:
# load up the minds dataset previously used...
from datasets import load_dataset

minds = load_dataset(
    "PolyAI/minds14", 
    name="en-AU", 
    split="train",
    trust_remote_code=True
)

columns_to_remove = ["lang_id", "english_transcription"]
minds = minds.remove_columns(columns_to_remove)
minds

Dataset({
    features: ['path', 'audio', 'transcription', 'intent_class'],
    num_rows: 654
})

In [2]:
#print(minds[0])

print(
    f"Sampling rate is {minds[0]['audio']['sampling_rate']}, "
    f"with {len(minds[0]['audio']['array'])} audio samples"
)

Sampling rate is 8000, with 62415 audio samples


----

In [3]:
from datasets import Audio

minds = minds.cast_column(
    "audio", 
    Audio(sampling_rate=16_000)
)

In [4]:
#print(minds[0])

print(
    f"Sampling rate is {minds[0]['audio']['sampling_rate']}, "
    f"with {len(minds[0]['audio']['array'])} audio samples"
)

Sampling rate is 16000, with 124830 audio samples


----

### Filtering the dataset

c.f. [https://huggingface.co/learn/audio-course/chapter1/preprocessing#filtering-the-dataset](https://huggingface.co/learn/audio-course/chapter1/preprocessing#filtering-the-dataset)

In [5]:
# simple filter function
MAX_DURATION_IN_SECONDS = 20.0

def is_audio_length_in_range(input_length):
    return input_length < MAX_DURATION_IN_SECONDS

In [7]:
import librosa

# use librosa to get example's duration from the audio file
new_column = [librosa.get_duration(path=x) for x in minds["path"]]
minds = minds.add_column("duration", new_column)

# use 🤗 Datasets' `filter` method to apply the filtering function
minds = minds.filter(is_audio_length_in_range, input_columns=["duration"])

# remove the temporary helper column
minds = minds.remove_columns(["duration"])
minds

Filter:   0%|          | 0/654 [00:00<?, ? examples/s]

Dataset({
    features: ['path', 'audio', 'transcription', 'intent_class'],
    num_rows: 624
})

----

### Pre-processing audio data

c.f. [https://huggingface.co/learn/audio-course/chapter1/preprocessing#pre-processing-audio-data](https://huggingface.co/learn/audio-course/chapter1/preprocessing#pre-processing-audio-data)

> The requirements for the input features may vary from one model to another — they depend on the model’s architecture, and the data it was pre-trained with. The good news is, for every supported audio model, 🤗 Transformers offer a feature extractor class that can convert raw audio data into the input features the model expects.

We are looking at the Whisper family of ASR models.

* Whisper feature extractor pads/truncates a batch of audio examples such that all examples have an input length of 30s. Examples shorter than this are padded to 30s by appending zeros to the end of the sequence (zeros in an audio signal correspond to no signal or silence). Examples longer than 30s are truncated to 30s.
* Since all elements in the batch are padded/truncated to a maximum length in the input space, there is no need for an attention mask. Whisper is unique in this regard, most other audio models require an attention mask that details where sequences have been padded, and thus where they should be ignored in the self-attention mechanism.
* The second operation that the Whisper feature extractor performs is converting the padded audio arrays to log-mel spectrograms.

c.f. [https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperFeatureExtractor](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperFeatureExtractor)

In [8]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")



preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]