# Audio Classification with Hugging Face Transformers

**Author:** Sreyan Ghosh<br>
**Date created:** 2022/07/01<br>
**Last modified:** 2022/08/27<br>
**Description:** Training Wav2Vec 2.0 using Hugging Face Transformers for Audio Classification.

## Introduction

Identification of speech commands, also known as *keyword spotting* (KWS),
is important from an engineering perspective for a wide range of applications,
from indexing audio databases and indexing keywords, to running speech models locally
on microcontrollers. Currently, many human-computer interfaces (HCI) like Google
Assistant, Microsoft Cortana, Amazon Alexa, Apple Siri and others rely on keyword
spotting. There is a significant amount of research in the field by all major companies,
notably Google and Baidu.

In the past decade, deep learning has led to significant performance
gains on this task. Though low-level audio features extracted from raw audio like MFCC or
mel-filterbanks have been used for decades, the design of these low-level features
are [flawed by biases](https://arxiv.org/abs/2101.08596). Moreover, deep learning models
trained on these low-level features can easily overfit to noise or signals irrelevant to the
task.  This makes it is essential for any system to learn speech representations that make
high-level information, such as acoustic and linguistic content, including phonemes,
words, semantic meanings, tone, speaker characteristics from speech signals available to
solve the downstream task. [Wav2Vec 2.0](https://arxiv.org/abs/2006.11477), which solves a
self-supervised contrastive learning task to learn high-level speech representations,
provides a great alternative to traditional low-level features for training deep learning
models for KWS.

In this notebook, we train the Wav2Vec 2.0 (base) model, built on the
Hugging Face Transformers library, in an end-to-end fashion on the keyword spotting task and
achieve state-of-the-art results on the Google Speech Commands Dataset.

## Importing IEMOCAP

In [1]:
!pip install tensorflow-addons
!pip3 install -q git+https://github.com/vasudevgupta7/gsoc-wav2vec2@main
!sudo apt-get install -y libsndfile1-dev
!pip3 install -q SoundFile

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
  Preparing metadata (setup.py) ... [?25l[?25hdone
Reading package lists... Done
Building dependency tree       
Reading state information... Done
libsndfile1-dev is already the newest version (1.0.28-7ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


In [2]:
from IPython.display import Audio, clear_output
from google.colab import files
import pandas as pd
import os
import numpy as np
import tensorflow_addons as tfa


files.upload()
!ls -lha kaggle.json
!pip install -q kaggle # Install kaggle API
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download -d riccardopaolini/nlp-project-work
!unzip nlp-project-work.zip
clear_output()

folder = os.path.join(os.getcwd(), 'IEMOCAP')

conv_id = 0

In [3]:
df = []
for session in ['Session1','Session2','Session3','Session4','Session5']:
    session_path = os.path.join(folder, session)
    # 'dialogue' folder contains Emotions and Transcripts
    # 'sentences' folder contains Audios

    trans_folder = os.path.join(session_path, 'dialog', 'transcriptions')

    for trans_name in np.sort(os.listdir(trans_folder)):
        if trans_name[:2] != '._':
            emo_path = os.path.join(session_path, 'dialog', 'EmoEvaluation', trans_name)
            with open(os.path.join(trans_folder, trans_name), encoding='utf8') as trans_file, open(emo_path, encoding='utf8') as emo_file:
                conv_id += 1
                turn_id = 0
                for line in trans_file:
                    audio_name, text = line.split(':')
                    if trans_name.split('.')[0] in audio_name:
                        turn_id += 1

                        wav_path = os.path.join(session_path, 'sentences', 'wav', trans_name.split('.')[0], audio_name.split(' ')[0] + '.wav')

                        reached = False
                        count_em = {'Anger': 0, 'Happiness': 0, 'Sadness': 0, 'Neutral': 0, 'Frustration': 0, 'Excited': 0, 'Fear': 0, 'Surprise': 0, 'Disgust': 0, 'Other': 0}
                        for emo_line in emo_file:
                            if audio_name.split(' ')[0] in emo_line:
                                emotion, vad = emo_line.split('\t')[-2:]
                                vad = vad[1:-2].split(',')
                                reached = True
                            elif emo_line[0] == 'C' and reached:
                                evaluator = emo_line.split(':')[0]
                                emotions = emo_line.split(':')[1].split('(')[0].split(';')
                                emotions = [em.strip() for em in emotions]
                                for em in emotions:
                                    if em != '':
                                        count_em[em] += 1
                            elif reached:
                                emo_file.seek(0)
                                break
                                    

                        row = {'session_id': int(session[-1]),
                                'conv_id': conv_id, 
                                'turn_id': turn_id, 
                                'sentence': text.strip(),
                                'path': wav_path,
                                'emotion': emotion,
                                'valence': float(vad[0]),
                                'activation': float(vad[1]),
                                'dominance': float(vad[2])
                                }
                        
                        df.append(dict(**row, **count_em))

df = pd.DataFrame(df)

idx = np.array([os.path.exists(path) for path in df.path])
print(f'Missing Audios: {np.sum(~idx)}')
print('Missing Sentences:')
print(df.iloc[~idx,3])
df = df.iloc[idx, :]

Missing Audios: 48
Missing Sentences:
3854    [LAUGHTER], That's what they say.
3866                            Mmm, Hmm.
3880                                Yeah.
3898                               Kelly.
3915                           Yeah, man.
3939                      Uh-huh, uh-huh.
3961                              Uh-huh.
3968                                Yeah.
3972                                Yeah.
4010                       Well, I don't-
4044                                Yeah.
4827                        But, Listen--
4847                                Yeah.
4873                                Yeah.
4975                          We- I mean-
4991                                Okay.
5005                                Yeah.
5051                              Thanks.
5124                        to start off.
5181                                okay.
5192                                Okay.
5208                                Okay.
7893                                  

In [4]:
print(df.shape)

(10039, 19)


In [5]:
df.head()

Unnamed: 0,session_id,conv_id,turn_id,sentence,path,emotion,valence,activation,dominance,Anger,Happiness,Sadness,Neutral,Frustration,Excited,Fear,Surprise,Disgust,Other
0,1,1,1,Excuse me.,/content/IEMOCAP/Session1/sentences/wav/Ses01F...,neu,2.5,2.5,2.5,0,0,0,4,0,0,0,0,0,0
1,1,1,2,Do you have your forms?,/content/IEMOCAP/Session1/sentences/wav/Ses01F...,fru,2.5,2.0,2.5,0,0,0,1,3,0,0,0,0,1
2,1,1,3,Yeah.,/content/IEMOCAP/Session1/sentences/wav/Ses01F...,neu,2.5,2.5,2.5,1,0,0,4,0,0,0,0,0,0
3,1,1,4,Let me see them.,/content/IEMOCAP/Session1/sentences/wav/Ses01F...,fru,2.5,2.0,2.5,0,0,0,0,3,0,0,0,0,1
4,1,1,5,Is there a problem?,/content/IEMOCAP/Session1/sentences/wav/Ses01F...,neu,2.5,2.5,2.5,1,0,0,3,0,0,0,1,0,0


### Define certain variables

In [6]:
# Maximum duration of the input audio file we feed to our Wav2Vec 2.0 model.
MAX_DURATION = 1
# Sampling rate is the number of samples of audio recorded every second
SAMPLING_RATE = 16000
BATCH_SIZE = 4  # Batch-size for training and evaluating our model.
'''NUM_CLASSES = 10'''
NUM_CLASSES = 11  # Number of classes our dataset will have (11 in our case).
HIDDEN_DIM = 768  # Dimension of our model output (768 in case of Wav2Vec 2.0 - Base).
MAX_SEQ_LENGTH = MAX_DURATION * SAMPLING_RATE  # Maximum length of the input audio file.
# Wav2Vec 2.0 results in an output frequency with a stride of about 20ms.
MAX_FRAMES = 49
MAX_EPOCHS = 2  # Maximum number of training epochs.

MODEL_CHECKPOINT = "facebook/wav2vec2-base"  # Name of pretrained model from Hugging Face Model Hub

In [7]:
import soundfile as sf


def read_flac_file(file_path):
  with open(file_path, "rb") as f:
      audio, sample_rate = sf.read(f)
  if sample_rate != SAMPLING_RATE:
      raise ValueError(
          f"sample rate (={sample_rate}) of your files must be {SAMPLING_RATE}"
      )
  file_id = os.path.split(file_path)[-1][:-len(".wav")]
  return audio #.tolist()

In [8]:
# visualizing a sample
from IPython.display import Audio
import random

flac_files = df.iloc[:,4].to_list()
flac_files[:4]
file_id = random.choice([f[:-len(".wav")] for f in flac_files])
flac_file_path = os.path.join(f"{file_id}.wav")

In [9]:
# getting audios
audio_paths = flac_files
print(read_flac_file(flac_files[0]))
audio_samples = list(map(read_flac_file, audio_paths))
audio_samples[:5]

[-0.0050354  -0.00497437 -0.0038147  ... -0.00265503 -0.00317383
 -0.00418091]


[array([-0.0050354 , -0.00497437, -0.0038147 , ..., -0.00265503,
        -0.00317383, -0.00418091]),
 array([-0.00354004, -0.00308228, -0.0062561 , ...,  0.00341797,
         0.00274658,  0.00256348]),
 array([ 0.00094604, -0.00094604, -0.0007019 , ..., -0.00045776,
        -0.00033569, -0.00128174]),
 array([ 0.00021362, -0.00048828, -0.00143433, ..., -0.00378418,
        -0.00375366, -0.00292969]),
 array([-0.00036621, -0.00015259,  0.00042725, ..., -0.00030518,
        -0.00018311,  0.00088501])]

In [10]:
print(len(audio_samples))

10039


In [11]:
!pip install git+https://github.com/huggingface/transformers.git
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained(
    MODEL_CHECKPOINT, return_attention_mask=True
)


def preprocess_function(audio_arrays):
    #audio_arrays e' lista di liste di signed float lunghi 16000 
    print(type(audio_arrays))
    print(len(audio_arrays[0]))
    
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=MAX_SEQ_LENGTH,
        truncation=True,
        padding=True,
    )
    return inputs

X = preprocess_function(audio_samples)
print(type(X))

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-01buqkau
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-01buqkau
  Resolved https://github.com/huggingface/transformers.git to commit cf11493dce0a1d22446efe0d6c4ade02fd928e50
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone




<class 'list'>
31129
<class 'transformers.feature_extraction_utils.BatchFeature'>


In [12]:
print(type(X['input_values']))
print(type(X['input_values'][0]))

<class 'list'>
<class 'numpy.ndarray'>


In [13]:
a = dict()
X['input_values']
#should be list of ndarrays having size (16000,)

[array([-0.28229174, -0.27885097, -0.21347645, ...,  0.292316  ,
         0.34220707,  0.50564337], dtype=float32),
 array([-0.6427869, -0.5591317, -1.1391412, ...,  0.5897334,  0.5785794,
         0.5785794], dtype=float32),
 array([ 0.04751554, -0.04414944, -0.0323217 , ...,  0.05490787,
         0.02977393, -0.01162316], dtype=float32),
 array([ 0.01490017, -0.02970277, -0.08981977, ..., -1.4550576 ,
        -1.6838901 , -1.8467878 ], dtype=float32),
 array([-0.11737954, -0.04251289,  0.16069658, ...,  0.7168488 ,
         0.82380116,  0.81310594], dtype=float32),
 array([-0.06090495, -0.05052799,  0.02729918, ..., -0.1568918 ,
        -0.1309494 , -0.12316668], dtype=float32),
 array([-0.17318904, -0.16555297, -0.1873703 , ...,  0.07116506,
         0.09734586,  0.06352899], dtype=float32),
 array([0.5896468 , 0.6750473 , 1.2362504 , ..., 0.626247  , 0.61404693,
        0.35784552], dtype=float32),
 array([-0.02691223,  0.00954321,  0.0091336 , ...,  0.90864086,
         0.83859736

## Converting text labels

In [14]:
encoded_dict= {'ang':0, 'dis':1, 'exc':2, 'fea':3, 'fru':4, 'hap':5,'neu':6, 'oth':7, 'sad':8, 'sur':9, 'xxx':10}

y = df.emotion.map(encoded_dict).to_numpy()

In [15]:
len(y)

10039

## Setup

### Installing the requirements

In [16]:
!pip install git+https://github.com/huggingface/transformers.git
!pip install datasets ##
!pip install huggingface-hub
!pip install joblib
!pip install librosa ##

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-fuvsn69y
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-fuvsn69y
  Resolved https://github.com/huggingface/transformers.git to commit cf11493dce0a1d22446efe0d6c4ade02fd928e50
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/sim

### Importing the necessary libraries

In [17]:
import random
import logging

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)
# Set random seed
tf.keras.utils.set_random_seed(42)

### Define certain variables

## Load the Google Speech Commands Dataset

We now download the [Google Speech Commands V1 Dataset](https://arxiv.org/abs/1804.03209),
a popular benchmark for training and evaluating deep learning models built for solving the KWS task.
The dataset consists of a total of 60,973 audio files, each of 1 second duration,
divided into ten classes of keywords ("Yes", "No", "Up", "Down", "Left", "Right", "On",
"Off", "Stop", and "Go"), a class for silence, and an unknown class to include the false
positive. We load the dataset from [Hugging Face Datasets](https://github.com/huggingface/datasets).
This can be easily done with the `load_dataset` function.

In [18]:
# cell to be deleted
from datasets import load_dataset

speech_commands_v1 = load_dataset("superb", "ks")



  0%|          | 0/3 [00:00<?, ?it/s]

The dataset has the following fields:

- **file**: the path to the raw .wav file of the audio
- **audio**: the audio file sampled at 16kHz
- **label**: label ID of the audio utterance

In [19]:
# cell to be deleted
print(speech_commands_v1)

DatasetDict({
    train: Dataset({
        features: ['file', 'audio', 'label'],
        num_rows: 51094
    })
    validation: Dataset({
        features: ['file', 'audio', 'label'],
        num_rows: 6798
    })
    test: Dataset({
        features: ['file', 'audio', 'label'],
        num_rows: 3081
    })
})


## Data Pre-processing

For the sake of demonstrating the workflow, in this notebook we only take
small stratified balanced splits (50%) of the train as our training and test sets.
We can easily split the dataset using the `train_test_split` method which expects
the split size and the name of the column relative to which you want to stratify.

Post splitting the dataset, we remove the `unknown` and `silence` classes and only
focus on the ten main classes. The `filter` method does that easily for you.

Next we sample our train and test splits to a multiple of the `BATCH_SIZE` to
facilitate smooth training and inference. You can achieve that using the `select`
method which expects the indices of the samples you want to keep. Rest all are
discarded.

In [20]:
speech_commands_v1 = speech_commands_v1["train"].train_test_split(
    train_size=0.5, test_size=0.5, stratify_by_column="label"
)

speech_commands_v1 = speech_commands_v1.filter(
    lambda x: x["label"]
    != (
        speech_commands_v1["train"].features["label"].names.index("_unknown_")
        and speech_commands_v1["train"].features["label"].names.index("_silence_")
    )
)

speech_commands_v1["train"] = speech_commands_v1["train"].select(
    [i for i in range((len(speech_commands_v1["train"]) // BATCH_SIZE) * BATCH_SIZE)]
)
speech_commands_v1["test"] = speech_commands_v1["test"].select(
    [i for i in range((len(speech_commands_v1["test"]) // BATCH_SIZE) * BATCH_SIZE)]
)





In [21]:
#print(len(speech_commands_v1["train"]["audio"]))

Additionally, you can check the actual labels corresponding to each label ID.

In [22]:
labels = speech_commands_v1["train"].features["label"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

print(id2label)

{'0': 'yes', '1': 'no', '2': 'up', '3': 'down', '4': 'left', '5': 'right', '6': 'on', '7': 'off', '8': 'stop', '9': 'go', '10': '_silence_', '11': '_unknown_'}


Before we c[testo del link](https://)an feed the audio utterance samples to our model, we need to
pre-process them. This is done by a Hugging Face Transformers "Feature Extractor"
which will (as the name indicates) re-sample your the inputs to sampling rate
the the model expects (in-case they exist with a different sampling rate), as well
as generate the other inputs that model requires.

To do all of this, we instantiate our `Feature Extractor` with the
`AutoFeatureExtractor.from_pretrained`, which will ensure:

We get a `Feature Extractor` that corresponds to the model architecture we want to use.
We download the config that was used when pretraining this specific checkpoint.
This will be cached so it's not downloaded again the next time we run the cell.

The `from_pretrained()` method expects the name of a model from the Hugging Face Hub. This is
exactly similar to `MODEL_CHECKPOINT` and we just pass that.

We write a simple function that helps us in the pre-processing that is compatible
with Hugging Face Datasets. To summarize, our pre-processing function should:

- Call the audio column to load and if necessary resample the audio file.
- Check the sampling rate of the audio file matches the sampling rate of the audio data a
model was pretrained with. You can find this information on the Wav2Vec 2.0 model card.
- Set a maximum input length so longer inputs are batched without being truncated.

In [23]:
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained(
    MODEL_CHECKPOINT, return_attention_mask=True
)


def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    #audio_arrays e' lista di liste di signed float lunghi 16000 
    
    inputs = feature_extractor(
        audio_arrays,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=MAX_SEQ_LENGTH,
        truncation=True,
        padding=True,
    )
    return inputs


# This line with pre-process our speech_commands_v1 dataset. We also remove the "audio"
# and "file" columns as they will be of no use to us while training.
processed_speech_commands_v1 = speech_commands_v1.map(
    preprocess_function, remove_columns=["audio", "file"], batched=True
)

#print(type(processed_speech_commands_v1["train"]))

# Load the whole dataset splits as a dict of numpy arrays
train = processed_speech_commands_v1["train"].shuffle(seed=42).with_format("numpy")[:]
#test = processed_speech_commands_v1["test"].shuffle(seed=42).with_format("numpy")[:]



## Defining the Wav2Vec 2.0 with Classification-Head

We now define our model. To be precise, we define a Wav2Vec 2.0 model and add a
Classification-Head on top to output a probability ditribution of all classes for each
input audio sample. Since the model might get complex we first define the Wav2Vec
2.0 model with Classification-Head as a Keras layer and then build the model using that.

We instantiate our main Wav2Vec 2.0 model using the `TFWav2Vec2Model` class. This will
instantiate a model which will output 768 or 1024 dimensional embeddings according to
the config you choose (BASE or LARGE). The `from_pretrained()` additionally helps you
load pre-trained weights from the Hugging Face Model Hub. It will download the pre-trained weights
together with the config corresponding to the name of the model you have mentioned when
calling the method. For our task, we choose the BASE variant of the model that has
just been pre-trained, since we fine-tune over it.

In [24]:
from transformers import TFWav2Vec2Model


def mean_pool(hidden_states, feature_lengths):
    attenion_mask = tf.sequence_mask(
        feature_lengths, maxlen=MAX_FRAMES, dtype=tf.dtypes.int64
    )
    padding_mask = tf.cast(
        tf.reverse(tf.cumsum(tf.reverse(attenion_mask, [-1]), -1), [-1]),
        dtype=tf.dtypes.bool,
    )
    hidden_states = tf.where(
        tf.broadcast_to(
            tf.expand_dims(~padding_mask, -1), (BATCH_SIZE, MAX_FRAMES, HIDDEN_DIM)
        ),
        0.0,
        hidden_states,
    )
    pooled_state = tf.math.reduce_sum(hidden_states, axis=1) / tf.reshape(
        tf.math.reduce_sum(tf.cast(padding_mask, dtype=tf.dtypes.float32), axis=1),
        [-1, 1],
    )
    return pooled_state


class TFWav2Vec2ForAudioClassification(layers.Layer):
    """Combines the encoder and decoder into an end-to-end model for training."""

    def __init__(self, model_checkpoint, num_classes):
        super().__init__()
        # Instantiate the Wav2Vec 2.0 model without the Classification-Head
        self.wav2vec2 = TFWav2Vec2Model.from_pretrained(
            model_checkpoint, apply_spec_augment=False, from_pt=True
        )
        self.pooling = layers.GlobalAveragePooling1D()
        # Drop-out layer before the final Classification-Head
        self.intermediate_layer_dropout = layers.Dropout(0.5)
        # Classification-Head
        self.final_layer = layers.Dense(num_classes, activation="softmax")

    def call(self, inputs):
        # We take only the first output in the returned dictionary corresponding to the
        # output of the last layer of Wav2vec 2.0
        hidden_states = self.wav2vec2(inputs["input_values"])[0]

        # If attention mask does exist then mean-pool only un-masked output frames
        if tf.is_tensor(inputs["attention_mask"]):
            # Get the length of each audio input by summing up the attention_mask
            # (attention_mask = (BATCH_SIZE x MAX_SEQ_LENGTH) ∈ {1,0})
            audio_lengths = tf.cumsum(inputs["attention_mask"], -1)[:, -1]
            # Get the number of Wav2Vec 2.0 output frames for each corresponding audio input
            # length
            feature_lengths = self.wav2vec2.wav2vec2._get_feat_extract_output_lengths(
                audio_lengths
            )
            pooled_state = mean_pool(hidden_states, feature_lengths)
        # If attention mask does not exist then mean-pool only all output frames
        else:
            pooled_state = self.pooling(hidden_states)

        intermediate_state = self.intermediate_layer_dropout(pooled_state)
        final_state = self.final_layer(intermediate_state)

        return final_state


## Building and Compiling the model

We now build and compile our model. We use the `SparseCategoricalCrossentropy`
to train our model since it is a classification task. Following much of literature
we evaluate our model on the `accuracy` metric.

In [25]:

def build_model():
    # Model's input
    inputs = {
        "input_values": tf.keras.Input(shape=(MAX_SEQ_LENGTH,), dtype="float32"),
        "attention_mask": tf.keras.Input(shape=(MAX_SEQ_LENGTH,), dtype="int32"),
    }
    # Instantiate the Wav2Vec 2.0 model with Classification-Head using the desired
    # pre-trained checkpoint
    wav2vec2_model = TFWav2Vec2ForAudioClassification(MODEL_CHECKPOINT, NUM_CLASSES)(
        inputs
    )
    # Model
    model = tf.keras.Model(inputs, wav2vec2_model)
    # Loss
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
    # Optimizer
    optimizer = keras.optimizers.Adam(learning_rate=1e-5)
    # Compile and return
    model.compile(loss=loss, optimizer=optimizer, metrics=["accuracy"])
    return model


model = build_model()


TFWav2Vec2Model has backpropagation operations that are NOT supported on CPU. If you wish to train/fine-tine this model, you need a GPU or a TPU
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFWav2Vec2Model: ['project_hid.weight', 'quantizer.codevectors', 'quantizer.weight_proj.weight', 'quantizer.weight_proj.bias', 'project_q.bias', 'project_q.weight', 'project_hid.bias']
- This IS expected if you are initializing TFWav2Vec2Model from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFWav2Vec2Model from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFWav2Vec2Model were initialized from the PyTorch model.
If your task is similar to the task the model of the che

In [26]:
processed_speech_commands_v1["train"]

Dataset({
    features: ['label', 'input_values', 'attention_mask'],
    num_rows: 25544
})

## Training the model

Before we start training our model, we divide the inputs into its
dependent and independent variables.

In [27]:
# Remove targets from training dictionaries
train_x = {x: y for x, y in train.items() if x != "label"}
#test_x = {x: y for x, y in test.items() if x != "label"}

In [28]:
print(type(train_x['attention_mask']))
print(type(train_x['attention_mask'][0]))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [29]:
#print(type(train_x),type(train['label']))
#<class 'dict'> <class 'numpy.ndarray'>
#print(train_x)
'''
{'input_values': array([[ 0.10891221,  0.07141373,  0.0451648 , ..., -0.07858016,
        -0.07108047, -0.08982971],
       [-0.00290692, -0.0036458 , -0.00290692, ...,  0.00078746,
         0.00078746,  0.00152633],
       [ 0.03653951,  0.04970387,  0.03856479, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.1159748 ,  0.3073597 ,  0.33063623, ...,  0.18968385,
         0.24011636,  0.25563404],
       [ 0.006515  , -0.01684613, -0.01569722, ..., -0.01952692,
        -0.0210588 , -0.0172291 ],
       [ 0.0863734 ,  0.09564765,  0.1049219 , ...,  0.10425945,
         0.07776161,  0.24271068]], dtype=float32), 'attention_mask': array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1]])}
'''
#print(train['label'])
#[ 6  7  8 ... 11 11  3]
# 30 GB RAM, 18 GB VRAM

"\n{'input_values': array([[ 0.10891221,  0.07141373,  0.0451648 , ..., -0.07858016,\n        -0.07108047, -0.08982971],\n       [-0.00290692, -0.0036458 , -0.00290692, ...,  0.00078746,\n         0.00078746,  0.00152633],\n       [ 0.03653951,  0.04970387,  0.03856479, ...,  0.        ,\n         0.        ,  0.        ],\n       ...,\n       [ 0.1159748 ,  0.3073597 ,  0.33063623, ...,  0.18968385,\n         0.24011636,  0.25563404],\n       [ 0.006515  , -0.01684613, -0.01569722, ..., -0.01952692,\n        -0.0210588 , -0.0172291 ],\n       [ 0.0863734 ,  0.09564765,  0.1049219 , ...,  0.10425945,\n         0.07776161,  0.24271068]], dtype=float32), 'attention_mask': array([[1, 1, 1, ..., 1, 1, 1],\n       [1, 1, 1, ..., 1, 1, 1],\n       [1, 1, 1, ..., 0, 0, 0],\n       ...,\n       [1, 1, 1, ..., 1, 1, 1],\n       [1, 1, 1, ..., 1, 1, 1],\n       [1, 1, 1, ..., 1, 1, 1]])}\n"

In [30]:
print(type(train_x),type(X))
print(type(train['label']),type(y))


<class 'dict'> <class 'transformers.feature_extraction_utils.BatchFeature'>
<class 'numpy.ndarray'> <class 'numpy.ndarray'>


In [31]:
train_x

{'input_values': array([[-2.1842141e-03, -1.2944167e-02, -1.3840830e-02, ...,
          1.2162388e-02,  3.1957619e-03,  3.1957619e-03],
        [ 3.7146423e-05,  3.7146423e-05,  3.7146423e-05, ...,
          3.7146423e-05,  3.7146423e-05,  3.7146423e-05],
        [-2.6119235e-03,  1.0343013e-02, -3.9999527e-03, ...,
         -5.3879819e-03, -5.8506578e-03, -5.8506578e-03],
        ...,
        [ 1.1597480e-01,  3.0735970e-01,  3.3063623e-01, ...,
          1.8968385e-01,  2.4011636e-01,  2.5563404e-01],
        [ 6.5150042e-03, -1.6846132e-02, -1.5697222e-02, ...,
         -1.9526917e-02, -2.1058796e-02, -1.7229101e-02],
        [ 8.6373404e-02,  9.5647648e-02,  1.0492190e-01, ...,
          1.0425945e-01,  7.7761605e-02,  2.4271068e-01]], dtype=float32),
 'attention_mask': array([[1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1],
        ...,
        [1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1]])}

In [32]:

X_new = { 'input_values': np.array(X['input_values']),'attention_mask':np.array(X['attention_mask'])}


And now we can finally start training our model.

In [33]:
model.fit(X_new,y,
    validation_data=(X_new,y),
    batch_size=BATCH_SIZE,
    epochs=MAX_EPOCHS
) #train_x,train["label"],(test_x, test["label"])

Epoch 1/2

InvalidArgumentError: ignored

Great! Now that we have trained our model, we predict the classes
for audio samples in the test set using the `model.predict()` method! We see
the model predictions are not that great as it has been trained on a very small
number of samples for just 1 epoch. For best results, we reccomend training on
the complete dataset for at least 5 epochs!

In [None]:
preds = model.predict(test_x)

Now we try to infer the model we trained on a randomly sampled audio file.
We hear the audio file and then also see how well our model was able to predict!

In [None]:
import IPython.display as ipd

rand_int = random.randint(0, len(test_x))

ipd.Audio(data=np.asarray(test_x["input_values"][rand_int]), autoplay=True, rate=16000)

print("Original Label is ", id2label[str(test["label"][rand_int])])
print("Predicted Label is ", id2label[str(np.argmax((preds[rand_int])))])

Now you can push this model to Hugging Face Model Hub and also share it with with all your friends,
family, favorite pets: they can all load it with the identifier
`"your-username/the-name-you-picked"`, for instance:

```python
model.push_to_hub("wav2vec2-ks", organization="keras-io")
tokenizer.push_to_hub("wav2vec2-ks", organization="keras-io")
```
And after you push your model this is how you can load it in the future!

```python
from transformers import TFWav2Vec2Model

model = TFWav2Vec2Model.from_pretrained("your-username/my-awesome-model", from_pt=True)
```