# Template notebook to process data and save/load raw predictions to/from bucket

In [1]:
import utils

## 0. Download LibriSpeech test-clean data

In [2]:
# from bucket
!gsutil -m -q cp -n -r gs://capstone_datasets/* ./datasets/

In [3]:
# load extracted lr data as dataset
librispeech_eval = utils.load_dataset('datasets/librispeech/test', "clean", split="test")

Resolving data files:   0%|          | 0/2709 [00:00<?, ?it/s]

Using custom data configuration test-505871433d06fd61


Downloading and preparing dataset audiofolder/test to /home/antonin/.cache/huggingface/datasets/audiofolder/test-505871433d06fd61/0.0.0/6cbdd16f8688354c63b4e2a36e1585d05de285023ee6443ffd71c4182055c0fc...
                

Downloading data files #7:   0%|          | 0/164 [00:00<?, ?obj/s]

Downloading data files #3:   0%|          | 0/164 [00:00<?, ?obj/s]

Downloading data files #2:   0%|          | 0/164 [00:00<?, ?obj/s]

Downloading data files #0:   0%|          | 0/164 [00:00<?, ?obj/s]

Downloading data files #1:   0%|          | 0/164 [00:00<?, ?obj/s]

Downloading data files #5:   0%|          | 0/164 [00:00<?, ?obj/s]

Downloading data files #6:   0%|          | 0/164 [00:00<?, ?obj/s]

Downloading data files #10:   0%|          | 0/164 [00:00<?, ?obj/s]

Downloading data files #9:   0%|          | 0/164 [00:00<?, ?obj/s]

Downloading data files #13:   0%|          | 0/163 [00:00<?, ?obj/s]

Downloading data files #11:   0%|          | 0/164 [00:00<?, ?obj/s]

Downloading data files #14:   0%|          | 0/163 [00:00<?, ?obj/s]

Downloading data files #12:   0%|          | 0/163 [00:00<?, ?obj/s]

Downloading data files #15:   0%|          | 0/163 [00:00<?, ?obj/s]

Downloading data files #8:   0%|          | 0/164 [00:00<?, ?obj/s]

Downloading data files #4:   0%|          | 0/164 [00:00<?, ?obj/s]

       

Downloading data files #3:   0%|          | 0/6 [00:00<?, ?obj/s]

Downloading data files #2:   0%|          | 0/6 [00:00<?, ?obj/s]

Downloading data files #1:   0%|          | 0/6 [00:00<?, ?obj/s]

  

Downloading data files #4:   0%|          | 0/6 [00:00<?, ?obj/s]

Downloading data files #0:   0%|          | 0/6 [00:00<?, ?obj/s]

Downloading data files #5:   0%|          | 0/6 [00:00<?, ?obj/s]

 

Downloading data files #6:   0%|          | 0/6 [00:00<?, ?obj/s]

    

Downloading data files #7:   0%|          | 0/6 [00:00<?, ?obj/s]

 

Downloading data files #9:   0%|          | 0/5 [00:00<?, ?obj/s]

 

Downloading data files #8:   0%|          | 0/6 [00:00<?, ?obj/s]

Downloading data files #11:   0%|          | 0/5 [00:00<?, ?obj/s]

Downloading data files #12:   0%|          | 0/5 [00:00<?, ?obj/s]

Downloading data files #10:   0%|          | 0/5 [00:00<?, ?obj/s]

Downloading data files #13:   0%|          | 0/5 [00:00<?, ?obj/s]

Downloading data files #14:   0%|          | 0/5 [00:00<?, ?obj/s]

Downloading data files #15:   0%|          | 0/5 [00:00<?, ?obj/s]

Extracting data files:   0%|          | 0/89 [00:00<?, ?it/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset audiofolder downloaded and prepared to /home/antonin/.cache/huggingface/datasets/audiofolder/test-505871433d06fd61/0.0.0/6cbdd16f8688354c63b4e2a36e1585d05de285023ee6443ffd71c4182055c0fc. Subsequent calls will reuse this data.


## 1. Map each audio file to its ground truth

In [4]:
librispeech_eval = librispeech_eval.map(utils.map_to_ground_truth)

  0%|          | 0/2620 [00:00<?, ?ex/s]

## 2. Load Wav2Vec 2.0 model and tokenizer

In [6]:
tokenizer, model = utils.load_wav2vec_model("facebook/wav2vec2-base-960h")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 3. Compute prediction

In [7]:
result = librispeech_eval.map(utils.map_to_pred, fn_kwargs={"model": model, "tokenizer": tokenizer})

  0%|          | 0/2620 [00:00<?, ?ex/s]

## 4. Compute metric

In [8]:
print('WER with wav2vec2-base-960h on lr-test-clean:', round(100 * utils.wer(result["ground_truth"], result["transcription"]), 1), '%.')

WER with wav2vec2-base-960h on lr-test-clean: 3.4 %.


We get the same metric as in the original paper https://arxiv.org/pdf/2006.11477.pdf page 15.

## 5. Save dataset and prediction to disk

In [9]:
result.save_to_disk(utils.os.path.join(utils.predictions_path, 'lr_clean_test_w2v2_base_960h'))

In [10]:
result

Dataset({
    features: ['audio', 'label', 'ground_truth', 'logits', 'transcription'],
    num_rows: 2620
})

## 6. Load dataset

In [11]:
dataset = utils.load_from_disk(utils.os.path.join(utils.predictions_path, 'lr-clean-test-w2v2-base-960h.hf'))

In [12]:
dataset

Dataset({
    features: ['audio', 'label', 'ground_truth', 'logits', 'transcription'],
    num_rows: 2620
})

## 7. Send to bucket

In [13]:
!gsutil -m cp -n -r ./predictions/ gs://capstone_datasets/librispeech/test/ 

Skipping existing item: gs://capstone_datasets/librispeech/test/predictions/lr-clean-test-w2v2-base-960h.hf/dataset.arrow
Skipping existing item: gs://capstone_datasets/librispeech/test/predictions/lr-clean-test-w2v2-base-960h.hf/state.json
Skipping existing item: gs://capstone_datasets/librispeech/test/predictions/lr-clean-test-w2v2-base-960h.hf/dataset_info.json


## 8. Copy from bucket and load

In [14]:
!gsutil -m cp -n -r gs://capstone_datasets/librispeech/test/predictions/* ./predictions/

Skipping existing item: file://./predictions/lr-clean-test-w2v2-base-960h.hf/dataset.arrow
Skipping existing item: file://./predictions/lr-clean-test-w2v2-base-960h.hf/dataset_info.json
Skipping existing item: file://./predictions/lr-clean-test-w2v2-base-960h.hf/state.json


In [15]:
dataset = utils.load_from_disk(utils.os.path.join(utils.predictions_path, 'lr-clean-test-w2v2-base-960h.hf'))

In [16]:
dataset

Dataset({
    features: ['audio', 'label', 'ground_truth', 'logits', 'transcription'],
    num_rows: 2620
})