# Predict on downsampled tracks and save to disk/bucket

In [1]:
import utils
import time

In [2]:
from multiprocessing import set_start_method, cpu_count
set_start_method("spawn")
num_cpus = cpu_count()
print('{} available cpus'.format(num_cpus))

4 available cpus


## 1. Load data

In [3]:
!gsutil -m cp -n -r gs://capstone_datasets/librispeech/test/predictions/* ./predictions/

Skipping existing item: file://./predictions/lr-clean-test-w2v2-base-960h.hf/dataset_info.json
Skipping existing item: file://./predictions/lr-clean-test-w2v2-base-960h.hf/state.json
Skipping existing item: file://./predictions/lr-clean-test-w2v2-base-960h.hf/dataset.arrow


In [4]:
dataset = utils.load_from_disk(utils.os.path.join(utils.predictions_path, 'lr-clean-test-w2v2-base-960h'))

## 2. Downsample

In [5]:
# the following downsampling rate can be changed
# we only downsample one rate at a time to avoid OOM issues
ds_rate = 8000

In [6]:
print('downsampling to ' + str(ds_rate) + 'Hz...')
before = time.time()
dataset = dataset.map(utils.map_to_downsampled, fn_kwargs={"input_sr": 16000, "output_sr": ds_rate}, num_proc=num_cpus, writer_batch_size=50) # decrease writer_batch_size to avoid OOM issues
print('operation took {}s'.format(round(time.time() - before), 3))

downsampling to 8000Hz...
     

#0:   0%|          | 0/655 [00:00<?, ?ex/s]

 

#1:   0%|          | 0/655 [00:00<?, ?ex/s]

 

#2:   0%|          | 0/655 [00:00<?, ?ex/s]

 

#3:   0%|          | 0/655 [00:00<?, ?ex/s]

operation took 306s


## 3. Compute prediction

In [7]:
tokenizer, model = utils.load_wav2vec_model("facebook/wav2vec2-base-960h")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
print('computing prediction...')
before = time.time()
dataset = dataset.map(utils.map_to_pred, fn_kwargs={"model": model, "tokenizer": tokenizer}, writer_batch_size=1000)
print('operation took {}s'.format(round(time.time() - before), 3))

computing prediction...


  0%|          | 0/2620 [00:00<?, ?ex/s]

operation took 374s


## 3. Save to disk

In [9]:
dataset.save_to_disk(utils.os.path.join(utils.predictions_path, 'lr_clean_test_ds_' + str(ds_rate) + 'Hz_w2v2_base_960h.hf'))

## 4. Compute WER

In [10]:
wer = utils.wer(dataset["ground_truth"], dataset["transcription"])
print('wer=', round(100 * wer, 1), '%.')

wer= 4.2 %.


## 5. Send all downsampled datasets in the bucket

In [1]:
!gsutil -m cp -n -r ./predictions/ gs://capstone_datasets/librispeech/test/

Copying file://./predictions/lr_clean_test_ds_2000Hz_w2v2_base_960h/state.json [Content-Type=application/json]...
Copying file://./predictions/lr_clean_test_ds_2000Hz_w2v2_base_960h/dataset_info.json [Content-Type=application/json]...
Copying file://./predictions/lr_clean_test_ds_500Hz_w2v2_base_960h/dataset_info.json [Content-Type=application/json]...
Copying file://./predictions/lr_clean_test_ds_2000Hz_w2v2_base_960h/dataset.arrow [Content-Type=application/octet-stream]...
Copying file://./predictions/lr_clean_test_ds_4000Hz_w2v2_base_960h/dataset_info.json [Content-Type=application/json]...
==> NOTE: You are uploading one or more large file(s), which would run          
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-