# Test of wav2vec for norwegian

**Author:** [Computas AS](https://github.com/computas) ([kontakt@computas.com](mailto:kontakt@computas.com))

**Achievement:** *[Short, preferably single-line, statement of what has been accomplished. For example, "Assuming ... and using ... we show that ...".]*

## Introduction

This is a simple quality test of Facebook's wav2vec ASR system.

Based on the code from: 
- https://github.com/pytorch/fairseq/tree/master/examples/wav2vec

# Reproducibility and code formatting

In [1]:
# To watermark the environment
%load_ext watermark

# For automatic code formatting in jupyter lab.
%load_ext lab_black

# For automatic code formatting in jupyter notebook
%load_ext nb_black

# ASR

In [3]:
# Download wav2vec large
!wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_large.pt

--2020-07-09 13:04:08--  https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_large.pt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)...104.22.75.142, 172.67.9.4, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response...200 OK
Length: 325396342 (310M) [application/octet-stream]
Saving to: ‘wav2vec_large.pt.2’


2020-07-09 13:04:47 (8,11 MB/s) - ‘wav2vec_large.pt.2’ saved [325396342/325396342]



In [None]:
import torch
from fairseq.models.wav2vec import Wav2VecModel

In [6]:
cp = torch.load('wav2vec_large.pt',map_location=torch.device('cpu'))
model = Wav2VecModel.build_model(cp['args'], task=None)
model.load_state_dict(cp['model'])
model.eval()

Wav2VecModel(
  (feature_extractor): ConvFeatureExtractionModel(
    (conv_layers): ModuleList(
      (0): Sequential(
        (0): Conv1d(1, 512, kernel_size=(10,), stride=(5,), bias=False)
        (1): Dropout(p=0.0, inplace=False)
        (2): Fp32GroupNorm(1, 512, eps=1e-05, affine=True)
        (3): ReLU()
      )
      (1): Sequential(
        (0): Conv1d(512, 512, kernel_size=(8,), stride=(4,), bias=False)
        (1): Dropout(p=0.0, inplace=False)
        (2): Fp32GroupNorm(1, 512, eps=1e-05, affine=True)
        (3): ReLU()
      )
      (2): Sequential(
        (0): Conv1d(512, 512, kernel_size=(4,), stride=(2,), bias=False)
        (1): Dropout(p=0.0, inplace=False)
        (2): Fp32GroupNorm(1, 512, eps=1e-05, affine=True)
        (3): ReLU()
      )
      (3): Sequential(
        (0): Conv1d(512, 512, kernel_size=(4,), stride=(2,), bias=False)
        (1): Dropout(p=0.0, inplace=False)
        (2): Fp32GroupNorm(1, 512, eps=1e-05, affine=True)
        (3): ReLU()
      )
 

Wav2VecModel(
  (feature_extractor): ConvFeatureExtractionModel(
    (conv_layers): ModuleList(
      (0): Sequential(
        (0): Conv1d(1, 512, kernel_size=(10,), stride=(5,), bias=False)
        (1): Dropout(p=0.0, inplace=False)
        (2): Fp32GroupNorm(1, 512, eps=1e-05, affine=True)
        (3): ReLU()
      )
      (1): Sequential(
        (0): Conv1d(512, 512, kernel_size=(8,), stride=(4,), bias=False)
        (1): Dropout(p=0.0, inplace=False)
        (2): Fp32GroupNorm(1, 512, eps=1e-05, affine=True)
        (3): ReLU()
      )
      (2): Sequential(
        (0): Conv1d(512, 512, kernel_size=(4,), stride=(2,), bias=False)
        (1): Dropout(p=0.0, inplace=False)
        (2): Fp32GroupNorm(1, 512, eps=1e-05, affine=True)
        (3): ReLU()
      )
      (3): Sequential(
        (0): Conv1d(512, 512, kernel_size=(4,), stride=(2,), bias=False)
        (1): Dropout(p=0.0, inplace=False)
        (2): Fp32GroupNorm(1, 512, eps=1e-05, affine=True)
        (3): ReLU()
      )
 

In [7]:
# Test that it works
wav_input_16khz = torch.randn(1,10000)
# print(wav_input_16khz)
z = model.feature_extractor(wav_input_16khz)
c = model.feature_aggregator(z)
c

tensor([[[1.5352e-02, 2.5548e-02, 2.2611e-02,  ..., 1.8008e-02,
          1.9221e-02, 2.5454e-02],
         [2.5437e-04, 1.0058e-03, 1.0113e-03,  ..., 1.5988e-02,
          5.0164e-02, 2.9576e-02],
         [7.5008e-03, 0.0000e+00, 9.1924e-04,  ..., 1.7227e-03,
          1.8247e-02, 4.3611e-03],
         ...,
         [6.1493e-03, 8.0292e-04, 6.0493e-04,  ..., 0.0000e+00,
          0.0000e+00, 4.4917e-03],
         [1.0668e-01, 1.5695e-01, 1.7490e-01,  ..., 4.5870e-01,
          2.5599e-01, 2.8228e-01],
         [2.5950e-01, 2.8349e-01, 2.6760e-01,  ..., 3.3884e-01,
          3.5796e-01, 3.6244e-01]]], grad_fn=<MulBackward0>)

In [8]:
#dir(wav_input_16khz)
wav_input_16khz.size()

torch.Size([1, 10000])

We test with a custom file

In [9]:
import librosa

In [10]:
wav_input = librosa.load('data/solberg.wav')
#print(wav_input)
tensors = torch.from_numpy(wav_input[0]).reshape(1,wav_input[0].size)
z = model.feature_extractor(tensors)
c = model.feature_aggregator(z)
c

tensor([[[0.0635, 0.0635, 0.0635,  ..., 0.0279, 0.0398, 0.0354],
         [0.0221, 0.0221, 0.0221,  ..., 0.1054, 0.0568, 0.1092],
         [0.0005, 0.0005, 0.0005,  ..., 0.0977, 0.3109, 0.3539],
         ...,
         [0.0000, 0.0000, 0.0000,  ..., 0.0037, 0.0084, 0.0127],
         [0.4041, 0.4041, 0.4041,  ..., 0.1282, 0.0406, 0.1174],
         [0.0088, 0.0088, 0.0088,  ..., 0.0772, 0.1449, 0.2367]]],
       grad_fn=<MulBackward0>)

# Watermarks of the enviroment

Make sure you run this last!

In [11]:
%watermark -gb -iv -m -v

librosa 0.7.2
torch   1.5.1
CPython 3.7.7
IPython 7.16.1

compiler   : GCC 7.3.0
system     : Linux
release    : 5.3.0-62-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit
Git hash   : 523878ff55a626c559575bc60fca2cc349e830a4
Git branch : master
