# Convert and Quantize Speech Recognition Models with OpenVINO™
This tutorial demonstrates how to convert and apply `INT8` quantization to the speech recognition model, known as [Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2), using the [Post-Training Optimization Tool API (POT API)](https://docs.openvino.ai/latest/pot_compression_api_README.html) (part of the [OpenVINO Toolkit](https://docs.openvino.ai/)). This notebook uses a fine-tuned [Wav2Vec2-Base-960h](https://huggingface.co/facebook/wav2vec2-base-960h) [PyTorch](https://pytorch.org/) model trained on the [LibriSpeech ASR corpus](https://www.openslr.org/12). The tutorial is designed to be extendable to custom models and datasets. It consists of the following steps:

- Prepare the Wav2Vec2 model and LibriSpeech dataset using HuggingFace Transformers library.
- Define data loading and accuracy validation functionality.
- Prepare the model for quantization.
- Run optimization pipeline.
- Compare performance of the original and quantized models.

## Imports

In [1]:
import os
import sys
import time
import re
import numpy as np
import torch
import tarfile
from pathlib import Path
from itertools import groupby
import soundfile as sf
import IPython.display as ipd
from tqdm.notebook import tqdm


from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import nncf
from openvino.runtime import Core, serialize
from openvino.tools import mo

sys.path.append("../utils")
from notebook_utils import download_file

c:\Users\eaidova\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
c:\Users\eaidova\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll


## Settings

In [21]:
# Set model directory
MODEL_DIR = Path("model")
MODEL_DIR.mkdir(exist_ok=True)

## Prepare the Model

Wav2Vec2 is PyTorch model, in order to convert it to OpenVINO Intermediate Representation format, we should export model to ONNX before. This model is uploaded to HuggingFace hub, so let's use HuggingFace interface for creating PyTorch model class. According to instruction, provided in model card, we should use `from_pretrained` method of `Wav2Vec2ForCTC` for creating model instance and loading pretrained weights.
Beside that, we also will use  `Wav2Vec2Processor` class, which provides set of model specific preprocessing and postprocessing steps.



In [4]:
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
torch_model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


OpenVINO supports PyTorch\* through export to the ONNX\* format. We will use `torch.onnx.export` function for obtaining ONNX, 
you can find more info about it in [PyTorch documentation](https://pytorch.org/docs/stable/onnx.html). 
We need provide model object, input data for model tracing and path for model saving. 
It is preferable way to infer wav2vec model to process whole audio in one time, so additionally we provide `dynamic_axes` parameter to preserve dynamic input shapes after ONNX export.

In [6]:
BATCH_SIZE = 1
MAX_SEQ_LENGTH = 30480


def export_model_to_onnx(model, path):
    # switch model to evaluation mode 
    model.eval()
    # disallow gradient propagation for reducing memory during export
    with torch.no_grad():
        # define dummy input with specific shape
        default_input = torch.zeros([1, MAX_SEQ_LENGTH], dtype=torch.float)
        inputs = {
            "inputs": default_input
        }

        # define names for dynamic dimentions
        symbolic_names = {0: "batch_size", 1: "sequence_len"}
        # export model
        torch.onnx.export(
            model,
            (inputs["inputs"]),
            path,
            opset_version=11,
            input_names=["inputs"],
            output_names=["logits"],
            dynamic_axes={
                "inputs": symbolic_names,
                "logits": symbolic_names,
            },
        )
        print("ONNX model saved to {}".format(path))

onnx_model_path = Path(MODEL_DIR) / "wav2vec2_base.onnx"
if not onnx_model_path.exists():
    export_model_to_onnx(torch_model, onnx_model_path)

## Verify ONNX file correctness

The code below demonstrates how to check that ONNX graph has correct representation

In [23]:
import onnx
# Load the ONNX model
onnx_model = onnx.load(onnx_model_path)

# Check that the model is well formed
onnx.checker.check_model(onnx_model)

## Convert the ONNX Model to OpenVINO IR

While ONNX models are directly supported by OpenVINO™, it can be useful to convert them to IR format to take advantage of OpenVINO optimization tools and features.
`mo.convert` function can be used for converting model using OpenVINO Model Optimizer capabilities. 
It returns of instance OpenVINO Model class, which is ready to use in python interface and can be serialized to IR for future execution using `serialize` function.

In [7]:
ir_model_xml = onnx_model_path.with_suffix(".xml")
core = Core()

if not ir_model_xml.exists():
    ov_model = mo.convert(input_model=onnx_model_path, data_type='FP16')
    serialize(ov_model, str(ir_model_xml))
else:
    ov_model = core.read_model(ir_model_xml)

## Validate model inference

### Prepare LibriSpeech Dataset

Wav2Vec2 model pretrained on `LibriSpeech` dataset. The code below download dataset using `huggingface.datasets` library.
> NOTE: For saving time, we will use small [dummy subset](https://huggingface.co/datasets/patrickvonplaten/librispeech_asr_dummy), in order to reproduce reference accuracy you shuld use [full dataset version](https://huggingface.co/datasets/librispeech_asr).


In [22]:
# 
librispeech_eval = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# define preprocessing function for converting audio to input values for model
def map_to_input(batch):
    preprocessed_signal = processor(batch["audio"]["array"], return_tensors="pt", padding="longest", sampling_rate=batch['audio']['sampling_rate'])
    input_values = preprocessed_signal.input_values
    batch['input_values'] = input_values
    return batch

# apply preprocessing function to dataset and remove audio column, to save memory as we do not need it anymore
dataset = librispeech_eval.map(map_to_input, batched=False, remove_columns=["audio"])





  0%|          | 0/73 [00:00<?, ?ex/s]

Let's view what located inside dataset sample and make sure that input values now is part of sample

In [39]:
dataset.info.features

{'file': Value(dtype='string', id=None),
 'text': Value(dtype='string', id=None),
 'speaker_id': Value(dtype='int64', id=None),
 'chapter_id': Value(dtype='int64', id=None),
 'id': Value(dtype='string', id=None),
 'input_values': Sequence(feature=Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None), length=-1, id=None)}

### Inference on audio sample

In [10]:
# inference function for pytorch
def torch_infer(model, sample):
    logits = model(torch.Tensor(sample['input_values'])).logits
    # take argmax and decode
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    return transcription

# inference function for openvino
def ov_infer(model, sample):
    output = model.output(0)
    logits = model(np.array(sample['input_values']))[output]
    predicted_ids = np.argmax(logits, axis=-1)
    transcription = processor.batch_decode(torch.from_numpy(predicted_ids))
    return transcription

In [25]:

sample = dataset[0]
torch_transcription = torch_infer(torch_model, sample)
# compile openvino model
compiled_model = core.compile_model(ov_model, 'CPU')
ov_transcription = ov_infer(compiled_model, sample)
print(f"Annotation text: {sample['text']}")
print(f"[PT] Prediction text: {torch_transcription[0]}")
print(f'[OV FP16] Prediction text {ov_transcription[0]}')

Annotation text: MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
[PT] Prediction text: MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
[OV FP16] Prediction text MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL


### Validate model accuracy on dataset

For accuracy evaluation we will use Word Error Rate metirc (WER). It is a common metric of the performance of an automatic speech recognition system. 
The metric value indicates the percentage of words that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.
More details about evaluation approach can be found on this [page](https://en.wikipedia.org/wiki/Word_error_rate) 

In [12]:
from torchmetrics import WER
from tqdm.notebook import tqdm

def compute_wer(dataset, model, infer_fn):
    wer = WER()
    for sample in tqdm(dataset):
        # run infer function on sample
        transcription = infer_fn(model, sample)
        # update metric on sample result
        wer.update(transcription, [sample['text']])
    # finalize metric calculation
    result = wer.compute()
    return result



  interpolation: int = Image.BILINEAR,
  interpolation: int = Image.NEAREST,
  interpolation: int = Image.BICUBIC,


In [13]:

pt_result = compute_wer(dataset, torch_model, torch_infer)
print(f'[PYTORCH] Word Error Rate: {pt_result:.4f}')
ov_result = compute_wer(dataset, compiled_model, ov_infer)
print(f'[OV] Word Error Rate: {ov_result:.4f}')



  0%|          | 0/73 [00:00<?, ?it/s]

[PYTORCH] Word Error Rate: 0.0530


  0%|          | 0/73 [00:00<?, ?it/s]

[OV] Word Error Rate: 0.0530


## Optimize model using NNCF PTQ

### Define DataLoader for PTQ

For quantization we will reuse our validation dataset. `nncf.create_dataloader` interface helps to prepare dataset for quantization. `transform_fn` - transformation function for getting input data from samples.

In [15]:
def transform_fn(batch_item):
    return np.array(batch_item['input_values'])

quantization_dataset = nncf.create_dataloader(dataset, transform_fn=transform_fn)

## Run Quantization
`nncf.quantize` function provides interface for model quantization. It accept model, quantization dataset and, optionally, some additional parameters like `preset` or `model_type`. Wav2Vec2 model is based on Transformer architecture, so we need to provide `model_type=transformer` for proper configuration.

In [16]:
# TO DO ignored scope
quantized_model = nncf.quantize(ov_model, quantization_dataset, model_type='transformer')

# serialize int8 IR
compressed_model_xml = ir_model_xml.with_stem(ir_model_xml.name.replace('.xml', '_int8.xml'))
serialize(quantized_model, str(compressed_model_xml))

## Model Usage Example with Inference Pipeline
Both initial (`FP16`) and quantized (`INT8`) models are exactly the same in use.

Start with taking one example from the dataset to show inference steps for it.

Next, load quantized model to the inference pipeline.

In [None]:
compiled_int8_model = core.compile_model(quantized_model, 'CPU')
transcript = ov_infer(compiled_int8_model, dataset[0])
print(f'Predicted text: {transcript[0]}')
print(f"Annotation text: {dataset[0]['text']}")

Now, we can measure accuracy of INT8 model

In [17]:
ov_int8_result = compute_wer(dataset, compiled_int8_model, ov_infer)

print(f'OV FP16 Word Error Rate: {ov_result:.4f}')
print(f'OV INT8 Word Error Rate : {ov_int8_result:.4f}')

  0%|          | 0/73 [00:00<?, ?it/s]

OV FP16 Word Error Rate: 0.0530
OV INT8 Word Error Rate : 0.5757


## Compare Performance of the Original and Quantized Models
Finally, use [Benchmark Tool](https://docs.openvino.ai/latest/openvino_inference_engine_tools_benchmark_tool_README.html) to measure the inference performance of the `FP16` and `INT8` models.

> NOTE: For more accurate performance, it is recommended to run `benchmark_app` in a terminal/command prompt after closing other applications. Run `benchmark_app -m model.xml -d CPU` to benchmark async inference on CPU for one minute. Change `CPU` to `GPU` to benchmark on GPU. Run `benchmark_app --help` to see an overview of all command-line options.

In [19]:
# Inference FP16 model (OpenVINO IR)
! benchmark_app -m $ir_model_xml -shape [1,30480] -d CPU -api async

[Step 1/11] Parsing and validating input arguments


C:\Users\eaidova\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
C:\Users\eaidova\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll


[Step 2/11] Loading OpenVINO
[ INFO ] OpenVINO:
         API version............. 2022.2.0-7713-af16ea1d79a-releases/2022/2
[ INFO ] Device info
         CPU
         openvino_intel_cpu_plugin version 2022.2
         Build................... 2022.2.0-7713-af16ea1d79a-releases/2022/2

[Step 3/11] Setting device configuration
[Step 4/11] Reading network files
[ INFO ] Read model took 539.39 ms
[Step 5/11] Resizing network to match image sizes and given batch
[ INFO ] Reshaping model: 'inputs': {1,30480}
[ INFO ] Reshape model took 106.88 ms
[ INFO ] Network batch size: 1
[Step 6/11] Configuring input of the model
[ INFO ] Model input 'inputs' precision f32, dimensions ([...]): 1 30480
[ INFO ] Model output 'logits' precision f32, dimensions ([...]): 1 95 32
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 915.64 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] DEVICE: CPU
[ INFO ]   AVAILABLE_DEVICES  , ['']
[ INFO ]   RANGE_FOR_ASYNC_INFER_REQUESTS  , (

In [27]:
# Inference INT8 model (OpenVINO IR)
! benchmark_app -m $compressed_model_xml -shape [1,30480] -d CPU -api async

[Step 1/11] Parsing and validating input arguments




[Step 2/11] Loading OpenVINO
[ INFO ] OpenVINO:
         API version............. 2022.2.0-7713-af16ea1d79a-releases/2022/2
[ INFO ] Device info
         CPU
         openvino_intel_cpu_plugin version 2022.2
         Build................... 2022.2.0-7713-af16ea1d79a-releases/2022/2

[Step 3/11] Setting device configuration
[Step 4/11] Reading network files
[ INFO ] Read model took 548.32 ms
[Step 5/11] Resizing network to match image sizes and given batch
[ INFO ] Reshaping model: 'inputs': {1,30480}
[ INFO ] Reshape model took 88.18 ms
[ INFO ] Network batch size: 1
[Step 6/11] Configuring input of the model
[ INFO ] Model input 'inputs' precision f32, dimensions ([...]): 1 30480
[ INFO ] Model output 'logits' precision f32, dimensions ([...]): 1 95 32
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 757.41 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] DEVICE: CPU
[ INFO ]   AVAILABLE_DEVICES  , ['']
[ INFO ]   RANGE_FOR_ASYNC_INFER_REQUESTS  , (


C:\Users\eaidova\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
C:\Users\eaidova\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll
