# Quantize Speech Recognition Models with accuracy control using NNCF PTQ API
This tutorial demonstrates how to apply `INT8` quantization with accuracy control to the speech recognition model, known as [Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2), using the NNCF (Neural Network Compression Framework) 8-bit quantization with accuracy control in post-training mode (without the fine-tuning pipeline). This notebook uses a fine-tuned [Wav2Vec2-Base-960h](https://huggingface.co/facebook/wav2vec2-base-960h) [PyTorch](https://pytorch.org/) model trained on the [LibriSpeech ASR corpus](https://www.openslr.org/12). The tutorial is designed to be extendable to custom models and datasets. It consists of the following steps:

- Download and prepare the Wav2Vec2 model and LibriSpeech dataset.
- Define data loading and accuracy validation functionality.
- Model quantization with accuracy control.
- Compare Accuracy of original PyTorch model, OpenVINO FP16 and INT8 models.
- Compare performance of the original and quantized models.

The advanced quantization flow allows to apply 8-bit quantization to the model with control of accuracy metric. This is achieved by keeping the most impactful operations within the model in the original precision. The flow is based on the [Basic 8-bit quantization](https://docs.openvino.ai/2023.0/basic_quantization_flow.html) and has the following differences:

- Besides the calibration dataset, a validation dataset is required to compute the accuracy metric. Both datasets can refer to the same data in the simplest case.
- Validation function, used to compute accuracy metric is required. It can be a function that is already available in the source framework or a custom function.
- Since accuracy validation is run several times during the quantization process, quantization with accuracy control can take more time than the Basic 8-bit quantization flow.
- The resulted model can provide smaller performance improvement than the Basic 8-bit quantization flow because some of the operations are kept in the original precision.

> **NOTE**: Currently, 8-bit quantization with accuracy control is available only for models in OpenVINO representation.

The steps for the quantization with accuracy control are described below.

<a id="0"></a>
### Table of content:
- [Imports](#1)
- [Settings](#2)
- [Prepare the Model](#3)
- [Prepare LibriSpeech Dataset](#4)
- [Define DataLoader](#5)
- [Prepare calibration and validation datasets](#6)
- [Prepare validation function](#7)
- [Run quantization with accuracy control](#8)
- [Model Usage Example](#9)
- [Compare Performance of the Original and Quantized Models](#10)

In [None]:
# !pip install -q "openvino-dev>=2023.1.0" "nncf>=2.6.0"
!pip install openvino==2023.1.0.dev20230728
!pip install git+https://github.com/openvinotoolkit/nncf.git@develop
!pip install -q soundfile librosa transformers torch

<a id="1"></a>
## Imports [&#8657;](#0)

In [None]:
import os
import sys
import tarfile
import torch

from transformers import Wav2Vec2ForCTC

sys.path.append("../utils")
from notebook_utils import download_file

<a id="2"></a>
## Settings [&#8657;](#0)

In [None]:
from pathlib import Path

# Set the data and model directories, model source URL and model filename.
MODEL_DIR = Path("model")
DATA_DIR = Path("../data/datasets/librispeech")
MODEL_DIR.mkdir(exist_ok=True)
DATA_DIR.mkdir(exist_ok=True)

<a id="3"></a>
## Prepare the Model [&#8657;](#0)
Perform the following:
- Download and unpack a pre-trained Wav2Vec2 model.
- Run model conversion API to convert the model to the OpenVINO Intermediate Representation (OpenVINO IR).

In [None]:
download_file("https://huggingface.co/facebook/wav2vec2-base-960h/resolve/main/pytorch_model.bin", directory=Path(MODEL_DIR) / 'pytorch', show_progress=True)
download_file("https://huggingface.co/facebook/wav2vec2-base-960h/resolve/main/config.json", directory=Path(MODEL_DIR) / 'pytorch', show_progress=False)

Import all dependencies to load the original PyTorch model and convert it to the OpenVINO Intermediate Representation (OpenVINO IR)..

In [None]:
BATCH_SIZE = 1
MAX_SEQ_LENGTH = 30480


torch_model = Wav2Vec2ForCTC.from_pretrained(Path(MODEL_DIR) / 'pytorch')

In [None]:
import openvino


default_input = torch.zeros([1, MAX_SEQ_LENGTH], dtype=torch.float)
ov_model = openvino.convert_model(torch_model, example_input=default_input)

<a id="4"></a>
## Prepare LibriSpeech Dataset [&#8657;](#0)

Use the code below to download and unpack the archives with 'dev-clean' and 'test-clean' subsets of LibriSpeech Dataset.

In [None]:
download_file("http://openslr.elda.org/resources/12/dev-clean.tar.gz", directory=DATA_DIR, show_progress=True)
download_file("http://openslr.elda.org/resources/12/test-clean.tar.gz", directory=DATA_DIR, show_progress=True)

if not os.path.exists(f'{DATA_DIR}/LibriSpeech/dev-clean'):
    with tarfile.open(f"{DATA_DIR}/dev-clean.tar.gz") as tar:
        tar.extractall(path=DATA_DIR)
if not os.path.exists(f'{DATA_DIR}/LibriSpeech/test-clean'):
    with tarfile.open(f"{DATA_DIR}/test-clean.tar.gz") as tar:
        tar.extractall(path=DATA_DIR)

<a id="5"></a>
## Define DataLoader [&#8657;](#0)
Wav2Vec2 model accepts a raw waveform of the speech signal as input and produces vocabulary class estimations as output. Since the dataset contains
audio files in FLAC format, use the `soundfile` package to convert them to waveform.

> **NOTE**: Consider increasing `samples_limit` to get more precise results. A suggested value is `300` or more, as it will take longer time to process.

In [None]:
import re
import numpy as np
import soundfile


class LibriSpeechDataLoader:

    @staticmethod
    def read_flac(file_name):
        speech, samplerate = soundfile.read(file_name)
        assert samplerate == 16000, "read_flac: only 16kHz supported!"
        return speech

    # Required methods
    def __init__(self, config, samples_limit=300):
        """Constructor
        :param config: data loader specific config
        """
        self.samples_limit = samples_limit
        self._data_dir = config["data_source"]
        self._ds = []
        self._prepare_dataset()

    def __len__(self):
        """Returns size of the dataset"""
        return len(self._ds)

    def __getitem__(self, index):
        """
        Returns annotation, data and metadata at the specified index.
        Possible formats:
        (index, annotation), data
        (index, annotation), data, metadata
        """
        label = self._ds[index][0]
        inputs = {'inputs': np.expand_dims(self._ds[index][1], axis=0)}
        return label, inputs

    # Methods specific to the current implementation
    def _prepare_dataset(self):
        pattern = re.compile(r'([0-9\-]+)\s+(.+)')
        data_folder = Path(self._data_dir)
        txts = list(data_folder.glob('**/*.txt'))
        counter = 0
        for txt in txts:
            content = txt.open().readlines()
            for line in content:
                res = pattern.search(line)
                if not res:
                    continue
                name = res.group(1)
                transcript = res.group(2)
                fname = txt.parent / name
                fname = fname.with_suffix('.flac')
                identifier = str(fname.relative_to(data_folder))
                self._ds.append(((counter, transcript.upper()), LibriSpeechDataLoader.read_flac(os.path.join(self._data_dir, identifier))))
                counter += 1
                if counter >= self.samples_limit:
                    # Limit exceeded
                    return

<a id="6"></a>
## Prepare calibration and validation datasets [&#8657;](#0)

In [None]:
import nncf

def transform_fn(data_item):
    """
    Extract the model's input from the data item.
    The data item here is the data item that is returned from the data source per iteration.
    This function should be passed when the data item cannot be used as model's input.
    """
    _, inputs = data_item

    return inputs["inputs"]


dataset_config = {"data_source": os.path.join(DATA_DIR, "LibriSpeech/dev-clean")}
data_loader = LibriSpeechDataLoader(dataset_config, samples_limit=300)
calibration_dataset = nncf.Dataset(data_loader, transform_fn)
dataset_config = {"data_source": os.path.join(DATA_DIR, "LibriSpeech/test-clean")}
test_data_loader = LibriSpeechDataLoader(dataset_config, samples_limit=300)
validation_dataset = nncf.Dataset(test_data_loader, transform_fn)

<a id="7"></a>
## Prepare validation function [&#8657;](#0)
Define function that decodes predicted probabilities to text, using tokenizer decode_logits.

In [None]:
from itertools import groupby


def decode_logits(logits):
    decoding_vocab = dict(enumerate(MetricWER.alphabet))
    token_ids = np.squeeze(np.argmax(logits, -1))
    tokens = [decoding_vocab[idx] for idx in token_ids]
    tokens = [token_group[0] for token_group in groupby(tokens)]
    tokens = [t for t in tokens if t != MetricWER.pad_token]
    res_string = ''.join([t if t != MetricWER.words_delimiter else ' ' for t in tokens]).strip()
    res_string = ' '.join(res_string.split(' '))
    res_string = res_string.lower()

    return res_string

Define `MetricWER` class to calculate Word Error Rate.

In [None]:
class MetricWER:
    alphabet = [
        "<pad>", "<s>", "</s>", "<unk>", "|",
        "e", "t", "a", "o", "n", "i", "h", "s", "r", "d", "l", "u",
        "m", "w", "c", "f", "g", "y", "p", "b", "v", "k", "'", "x", "j", "q", "z"]
    words_delimiter = '|'
    pad_token = '<pad>'

    # Required methods
    def __init__(self):
        self._name = "WER"
        self._sum_score = 0
        self._sum_words = 0
        self._cur_score = 0
        self._decoding_vocab = dict(enumerate(self.alphabet))

    @property
    def value(self):
        """Returns accuracy metric value for the last model output."""
        return {self._name: self._cur_score}

    @property
    def avg_value(self):
        """Returns accuracy metric value for all model outputs."""
        return {self._name: self._sum_score / self._sum_words if self._sum_words != 0 else 0}

    def update(self, output, target):
        """
        Updates prediction matches.

        :param output: model output
        :param target: annotations
        """
        decoded = [decode_logits(i) for i in output]
        target = [i.lower() for i in target]
        assert len(output) == len(target), "sizes of output and target mismatch!"
        for i in range(len(output)):
            self._get_metric_per_sample(decoded[i], target[i])

    def reset(self):
        """
        Resets collected matches
        """
        self._sum_score = 0
        self._sum_words = 0

    def get_attributes(self):
        """
        Returns a dictionary of metric attributes {metric_name: {attribute_name: value}}.
        Required attributes: 'direction': 'higher-better' or 'higher-worse'
                             'type': metric type
        """
        return {self._name: {"direction": "higher-worse", "type": "WER"}}

    # Methods specific to the current implementation
    def _get_metric_per_sample(self, annotation, prediction):
        cur_score = self._editdistance_eval(annotation.split(), prediction.split())
        cur_words = len(annotation.split())

        self._sum_score += cur_score
        self._sum_words += cur_words
        self._cur_score = cur_score / cur_words

        result = cur_score / cur_words if cur_words != 0 else 0
        return result

    def _editdistance_eval(self, source, target):
        n, m = len(source), len(target)

        distance = np.zeros((n + 1, m + 1), dtype=int)
        distance[:, 0] = np.arange(0, n + 1)
        distance[0, :] = np.arange(0, m + 1)

        for i in range(1, n + 1):
            for j in range(1, m + 1):
                cost = 0 if source[i - 1] == target[j - 1] else 1

                distance[i][j] = min(distance[i - 1][j] + 1,
                                     distance[i][j - 1] + 1,
                                     distance[i - 1][j - 1] + cost)
        return distance[n][m]

Define the validation function.

In [None]:
def validation_fn(model, dataset):
    """
    Calculate and returns a metric for the model.
    """
    wer = MetricWER()
    for sample in dataset:
        # run infer function on sample
        output = model(np.array(sample[1]['inputs']))[model.output(0)]

        # update metric on sample result
        target = [sample[0][1]]
        wer.update(output, target)

    return 1 - wer.avg_value["WER"]

<a id="8"></a>
## Run quantization with accuracy control [&#8657;](#0)
You should provide the calibration dataset and the validation dataset. It can be the same dataset. 
  - parameter `max_drop` defines the accuracy drop threshold. The quantization process stops when the degradation of accuracy metric on the validation dataset is less than the `max_drop`. The default value is 0.01. NNCF will stop the quantization and report an error if the `max_drop` value can’t be reached.
  - `drop_type` defines how the accuracy drop will be calculated: ABSOLUTE (used by default) or RELATIVE.
  - `ranking_subset_size` - size of a subset that is used to rank layers by their contribution to the accuracy drop. Default value is 300, and the more samples it has the better ranking, potentially. Here we use the value 25 to speed up the execution. 

> **NOTE**: Execution can take tens of minutes and requires up to 10 GB of free memory

In [None]:
from nncf.quantization.advanced_parameters import AdvancedAccuracyRestorerParameters
from nncf.parameters import ModelType

quantized_model = nncf.quantize_with_accuracy_control(
    ov_model,
    calibration_dataset=calibration_dataset,
    validation_dataset=validation_dataset,
    validation_fn=validation_fn,
    max_drop=0.01,
    drop_type=nncf.DropType.ABSOLUTE,
    model_type=ModelType.TRANSFORMER,
    advanced_accuracy_restorer_parameters=AdvancedAccuracyRestorerParameters(
        ranking_subset_size=25
    ),
)

<a id="9"></a>
## Model Usage Example [&#8657;](#0)

In [None]:
import IPython.display as ipd


audio = LibriSpeechDataLoader.read_flac(f'{DATA_DIR}/LibriSpeech/test-clean/121/127105/121-127105-0017.flac')

ipd.Audio(audio, rate=16000)

In [None]:
core = openvino.Core()

compiled_model = core.compile_model(model=quantized_model, device_name='CPU')

input_data = np.expand_dims(audio, axis=0)

Next, make a prediction.

In [None]:
predictions = compiled_model([input_data])[0]

<a id="10"></a>
## Compare Performance of the Original and Quantized Models [&#8657;](#0)

  - Define dataloader for test dataset.
  - Define functions to get inference for PyTorch and OpenVINO models.
  - Define functions to compute Word Error Rate.

In [None]:
from tqdm.notebook import tqdm

import numpy as np


dataset_config = {"data_source": os.path.join(DATA_DIR, "LibriSpeech/test-clean")}
test_data_loader = LibriSpeechDataLoader(dataset_config, samples_limit=300)


# inference function for pytorch
def torch_infer(model, sample):
    output = model(torch.Tensor(sample[1]['inputs'])).logits
    output = output.detach().cpu().numpy()

    return output


# inference function for openvino
def ov_infer(model, sample):
    output = model.output(0)
    output = model(np.array(sample[1]['inputs']))[output]

    return output


def compute_wer(dataset, model, infer_fn):
    wer = MetricWER()
    for sample in tqdm(dataset):
        # run infer function on sample
        output = infer_fn(model, sample)
        # update metric on sample result
        wer.update(output, [sample[0][1]])

    return wer.avg_value

Now, compute WER for the original PyTorch model and quantized model.

In [None]:
pt_result = compute_wer(test_data_loader, torch_model, torch_infer)
quantized_result = compute_wer(test_data_loader, compiled_model, ov_infer)

print(f'[PyTorch]   Word Error Rate: {pt_result["WER"]:.4f}')
print(f'[Quantized OpenVino]  Word Error Rate: {quantized_result["WER"]:.4f}')