The notebook shows how to use https://huggingface.co/Intel/whisper-large-v2-onnx-int4-inc to do ASR with Whisper optimized for Intel processors.

## Check processor

In [1]:
! lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz
    CPU family:          6
    Model:               63
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            2
    CPU max MHz:         3600.0000
    CPU min MHz:         1200.0000
    BogoMIPS:            6983.02
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
                         a cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss 
                         ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arc
                         h_perfmon pebs bts rep_good nopl xtopology nonstop_tsc 
                         cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl v

### Check versions

In [1]:
import torch
import torchaudio

print(torch.__version__)
print(torchaudio.__version__)

2.4.0+cu118
2.4.0+cu118


### Download the model

In [2]:
! git clone https://huggingface.co/Intel/whisper-large-v2-onnx-int4-inc

Cloning into 'whisper-large-v2-onnx-int4-inc'...
remote: Enumerating objects: 38, done.[K
remote: Counting objects: 100% (34/34), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 38 (delta 9), reused 0 (delta 0), pack-reused 4 (from 1)[K
Unpacking objects: 100% (38/38), 1.17 MiB | 2.39 MiB/s, done.
Filtering content: 100% (6/6), 1.89 GiB | 5.46 MiB/s, done.


### Load the model

In [3]:
import os

from transformers import WhisperProcessor, PretrainedConfig
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq

model_name = "openai/whisper-large-v2"
model_path = "whisper-large-v2-onnx-int4-inc"

processor = WhisperProcessor.from_pretrained(model_name)

model_config = PretrainedConfig.from_pretrained(model_name)

sessions = ORTModelForSpeechSeq2Seq.load_model(
    os.path.join(model_path, "encoder_model.onnx"),
    os.path.join(model_path, "decoder_model.onnx"),
    os.path.join(model_path, "decoder_with_past_model.onnx"),
)

model = ORTModelForSpeechSeq2Seq(
    sessions[0], sessions[1], model_config, model_path, sessions[2]
)

  from .autonotebook import tqdm as notebook_tqdm
You are using a model of type whisper to instantiate a model of type . This is not supported for all configurations of models and can yield errors.


Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from whisper-large-v2-onnx-int4-inc/encoder_model.onnx failed:Node (/layers.0/self_attn/k_proj/MatMul_Q4) Op (MatMulFpQ4) [ShapeInferenceError] 4b quantization not yet supported on this hardware platform!

### Transcript the audio file

In [None]:
! wget -O short_1_16k.wav https://github.com/egorsmkv/wav2vec2-uk-demo/raw/master/short_1_16k.wav

In [None]:
import torchaudio

audio_data, sr = torchaudio.load("short_1_16k.wav")

input_features = processor(
    audio_data, sampling_rate=sr, return_tensors="pt"
).input_features

predicted_ids = model.generate(input_features)[0]
transcription = processor.decode(predicted_ids)

print(transcription)