<a href="https://colab.research.google.com/github/beinghorizontal/wav2vec2/blob/main/Copy_of_insanely_fast_whisper_fp16_sdpa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Insanely Fast Whisper: A journey to build the fastest possible transcription with Whisper 🔥

By: [Vaibhav (VB) Srivastav](https://twitter.com/reach_vb)

## fp16 + SDPA

fp16 + Scaled Dot Product Attention for efficient attention calculation.

### Setup our inference environment 🧑‍💻

In [None]:
!pip install -q --upgrade transformers accelerate torch ipython-autotime

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m65.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m89.7 MB/s[0m eta [36m0:00:00[0m
[?25h

### Setting up the utilities to track time taken by each step ⏳

In [None]:
%load_ext autotime

time: 324 µs (started: 2024-05-24 07:27:28 +00:00)


### Necessary imports 🔧




In [None]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

time: 16.8 s (started: 2024-05-24 07:27:28 +00:00)


### Define Model checkpoint, device and datatype 🔉

In [None]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa"
)
model.to(device)

config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(128, 1280, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 1280)
      (layers): ModuleList(
        (0-31): 32 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (v_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (q_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (out_proj): Linear(in_features=1280, out_features=1280, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1280, out_features=5120, bias=True)
          (fc2): Linear(in_features=5120, out_features=1280, bia

time: 34.2 s (started: 2024-05-24 07:27:44 +00:00)


### Load the model and initialise the speech recognition pipeline ⚡

In [None]:
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


time: 3.24 s (started: 2024-05-24 07:28:19 +00:00)


### Define an audio sample to test on 👇

In [None]:
sample = "https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/4469669-10.mp3"

time: 396 µs (started: 2024-05-24 07:28:22 +00:00)


### Transcribe away! 💪

In [None]:
result = pipe(sample)

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


time: 1min (started: 2024-05-24 07:28:22 +00:00)


In [None]:
print(result["text"])

 Thank you. and investors will be accepted from 5.30 to 6 o'clock Japan time. Please be aware of that. Now, we will be collecting questions via telephone conferencing system. As is informed to you beforehand, the conference call system will require the pre-registration beforehand. Let me introduce the presenter today, President and CEO, Satoshi Tsunakawa. Corporate Senior Executive Vice President Mamoru Hatazawa. Representative Executive Officer, Corporate Executive Vice President, NCFO, Masayoshi Hirata. We have a chairperson of Strategic Review Committee Outside Director, Paul Brough. He is joining from Hong Kong online. My name is Hara of Corporate Communications Department. We are providing simultaneous translation, so if you are watching the live streaming in Japanese, you will be able to hear translation's voice. Please be aware of that. First, before going into transforming Toshiba to enhance Shihara's value, May I have Mr. Tanaka to say a few words upon the receipt of the repor