<a href="https://colab.research.google.com/github/Vaibhavs10/optimise-my-whisper/blob/main/insanely_fast_whisper_distil_fp16_sdpa_chunking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Insanely Fast Whisper: A journey to build the fastest possible transcription with Whisper 🔥

By: [Vaibhav (VB) Srivastav](https://twitter.com/reach_vb)

## distil-whisper + fp16 + SDPA + chunking

fp16 + Scaled Dot Product Attention for efficient attention calculation.

### Setup our inference environment 🧑‍💻

In [1]:
!pip install -q --upgrade transformers accelerate torch ipython-autotime

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m50.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m51.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m83.9 MB/s[0m eta [36m0:00:00[0m
[?25h

### Setting up the utilities to track time taken by each step ⏳

In [2]:
%load_ext autotime

time: 224 µs (started: 2024-05-24 07:53:47 +00:00)


### Necessary imports 🔧




In [3]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

time: 16.7 s (started: 2024-05-24 07:53:47 +00:00)


### Define Model checkpoint, device and datatype 🔉

In [4]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa"
)
model.to(device)

config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/4.25k [00:00<?, ?B/s]

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(128, 1280, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 1280)
      (layers): ModuleList(
        (0-31): 32 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (v_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (q_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (out_proj): Linear(in_features=1280, out_features=1280, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1280, out_features=5120, bias=True)
          (fc2): Linear(in_features=5120, out_features=1280, bia

time: 1min 25s (started: 2024-05-24 07:54:04 +00:00)


### Load the model and initialise the speech recognition pipeline ⚡

In [5]:
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


time: 2.23 s (started: 2024-05-24 07:55:29 +00:00)


### Define an audio sample to test on 👇

In [6]:
sample = "https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/4469669-10.mp3"

time: 422 µs (started: 2024-05-24 07:55:32 +00:00)


### Transcribe away! 💪

In [7]:
result = pipe(sample, batch_size=8)

time: 17.2 s (started: 2024-05-24 07:55:32 +00:00)


In [8]:
print(result["text"])

 Now, it's time. May I start the presentation on transforming Toshiba to enhance your health value and FY21 second quarter consolidated business results. We are organized this presentation session on online basis. From 4 to 5 o'clock, we will be presenting from our side and followed by 30 minutes' question session for the middle of the media. The questions from Alice And investors will be accepted from 5.30 to 6 o'clock Japan time. Please be aware of that. Now, we will be collecting questions via telephone conferencing system. As is informed you beforehand, the conference call system will require the pre-registration beforehand. Let me introduce the presenter today. President and CEO, Satoshi Tsunakawa. Corporate Senior Executive Vice President, Mamor Hatazawa. Representative Executive Executive Vice President, Corporate Executive Vice President, NCFO, Masei Hiraeta. We have a chairperson of Strategic Review Committee outside director of Paul Broff. He is joining from Hong Kong on onli