<a href="https://colab.research.google.com/github/Vaibhavs10/notebooks/blob/main/whisper_turbo_in_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Whisper Turbo in Transformers 🤗

[Model Repo on the Hub 🤗](https://huggingface.co/ylacombe/whisper-large-v3-turbo)

In [3]:
!pip install -q transformers accelerate

# Setup the environment

In [4]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

# Load the model checkpoint

In [5]:
model_id = "ylacombe/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True, use_safetensors=True
)
model.to("cuda")

config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.62G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(128, 1280, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 1280)
      (layers): ModuleList(
        (0-31): 32 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (v_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (q_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (out_proj): Linear(in_features=1280, out_features=1280, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1280, out_features=5120, bias=True)
          (fc2): Linear(in_features=5120, out_features=1280, bia

# Setup the Speech Recognition pipeline

In [7]:
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch.float16,
    device="cuda",
)

# Transcribe away!

In [8]:
!wget https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/ted_60.wav

--2024-10-01 11:48:00--  https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/ted_60.wav
Resolving huggingface.co (huggingface.co)... 99.84.66.72, 99.84.66.70, 99.84.66.65, ...
Connecting to huggingface.co (huggingface.co)|99.84.66.72|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/repos/96/e4/96e4f69cd112b019dd764318570e47e5fe96de53d8c32a99d745e72d9086e355/7795cbebcdf651ebbe4b1d8f36d7cde2286d482b800670c70bf7ce449c9f1ed5?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27ted_60.wav%3B+filename%3D%22ted_60.wav%22%3B&response-content-type=audio%2Fwave&Expires=1728042480&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyODA0MjQ4MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy85Ni9lNC85NmU0ZjY5Y2QxMTJiMDE5ZGQ3NjQzMTg1NzBlNDdlNWZlOTZkZTUzZDhjMzJhOTlkNzQ1ZTcyZDkwODZlMzU1Lzc3OTVjYmViY2RmNjUxZWJiZTRiMWQ4ZjM2ZDdjZGUyMjg2ZDQ4MmI4MDA2NzBjNzBiZjdjZTQ0OWM5ZjFlZDU%7EcmVzc

In [10]:
result = pipe("ted_60.wav")



In [11]:
print(result["text"])

 So in college, I was a government major, which means I had to write a lot of papers. Now, when a normal student writes a paper, they might spread the work out a little like this. So, you know, you get started maybe a little slowly, but you get enough done in the first week that with some heavier days later on, everything gets done and things stay civil. And I would want to do that like that. That would be the plan. I would have it all ready to go, but then actually the paper would come along, and then I would kind of do this. And that would happen every single paper. But then came my 90-page senior thesis, a paper you're supposed to spend a year on. I knew for a paper like that, my normal workflow was not an option, it was way too big a project. So I planned things out, and I decided I kind of had to go something like this. This is how the year would go. So I'd start off light, and I'd bump it up,


# Transcriptions with Timestamps

In [12]:
result = pipe("ted_60.wav", return_timestamps=True)



In [14]:
print(result["chunks"])

[{'timestamp': (0.0, 4.92), 'text': ' So in college, I was a government major,'}, {'timestamp': (4.94, 7.38), 'text': ' which means I had to write a lot of papers.'}, {'timestamp': (7.4, 9.48), 'text': ' Now, when a normal student writes a paper,'}, {'timestamp': (9.5, 12.18), 'text': ' they might spread the work out a little like this.'}, {'timestamp': (12.2, 13.48), 'text': ' So, you know,'}, {'timestamp': (14.76, 16.76), 'text': ' you get started maybe a little slowly,'}, {'timestamp': (16.78, 18.68), 'text': ' but you get enough done in the first week'}, {'timestamp': (18.7, 20.32), 'text': ' that with some heavier days later on,'}, {'timestamp': (20.34, 22.72), 'text': ' everything gets done and things stay civil.'}, {'timestamp': (23.94, 26.16), 'text': ' And I would want to do that like that.'}, {'timestamp': (26.18, 27.38), 'text': ' That would be the plan.'}, {'timestamp': (0.0, 2.54), 'text': ' I would have it all ready to go,'}, {'timestamp': (2.56, 5.06), 'text': ' but then

# Transcriptions with Word Timestamps

In [15]:
result = pipe("ted_60.wav", return_timestamps="word")



In [16]:
print(result["chunks"])

[{'text': ' So', 'timestamp': (0.56, 0.74)}, {'text': ' in', 'timestamp': (0.74, 0.9)}, {'text': ' college,', 'timestamp': (0.9, 1.76)}, {'text': ' I', 'timestamp': (1.76, 3.5)}, {'text': ' was', 'timestamp': (3.5, 3.64)}, {'text': ' a', 'timestamp': (3.64, 3.74)}, {'text': ' government', 'timestamp': (3.74, 4.02)}, {'text': ' major,', 'timestamp': (4.02, 4.68)}, {'text': ' which', 'timestamp': (4.92, 5.06)}, {'text': ' means', 'timestamp': (5.06, 5.26)}, {'text': ' I', 'timestamp': (5.26, 5.4)}, {'text': ' had', 'timestamp': (5.4, 5.56)}, {'text': ' to', 'timestamp': (5.56, 5.66)}, {'text': ' write', 'timestamp': (5.66, 5.86)}, {'text': ' a', 'timestamp': (5.86, 5.98)}, {'text': ' lot', 'timestamp': (5.98, 6.08)}, {'text': ' of', 'timestamp': (6.08, 6.22)}, {'text': ' papers.', 'timestamp': (6.22, 7.1)}, {'text': ' Now,', 'timestamp': (7.42, 7.68)}, {'text': ' when', 'timestamp': (7.68, 7.78)}, {'text': ' a', 'timestamp': (7.78, 7.88)}, {'text': ' normal', 'timestamp': (7.88, 8.06)}, 