<a href="https://colab.research.google.com/github/Vaibhavs10/notebooks/blob/main/Whisper_Large_8bit_loading_w_bnb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Whisper Large inference in 8-bit mode

For faster and memory efficient inference for large models. Read more about it [here](https://huggingface.co/blog/hf-bitsandbytes-integration)

Compiled by: [Vaibhav (VB) Srivastav](https://twitter.com/reach_vb)

We'll first install the necessary packages. We need ffmpeg to decode `mp3` files from the CV11 dataset and transformers, bnb and accelerate to load the model in 8bit mode.

In [None]:
!add-apt-repository -y ppa:jonathonf/ffmpeg-4 && apt update && apt install -y ffmpeg
!pip install --quiet datasets git+https://github.com/huggingface/transformers evaluate huggingface_hub jiwer bitsandbytes accelerate

0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
0% [Connecting to archive.ubuntu.com (185.125.190.36)] [Connecting to security.0% [1 InRelease gpgv 3,626 B] [Waiting for headers] [Connecting to security.ubu                                                                               Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
0% [1 InRelease gpgv 3,626 B] [Waiting for headers] [Waiting for headers] [Wait                                                                               Hit:3 http://security.ubuntu.com/ubuntu bionic-security InRelease
0% [1 InRelease gpgv 3,626 B] [Waiting for headers] [Waiting for headers] [Wait                                                                               Hit:4 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Hit:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:6 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Ign:7 

Since we will be running inference on CV11 dataset, we'd need to authenticate ourselves (since, CV11 requires accepting its Terms and Conditions)

In [None]:
!git config --global credential.helper store
from huggingface_hub import login

login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


To reduce the memory and time overhead, we'll load the dataset in streaming fashion. During the time of inference we'll stream one data point at a time. This is specially useful for larger datasets.

In [None]:
from datasets import load_dataset

dataset = load_dataset(
    "mozilla-foundation/common_voice_11_0", "en", revision="streaming", split="test", streaming=True, use_auth_token=True
)

Loading the model and processor in 8bit mode with `load_in_8bit=True`

Note: This is the only change you need to make in order for you to run the model in 8bit mode.

In [None]:
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large", device_map="auto", load_in_8bit=True)
processor = WhisperProcessor.from_pretrained("openai/whisper-large", load_in_8bit=True)

Preprocess the dataset to be sampled at 16KHz, since Whisper expects 16KHz input.

In [None]:
from datasets import Audio

dataset = dataset.take(10)

# resample to 16kHz
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

Voila! Time to run inference loop!

In [None]:
%%time

device = "cuda" if torch.cuda.is_available() else "cpu"

for data in dataset:
    inputs = processor.feature_extractor(data["audio"]["array"], return_tensors="pt", sampling_rate=16_000).input_features.half().to(device)
    forced_decoder_ids = processor.get_decoder_prompt_ids(language="en", task="transcribe")
    predicted_ids = model.generate(inputs, forced_decoder_ids=forced_decoder_ids)
    text = processor.tokenizer.batch_decode(predicted_ids, skip_special_tokens=True, normalize=False)[0]
    print(text)

Reading metadata...: 16354it [00:00, 66038.23it/s]


 Joe Keaton disapproved of films and Buster also had reservations about the medium.
 She'll be alright.
 Six.
 All is well that ends well.
 It is a busy market town that serves a large, surrounded area.
 the team had Olympic champion Carolina Marin in the squad for the season
 Do you mean it?
 The new patch is less invasive than the old one, but still causes regression.
 How is Mozilla going to handle ambiguities like Q and Q?
 Wish you a safe and happy holiday.
CPU times: user 42.3 s, sys: 1.27 s, total: 43.6 s
Wall time: 43.7 s


In [None]:
!nvidia-smi

Wed Dec  7 15:28:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P0    29W /  70W |   6410MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces