# Inference 101 using Whisper Models from HuggingFace

Note: there are many different ways to run inference. This is just one example to demonstrate how the audio data from the datasets can be run through a model.

We are focussing on Whisper models here exclusively, but there are other models that one could use. More to this later...

## Preparation -- Imports and Load dataset

In [None]:
import datasets
from huggingface_hub import hf_hub_download
from IPython.display import Audio, display
import pandas as pd

from transformers import pipeline

In [None]:
from huggingface_hub import login, whoami
HF_TOKEN = input()
login(token=HF_TOKEN)

In [None]:
dataset_name = 'cdli/ugandan_english_nonstandard_speech_v0.1'
LANGUAGE = 'en'

ds = datasets.load_dataset(dataset_name, split='test', streaming=False)
ds = ds.filter(lambda example: example['audio_length'] <= 30)
ds

## Load a model for Inference

In [None]:
WHISPER_MODEL_NAME = "openai/whisper-tiny"
# WHISPER_MODEL_NAME = "openai/whisper-small"
# WHISPER_MODEL_NAME = "openai/whisper-large-v3"

### Easiest way is via HF's pipeline approach

In [None]:
pipe = pipeline("automatic-speech-recognition", 
                model=WHISPER_MODEL_NAME,
                #return_timestamps=False,
)

In [None]:
example = ds[5]
example

In [None]:
Audio(example['audio']['array'], rate=example['audio']['sampling_rate'])

In [None]:
prediction = pipe(example['audio']['array'], 
                  generate_kwargs={
                        "language": LANGUAGE,
                        "num_beams": 5,
                        "task": "transcribe"
                    })
prediction['text']

In [None]:
example['transcription']

## Next step

* try different model sizes and how this affects transcription performance!