<a href="https://colab.research.google.com/github/brilla-ai/brilla-ai/blob/kojomensahonums-starter-notebook-upload/STT_Starter_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AfricAIED 2024 Hackathon
This notebook is designed to help you get started with using speech transcription models for your project. In this notebook we show how to import and transcribe an audio using distil-whisper, a lightweight variant of the Whisper model.<br>

The audios present in this notebook are available in the "starter notebook audios" folder. Download the audios and upload them into Google Colab. The reference paths should then be the same as those present already. Happy transcribing! ðŸ˜€

### About the Speech Transcription Model

The distil-whisper collection currently has 4 different model sizes. Performance quality increases from the smallest to the largest. Arranged in order of increasing size, the models are as follows:


*   distil-small.en
*   distill-medium.en
*   distil-large-v2
*   distil-large-v3

Distil-large-v2 is used in this notebook. You can try the different models out!





#### Imports

In [None]:
# Install and import required libraries
!pip install git+https://github.com/openai/whisper.git
!pip install jiwer
!pip install tabulate
!pip install pydub
!pip install -q torchaudio
!pip install --upgrade pip
!pip install --upgrade transformers accelerate datasets[audio]
import torch
import whisper
from pydub import AudioSegment
import os
import IPython.display as ipd
from IPython.display import Audio, clear_output
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

clear_output()

#### Short-form Transcription

Short-form transcription is the process of transcribing audio samples that are less than 30-seconds long, which is the maximum receptive field of the Whisper models. This means the entire audio clip can be processed in one go without the need for chunking.

Import and play audio

In [None]:
sample_audio = AudioSegment.from_file(r"/content/audio_riddle_short_form.wav")
sample_audio


Load model

In [None]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2" # Test out the different model sizes here

model_short = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model_short.to(device)

processor = AutoProcessor.from_pretrained(model_id)


pipe = pipeline(
    "automatic-speech-recognition",
    model=model_short,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device
)

To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline.

In [None]:
# Transcribe audio
result = pipe(r"/content/audio_riddle_short_form.wav")

# Print transcript
print(result["text"])

#### Chunked Long-form transcription

This algorithm should be used when a single large audio file is being transcribed and the fastest possible inference is required.

Import and play audio

In [None]:
sample_audio_2 = AudioSegment.from_file(r"/content/audio_riddle_long_form.wav")
sample_audio_2


Load model

In [None]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"  # Test out the different model sizes here

model_long = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model_long.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe_long = pipeline(
    "automatic-speech-recognition",
    model=model_long,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=25, # difference
    batch_size=16, # difference
    torch_dtype=torch_dtype,
    device=device,
)

To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline.

In [None]:
# Transcribe audio
result2 = pipe_long(r"/content/audio_riddle_long_form.wav")

# Print transcript
print(result2["text"])