[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/biodatlab/thonburian-whisper/blob/main/thonburian_whisper_notebook.ipynb)


# **Thonburian Whisper**

Automatic Speech Recognition (ASR) model for Thai

<img src="https://raw.githubusercontent.com/biodatlab/thonburian-whisper/main/assets/thonburian-whisper-logo.png" width="400"/>
---



> By Crews from Looloo Technology and Mahidol University




## **Install Dependencies** ⚙

In [None]:
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!sudo apt install ffmpeg
!pip install torchaudio ipywebrtc notebook
!pip install -q gradio
!pip install pytube
!jupyter nbextension enable --py widgetsnbextension

## **Load and Set-up Thonburian Whisper 🤗**


In [2]:
import os
import torch
from transformers import pipeline

MODEL_NAME = "biodatlab/whisper-th-medium-combined"
lang = "th"

device = 0 if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)

config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

Device set to use cuda:0


## **Try it with your own voice** 🎥

### Record your own audio here in the notebook!

In [3]:
from ipywebrtc import AudioRecorder, CameraStream
from google.colab import output
output.enable_custom_widget_manager()

In [6]:
camera = CameraStream(constraints={'audio': True, 'video': False})
recorder = AudioRecorder(stream=camera)
recorder

AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …

In [7]:
# Save the recorded audio.
recorder.save("audio.mp3")

### Now let our *Thonburian Whisper* do the work!!

In [8]:
transcriptions = pipe(
    "audio.mp3",
    batch_size=16,
    return_timestamps=False,
    generate_kwargs={"language": "<|th|>", "task": "transcribe"}
)["text"]
print(transcriptions)


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


สวัสดีครับวันนี้ผมมาพูดที่โรงพยาบาลชลบุรีครับ


## **Transcribe a Youtube Video?**

> [![Watch the video](https://img.youtube.com/vi/jwBqoBIDv3o/default.jpg)](https://www.youtube.com/watch?v=jwBqoBIDv3o)




In [None]:
import pytube as pt


def yt_transcribe(yt_url: str):
    """Transcribe a given Youtube URL"""
    yt = pt.YouTube(yt_url)
    stream = yt.streams.filter(only_audio=True)[0]
    stream.download(filename="audio.mp3")
    text = pipe(
        "audio.mp3",
        generate_kwargs={"language": "<|th|>", "task": "transcribe"},
        return_timestamps=False,
        batch_size=16
    )
    return text

In [None]:
# This may take some time depending on the length of the video.
url = "https://www.youtube.com/watch?v=jwBqoBIDv3o"

transcriptions = yt_transcribe(url)
print(transcriptions["text"])