<a href="https://colab.research.google.com/github/dbstj1231/2023_AI_Academy_ASR/blob/main/5_whisper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 5. Whisper

- https://openai.com/blog/whisper/  
- trained on 680,000 hours of multilingual and multitask supervised data collected from the web.
- multiple languages (https://github.com/openai/whisper/blob/main/whisper/tokenizer.py)

<img src="https://cdn.openai.com/whisper/asr-summary-of-model-architecture-desktop.svg">

## Whisper install
https://github.com/openai/whisper

In [None]:
# clone git repo (use pip)
!pip install git+https://github.com/openai/whisper.git

## Whisper model load

In [4]:
import whisper

# whisper model load large
model = whisper.load_model("large")

In [3]:
# whisper model load tiny


In [8]:
# check parameters of tiny model


- `datasets` : audio, computer vision, nlp task 용 공유 데이터에 쉽게 접근할 수 있는 라이브러리, huggingface에서 사용됨

In [None]:
# install datasets
!pip install datasets

In [7]:
from datasets import load_dataset

## LibriSpeech dataset

- approximately 1000 hours of 16kHz
- read audiobooks from the LibriVox project

https://www.openslr.org/12  
https://paperswithcode.com/dataset/librispeech  
https://huggingface.co/datasets/librispeech_asr

In [11]:
# LibriSpeech dataset load -> too large
ds = load_dataset("librispeech_asr", 'clean')

In [None]:
# LibriSpeech test data load
english_ds = load_dataset("kresnik/librispeech_asr_test", "clean")

In [16]:
english_ds

DatasetDict({
    test: Dataset({
        features: ['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'],
        num_rows: 2620
    })
})

In [17]:
# check file list
english_ds = english_ds['test']

In [18]:
len(english_ds)

2620

In [12]:
# check sample data
sample =

In [None]:
sample

- `IPython.display` : IPython 위젯을 사용할 수 있는 라이브러리

In [14]:
import IPython.display as ipd

In [None]:
# listen audio file using ipd.Audio
ipd.Audio( )

In [None]:
# text
sample['text']

## Whisper model을 이용한 LibriSpeech 인식

In [13]:
# transcribe using whisper model
result =

In [28]:
from pprint import pprint

In [None]:
pprint(result)

In [None]:
# check result
print("hypothesis: "+ result['text'])
print("reference: "+ sample['text'].lower())

In [31]:
# compare with reference

- `jiwer` : CER, WER 등 음성인식 결과 평가 관련 라이브러리

In [None]:
! pip install jiwer

In [33]:
from jiwer import cer

In [None]:
# calculate cer


## Zeroth-Korean

- 51.6시간 한국어 학습데이터 (22,263 발화, 105명, 3000 문장)  
- 휴대폰으로 녹음
- https://github.com/goodatlas/zeroth

In [8]:
# Zeroth-Korean dataset load
korea_ds = load_dataset("kresnik/zeroth_korean", "clean")

Downloading builder script:   0%|          | 0.00/4.59k [00:00<?, ?B/s]

Downloading and preparing dataset zeroth_korean/clean to /root/.cache/huggingface/datasets/kresnik___zeroth_korean/clean/1.0.1/e5d146fc495c84b4b1471118f43a266048059e6a0ccd6c0e23b34322b1d6d379...


Downloading data:   0%|          | 0.00/10.3G [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset zeroth_korean downloaded and prepared to /root/.cache/huggingface/datasets/kresnik___zeroth_korean/clean/1.0.1/e5d146fc495c84b4b1471118f43a266048059e6a0ccd6c0e23b34322b1d6d379. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [9]:
# check file list
korea_ds

DatasetDict({
    train: Dataset({
        features: ['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'],
        num_rows: 22263
    })
    test: Dataset({
        features: ['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'],
        num_rows: 457
    })
})

In [13]:
ko_test_ds = korea_ds['test']

In [None]:
len(ko_test_ds)

In [None]:
# check sample data
sample = ko_test_ds[0]
pprint(sample)

In [None]:
# listen audio file using ipd.Audio


## whisper model을 이용한 Zeroth-Korean 인식

In [None]:
# transcribe using whisper model


In [None]:
# check result
pprint(ko_result)

In [None]:
# compare with reference
print("hypothesis: " + ko_result['text'])
print("reference: " + sample['text'])

In [None]:
# calculate cer


In [None]:
# remove special symbol

import re
# re.sub('[-=+,#/\?:^.@*\"※~ㆍ!』‘|\(\)\[\]`\'…》\”\“\’·]', '', text)

In [None]:
ref = ko_result['text'].lstrip()
ref = re.sub('[-=+,#/\?:^.@*\"※~ㆍ!』‘|\(\)\[\]`\'…》\”\“\’·]', '', ref)

In [None]:
print(ref)

In [None]:
# calculate cer
cer(sample['text'], ref)

In [None]:
# whisper load_audio
# whisper pad_or_trim

audio = whisper.load_audio(sample['file'])
audio.shape

(78480,)

In [None]:
audio = whisper.pad_or_trim(audio)
audio.shape

(480000,)

In [None]:
480000/16000

30.0

In [None]:
/16000

In [None]:
# whisper log_mel_spectrogram


In [None]:
mel

In [None]:
mel.shape

torch.Size([80, 3000])

- `matplotlib` : 시각화용 라이브러리

In [11]:
import matplotlib.pyplot as plt

In [None]:
mel_cpu = mel.cpu()

In [None]:
plt.imshow(mel_cpu, aspect='auto', interpolation='nearest', origin='lower')

In [None]:
# whisper decode
# options = whisper.DecodingOptions(fp16 = False)
result = whisper.decode(model, mel)

In [None]:
sample['text'].lower()

In [None]:
pprint(result)

In [None]:
whisper_result = model.transcribe(sample['file'])

In [None]:
pprint(whisper_result)

In [None]:
from jiwer import wer

In [None]:
wer(sample['text'], ref)

0.34782608695652173

## 실시간으로 녹음 후 Whisper Model로 인식 결과 확인
Gradio를 이용해 간단한 Web UI를 구현해 본인의 목소리를 실시간으로 녹음하고  
whisper model로 녹음한 음성 인식 결과 확인  
https://github.com/innovatorved/whisper-openai-gradio-implementation

In [None]:
! pip install gradio

In [4]:
import gradio as gr
import time

In [5]:
def SpeechToText(audio):
    if audio == None : return ""
    time.sleep(1)

    audio = whisper.load_audio(audio)
    audio = whisper.pad_or_trim(audio)

    # make log-Mel spectrogram and move to the same device as the model
    mel = whisper.log_mel_spectrogram(audio).to(model.device)

    # Detect the Max probability of language
    _, probs = model.detect_language(mel)
    language = max(probs, key=probs.get)

    #  Decode audio to Text
    options = whisper.DecodingOptions(fp16 = False)
    result = whisper.decode(model, mel)
    return (language , result.text)

In [7]:
??gr.Interface

In [None]:
gr.Interface(
    title = 'OpenAI Whisper implementation on Gradio Web UI',
    fn=SpeechToText,

    inputs=[
        gr.Audio(source="microphone", type="filepath")
    ],
    outputs=[
        "label",
        "textbox",
    ],
    live=True
).launch(
    debug=False,
    share=True
)