 程式碼修改來源自Github上[OpenAI的whisper](https://colab.research.google.com/github/openai/whisper/blob/master/notebooks/LibriSpeech.ipynb)專案，使用colab範例進行測試，本人(梁齊恆)進行部分添加修改和中文註解

 官方[LICENSE](https://github.com/openai/whisper/blob/main/LICENSE)

# Installing Whisper /下載whisper專案

The commands below will install the Python packages needed to use Whisper models and evaluate the transcription results.

安裝使用 Whisper 模型所需的Python包和套件，下面要使用資料集測試準確率

In [42]:
! pip install git+https://github.com/openai/whisper.git
! pip install jiwer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-_5llafm2
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-_5llafm2
  Resolved https://github.com/openai/whisper.git to commit 248b6cb124225dd263bb9bd32d060b6517e067f8
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Loading the LibriSpeech dataset
The following will load the test-clean split of the LibriSpeech corpus using torchaudio.

## /加載librispeech資料集
LibriSpeech 是一個包含約 1000 小時閱讀英語語音的語音資料庫，torchaudio主要處理音頻


In [39]:
import os
import numpy as np

try:
    import tensorflow  # 避免colab的兼容性問題
except ImportError:
    pass

import torch
import pandas as pd
import whisper
import torchaudio

from tqdm.notebook import tqdm


DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

In [40]:
class LibriSpeech(torch.utils.data.Dataset):
    """
    A simple class to wrap LibriSpeech and trim/pad the audio to 30 seconds.
    It will drop the last few seconds of a very small portion of the utterances.
    """
    def __init__(self, split="test-clean", device=DEVICE):
        self.dataset = torchaudio.datasets.LIBRISPEECH(
            root=os.path.expanduser("~/.cache"),
            url=split,
            download=True,
        )
        self.device = device

    def __len__(self):
        return len(self.dataset) #計算長度

    def __getitem__(self, item):
        audio, sample_rate, text, _, _, _ = self.dataset[item]
        assert sample_rate == 16000
        audio = whisper.pad_or_trim(audio.flatten()).to(self.device) #將音訊進行修剪與填充，使其長度為30秒
        mel = whisper.log_mel_spectrogram(audio)

        return (mel, text) #計算音訊的Mel頻譜圖。最後返回一個元組(mel, text)，其中mel是梅爾頻譜圖，text是音訊對應的文字。

In [41]:
dataset = LibriSpeech("test-clean")
loader = torch.utils.data.DataLoader(dataset, batch_size=16)

100%|██████████| 331M/331M [00:17<00:00, 20.0MB/s]


# Running inference on the dataset using a base Whisper model

The following will take a few minutes to transcribe all utterances in the dataset.

## /使用基礎 whisper模型進行資料推論

In [43]:
#@markdown 語言分成English-only model和Multilingual model，可用模型由小到大分為`tiny`、`base`、`small`、`medium`、`large`，English-only沒有large模型
model = whisper.load_model("base.en") #選定English-only的base模型
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

100%|███████████████████████████████████████| 139M/139M [00:01<00:00, 91.1MiB/s]


Model is English-only and has 71,825,408 parameters.


In [44]:
# predict without timestamps for short-form transcription
options = whisper.DecodingOptions(language="en", without_timestamps=True)

In [45]:
hypotheses = []
references = []

for mels, texts in tqdm(loader):
    results = model.decode(mels, options)
    hypotheses.extend([result.text for result in results])
    references.extend(texts)

  0%|          | 0/164 [00:00<?, ?it/s]

In [46]:
data = pd.DataFrame(dict(hypothesis=hypotheses, reference=references))
data

Unnamed: 0,hypothesis,reference
0,"He hoped there would be stew for dinner, turni...",HE HOPED THERE WOULD BE STEW FOR DINNER TURNIP...
1,"Stuffered into you, his belly counseled him.",STUFF IT INTO YOU HIS BELLY COUNSELLED HIM
2,After early nightfall the yellow lamps would l...,AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD L...
3,"Hello Bertie, any good in your mind?",HELLO BERTIE ANY GOOD IN YOUR MIND
4,Number 10. Fresh Nelly is waiting on you. Good...,NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD ...
...,...,...
2615,"Oh, to shoot my soul's full meaning into futur...",OH TO SHOOT MY SOUL'S FULL MEANING INTO FUTURE...
2616,"Then I, long tried by natural ills, received t...",THEN I LONG TRIED BY NATURAL ILLS RECEIVED THE...
2617,I love thee freely as men strive for right. I ...,I LOVE THEE FREELY AS MEN STRIVE FOR RIGHT I L...
2618,"I love thee with the passion put to use, in my...",I LOVE THEE WITH THE PASSION PUT TO USE IN MY ...


# Calculating the word error rate

Now, we use our English normalizer implementation to standardize the transcription and calculate the WER.

## /計算單詞的錯誤率
jiwer是 automatic speech recognition system 自動語音辨識系統，主要能分析:
1.   word error rate (WER)
2.   match error rate (MER)
3.   word information lost (WIL)
4.   word information preserved (WIP)
5.   character error rate (CER)



In [47]:
import jiwer
from whisper.normalizers import EnglishTextNormalizer

normalizer = EnglishTextNormalizer() #英文文本正規化處理

In [48]:
data["hypothesis_clean"] = [normalizer(text) for text in data["hypothesis"]]
data["reference_clean"] = [normalizer(text) for text in data["reference"]]
data

Unnamed: 0,hypothesis,reference,hypothesis_clean,reference_clean
0,"He hoped there would be stew for dinner, turni...",HE HOPED THERE WOULD BE STEW FOR DINNER TURNIP...,he hoped there would be stew for dinner turnip...,he hoped there would be stew for dinner turnip...
1,"Stuffered into you, his belly counseled him.",STUFF IT INTO YOU HIS BELLY COUNSELLED HIM,stuffered into you his belly counseled him,stuff it into you his belly counseled him
2,After early nightfall the yellow lamps would l...,AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD L...,after early nightfall the yellow lamps would l...,after early nightfall the yellow lamps would l...
3,"Hello Bertie, any good in your mind?",HELLO BERTIE ANY GOOD IN YOUR MIND,hello bertie any good in your mind,hello bertie any good in your mind
4,Number 10. Fresh Nelly is waiting on you. Good...,NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD ...,number 10 fresh nelly is waiting on you good n...,number 10 fresh nelly is waiting on you good n...
...,...,...,...,...
2615,"Oh, to shoot my soul's full meaning into futur...",OH TO SHOOT MY SOUL'S FULL MEANING INTO FUTURE...,0 to shoot my soul is full meaning into future...,0 to shoot my soul is full meaning into future...
2616,"Then I, long tried by natural ills, received t...",THEN I LONG TRIED BY NATURAL ILLS RECEIVED THE...,then i long tried by natural ills received the...,then i long tried by natural ills received the...
2617,I love thee freely as men strive for right. I ...,I LOVE THEE FREELY AS MEN STRIVE FOR RIGHT I L...,i love thee freely as men strive for right i l...,i love thee freely as men strive for right i l...
2618,"I love thee with the passion put to use, in my...",I LOVE THEE WITH THE PASSION PUT TO USE IN MY ...,i love thee with the passion put to use in my ...,i love thee with the passion put to use in my ...


In [49]:
wer = jiwer.wer(list(data["reference_clean"]), list(data["hypothesis_clean"]))

print(f"Word Error Rate: {wer * 100:.2f} %") #錯誤率

Word Error Rate: 4.28 %


#
___
## 補充whisper在python套件中直接使用


In [37]:
!git clone https://github.com/boy20100619/test.git
import whisper

model = whisper.load_model("base")
result = model.transcribe("./test/scottish-accent.wav") # model.transcribe("你的檔案路徑")
print("-------------------------------","\n")
print(result["text"])

Cloning into 'test'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects:  16% (1/6)[Kremote: Counting objects:  33% (2/6)[Kremote: Counting objects:  50% (3/6)[Kremote: Counting objects:  66% (4/6)[Kremote: Counting objects:  83% (5/6)[Kremote: Counting objects: 100% (6/6)[Kremote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (6/6), 1.88 MiB | 6.64 MiB/s, done.
------------------------------- 

 One of the most famous landmarks in the border is three hills. And the myth is that Maryland and the Magicians split one hill into three and the left to two hills at the back of us which you can see. The weather's never good though, always seeing the border's of the mist on the Yolkens. We never get the good weather. And as you can see today, there's no sunshine. It's a typical Scottish border's day. Fantastic!


In [54]:
import whisper

model = whisper.load_model("base")

#load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("./test/nahida_0.wav") # whisper.load_audio("你的檔案路徑")
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
# 生成梅爾頻譜圖和切換模型
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language 檢測語言
_, probs = model.detect_language(mel)
print(f"Detected language is: {max(probs, key=probs.get)}")

# decode the audio 解碼音頻
#options = whisper.DecodingOptions()  #預設
options = whisper.DecodingOptions(fp16 = False) # 如果不支援 fp16,以 fp32 取代,須改為 False
result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)

Detected language is: zh
我自认为我擅长提问和回答问题但我见见明白有很多人是揣着明白装糊涂问题的答案并不能帮上他们的忙是不是随着年龄增长大家都会失去面对质问和答案的勇气呢


In [31]:
#@markdown 透過google colab直接上傳音檔測試whisper語音轉文字，可以用 .wav檔和 .mp3檔
from google.colab import files
import whisper

# 上傳音檔並檢測檔名
filename = ""
uploaded = files.upload()
for nametest in uploaded.keys():
  if ".wav" in nametest or ".mp3" in nametest:
    filename = nametest

if filename != "":
  audio = whisper.load_audio(filename)
  audio = whisper.pad_or_trim(audio)

  mel = whisper.log_mel_spectrogram(audio).to(model.device)
  _, probs = model.detect_language(mel)
  print("\n---",f"檢測語言為: {max(probs, key=probs.get)}")
  options = whisper.DecodingOptions(fp16 = False)
  result = whisper.decode(model, mel, options)

  print(result.text)
  # 保存文字檔
  with open(f"output_{filename[:-4]}.txt",'w') as f:
    f.write(result.text)

Saving 胡桃 らくらく安楽死 meme.mp4 to 胡桃 らくらく安楽死 meme.mp4
