# Whisper large-v3

Whisper Large V3は、OpenAIによって開発された先進的な自動音声認識（ASR）および音声翻訳モデルです。このモデルは、従来の80メル周波数ビンから拡張された128メル周波数ビンを使用し、音声の特徴をより詳細に捉えます。さらに、新たに広東語のサポートが追加されたことで、多言語対応能力が強化されています。Whisper Large V3は、1,000,000時間の弱ラベル付きオーディオと、4,000,000時間の擬似ラベル付きオーディオでトレーニングされ、多様な言語やアクセント、環境ノイズにも高い適応性を示しています。このモデルは、音声認識だけでなく、音声翻訳のタスクにも優れたパフォーマンスを発揮し、前モデルであるWhisper Large V2と比較して、10%から20%のエラー率の低下を達成しています。
- HuggingFace：https://huggingface.co/openai/whisper-large-v3

<a href="https://colab.research.google.com/github/fuyu-quant/data-science-wiki/blob/main/multimodal/speech_to_text/whisper_large-v3.ipynb" target="_blank" rel="noopener noreferrer"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
%%capture
!pip install datasets
!pip install accelerate

In [1]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

### データの用意

In [12]:
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

### Whisper large-v3の実行

In [13]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v3", torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True).to(device)
processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [14]:
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

In [15]:
result = pipe(sample)
result["text"]

" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell's pictures are a sort of Upguards and Adam paintings, and Mason's exquisite idylls are as national as a jingo poem. Mr. Burkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampooer in a Turkish bath, Next man!"