# Testing OpenAI's Whisper Speech Recognition Model

This is a Colab notebook that allows you to process large audio files and transcribe them using OpenAI's Whisper model.

You can upload your own audio samples using the folder icon on the left of this page. That gives you access to a file system you can upload to by dragging files into it. You can see examples of how to run the transcription in a couple of the cells below.

## Install the Whisper Code

In [1]:
! pip install git+https://github.com/openai/whisper.git -q
! pip install pydub -q

[K     |████████████████████████████████| 5.8 MB 15.2 MB/s 
[K     |████████████████████████████████| 182 kB 42.6 MB/s 
[K     |████████████████████████████████| 7.6 MB 39.1 MB/s 
[?25h  Building wheel for whisper (setup.py) ... [?25l[?25hdone


## Load the ML Model

In [5]:
import whisper

model = whisper.load_model("small")

## Check we have a GPU

You should see the output `device(type='cuda', index=0)` below. If you don't, you may be on a CPU-only Colab instance which will run more slowly. Go to `Runtime->Change Runtime Type` to fix this.

In [6]:
model.device

device(type='cuda', index=0)

## Download Test Audio Files

This repository has a couple of pre-recorded MP3s to run through the transcribe function. You can listen to them with the audio widgets displayed below.

In [7]:
!git clone https://github.com/petewarden/openai-whisper-webapp

fatal: destination path 'openai-whisper-webapp' already exists and is not an empty directory.


In [8]:
from IPython.display import Audio
Audio("/content/openai-whisper-webapp/mary.mp3")

In [9]:
from IPython.display import Audio
Audio("/content/openai-whisper-webapp/daisy_HAL_9000.mp3")

In [59]:
from IPython.display import Audio
Audio("/content/Chaotic parish council zoom meeting goes viral You have no authority here Jackie Weaver.mp3")

## Define the Transcribe Function

Now we've loaded the model, and have the code, this is the function that takes an audio file path as an input and returns the recognized text (and logs what it thinks the language is).

In [93]:
def transcribe(audio):    
    # load audio and pad/trim it to fit 30 seconds
    audio = whisper.load_audio(audio)

    # it is not possible to process segments larger than 30 seconds
    audio = whisper.pad_or_trim(audio)

    # make log-Mel spectrogram and move to the same device as the model
    mel = whisper.log_mel_spectrogram(audio).to(model.device)

    # detect the spoken language
    _, probs = model.detect_language(mel)
    # print(f"Detected language: {max(probs, key=probs.get)}")
    print(f"Detected languages: {[p for p in probs if probs[p] >= 0.3 ]}")
    
    # decode the audio
    # force language
    options = whisper.DecodingOptions(language="en")
    result = whisper.decode(model, mel, options)
    return result.text


## Test with Pre-Recorded Audio

Before we bring up the UI to allow you to record your own live audio, we're going to run the `transcribe()` function on a couple of MP3s we've downloaded. You should see `Mary had a little lamb, its fleece was white as snow, and everywhere that Mary went, the lamb was sure to go.` for `mary.mp3`, which I recorded as an example of clear audio. The second file is a lot harder to transcribe, with very distorted audio, but the model does a good job with `Tazy, Tazy, Tazy. Give me your answer to time after crazy all for the love of you. It won't be a stylish marriage`. You'll notice the transcript is cut off after 30 seconds, which is the default length for this notebook. It can be extended, but that's outside of the scope of this documentation.

In [94]:
easy_text = transcribe("/content/openai-whisper-webapp/mary.mp3")
print(easy_text)

hard_text = transcribe("/content/openai-whisper-webapp/daisy_HAL_9000.mp3")
print(hard_text)

Detected language: ['en']
Mary had a little lamb, its fleece was white as snow, and everywhere that Mary went, the lamb was sure to go.
Detected language: ['en']
Tazy, Tazy, Tazy. Give me your answer to time after crazy all for the love of you. It won't be a stylish marriage


In [95]:
hard_text = transcribe("/content/Joe Rogan realizes Kanye West is insanesupercut edition.mp3")
print(hard_text)

Detected language: ['en']
What's up? We made it happen. We're in the building, yes sir. So, what are you doing? You're running for president? Yes. If you're in that position and we have to deal with some sort of a military action with China, what if China takes over Taiwan? What if something happens with Syria? What if something happens with Iran? Have you thought about this?


In [96]:
hard_text = transcribe("/content/Joe Rogan The TRUTH About The Pyramids No Pharaohs Were Ever Found Inside A Pyramid.mp3")
print(hard_text)

Detected language: ['en']
I know fingerprints of the gods you released in the 90s. It was 1990. 1995, that's when I first read it, you know, became obsessed. What motivated you to put that out? It was a process really. I used to be a current affairs journalist. I was the East Africa correspondent for the economist. I had no interest in history whatsoever, but I began to come across things, particularly traveling initially in Ethiopia and then in Egypt, which made me wonder about the past


In [97]:
hard_text = transcribe("/content/Chaotic parish council zoom meeting goes viral You have no authority here Jackie Weaver.mp3")
print(hard_text)

Detected language: ['en', 'cy']
Hello again. Hello again. I thought it wasn't going to get in then. When do we plan to start? I think we could start any moment chairman. I think it's perhaps helpful just to go through the same things as we went through before, which is just to encourage people to switch off their microphones, because it just reduced the background bit.


# Processing large audio files
Whisper is capable of processing no more than 30 seconds of audio.
In order to transcribe large audios we need to split the audio in small chunks and precess them separately.



In [60]:
import os
from hashlib import sha256

import numpy as np

from pydub import AudioSegment
from pydub.utils import make_chunks

def setup_output_dir(filename):
    output_dir_name = "/content/" + sha256(filename.encode("utf-8")).hexdigest()
    # checking if directory exist or not.
    if not os.path.exists(output_dir_name):        
        # if directory is not present then create it.
        os.makedirs(output_dir_name)
    
    return output_dir_name

def chunk_audio_into_30s_clips(audio_file):
    audio_segment = AudioSegment.from_file(audio_file, "mp3") 

    # pydub calculates in milliseconds
    chunk_length_ms = 30000 

    #Make chunks of 30 seconds
    chunks = make_chunks(audio_segment, chunk_length_ms)

    # setup output directory
    output_dir = setup_output_dir(audio_file)

    #Export all of the individual chunks as mp3 files
    for i, chunk in enumerate(chunks):
        chunk_name = f"{output_dir}/chunk_{i}.mp3"
        print("exporting", chunk_name)
        chunk.export(chunk_name, format="mp3")

    return output_dir



## Split the long audio file into smaller chunks and transcribe them 

In [None]:
import os

def transcribe_large_audio(audio_file):
    path_name = chunk_audio_into_30s_clips(audio_file)

    transcription = []
    for filename in os.listdir(path_name):
        filepath = f"{path_name}/{filename}"
        print("transcribing", filepath)
        transcription.append(transcribe(filepath))

    return transcription

transcribe_large_audio("/content/Joe Rogan The TRUTH About The Pyramids No Pharaohs Were Ever Found Inside A Pyramid.mp3")

In [61]:
transcribe_large_audio("/content/Chaotic parish council zoom meeting goes viral You have no authority here Jackie Weaver.mp3")

exporting /content/458fc22bf515905f8e842dd061b0fd2cdf4fb5d2f321c02a7dbc95df127095d7/chunk_0.mp3
exporting /content/458fc22bf515905f8e842dd061b0fd2cdf4fb5d2f321c02a7dbc95df127095d7/chunk_1.mp3
exporting /content/458fc22bf515905f8e842dd061b0fd2cdf4fb5d2f321c02a7dbc95df127095d7/chunk_2.mp3
exporting /content/458fc22bf515905f8e842dd061b0fd2cdf4fb5d2f321c02a7dbc95df127095d7/chunk_3.mp3
exporting /content/458fc22bf515905f8e842dd061b0fd2cdf4fb5d2f321c02a7dbc95df127095d7/chunk_4.mp3
exporting /content/458fc22bf515905f8e842dd061b0fd2cdf4fb5d2f321c02a7dbc95df127095d7/chunk_5.mp3
exporting /content/458fc22bf515905f8e842dd061b0fd2cdf4fb5d2f321c02a7dbc95df127095d7/chunk_6.mp3
transcribing /content/458fc22bf515905f8e842dd061b0fd2cdf4fb5d2f321c02a7dbc95df127095d7/chunk_0.mp3
Detected language: cy
{'en': 0.3079226315021515, 'zh': 0.0003253078320994973, 'de': 0.0004439037584234029, 'es': 0.00019472891290206462, 'ru': 0.00012166338274255395, 'ko': 0.00018816835654433817, 'fr': 0.00037366687320172787, 'j

["Hello again. Hello again. I thought it wasn't going to get in then. When do we plan to start? I think we could start any moment chairman. I think it's perhaps helpful just to go through the same things as we went through before, which is just to encourage people to switch off their microphones, because it just reduced the background bit.",
 "Jackie Weaver, I find that the person on Alib Bruton's Zoom is being very disrespectful to everybody. Oh, coming through me from Birkidet, that sounds good. Wow. Thank God for that. Can I propose John Smith, please? Yeah. I'll second it. Thank you.",
 "is to apologise to Jackie, but welcome to handforth. Indeed, it's not lively in handforth. Yes, but what I would say is that it was a very good example of bullying within Cheshireeest and the environment. The chairman simply declared himself Clark and notified",
 "Where is the chairman? In the next thing, all this, it now reverts to me. Where's the chairman gone? To select a chairman for this meeti