<a href="https://colab.research.google.com/github/edgarbc/audio_transcriber/blob/main/my_audio_transcriber_whisper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# My audio transcriber

Audio automated transcriber using whisper from openAI.

Whisper is an encoder-decoder auto-regressive model which was trained on audio translation and transcription tasks. Given audio data, the model is able to generate the corresponding text.

by Edgar Bermudez - edgar.bermudez@gmail.com

November, 2022.

For this example the audio files to be transcribed are previously uploaded to googgle drive. However, this example could be extended to have an interface to upload files and be transcribed directly (e.g. using gradio). I left a short example of how to use gradio for this at the end of the notebook. 

In [1]:
# to handle audio files
!pip install pydub
from pydub import AudioSegment

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [2]:
# install whisper
!pip install git+https://github.com/openai/whisper.git


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-f2f54cjn
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-f2f54cjn
  Resolved https://github.com/openai/whisper.git to commit 7858aa9c08d98f75575035ecd6481f462d66ca27
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers>=4.19.0
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m93.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpeg-python==0.2.0
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [3

In [3]:
# in order to access audio files (previously saved into google drive), we mount it
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# to handle files easily
from glob import glob

In [5]:
# load whisper 
import whisper
# load model. Check for other models
model = whisper.load_model("base")

100%|████████████████████████████████████████| 139M/139M [00:00<00:00, 146MiB/s]


Parameters and definitions 

In [6]:
data_dir = 'drive/MyDrive/data/'
sound_file = 'example_audio.WAV'
print(data_dir)

drive/MyDrive/BaileyAndSoda/data/


In [7]:
# make sure we are in the right place
!pwd

/content


## Load audio file

Assumes that audio files are saved into google drive

In [9]:

#importing file from location by giving its path (for mp3 and wav)

#sound = AudioSegment.from_mp3(data_dir + sound_file)
sound = AudioSegment.from_file(file = data_dir + sound_file, format = "wav")


## Audio File slicing

Slice the audio file into 10 min (approx) segments,  transcribe them and save them into text files.

In [10]:
# total time in mins of the file
total_mins = sound.duration_seconds/60
print('total duration (mins): ' + str(total_mins))

slice_size = 10 # slice size (mins)

num_slices = int(total_mins / slice_size) + 1

total duration (mins): 73.2


In [1]:

interval = 10
offset = 20

for i in range(num_slices):

  if (i==0):
    start_time = 1000 * ((i * interval * 60))
  else:   
    start_time = 1000 * ((i * interval * 60) - offset) 
  end_time = 1000 * ((i+1) * interval * 60)
  print(start_time)
  print(end_time)   
  # take the corresponding slice
  sound_slice = sound[start_time:end_time]

  # create a file name
  fname = 'slice_'+str(i) + '.mp3'
  print(data_dir + fname)
  # save it to file 
  sound_slice.export(data_dir + fname, format='mp3')


In [None]:
# Only with very large files (more than 2 hours is necessary to 
# slice it).

#slice_files = glob(data_dir + '*.mp3')
slice_files = glob(data_dir + '*.WAV')
print(slice_files)

num_slices = len(slice_files)
# transcribe each of the audio segments
for slice_file in range(num_slices):
  result = model.transcribe(slice_files[slice_file])

  text_fname = slice_files[slice_file][:-4] + '.txt'
  text_file = open(text_fname, "w")
  n = text_file.write(result['text'])
  text_file.close()
  print(text_fname + ' transcribed!') 


In [None]:
# example using gradio
# TODO: expand and improve
from transformers import pipeline
import gradio as gr

pipe = pipeline("automatic-speech-recognition", model="openai/whisper-small")

def inference(speech_file):
  return pipe(speech_file)["text"]

gr.Interface(inference,gr.Audio(type="filepath"),"text").launch()

Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`

Using Embedded Colab Mode (NEW). If you have issues, please use share=True and file an issue at https://github.com/gradio-app/gradio/
Note: opening the browser inspector may crash Embedded Colab Mode.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

(<gradio.routes.App at 0x7f73edf43bd0>, 'http://127.0.0.1:7860/', None)