<a href="https://colab.research.google.com/github/gilbert-umuzi/audio_transcription/blob/main/Transcribe_audio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Audio Transcription Notebook

## Overview
This Jupyter notebook provides an end-to-end solution for transcribing audio files using Google's Automatic Speech Recognition (ASR) and OpenAI's Whisper ASR. It is designed to be flexible, supporting various audio file formats and even allowing for audio scraping from YouTube videos.

## Features
- **File Upload**: Upload audio files from your local machine for transcription.
- **YouTube Audio Extraction**: Optionally, provide a YouTube URL to extract and transcribe audio.
- **Multi-Service Transcription**: Utilizes both Google ASR and Whisper ASR for comprehensive and accurate transcription.
- **Output**: Saves the transcriptions into a `.txt` file for easy comparison and further analysis.

## How to Use
1. **Install Dependencies**: Make sure to install all required Python packages.
2. **Provide Audio**: Either upload an audio file or input a YouTube URL.
3. **Run the Notebook**: Execute the cells to transcribe the audio.
4. **Download Output**: The transcriptions will be saved into a `.txt` file, which you can download.

## A note on the two transcriptions services
Two transcription services are used in this notebook, creating two versions of the transcript, which can be compared for accuracy.

* Google ASR: A popular speech recognition service by Google.
* Whisper ASR: An automatic speech recognition system by OpenAI.

Whisper has five models of increasing size.

When tested with an 8 minute .m4a audio clip of a Nigerian English speaker:
* Whisper base.en downloads and runs quickly on Google Colab, outperforming the Google ASR transcription in accuracy, but still with many errors
* Whisper medium.en is much slower to run, however its results seem to be significantly more accurate than base.en and ASR, producing a very usable transcript


In [1]:
# Install the required packages
!pip install -— upgrade pytube
!pip install pydub
!pip install SpeechRecognition
!pip install git+https://github.com/openai/whisper.git -q #Whisper from OpenAI

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1
Collecting SpeechRecognition
  Downloading SpeechRecognition-3.10.0-py2.py3-none-any.whl (32.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.8/32.8 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: SpeechRecognition
Successfully installed SpeechRecognition-3.10.0


In [None]:
# Dowload audio from a Youtube video
# Note that extracting audio from YouTube may violate its terms of service.
# Either use this to download the audio OR use the next block to upload an audio file

# Importing Pytube library
import pytube

# Reading the Youtube video link
video = 'https://www.youtube.com/watch?v=-LIIf7E-qFI'  # Replace with Youtube URL
data = pytube.YouTube(video)

# Get the audio-only stream
audio = data.streams.get_audio_only()

# Download the audio and get the file path
audio_path = audio.download()

# Extract the file extension from the downloaded file
file_extension = os.path.splitext(audio_path)[1][1:]


In [2]:
# Upload an audio file
# Either use this to upload the audio OR use the previous block to download audio from Youtube

from google.colab import files
import os

# Upload the audio file
print("Please upload the audio file:")
uploaded = files.upload()

# Get the original name of the uploaded file
original_file_name = list(uploaded.keys())[0]

# Extract the file extension from the original file name
file_extension = os.path.splitext(original_file_name)[1][1:]

# No need to rename the file; use the original file name for processing
audio_path = original_file_name

Please upload the audio file:


Saving Silas Idowu Live Interview.m4a to Silas Idowu Live Interview.m4a


In [3]:
# Transcribe audio with SpeechRecognition

from pydub import AudioSegment
import speech_recognition as sr

# Load the audio file
audio = AudioSegment.from_file(audio_path, format = file_extension)

# Initialize the recognizer
recognizer = sr.Recognizer()

# Function to transcribe a chunk of audio
def transcribe_audio(audio_chunk):
    with sr.AudioFile(audio_chunk) as source:
        audio_data = recognizer.record(source)
        try:
            text = recognizer.recognize_google(audio_data)
            return text
        except sr.UnknownValueError:
            return "[Unintelligible]"
        except sr.RequestError as e:
            return f"[Could not request results; {e}]"

# Split the audio into smaller chunks for easier processing
chunk_length = 30 * 1000  # 30 seconds
chunks = [audio[i:i + chunk_length] for i in range(0, len(audio), chunk_length)]

# Temporary storage for audio chunks and transcriptions
chunk_paths = [f'temp_chunk_{i}.wav' for i in range(len(chunks))]

# Save each chunk as a temporary file
for i, chunk in enumerate(chunks):
    chunk.export(chunk_paths[i], format="wav")

# Transcribe each chunk and store the text
transcriptions = []
for chunk_path in chunk_paths:
    transcriptions.append(transcribe_audio(chunk_path))

# Join the transcriptions to form the complete text
google_transcription_text = ' '.join(transcriptions)

# You may want to save or print the transcription text
print(google_transcription_text[:500])  # Prints the first 500 characters of the transcription

the first thing that I wanted to ask is just to tell me a little bit about you where you from where you at currently and what you were busy doing before you before you came into contact with the African coding Network okay thank you very much [Unintelligible] this is my second cuz I go to the federal University and Strikes bowling game so it's a long journey how many years ago and how did you find that was it something you were interested in before or was it something new that you tried yeah I a


In [12]:
# Transcribe audio with Whisper

import whisper

# Load the Whisper model
model = whisper.load_model("medium") # There are 5 models from fastest to slowest: "tiny" , "base" , "small" , "medium" , "large"

# Transcribe each chunk using Whisper ASR and store the text
whisper_transcriptions = []
for chunk_path in chunk_paths:
    whisper_result = model.transcribe(chunk_path)
    whisper_transcriptions.append(whisper_result['text'])

# Join the transcriptions to form the complete text
whisper_transcription_text = ' '.join(whisper_transcriptions)

# You may want to save or print the transcription text
print(whisper_transcription_text[:500])  # Prints the first 500 characters of the transcription

100%|█████████████████████████████████████| 1.42G/1.42G [00:16<00:00, 92.1MiB/s]


In [13]:
# Prepare the combined transcript
combined_transcription = f"Google ASR Transcript:\n{google_transcription_text}\n\nWhisper ASR Transcript:\n{whisper_transcription_text}"

# Define the path to save the transcript file
transcript_file_path = 'Transcribed_Audio.txt'

# Write the combined transcript to a .txt file
with open(transcript_file_path, 'w') as f:
    f.write(combined_transcription)

# Download the file
files.download(transcript_file_path)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>