# Audio to Text Converter using Whisper

This notebook demonstrates how to convert audio files (MP3 format) to text using the Whisper model from Hugging Face.

## Install Required Libraries

First, let's install the necessary libraries:

In [2]:
%pip install transformers datasets soundfile librosa torch

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


## Import Libraries

In [3]:
import torch
import librosa
import numpy as np
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import os
from IPython.display import Audio, display

  from .autonotebook import tqdm as notebook_tqdm


## Load the Whisper Model

We'll use the Whisper model from Hugging Face. You can choose different model sizes based on your needs:
- `openai/whisper-tiny`: Smallest model, fastest but less accurate
- `openai/whisper-base`: Small model with decent accuracy
- `openai/whisper-small`: Medium-sized model with good accuracy
- `openai/whisper-medium`: Larger model with better accuracy
- `openai/whisper-large-v2`: Largest model, most accurate but requires more resources

For this example, we'll use the base model, but you can change it based on your requirements.

In [8]:
# Load model and processor
model_name = "openai/whisper-small"

processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)

# Check if GPU is available and move model to GPU if possible
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

print(f"Model loaded on {device}")

Model loaded on cpu


## Create Functions to Transcribe Audio Files

We'll create three separate functions for different use cases:
1. `transcribe_audio`: For transcribing a single audio file
2. `transcribe_long_audio`: For transcribing longer audio files by processing them in chunks
3. `batch_transcribe`: For transcribing multiple audio files in a directory

In [9]:
def transcribe_audio(audio_path, language="english"):
    """
    Transcribe an audio file using the Whisper model.
    
    Args:
        audio_path (str): Path to the audio file (MP3 format)
        language (str): Language of the audio (default: "english")
        
    Returns:
        str: Transcribed text
    """
    try:
        # Load and preprocess the audio file
        print(f"Loading audio file: {audio_path}")
        speech_array, sampling_rate = librosa.load(audio_path, sr=16000)
        
        # Display a sample of the audio
        display(Audio(speech_array, rate=sampling_rate))
        
        # Process the audio with the Whisper processor (with attention mask to avoid warnings)
        inputs = processor(
            speech_array, 
            sampling_rate=sampling_rate, 
            return_tensors="pt",
            padding="max_length",
            return_attention_mask=True
        )
        inputs = inputs.to(device)
        
        # Generate token ids
        forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task="transcribe")
        
        # Generate transcription with attention mask to avoid warnings
        predicted_ids = model.generate(
            inputs.input_features,
            attention_mask=inputs.attention_mask,
            forced_decoder_ids=forced_decoder_ids
        )
        
        # Decode the predicted ids to text
        transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
        
        return transcription
    
    except Exception as e:
        return f"Error transcribing audio: {str(e)}"

In [10]:
def transcribe_long_audio(audio_path, chunk_length_sec=30, language="english"):
    """
    Transcribe a long audio file by processing it in chunks.
    
    Args:
        audio_path (str): Path to the audio file
        chunk_length_sec (int): Length of each chunk in seconds
        language (str): Language of the audio
        
    Returns:
        str: Transcribed text
    """
    try:
        # Load audio file
        print(f"Loading audio file: {audio_path}")
        speech_array, sampling_rate = librosa.load(audio_path, sr=16000)
        
        # Calculate chunk size in samples
        chunk_length_samples = chunk_length_sec * sampling_rate
        
        # Calculate number of chunks
        num_chunks = int(np.ceil(len(speech_array) / chunk_length_samples))
        
        print(f"Processing audio in {num_chunks} chunks...")
        
        # Process each chunk
        transcription_parts = []
        for i in range(num_chunks):
            print(f"Processing chunk {i+1}/{num_chunks}")
            
            # Extract chunk
            start = int(i * chunk_length_samples)
            end = int(min(len(speech_array), (i + 1) * chunk_length_samples))
            chunk = speech_array[start:end]
            
            # Process the chunk with attention mask to avoid warnings
            inputs = processor(
                chunk, 
                sampling_rate=sampling_rate, 
                return_tensors="pt",
                padding="max_length",
                return_attention_mask=True
            )
            inputs = inputs.to(device)
            
            # Generate token ids
            forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task="transcribe")
            
            # Generate transcription with attention mask
            predicted_ids = model.generate(
                inputs.input_features,
                attention_mask=inputs.attention_mask,
                forced_decoder_ids=forced_decoder_ids
            )
            
            # Decode the predicted ids to text
            chunk_transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
            transcription_parts.append(chunk_transcription)
        
        # Combine all chunks
        full_transcription = " ".join(transcription_parts)
        return full_transcription
    
    except Exception as e:
        return f"Error transcribing audio: {str(e)}"

In [None]:
def batch_transcribe(directory_path, language="english", use_chunking=False, chunk_length_sec=30):
    """
    Transcribe all MP3 files in a directory.
    
    Args:
        directory_path (str): Path to the directory containing MP3 files
        language (str): Language of the audio files (default: "english")
        use_chunking (bool): Whether to use chunking for long audio files (default: False)
        chunk_length_sec (int): Length of each chunk in seconds if use_chunking is True (default: 30)
    """
    if not os.path.exists(directory_path):
        print(f"Directory not found: {directory_path}")
        return
    
    # Get all MP3 files in the directory
    mp3_files = [f for f in os.listdir(directory_path) if f.lower().endswith('.mp3')]
    
    if not mp3_files:
        print(f"No MP3 files found in {directory_path}")
        return
    
    print(f"Found {len(mp3_files)} MP3 files. Starting transcription...")
    
    for i, mp3_file in enumerate(mp3_files, 1):
        file_path = os.path.join(directory_path, mp3_file)
        print(f"\nProcessing file {i}/{len(mp3_files)}: {mp3_file}")
        
        # Transcribe the audio using the appropriate function
        if use_chunking:
            transcription = transcribe_long_audio(file_path, chunk_length_sec=chunk_length_sec, language=language)
        else:
            transcription = transcribe_audio(file_path, language=language)
        
        # Save the transcription to a text file
        output_file = os.path.join(directory_path, os.path.splitext(mp3_file)[0] + "_transcription.txt")
        with open(output_file, 'w', encoding='utf-8') as f:
            f.write(transcription)
        
        print(f"Transcription saved to: {output_file}")
    
    print("\nBatch transcription completed!")

## Example 1: Transcribe a Single Audio File

Use this example to transcribe a single audio file using the standard method.

In [None]:
# Path to your MP3 file
audio_file_path = 'path_to_your_audio.mp3'  # Replace with your actual file path

# Check if the file exists
if os.path.exists(audio_file_path):
    # Transcribe the audio
    transcription = transcribe_audio(audio_file_path, language="english")
    
    # Print the transcription
    print("\nTranscription:")
    print(transcription)
    
    # Save the transcription to a text file
    output_file = os.path.splitext(audio_file_path)[0] + "_transcription.txt"
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(transcription)
    
    print(f"\nTranscription saved to: {output_file}")
else:
    print(f"File not found: {audio_file_path}")

## Example 2: Transcribe a Long Audio File

Use this example to transcribe a longer audio file by processing it in chunks.

In [11]:
# Path to your long MP3 file
long_audio_file_path = 'Shape-of-You.mp3'  # Replace with your actual file path

# Check if the file exists
if os.path.exists(long_audio_file_path):
    # Transcribe the long audio file in chunks
    transcription = transcribe_long_audio(
        long_audio_file_path, 
        chunk_length_sec=30,  # Adjust chunk length as needed
        language="english"    # Change language as needed
    )
    
    # Print the transcription
    print("\nTranscription:")
    print(transcription)
    
    # Save the transcription to a text file
    output_file = os.path.splitext(long_audio_file_path)[0] + "_transcription.txt"
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(transcription)
    
    print(f"\nTranscription saved to: {output_file}")
else:
    print(f"File not found: {long_audio_file_path}")

Loading audio file: Shape-of-You.mp3
Processing audio in 9 chunks...
Processing chunk 1/9
Processing chunk 2/9
Processing chunk 3/9
Processing chunk 4/9
Processing chunk 5/9
Processing chunk 6/9
Processing chunk 7/9
Processing chunk 8/9
Processing chunk 9/9

Transcription:
 The club isn't the best place to find the lovers of the bar is where I go Me and my friends at the table doing shots, drinking fast and then we talk slow Come over and start up a conversation with just me and trust me I'll give it a chance  Now take my hand, stop, prepare the man on the jukebox and then we start to dance and I'm singing like girl you know I want your love, your love was handmade for somebody like me. I'm coming now, follow my lead. I may be crazy, don't mind me. Say boy, let's not talk too much, grab all my waist and put that body on me. I'm coming now, follow my lead. I'm coming now, follow my lead. I'm in love with the shape of you, we push and pull.  I'm in love with your body  We're going out on

## Example 3: Batch Process Multiple Audio Files

Use this example to transcribe all MP3 files in a directory.

In [None]:
# Path to directory containing MP3 files
directory_path = 'path_to_directory_with_mp3_files'  # Replace with your actual directory path

# Batch transcribe all MP3 files in the directory
batch_transcribe(
    directory_path,
    language="english",     # Change language as needed
    use_chunking=True,      # Set to True for longer audio files, False for shorter ones
    chunk_length_sec=30     # Adjust chunk length as needed if use_chunking is True
)

## Example 4: Transcribe Audio in a Different Language

Use this example to transcribe audio in a language other than English.

In [None]:
# Path to your non-English audio file
non_english_audio_path = 'path_to_your_non_english_audio.mp3'  # Replace with your actual file path

# Check if the file exists
if os.path.exists(non_english_audio_path):
    # Choose the appropriate function based on audio length
    # For shorter audio files:
    transcription = transcribe_audio(non_english_audio_path, language="spanish")  # Change language as needed
    
    # For longer audio files:
    # transcription = transcribe_long_audio(non_english_audio_path, chunk_length_sec=30, language="spanish")
    
    # Print the transcription
    print("\nTranscription:")
    print(transcription)
    
    # Save the transcription to a text file
    output_file = os.path.splitext(non_english_audio_path)[0] + "_transcription.txt"
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(transcription)
    
    print(f"\nTranscription saved to: {output_file}")
else:
    print(f"File not found: {non_english_audio_path}")

## Conclusion

This notebook demonstrates how to use the Whisper model from Hugging Face to transcribe audio files in MP3 format to text. You can use this as a starting point for your audio transcription projects.

Key features implemented:
1. Single file transcription
2. Batch processing of multiple files
3. Support for different languages
4. Handling of longer audio files through chunking
5. Proper attention mask handling to avoid warnings

Additional improvements you could make:
1. Add support for more audio formats (WAV, FLAC, etc.)
2. Implement a progress bar for batch processing
3. Add a simple UI using ipywidgets
4. Implement post-processing to improve punctuation and formatting