# 🎙️ Audio Transcription Assistant

## Why I Built This

In today's content-driven world, audio and video are everywhere—podcasts, meetings, lectures, interviews. But what if you need to quickly extract text from an audio file in a different language? Or create searchable transcripts from recordings?

Manual transcription is time-consuming and expensive. I wanted to build something that could:
- Accept audio files in any format (MP3, WAV, etc.)
- Transcribe them accurately using AI
- Support multiple languages
- Work locally on my Mac **and** on cloud GPUs (Google Colab)

That's where **Whisper** comes in—OpenAI's powerful speech recognition model.

---

## What This Does

This app lets you:
- 📤 Upload any audio file
- 🌍 Choose from 12+ languages (or auto-detect)
- 🤖 Get accurate AI-powered transcription
- ⚡ Process on CPU (Mac) or GPU (Colab)

**Tech:** OpenAI Whisper • Gradio UI • PyTorch • Cross-platform (Mac/Colab)

---

**Note:** This is a demonstration. For production use, consider privacy and data handling policies.


## Step 1: Install Dependencies

Installing everything needed:
- **NumPy 1.26.4** - Compatible version for Whisper
- **PyTorch** - Deep learning framework
- **Whisper** - OpenAI's speech recognition model
- **Gradio** - Web interface
- **ffmpeg** - Audio file processing
- **Ollama** - For local LLM support (optional)


In [1]:
# Package installation

!uv pip install -q --reinstall "numpy==1.26.4"
!uv pip install -q torch torchvision torchaudio
!uv pip install -q gradio openai-whisper ffmpeg-python
!uv pip install -q ollama

# Ensure ffmpeg is available (Mac)
!which ffmpeg || brew install ffmpeg

/usr/local/bin/ffmpeg


## Step 2: Import Libraries

The essentials: NumPy for arrays, Gradio for the UI, Whisper for transcription, PyTorch for the model backend, and Ollama for optional LLM features.


In [2]:
# Imports

import os
import numpy as np
import gradio as gr
import whisper
import torch
import ollama

## Step 3: Load Whisper Model

Loading the **base** model—a balanced choice between speed and accuracy. It works on both CPU (Mac) and GPU (Colab). The model is ~140MB and will download automatically on first run.


In [3]:
# Model initialization

print("Loading Whisper model...")
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

whisper_model = whisper.load_model("base", device=device)
print("✅ Model loaded successfully!")
print(f"Model type: {type(whisper_model)}")
print(f"Has transcribe method: {hasattr(whisper_model, 'transcribe')}")


Loading Whisper model...
Using device: cpu
✅ Model loaded successfully!
Model type: <class 'whisper.model.Whisper'>
Has transcribe method: True


## Step 4: Transcription Function

This is the core logic:
- Accepts an audio file and target language
- Maps language names to Whisper's language codes
- Transcribes the audio using the loaded model
- Returns the transcribed text


In [4]:
# Transcription function

def transcribe_audio(audio_file, target_language):
    """Transcribe audio file to text in the specified language."""
    if audio_file is None:
        return "Please upload an audio file."
    
    try:
        # Language codes for Whisper
        language_map = {
            "English": "en",
            "Spanish": "es",
            "French": "fr",
            "German": "de",
            "Italian": "it",
            "Portuguese": "pt",
            "Chinese": "zh",
            "Japanese": "ja",
            "Korean": "ko",
            "Russian": "ru",
            "Arabic": "ar",
            "Auto-detect": None
        }
        
        lang_code = language_map.get(target_language)
        
        # Get file path from Gradio File component (returns path string directly)
        audio_path = audio_file.name if hasattr(audio_file, 'name') else audio_file
        
        if not audio_path or not os.path.exists(audio_path):
            return "Invalid audio file or file not found"

        # Transcribe using whisper_model.transcribe()
        result = whisper_model.transcribe(
            audio_path,
            language=lang_code,
            task="transcribe",
            verbose=False  # Hide confusing progress bar
        )
        
        return result["text"]
    
    except Exception as e:
        return f"Error: {str(e)}"


## Step 5: Build the Interface

Creating a simple, clean Gradio interface with:
- **File uploader** for audio files
- **Language dropdown** with 12+ options
- **Transcription output** box
- Auto-launches in browser for convenience


In [5]:
# Gradio interface

app = gr.Interface(
    fn=transcribe_audio,
    inputs=[
        gr.File(label="Upload Audio File", file_types=["audio"]),
        gr.Dropdown(
            choices=[
                "English", "Spanish", "French", "German", "Italian",
                "Portuguese", "Chinese", "Japanese", "Korean",
                "Russian", "Arabic", "Auto-detect"
            ],
            value="English",
            label="Language"
        )
    ],
    outputs=gr.Textbox(label="Transcription", lines=15),
    title="🎙️ Audio Transcription",
    description="Upload an audio file to transcribe it.",
    flagging_mode="never"
)

print("✅ App ready! Run the next cell to launch.")


✅ App ready! Run the next cell to launch.


## Step 6: Launch the App

Starting the Gradio server with Jupyter compatibility (`prevent_thread_lock=True`). The app will open automatically in your browser.


In [None]:
# Launch

# Close any previous instances
try:
    app.close()
except:
    pass

# Start the app
app.launch(inbrowser=True, prevent_thread_lock=True)


* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.




100%|██████████| 10416/10416 [00:06<00:00, 1723.31frames/s]
100%|██████████| 10416/10416 [00:30<00:00, 341.64frames/s]
100%|██████████| 2289/2289 [00:01<00:00, 1205.18frames/s]


---

## 💡 How to Use

1. **Upload** an audio file (MP3, WAV, M4A, etc.)
2. **Select** your language (or use Auto-detect)
3. **Click** Submit
4. **Get** your transcription!

---

## 🚀 Running on Google Colab

For GPU acceleration on Colab:
1. Runtime → Change runtime type → **GPU (T4)**
2. Run all cells in order
3. The model will use GPU automatically

**Note:** First run downloads the Whisper model (~140MB) - this is a one-time download.

---

## 📝 Supported Languages

English • Spanish • French • German • Italian • Portuguese • Chinese • Japanese • Korean • Russian • Arabic • Auto-detect
