## Step 1: Setup (Run this first!) ‚öôÔ∏è

Click the ‚ñ∂Ô∏è button to install the required software. This may take a minute.

In [None]:
# Install required packages
!pip install -q google-genai pydub ipywidgets

# Import necessary libraries
import os
import mimetypes
from pathlib import Path
from google.colab import files
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output
from google import genai
from google.genai import types
from pydub import AudioSegment

print("‚úÖ Setup complete! You can proceed to the next step.")

## Step 2: Enter Your API Key üîë

Enter your Google Gemini API key below. 

**Don't have one?** Get it free at: https://aistudio.google.com/app/api-keys

Your API key is entered securely (hidden like a password).

In [None]:
# Create a secure password field for the API key
api_key_input = widgets.Password(
    placeholder='Paste your API key here',
    description='API Key:',
    layout=widgets.Layout(width='500px'),
    style={'description_width': '80px'}
)

api_key_status = widgets.HTML(value="")

def validate_api_key(change):
    if len(change['new']) > 20:
        api_key_status.value = "<span style='color: green;'>‚úÖ API key entered</span>"
    else:
        api_key_status.value = "<span style='color: orange;'>‚è≥ Please enter your full API key</span>"

api_key_input.observe(validate_api_key, names='value')

display(HTML("<b>Enter your Gemini API key:</b>"))
display(api_key_input)
display(api_key_status)
display(HTML("<br><i>üí° Tip: Your key starts with 'AIza...'</i>"))

## Step 3: Upload Your Audio File(s) üìÅ

Click the button below to select and upload your audio file(s).

**Supported formats:** MP3, WAV, M4A, FLAC, OGG, WEBM, MP4, AAC

In [None]:
# Store uploaded files
uploaded_files = []

# Supported formats
SUPPORTED_FORMATS = {
    '.mp3': 'audio/mpeg',
    '.wav': 'audio/wav',
    '.m4a': 'audio/mp4',
    '.flac': 'audio/flac',
    '.ogg': 'audio/ogg',
    '.webm': 'audio/webm',
    '.mp4': 'audio/mp4',
    '.aac': 'audio/aac'
}

upload_status = widgets.HTML(value="")

def upload_audio_files(b):
    global uploaded_files
    upload_status.value = "<span style='color: blue;'>üì§ Upload dialog opened... Select your file(s)</span>"
    
    try:
        uploaded = files.upload()
        
        if uploaded:
            uploaded_files = []
            valid_files = []
            invalid_files = []
            
            for filename, content in uploaded.items():
                ext = Path(filename).suffix.lower()
                if ext in SUPPORTED_FORMATS:
                    # Save file locally in Colab
                    with open(filename, 'wb') as f:
                        f.write(content)
                    uploaded_files.append(filename)
                    valid_files.append(filename)
                else:
                    invalid_files.append(filename)
            
            status_html = ""
            if valid_files:
                status_html += f"<span style='color: green;'>‚úÖ Uploaded {len(valid_files)} audio file(s):</span><br>"
                for f in valid_files:
                    status_html += f"&nbsp;&nbsp;&nbsp;üìÑ {f}<br>"
            if invalid_files:
                status_html += f"<span style='color: red;'>‚ùå Skipped {len(invalid_files)} unsupported file(s):</span><br>"
                for f in invalid_files:
                    status_html += f"&nbsp;&nbsp;&nbsp;‚ö†Ô∏è {f}<br>"
            
            upload_status.value = status_html
        else:
            upload_status.value = "<span style='color: orange;'>‚ö†Ô∏è No files uploaded</span>"
    except Exception as e:
        upload_status.value = f"<span style='color: red;'>‚ùå Error: {str(e)}</span>"

upload_button = widgets.Button(
    description='üìÅ Click to Upload Audio Files',
    button_style='primary',
    layout=widgets.Layout(width='250px', height='40px')
)
upload_button.on_click(upload_audio_files)

display(upload_button)
display(upload_status)

## Step 4: Choose Your Settings üéõÔ∏è

Select the transcription style and options below.

In [None]:
# ============================================
# PROMPT DEFINITIONS
# ============================================

PROMPTS = {
    "1. Full Audio Transcription": {
        "description": "Detailed word-for-word transcription with timestamps and speaker labels",
        "auto_split": True,
        "content": """# Full audio transcription

## Role and Objective
- Faithfully transcribe audio recordings into a publication-ready, accurate, and well-structured transcript.

## Instructions
- Transcribe exactly what is spoken without summarising or paraphrasing.
- Use standard punctuation and sentence case; break into paragraphs at topic or speaker shifts.
- Label each speaker consistently as Speaker 1:, Speaker 2:, etc.
- Insert a timestamp at the start of every speaker turn in the format [hh:mm:ss].
- For unclear audio, use [inaudible hh:mm:ss]. If unsure about a word or name, bracket with a question mark, e.g., [Kandahar?].
- Mark non-speech events (e.g., [overlapping speech], [laughter], [applause], [music]) in square brackets.
- Omit routine filler words ("um", "uh", repeated false starts) unless their inclusion changes the meaning of the sentence.
- Normalize numbers and dates for clarity (e.g., "twenty-five" ‚Üí "25", "first of May 2024" ‚Üí "1 May 2024").
- Preserve names and terms as heard; if unsure of spelling, use [term?].
- Maintain any code-switching or language changes as spoken; do not translate.
- Transcribe profanity, slurs, and sensitive language exactly as spoken.
- After completing the transcription, validate the output to ensure it matches the defined formatting conventions and is free of omissions, correcting any errors identified before finalizing the output.

### Output Format
- Each speaker turn starts on a new line with a timestamp [hh:mm:ss], speaker label, and the transcript.
- Clearly indicate non-speech and unclear audio using the conventions above.
- Separate paragraphs (speaker turns or topic shifts) with a blank line.
- Output should be in plain text or Markdown with appropriate spacing."""
    },
    "2. Meeting Minutes": {
        "description": "Summarized meeting notes with decisions, action items, and next steps",
        "auto_split": False,
        "content": """# Minutes Meeting

## Role and Objective
- Generate succinct, decision-oriented meeting minutes focused on actionable outcomes and relevant context.

## Instructions
- Summarize, do not transcribe. Capture only essential information for clarity and accountability.

### Scope
- Include:
  - Header details (title, date/time, location, chair, note-taker, attendees, apologies)
  - Agenda coverage
  - Announcements
  - Decisions
  - Action items (specifying owner and due date)
  - Key risks/issues
  - Dependencies
  - Open questions
  - Next steps/next meeting
- Maintain only the context necessary to understand each decision, with brief rationale. Omit small talk and verbatim digressions.

### Participants & Timing
- List all attendees, apologies, chair, and note-taker.
- Add a `[hh:mm:ss]` timestamp at the start of any decision, action, or announcement if available in the input.

### Editing Rules
- Capture the core point, not all rhetoric; avoid unintended paraphrasing or misrepresentation.
- Normalize numbers and dates (e.g., 15 September 2025, 14:00‚Äì15:00 CEST).
- Use consistent speaker names/roles. If unknown, default to "Participant 1", "Participant 2", etc.
- For unclear audio, insert `[inaudible hh:mm:ss]`; for overlapping speakers, insert `[crosstalk]`.
- If any action item is missing an owner or deadline, set as Owner: TBD / Due: TBD and flag this instance."""
    },
    "3. Interview Transcription": {
        "description": "Q&A format with interviewer/interviewee labels and emotional context",
        "auto_split": True,
        "content": """# Interview Transcription Prompt

Please transcribe this interview accurately.
- Clearly distinguish between interviewer and interviewee
- Format in a question-and-answer structure when possible
- Include emotional context (laughter, pauses) in [brackets]
- Maintain the conversational flow and natural speech patterns
- Preserve the tone and style of both speakers
- Note any significant pauses or interruptions
- Keep the chronological order of the conversation"""
    },
    "4. Lecture/Educational Content": {
        "description": "Structured notes with key concepts, definitions, and Q&A sections",
        "auto_split": True,
        "content": """# Lecture

Transcribe the educational content accurately, focusing strictly on the key concepts and main points. Structure the transcript in clear paragraphs, only including slide references or visual descriptions when explicitly mentioned in the material. Note audience questions and responses in a separate section. Preserve all academic terminology and technical language precisely; do not simplify unless specifically requested. Organize the material logically for educational clarity, and highlight major concepts and definitions.

Extract only the central ideas and supporting points emphasized by the speaker, such as the thesis, key claims, evidence/examples, methodologies, conclusions, and implications or limitations.

Output format:
# Summary (‚â§ 200 words)
## Core Takeaways (5-8 bullets)
## Key Points by Section
## Definitions & Concepts
## Evidence & Examples
## Q&A (if any)
## Keywords/Tags"""
    },
    "5. Q&A Summary": {
        "description": "Extract and condense only questions and answers from recordings",
        "auto_split": False,
        "content": """# Q&A-Focused Transcription (Extract & Condense)

## Role and Objective
Produce a concise Q&A transcript from audio recordings by extracting and condensing only the essential questions and answers.

## Instructions
- Include only questions and answers in the transcript.
- Omit introductions, bios, housekeeping comments, and small talk.
- For each question, summarize to the essential inquiry in 1‚Äì2 sentences, retaining key names, citations, numbers, and dates.
- For each answer, distill the main claim(s) and provide up to 3‚Äì4 supporting points or examples.

## Speakers & Timestamps
- Label each turn as: `[hh:mm:ss] Q (Name/Audience #):` and `[hh:mm:ss] A (Name/Role):`
- If the speaker is unnamed, use Audience 1, Audience 2, etc.

## Output Format
- Output must be strictly in Markdown.
- Each Q and A block appears on its own line.
- Insert a single blank line between each Q/A pair."""
    },
    "6. Full Audio Translation (to English)": {
        "description": "Translate non-English audio to English with cultural context notes",
        "auto_split": True,
        "content": """# Full audio translation (to English)

## Role and Objective
- Faithfully transcribe and translate audio recordings into a publication-ready, accurate, and well-structured English transcript.

## Instructions
- Translate all spoken content into English, regardless of the original language(s).
- Maintain the original meaning and tone as closely as possible while producing natural, fluent English.
- Use standard punctuation and sentence case; break into paragraphs at topic or speaker shifts.
- Label each speaker consistently as Speaker 1:, Speaker 2:, etc.
- Insert a timestamp at the start of every speaker turn in the format [hh:mm:ss].
- For unclear audio, use [inaudible hh:mm:ss]. If unsure about a word or name, bracket with a question mark, e.g., [Kandahar?].
- Mark non-speech events (e.g., [overlapping speech], [laughter], [applause], [music]) in square brackets.
- When the original language changes (code-switching), indicate the original language in brackets, e.g., [in French:] before the translated text if relevant for context.
- For culturally specific terms, idiomatic expressions, or words with no direct English equivalent, provide the English translation followed by the original term in parentheses, e.g., "religious endowment (waqf)", "neighborhood (mahalla)"."""
    }
}

# ============================================
# SETTINGS WIDGETS
# ============================================

# Model selection
model_dropdown = widgets.Dropdown(
    options=[
        ('Gemini 2.5 Pro (High quality, balanced)', 'gemini-2.5-pro'),
        ('Gemini 2.5 Flash (Faster, good quality)', 'gemini-2.5-flash'),
        ('Gemini 2.0 Flash (Latest fast model)', 'gemini-2.0-flash'),
    ],
    value='gemini-2.5-pro',
    description='AI Model:',
    style={'description_width': '100px'},
    layout=widgets.Layout(width='400px')
)

# Prompt selection
prompt_dropdown = widgets.Dropdown(
    options=list(PROMPTS.keys()),
    value='1. Full Audio Transcription',
    description='Style:',
    style={'description_width': '100px'},
    layout=widgets.Layout(width='400px')
)

# Prompt description display
prompt_description = widgets.HTML(
    value=f"<i>üìù {PROMPTS['1. Full Audio Transcription']['description']}</i>"
)

def update_prompt_description(change):
    selected = change['new']
    desc = PROMPTS[selected]['description']
    auto_split = PROMPTS[selected]['auto_split']
    prompt_description.value = f"<i>üìù {desc}</i>"
    # Update split checkbox based on prompt recommendation
    split_checkbox.value = auto_split

prompt_dropdown.observe(update_prompt_description, names='value')

# Audio splitting options
split_checkbox = widgets.Checkbox(
    value=True,
    description='Split long audio files into segments (recommended for files > 10 min)',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='500px')
)

segment_slider = widgets.IntSlider(
    value=10,
    min=5,
    max=30,
    step=5,
    description='Segment length (minutes):',
    style={'description_width': '180px'},
    layout=widgets.Layout(width='400px')
)

# Custom prompt option
use_custom_prompt = widgets.Checkbox(
    value=False,
    description='Use custom prompt instead',
    style={'description_width': 'initial'}
)

custom_prompt_text = widgets.Textarea(
    placeholder='Enter your custom transcription instructions here...\n\nExample: Please transcribe this audio in French, focusing on technical terminology.',
    layout=widgets.Layout(width='500px', height='150px'),
    disabled=True
)

def toggle_custom_prompt(change):
    custom_prompt_text.disabled = not change['new']
    prompt_dropdown.disabled = change['new']

use_custom_prompt.observe(toggle_custom_prompt, names='value')

# Display all settings
display(HTML("<h3>ü§ñ Select AI Model</h3>"))
display(model_dropdown)

display(HTML("<h3>üìã Select Transcription Style</h3>"))
display(prompt_dropdown)
display(prompt_description)

display(HTML("<br>"))
display(use_custom_prompt)
display(custom_prompt_text)

display(HTML("<h3>‚úÇÔ∏è Audio Splitting Options</h3>"))
display(split_checkbox)
display(segment_slider)
display(HTML("<i>üí° Splitting helps with long recordings and improves accuracy</i>"))

## Step 5: Start Transcription! üöÄ

Click the button below to start transcribing your audio file(s).

In [None]:
# ============================================
# TRANSCRIPTION ENGINE
# ============================================

class ColabAudioTranscriber:
    """Simplified Audio Transcriber for Google Colab."""
    
    def __init__(self, api_key, model='gemini-2.5-pro'):
        self.api_key = api_key
        self.model = model
        self.client = genai.Client(api_key=self.api_key)
        self.supported_formats = SUPPORTED_FORMATS
    
    def prepare_audio(self, audio_file_path):
        """Read audio file and determine MIME type."""
        with open(audio_file_path, 'rb') as f:
            audio_bytes = f.read()
        ext = Path(audio_file_path).suffix.lower()
        mime_type = self.supported_formats.get(ext, 'audio/mpeg')
        return audio_bytes, mime_type
    
    def split_audio(self, audio_file_path, segment_minutes=10):
        """Split audio into segments."""
        try:
            segment_ms = segment_minutes * 60 * 1000
            audio = AudioSegment.from_file(audio_file_path)
            
            if len(audio) <= segment_ms:
                return [audio_file_path]
            
            segments = []
            base_name = Path(audio_file_path).stem
            ext = Path(audio_file_path).suffix
            
            for i, start in enumerate(range(0, len(audio), segment_ms), start=1):
                end = min(start + segment_ms, len(audio))
                chunk = audio[start:end]
                segment_path = f"{base_name}_segment_{i:02d}{ext}"
                
                # Map extensions to export formats
                format_map = {'m4a': 'mp4', 'mp4': 'mp4', 'mp3': 'mp3', 
                              'wav': 'wav', 'flac': 'flac', 'ogg': 'ogg'}
                export_format = format_map.get(ext.lstrip('.').lower(), 'mp3')
                chunk.export(segment_path, format=export_format)
                segments.append(segment_path)
            
            return segments
        except Exception as e:
            print(f"‚ö†Ô∏è Could not split audio: {e}. Processing as single file.")
            return [audio_file_path]
    
    def transcribe(self, audio_file_path, prompt):
        """Transcribe a single audio file."""
        audio_bytes, mime_type = self.prepare_audio(audio_file_path)
        
        audio_part = types.Part.from_bytes(
            data=audio_bytes,
            mime_type=mime_type
        )
        
        response = self.client.models.generate_content(
            model=self.model,
            contents=[prompt, audio_part],
            config=types.GenerateContentConfig(
                temperature=0.1,
                max_output_tokens=65536,
            )
        )
        
        return response.text.strip()

# ============================================
# TRANSCRIPTION BUTTON AND OUTPUT
# ============================================

output_area = widgets.Output()
transcription_results = {}  # Store results for download

def run_transcription(b):
    global transcription_results
    transcription_results = {}
    
    with output_area:
        clear_output()
        
        # Validate inputs
        if not api_key_input.value or len(api_key_input.value) < 20:
            print("‚ùå Please enter a valid API key in Step 2")
            return
        
        if not uploaded_files:
            print("‚ùå Please upload at least one audio file in Step 3")
            return
        
        # Get settings
        api_key = api_key_input.value
        model = model_dropdown.value
        split_audio = split_checkbox.value
        segment_minutes = segment_slider.value
        
        # Get prompt
        if use_custom_prompt.value and custom_prompt_text.value.strip():
            prompt = custom_prompt_text.value.strip()
            print("üìù Using custom prompt")
        else:
            selected_prompt = prompt_dropdown.value
            prompt = PROMPTS[selected_prompt]['content']
            print(f"üìù Using: {selected_prompt}")
        
        print(f"ü§ñ Model: {model}")
        print(f"‚úÇÔ∏è Audio splitting: {'Enabled' if split_audio else 'Disabled'}")
        if split_audio:
            print(f"   Segment length: {segment_minutes} minutes")
        print("\n" + "="*50)
        
        try:
            # Initialize transcriber
            transcriber = ColabAudioTranscriber(api_key, model)
            print("‚úÖ Connected to Gemini API\n")
            
            # Process each file
            for i, audio_file in enumerate(uploaded_files, 1):
                print(f"\nüéµ Processing file {i}/{len(uploaded_files)}: {audio_file}")
                print("-" * 40)
                
                try:
                    if split_audio:
                        segments = transcriber.split_audio(audio_file, segment_minutes)
                        if len(segments) > 1:
                            print(f"‚úÇÔ∏è Split into {len(segments)} segments")
                        
                        transcription_parts = []
                        for j, segment in enumerate(segments, 1):
                            print(f"   ‚è≥ Transcribing segment {j}/{len(segments)}...")
                            result = transcriber.transcribe(segment, prompt)
                            if len(segments) > 1:
                                transcription_parts.append(f"[Segment {j}]\n{result}")
                            else:
                                transcription_parts.append(result)
                            print(f"   ‚úÖ Segment {j} complete")
                        
                        transcription = "\n\n".join(transcription_parts)
                    else:
                        print("   ‚è≥ Transcribing...")
                        transcription = transcriber.transcribe(audio_file, prompt)
                    
                    # Store result
                    output_filename = Path(audio_file).stem + "_transcription.txt"
                    transcription_results[output_filename] = transcription
                    
                    # Save locally
                    with open(output_filename, 'w', encoding='utf-8') as f:
                        f.write(f"Transcription of: {audio_file}\n")
                        f.write(f"Model: {model}\n")
                        f.write("=" * 50 + "\n\n")
                        f.write(transcription)
                    
                    print(f"\n‚úÖ Transcription complete for: {audio_file}")
                    print(f"üìÑ Saved as: {output_filename}")
                    
                except Exception as e:
                    print(f"\n‚ùå Error transcribing {audio_file}: {str(e)}")
            
            # Summary
            print("\n" + "="*50)
            print("üéâ TRANSCRIPTION COMPLETE!")
            print(f"   Files processed: {len(transcription_results)}")
            print("\nüëá Download your transcriptions in the next step")
            
        except Exception as e:
            print(f"\n‚ùå Error: {str(e)}")
            if "API key" in str(e) or "authentication" in str(e).lower():
                print("\nüí° Please check that your API key is correct.")

transcribe_button = widgets.Button(
    description='üöÄ Start Transcription',
    button_style='success',
    layout=widgets.Layout(width='200px', height='50px')
)
transcribe_button.on_click(run_transcription)

display(transcribe_button)
display(HTML("<br>"))
display(output_area)

## Step 6: Download Your Transcriptions üì•

After transcription is complete, click below to download your files.

In [None]:
download_output = widgets.Output()

def download_transcriptions(b):
    with download_output:
        clear_output()
        
        if not transcription_results:
            print("‚ùå No transcriptions available yet. Please run Step 5 first.")
            return
        
        print("üì• Preparing downloads...\n")
        
        for filename in transcription_results.keys():
            try:
                print(f"   Downloading: {filename}")
                files.download(filename)
            except Exception as e:
                print(f"   ‚ö†Ô∏è Could not download {filename}: {e}")
        
        print("\n‚úÖ Downloads initiated! Check your browser's download folder.")

download_button = widgets.Button(
    description='üì• Download All Transcriptions',
    button_style='info',
    layout=widgets.Layout(width='250px', height='40px')
)
download_button.on_click(download_transcriptions)

display(download_button)
display(download_output)

## Step 7 (Optional): View Transcription Results üëÅÔ∏è

Preview your transcription directly in this notebook.

In [None]:
preview_output = widgets.Output()

def show_preview(b):
    with preview_output:
        clear_output()
        
        if not transcription_results:
            print("‚ùå No transcriptions available yet. Please run Step 5 first.")
            return
        
        for filename, content in transcription_results.items():
            print("=" * 60)
            print(f"üìÑ {filename}")
            print("=" * 60)
            print(content[:5000])  # Show first 5000 characters
            if len(content) > 5000:
                print(f"\n... [Truncated - {len(content) - 5000} more characters]")
            print("\n")

preview_button = widgets.Button(
    description='üëÅÔ∏è Preview Transcriptions',
    button_style='',
    layout=widgets.Layout(width='200px', height='35px')
)
preview_button.on_click(show_preview)

display(preview_button)
display(preview_output)

---

## ‚ÑπÔ∏è Help & Troubleshooting

### Common Issues:

**"API key not valid"**
- Make sure you copied the entire API key
- Get a new key at: https://aistudio.google.com/app/apikey

**"File format not supported"**
- Supported formats: MP3, WAV, M4A, FLAC, OGG, WEBM, MP4, AAC
- Try converting your file to MP3

**"Transcription takes too long"**
- Try using "Gemini 2.5 Flash" for faster processing
- Enable audio splitting for long files

**"Output is not what I expected"**
- Try a different transcription style
- Use the custom prompt option for specific needs

---

*Created by ZMO AI Pipelines*