# üéôÔ∏è Faster-Whisper-XXL Optimized Transcription

A user-friendly interface for transcribing Japanese audio/video files using optimized Faster-Whisper-XXL settings.

**Features:**
- üöÄ Optimized for Japanese language transcription
- üéØ Medium model for best speed/quality balance
- üîß Adjustable parameters
- üìÅ Multiple input sources (Google Drive, Upload, Local)
- üìä Real-time progress tracking
- üíæ Automatic SRT and JSON output

**Hardware Optimized for:**
- RTX 2080 GPU (8GB VRAM)
- i9-9000K CPU (16 threads)
- 32GB RAM

In [None]:
#@title üîß Setup Environment
#@markdown Install required packages and setup Faster-Whisper-XXL

print("üöÄ Setting up environment...")

# Install required packages
!pip install faster-whisper torch torchvision torchaudio --quiet
!pip install ffmpeg-python --quiet
!apt-get update && apt-get install -y ffmpeg --quiet

# Import libraries
import os
import sys
import torch
import faster_whisper
import ffmpeg
from pathlib import Path
from google.colab import files, drive
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output
import time
from tqdm.notebook import tqdm

# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"‚úÖ Using device: {device}")
if torch.cuda.is_available():
    print(f"üéÆ GPU: {torch.cuda.get_device_name(0)}")
    print(f"üíæ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

print("‚úÖ Setup complete!")

In [None]:
#@title üìÅ Mount Google Drive (Optional)
#@markdown Mount your Google Drive to access files

mount_drive = False #@param {type:"boolean"}

if mount_drive:
    print("üîó Mounting Google Drive...")
    drive.mount('/content/drive')
    print("‚úÖ Google Drive mounted at /content/drive")
    print("üìÇ Your files are accessible at: /content/drive/MyDrive/")
else:
    print("‚ÑπÔ∏è Google Drive not mounted. You can still upload files directly.")

In [None]:
#@title üì§ Upload Files
#@markdown Upload audio/video files for transcription

print("üì§ Upload your audio/video files...")
print("Supported formats: MP3, WAV, MP4, MKV, AVI, etc.")

# Create upload directory
upload_dir = Path("/content/uploads")
upload_dir.mkdir(exist_ok=True)

# Upload files
uploaded = files.upload()

# Move uploaded files to upload directory
uploaded_files = []
for filename, content in uploaded.items():
    file_path = upload_dir / filename
    with open(file_path, 'wb') as f:
        f.write(content)
    uploaded_files.append(str(file_path))
    print(f"‚úÖ Uploaded: {filename}")

if uploaded_files:
    print(f"\nüìÅ Files ready for processing: {len(uploaded_files)} file(s)")
    for f in uploaded_files:
        print(f"  - {Path(f).name}")
else:
    print("\n‚ö†Ô∏è No files uploaded. You can also specify file paths manually.")

In [None]:
#@title ‚öôÔ∏è Transcription Settings
#@markdown Configure transcription parameters

# Model settings
model_size = "medium" #@param ["tiny", "base", "small", "medium", "large-v2", "large-v3"] {type:"string"}
language = "ja" #@param ["ja", "en", "zh", "ko", "auto"] {type:"string"}

# Processing settings
compute_type = "float16" #@param ["int8", "float16", "float32"] {type:"string"}
batch_size = 16 #@param {type:"slider", min:1, max:32, step:1}
num_workers = 2 #@param {type:"slider", min:1, max:8, step:1}

# VAD settings
vad_filter = True #@param {type:"boolean"}
vad_threshold = 0.5 #@param {type:"slider", min:0.1, max:1.0, step:0.1}

# Output settings
output_formats = ["srt", "json", "text"] #@param {type:"raw"}

print("‚öôÔ∏è Transcription Settings:")
print(f"üéØ Model: {model_size}")
print(f"üåè Language: {language}")
print(f"üî¢ Compute Type: {compute_type}")
print(f"üì¶ Batch Size: {batch_size}")
print(f"üë∑ Workers: {num_workers}")
print(f"üéôÔ∏è VAD Filter: {vad_filter}")
print(f"üìä VAD Threshold: {vad_threshold}")
print(f"üìÑ Output Formats: {', '.join(output_formats)}")

# Create output directory
output_dir = Path("/content/output")
output_dir.mkdir(exist_ok=True)
print(f"üìÇ Output directory: {output_dir}")

In [None]:
#@title üéØ File Selection
#@markdown Select files to process

# Manual file path input
manual_files = "" #@param {type:"string"}
#@markdown Enter file paths separated by commas, or leave empty to use uploaded files

# Determine files to process
files_to_process = []

if manual_files.strip():
    # Parse manual file paths
    manual_paths = [f.strip() for f in manual_files.split(',') if f.strip()]
    for path_str in manual_paths:
        path = Path(path_str)
        if path.exists():
            files_to_process.append(str(path))
            print(f"‚úÖ Added: {path.name}")
        else:
            print(f"‚ùå Not found: {path_str}")
elif uploaded_files:
    files_to_process = uploaded_files
    print("üì§ Using uploaded files:")
    for f in uploaded_files:
        print(f"  - {Path(f).name}")
else:
    print("‚ö†Ô∏è No files selected. Please upload files or specify file paths.")

print(f"\nüìä Ready to process: {len(files_to_process)} file(s)")

In [None]:
#@title üöÄ Start Transcription
#@markdown Click to start the transcription process

start_transcription = True #@param {type:"boolean"}

if not start_transcription:
    print("‚è∏Ô∏è Transcription not started. Check the box above to begin.")
elif not files_to_process:
    print("‚ùå No files to process. Please upload files or specify file paths.")
else:
    print("üéØ Starting transcription...")
    print("=" * 50)

    # Load model
    print(f"üîÑ Loading {model_size} model...")
    model = faster_whisper.WhisperModel(
        model_size,
        device=device,
        compute_type=compute_type,
        num_workers=num_workers
    )
    print("‚úÖ Model loaded successfully!")

    # Process each file
    for i, file_path in enumerate(files_to_process, 1):
        file_path = Path(file_path)
        print(f"\nüéµ Processing file {i}/{len(files_to_process)}: {file_path.name}")
        print("-" * 40)

        try:
            # Extract audio if video file
            if file_path.suffix.lower() in ['.mp4', '.mkv', '.avi', '.mov', '.wmv']:
                print("üé¨ Extracting audio from video...")
                audio_path = output_dir / f"{file_path.stem}_audio.wav"
                
                # Use ffmpeg to extract audio
                (
                    ffmpeg
                    .input(str(file_path))
                    .output(str(audio_path), 
                           acodec='pcm_s16le', 
                           ar='16000', 
                           ac=1,
                           vn=None)
                    .run(quiet=True, overwrite_output=True)
                )
                
                input_file = str(audio_path)
                print(f"‚úÖ Audio extracted: {audio_path.name}")
            else:
                input_file = str(file_path)

            # Transcribe
            print("üéôÔ∏è Transcribing...")
            
            segments, info = model.transcribe(
                input_file,
                language=language if language != "auto" else None,
                beam_size=5,
                patience=2.0,
                length_penalty=1.0,
                repetition_penalty=1.0,
                compression_ratio_threshold=2.4,
                logprob_threshold=-1.0,
                no_speech_threshold=vad_threshold,
                vad_filter=vad_filter,
                suppress_blank=True,
                suppress_tokens=[-1],
                without_timestamps=False,
                max_initial_timestamp=1.0,
                word_timestamps=True,
                prepend_punctuations="'\"¬ø([{-",
                append_punctuations="'.„ÄÇ,Ôºå!ÔºÅ?Ôºü:Ôºö")]}„ÄÅ",
                initial_prompt=None,
                prefix=None,
                suppress_numerals=False,
                batch_size=batch_size
            )

            # Collect segments
            transcription_segments = []
            print("üìù Collecting transcription data...")
            
            with tqdm(total=None, desc="Processing segments") as pbar:
                for segment in segments:
                    transcription_segments.append({
                        'start': segment.start,
                        'end': segment.end,
                        'text': segment.text.strip(),
                        'words': [
                            {
                                'word': word.word,
                                'start': word.start,
                                'end': word.end,
                                'probability': word.probability
                            } for word in segment.words
                        ] if segment.words else []
                    })
                    pbar.update(1)

            # Generate output files
            base_name = file_path.stem
            
            # SRT format
            if 'srt' in output_formats:
                srt_path = output_dir / f"{base_name}.srt"
                print(f"üìÑ Generating SRT: {srt_path.name}")
                
                with open(srt_path, 'w', encoding='utf-8') as f:
                    for i, segment in enumerate(transcription_segments, 1):
                        start_time = f"{int(segment['start'] // 3600):02d}:{int((segment['start'] % 3600) // 60):02d}:{segment['start'] % 60:05.2f}"
                        end_time = f"{int(segment['end'] // 3600):02d}:{int((segment['end'] % 3600) // 60):02d}:{segment['end'] % 60:05.2f}"
                        f.write(f"{i}\n")
                        f.write(f"{start_time.replace('.', ',')} --> {end_time.replace('.', ',')}\n")
                        f.write(f"{segment['text']}\n\n")
                
                print(f"‚úÖ SRT saved: {srt_path}")

            # JSON format
            if 'json' in output_formats:
                json_path = output_dir / f"{base_name}.json"
                print(f"üìÑ Generating JSON: {json_path.name}")
                
                import json
                output_data = {
                    'file': str(file_path),
                    'language': info.language,
                    'language_probability': info.language_probability,
                    'duration': info.duration,
                    'segments': transcription_segments
                }
                
                with open(json_path, 'w', encoding='utf-8') as f:
                    json.dump(output_data, f, ensure_ascii=False, indent=2)
                
                print(f"‚úÖ JSON saved: {json_path}")

            print(f"‚úÖ Completed: {file_path.name}")
            
        except Exception as e:
            print(f"‚ùå Error processing {file_path.name}: {str(e)}")
            continue

    print("\n" + "=" * 50)
    print("üéâ Transcription completed!")
    print(f"üìÇ Output files saved in: {output_dir}")
    
    # List output files
    output_files = list(output_dir.glob("*"))
    if output_files:
        print("\nüìÑ Generated files:")
        for f in output_files:
            print(f"  - {f.name}")
    
    # Download option
    print("\nüì• Download options:")
    print("- Use the file browser on the left to download individual files")
    print("- Or run the download cell below")

In [None]:
#@title üì• Download Results
#@markdown Download all output files as a ZIP archive

download_results = False #@param {type:"boolean"}

if download_results:
    import shutil
    
    # Create ZIP archive
    zip_path = "/content/transcription_results.zip"
    shutil.make_archive("/content/transcription_results", 'zip', output_dir)
    
    # Download
    files.download(zip_path)
    print("‚úÖ ZIP archive downloaded!")
else:
    print("üìÅ Output files are available in the folder on the left.")
    print(f"üìÇ Local path: {output_dir}")

# üìñ Usage Instructions

## Quick Start:
1. **Setup**: Run the first cell to install dependencies
2. **Input**: Choose one of the input methods:
   - Mount Google Drive for cloud files
   - Upload files directly
   - Specify local Colab paths
3. **Configure**: Adjust settings in the configuration cell
4. **Process**: Click the transcription cell to start
5. **Download**: Get your results

## Supported Formats:
- **Audio**: MP3, WAV, FLAC, M4A, OGG
- **Video**: MP4, MKV, AVI, MOV, WMV (audio will be extracted)

## Tips:
- For best Japanese transcription, keep the language set to "ja"
- Medium model provides the best speed/quality balance
- Batch size of 16 works well for most GPUs
- Enable VAD filter for better segmentation

## Troubleshooting:
- If you get CUDA errors, try reducing batch_size
- For very long files, consider splitting them first
- Check the output logs for detailed error messages