# 🎧 AudioX: Diffusion Transformer for Anything-to-Audio Generation

[![arXiv](https://img.shields.io/badge/arXiv-2503.10522-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2503.10522)
[![Project Page](https://img.shields.io/badge/GitHub.io-Project-blue?logo=Github&style=flat-square)](https://zeyuet.github.io/AudioX/)
[![🤗 Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue)](https://huggingface.co/HKUSTAudio/AudioX)

**This notebook provides a Google Colab interface for AudioX, supporting:**
- 📝 Text-to-Audio Generation
- 🎬 Video-to-Audio Generation  
- 🎵 Video-to-Music Generation
- 🎶 Text-to-Music Generation

**Instructions:**
1. Run all cells in order
2. Wait for the Gradio interface to load
3. Use the public URL to access the demo

---

## 🔧 Setup and Installation

Choose one of the following setup options:

**Option A**: Automated setup (recommended) - Run the next cell
**Option B**: Manual setup - Skip to the cell after that

In [None]:
# OPTION A: Automated Setup (Recommended)
# This cell handles everything automatically

import os
import subprocess
from pathlib import Path

# Check if we're in Colab
try:
    import google.colab
    IN_COLAB = True
    print("🚀 Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("💻 Running locally")

if IN_COLAB:
    try:
        # Check if AudioX directory already exists
        if not Path('AudioX').exists():
            print("📥 Cloning AudioX repository...")
            !git clone https://github.com/Wamp1re-Ai/AudioX.git
        else:
            print("📁 AudioX directory already exists")
        
        # Change to AudioX directory
        %cd AudioX
        print(f"📁 Changed to directory: {Path.cwd()}")
        
        # Install system dependencies
        print("📦 Installing system dependencies...")
        !apt-get update -qq
        !apt-get install -y ffmpeg libsndfile1 git-lfs
        
        # Install Python dependencies with better error handling
        print("🐍 Installing Python dependencies...")
        
        # Install PyTorch (use Colab's version)
        !pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
        
        # Install essential AudioX dependencies
        !pip install -q gradio>=4.40.0 aeiou einops safetensors transformers huggingface_hub
        
        # Install AudioX package
        !pip install -q -e .
        
        # Create model directory
        !mkdir -p model
        
        print("✅ Automated setup complete!")
        print("🎛️ You can now skip to the 'Launch Gradio Interface' section")
        
    except Exception as e:
        print(f"⚠️  Automated setup encountered issues: {e}")
        print("🔄 Please try the manual setup option below")
        
else:
    print("ℹ️  Please use manual setup for local environments")

## 🔧 Manual Setup (Option B)

Use this if the automated setup fails or you prefer manual control.

In [None]:
# Check if we're in Colab
try:
    import google.colab
    IN_COLAB = True
    print("🚀 Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("💻 Running locally")

# Install system dependencies
if IN_COLAB:
    !apt-get update -qq
    !apt-get install -y ffmpeg libsndfile1 git-lfs
    
    # Clone the repository (use the correct repo URL)
    !git clone https://github.com/Wamp1re-Ai/AudioX.git
    %cd AudioX
    
    # Install Python dependencies with better compatibility
    !pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
    
    # Install essential dependencies first
    !pip install -q gradio>=4.40.0 aeiou einops safetensors transformers huggingface_hub
    
    # Install AudioX package
    !pip install -q -e .
    
    print("✅ Installation complete!")
    print("⚠️  Note: Some dependency warnings are normal in Colab")
else:
    print("ℹ️  Please ensure you have installed the dependencies locally")

## 📥 Download Pre-trained Models

Download the AudioX model checkpoints from Hugging Face.

In [None]:
import os
import urllib.request
from tqdm import tqdm

# Create model directory
os.makedirs('model', exist_ok=True)

# Download model files
model_files = {
    'model.ckpt': 'https://huggingface.co/HKUSTAudio/AudioX/resolve/main/model.ckpt',
    'config.json': 'https://huggingface.co/HKUSTAudio/AudioX/resolve/main/config.json'
}

def download_file(url, filename):
    """Download file with progress bar"""
    def progress_hook(block_num, block_size, total_size):
        downloaded = block_num * block_size
        if total_size > 0:
            percent = min(100, downloaded * 100 / total_size)
            print(f"\r{filename}: {percent:.1f}% ({downloaded//1024//1024}MB/{total_size//1024//1024}MB)", end="")
    
    if not os.path.exists(f'model/{filename}'):
        print(f"📥 Downloading {filename}...")
        urllib.request.urlretrieve(url, f'model/{filename}', progress_hook)
        print(f"\n✅ {filename} downloaded successfully!")
    else:
        print(f"✅ {filename} already exists, skipping download.")

# Download all model files
for filename, url in model_files.items():
    download_file(url, filename)

print("\n🎉 All models downloaded successfully!")

## ⚙️ Colab-Specific Setup

Configure the environment for optimal performance in Google Colab.

In [None]:
import torch
import platform
import gc
import os

# Set environment variables for optimal performance
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ['TMPDIR'] = './tmp'
os.makedirs('./tmp', exist_ok=True)
os.makedirs('./demo_result', exist_ok=True)

# Check GPU availability
if torch.cuda.is_available():
    device = torch.device("cuda")
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"🎮 GPU: {gpu_name} ({gpu_memory:.1f}GB)")
    
    # Clear GPU cache
    torch.cuda.empty_cache()
    gc.collect()
else:
    device = torch.device("cpu")
    print("💻 Using CPU (GPU not available)")

print(f"🔧 Device: {device}")
print(f"🐍 Python: {platform.python_version()}")
print(f"🔥 PyTorch: {torch.__version__}")

# Memory management for Colab
def clear_memory():
    """Clear GPU and system memory"""
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    gc.collect()

print("✅ Colab environment configured!")

## 🎛️ Launch Gradio Interface

Start the interactive AudioX demo with Gradio. The interface will be accessible via a public URL.

In [None]:
# Ensure we're in the right directory and can import modules
import os
import sys
from pathlib import Path

# Check current directory
current_dir = Path.cwd()
print(f"📁 Current directory: {current_dir}")

# If we're not in AudioX directory, try to find it
if not (current_dir / 'stable_audio_tools').exists():
    if (current_dir / 'AudioX').exists():
        os.chdir('AudioX')
        print("📁 Changed to AudioX directory")
    else:
        print("⚠️  AudioX directory not found. Please run the setup cells first.")

# Add current directory to Python path
if str(Path.cwd()) not in sys.path:
    sys.path.insert(0, str(Path.cwd()))
    print("🔧 Added current directory to Python path")

# Option 1: Try to use the optimized Colab interface
interface_launched = False

try:
    from colab_gradio_interface import launch_colab_demo
    
    print("🎛️ Launching optimized Colab interface...")
    interface = launch_colab_demo(
        model_config_path='./model/config.json',
        ckpt_path='./model/model.ckpt',
        share=True,
        debug=False
    )
    interface_launched = True
    
except ImportError as e:
    print(f"⚠️  Colab interface not found: {e}")
    print("🔄 Falling back to standard interface...")
except Exception as e:
    print(f"⚠️  Error with Colab interface: {e}")
    print("🔄 Falling back to standard interface...")

# Option 2: Fallback to standard interface
if not interface_launched:
    try:
        from stable_audio_tools.interface.gradio import create_ui
        import gradio as gr
        
        # Create the interface
        print("🎛️ Creating standard Gradio interface...")
        interface = create_ui(
            model_config_path='./model/config.json',
            ckpt_path='./model/model.ckpt',
            model_half=False  # Set to True if you have memory issues
        )
        
        # Configure for Colab
        interface.queue(max_size=10)  # Limit queue size for Colab
        
        # Launch with public sharing enabled
        print("🚀 Launching AudioX Demo...")
        print("📱 The interface will be available at the public URL below")
        print("⏱️  Please wait for the model to load (this may take a few minutes)")
        
        # Launch the interface
        interface.launch(
            share=True,  # Enable public sharing
            debug=False,
            server_name="0.0.0.0",
            server_port=7860,
            show_error=True,
            quiet=False
        )
        interface_launched = True
        
    except Exception as e:
        print(f"❌ Error launching interface: {e}")
        print("\n🔧 Troubleshooting steps:")
        print("1. Make sure you ran all setup cells successfully")
        print("2. Check that model files exist in ./model/ directory")
        print("3. Try restarting the runtime and running cells again")
        print("4. Check the troubleshooting section below")

if interface_launched:
    print("\n🎉 AudioX is now running!")
    print("🔗 Use the public URL above to access the demo")
    print("💡 Tip: Bookmark the URL to share with others")
else:
    print("\n❌ Failed to launch interface. Please check the errors above.")

## 📖 Usage Guide

### 🎯 Available Tasks:

1. **📝 Text-to-Audio**: Enter a text description to generate corresponding audio
   - Example: "Typing on a keyboard", "Ocean waves crashing"

2. **🎶 Text-to-Music**: Generate music from text descriptions
   - Example: "An orchestral music piece for a fantasy world"

3. **🎬 Video-to-Audio**: Upload a video file to generate matching audio
   - Supports common video formats (MP4, AVI, MOV)

4. **🎵 Video-to-Music**: Generate background music for videos
   - Use prompt: "Generate music for the video"

### ⚙️ Parameters:

- **Steps**: Number of diffusion steps (higher = better quality, slower)
- **CFG Scale**: Classifier-free guidance scale (higher = more prompt adherence)
- **Seed**: Random seed for reproducible results (-1 for random)
- **Sampler Type**: Different sampling algorithms

### 💡 Tips:

- Start with default parameters for best results
- Use descriptive prompts for better audio generation
- Video files should be under 100MB for optimal performance
- Generation typically takes 1-3 minutes depending on settings

---

**🔗 Links:**
- [AudioX Paper](https://arxiv.org/abs/2503.10522)
- [Project Page](https://zeyuet.github.io/AudioX/)
- [GitHub Repository](https://github.com/ZeyueT/AudioX)
- [Hugging Face Model](https://huggingface.co/HKUSTAudio/AudioX)

## 🔧 Troubleshooting

If you encounter issues, try running this cell to clear memory and restart:

In [None]:
# Clear memory and restart if needed
import torch
import gc

# Clear GPU memory
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print(f"🧹 GPU memory cleared. Available: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f}GB")

# Clear system memory
gc.collect()
print("🧹 System memory cleared")

# Check if model is loaded
try:
    from stable_audio_tools.interface.gradio import current_model
    if current_model is not None:
        print("✅ Model is loaded and ready")
    else:
        print("⚠️  Model not loaded. Please run the Gradio interface cell again.")
except:
    print("⚠️  Please run all cells in order.")