<a href="https://colab.research.google.com/github/daisysong76/AI--Machine--learning/blob/main/Optimized_real_time_AI_music_generation_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Optimization	  Latency   Reduction	  Memory  Efficiency

ONNX Optimized Model	  30%   faster	  No change

INT8/FP16 Quantization	  50% faster    on CPU	4x smaller

Pipeline Parallelism	  30% reduced wait time	  More efficient inference



In [None]:
# Why Use ONNX for Riffusion?
# ONNX (Open Neural Network Exchange) is used to optimize AI models for faster inference. It allows cross-platform compatibility and hardware acceleration (GPU, CPU, and edge devices).

# 🔹 Why Use ONNX? ✅ Faster Inference – Runs models 2-3x faster than PyTorch
# ✅ Lower Memory Usage – Uses INT8/FP16 quantization to reduce model size
# ✅ Cross-Platform – Works on Windows, Linux, Jetson Nano, Raspberry Pi
# ✅ GPU Acceleration – Supports CUDA, TensorRT, DirectML

# f ONNX is not the best fit for your use case, consider these alternatives:

# Alternative	Use Case	Pros	Cons
# TensorRT	Ultra-fast GPU inference	🚀 Up to 10x speedup	❌ Only supports NVIDIA GPUs
# TorchScript	Native PyTorch speedup	🏆 Best for PyTorch users	❌ Not as optimized as ONNX
# TFLite	Edge AI (Mobile, Raspberry Pi, Jetson)	🏆 Best for low-power devices	❌ Requires model conversion
# JAX (XLA)	Optimized for TPUs (Google Cloud AI)	🚀 Extreme speed on TPU	❌ Not well-supported outside Google
# 🔥 Best Solution for Riffusion?
# Scenario	Recommended Approach
# Running on an NVIDIA GPU (T4, RTX, A100)	ONNX + TensorRT
# Keeping Everything in PyTorch	TorchScript
# Deploying to Mobile, Raspberry Pi	TFLite
# Using TPUs (Google Cloud, TPUv4)	JAX (XLA)


In [None]:
# Result:

# AI music generation in real-time
# Lower latency with pipeline parallelism
# Works with MIDI input & text prompts

In [None]:
pip install --upgrade huggingface_hub




In [None]:
pip uninstall huggingface_hub -y


Found existing installation: huggingface-hub 0.29.2
Uninstalling huggingface-hub-0.29.2:
  Successfully uninstalled huggingface-hub-0.29.2


In [None]:
pip install --upgrade onnx onnxruntime-gpu onnxoptimizer


Collecting onnx
  Downloading onnx-1.17.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting onnxruntime-gpu
  Downloading onnxruntime_gpu-1.20.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.6 kB)
Collecting onnxoptimizer
  Downloading onnxoptimizer-0.3.13-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB)
Collecting coloredlogs (from onnxruntime-gpu)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime-gpu)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)
Downloading onnx-1.17.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.0/16.0 MB[0m [31m85.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading onnxruntime_gpu-1.20.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (291.5 MB)
[2K   [90m━━━━━━━━━━━━

In [None]:
!git clone https://github.com/riffusion/riffusion.git
%cd riffusion
!pip install -e .

Cloning into 'riffusion'...
remote: Enumerating objects: 800, done.[K
remote: Counting objects: 100% (395/395), done.[K
remote: Compressing objects: 100% (130/130), done.[K
remote: Total 800 (delta 315), reused 265 (delta 265), pack-reused 405 (from 3)[K
Receiving objects: 100% (800/800), 8.29 MiB | 16.82 MiB/s, done.
Resolving deltas: 100% (492/492), done.
/content/riffusion
Obtaining file:///content/riffusion
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting argh (from riffusion==0.3.1)
  Downloading argh-0.31.3-py3-none-any.whl.metadata (7.4 kB)
Collecting dacite (from riffusion==0.3.1)
  Downloading dacite-1.9.2-py3-none-any.whl.metadata (17 kB)
Collecting demucs (from riffusion==0.3.1)
  Downloading demucs-4.0.1.tar.gz (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import riffusion
print("✅ Riffusion installed successfully!")

✅ Riffusion installed successfully!


In [None]:
pip show riffusion

Name: riffusion
Version: 0.3.1
Summary: Riffusion - Stable diffusion for real-time music generation
Home-page: https://github.com/riffusion/riffusion
Author: Hayk Martiros
Author-email: hayk.mart@gmail.com
License: MIT
Location: /usr/local/lib/python3.11/dist-packages
Editable project location: /content/riffusion
Requires: accelerate, argh, dacite, demucs, diffusers, flask, flask_cors, numpy, pillow, plotly, pydub, pysoundfile, scipy, soundfile, sox, streamlit, torch, torchaudio, torchvision, transformers
Required-by: 


In [None]:
pip install --upgrade diffusers


Collecting diffusers
  Downloading diffusers-0.32.2-py3-none-any.whl.metadata (18 kB)
Downloading diffusers-0.32.2-py3-none-any.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m77.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diffusers
  Attempting uninstall: diffusers
    Found existing installation: diffusers 0.9.0
    Uninstalling diffusers-0.9.0:
      Successfully uninstalled diffusers-0.9.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
riffusion 0.3.1 requires diffusers==0.9.0, but you have diffusers 0.32.2 which is incompatible.[0m[31m
[0mSuccessfully installed diffusers-0.32.2


In [None]:
from diffusers import StableDiffusionPipeline
print("✅ Diffusers library is working!")


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

✅ Diffusers library is working!


In [None]:
import torch

pipe = StableDiffusionPipeline.from_pretrained("riffusion/riffusion-model-v1")
pipe.to("cuda")

output = pipe("A jazz music piece with a saxophone", guidance_scale=7.5).images[0]
output.show()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model_index.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/492M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.84k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

scheduler_config.json:   0%|          | 0.00/284 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

diffusion_pytorch_model.bin:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/547 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

diffusion_pytorch_model.bin:   0%|          | 0.00/335M [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

An error occurred while trying to fetch /root/.cache/huggingface/hub/models--riffusion--riffusion-model-v1/snapshots/8f2e752c74e8316c6eb4fdaa6598a46ce1d88af5/vae: Error no file named diffusion_pytorch_model.safetensors found in directory /root/.cache/huggingface/hub/models--riffusion--riffusion-model-v1/snapshots/8f2e752c74e8316c6eb4fdaa6598a46ce1d88af5/vae.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
An error occurred while trying to fetch /root/.cache/huggingface/hub/models--riffusion--riffusion-model-v1/snapshots/8f2e752c74e8316c6eb4fdaa6598a46ce1d88af5/unet: Error no file named diffusion_pytorch_model.safetensors found in directory /root/.cache/huggingface/hub/models--riffusion--riffusion-model-v1/snapshots/8f2e752c74e8316c6eb4fdaa6598a46ce1d88af5/unet.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
- CompVis/stable-diffusion-v1-4 
- CompVis/stable-diffusion-v1-3 
- CompVis/stable-diffusion-v1-2 

  0%|          | 0/50 [00:00<?, ?it/s]

In [None]:
print("Expected UNet input shape:")
print("Latent Space:", pipe.unet.config.in_channels, "x 64 x 64")
print("Timestep: Single scalar (int64)")
print("Encoder Hidden States:", pipe.text_encoder.config.hidden_size)  # Should be 768


Expected UNet input shape:
Latent Space: 4 x 64 x 64
Timestep: Single scalar (int64)
Encoder Hidden States: 768


In [None]:
import torch.onnx

class UNetWrapper(torch.nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(self, latent, timestep, encoder_hidden_states):
        return self.unet(latent, timestep, encoder_hidden_states)[0]

# Wrap UNet for ONNX export
unet_onnx = UNetWrapper(pipe.unet).to("cuda")

# Correct input shapes
dummy_latent = torch.randn(1, 4, 64, 64).to("cuda")  # Latent space input
dummy_timestep = torch.tensor([50], dtype=torch.int64).to("cuda")  # Fixed timestep
dummy_encoder_hidden_states = torch.randn(1, 77, 768).to("cuda")  # Corrected shape

# Export the model to ONNX
torch.onnx.export(
    unet_onnx,
    (dummy_latent, dummy_timestep, dummy_encoder_hidden_states),
    "riffusion_unet.onnx",
    export_params=True,
    opset_version=14,
    input_names=["latent", "timestep", "encoder_hidden_states"],
    output_names=["output"],
    dynamic_axes={"latent": {0: "batch_size"}, "output": {0: "batch_size"}}
)

print("✅ Successfully exported Riffusion UNet to ONNX: riffusion_unet.onnx")


✅ Successfully exported Riffusion UNet to ONNX: riffusion_unet.onnx


In [None]:
import threading

In [None]:
pip install librosa torchaudio numpy onnxruntime-tools

Collecting onnxruntime-tools
  Downloading onnxruntime_tools-1.7.0-py3-none-any.whl.metadata (14 kB)
Collecting py3nvml (from onnxruntime-tools)
  Downloading py3nvml-0.2.7-py3-none-any.whl.metadata (13 kB)
Collecting xmltodict (from py3nvml->onnxruntime-tools)
  Downloading xmltodict-0.14.2-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading onnxruntime_tools-1.7.0-py3-none-any.whl (212 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.7/212.7 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading py3nvml-0.2.7-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xmltodict-0.14.2-py2.py3-none-any.whl (10.0 kB)
Installing collected packages: xmltodict, py3nvml, onnxruntime-tools
Successfully installed onnxruntime-tools-1.7.0 py3nvml-0.2.7 xmltodict-0.14.2


In [None]:
# pip install --upgrade \
#     librosa torch torchaudio torchvision numpy scipy \
#     onnx onnxruntime-gpu onnxoptimizer onnxruntime-tools \
#     diffusers transformers soundfile pydub \
#     matplotlib plotly accelerate

In [None]:
from queue import Queue

In [None]:
# pip install huggingface_hub==0.18.0

Collecting huggingface_hub==0.18.0
  Downloading huggingface_hub-0.18.0-py3-none-any.whl.metadata (13 kB)
Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.28.1
    Uninstalling huggingface-hub-0.28.1:
      Successfully uninstalled huggingface-hub-0.28.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
accelerate 1.3.0 requires huggingface-hub>=0.21.0, but you have huggingface-hub 0.18.0 which is incompatible.
sentence-transformers 3.4.1 requires huggingface-hub>=0.20.0, but you have huggingface-hub 0.18.0 which is incompatible.
transformers 4.48.3 requires huggingface-hub<1.0,>=0.24.0,

In [None]:
# from huggingface_hub import notebook_login

# notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import torch
import torch.onnx
from diffusers import StableDiffusionPipeline

# Load Riffusion model
pipe = StableDiffusionPipeline.from_pretrained("riffusion/riffusion-model-v1")
pipe.to("cuda")

# Wrap the UNet model for ONNX export
class UNetWrapper(torch.nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(self, latent, timestep, encoder_hidden_states):
        return self.unet(latent, timestep, encoder_hidden_states)[0]

unet_onnx = UNetWrapper(pipe.unet).to("cuda")

# Correct input shapes
dummy_latent = torch.randn(1, 4, 64, 64).to("cuda")
dummy_timestep = torch.tensor([50], dtype=torch.int64).to("cuda")
dummy_encoder_hidden_states = torch.randn(1, 77, 768).to("cuda")

# Export to ONNX
torch.onnx.export(
    unet_onnx,
    (dummy_latent, dummy_timestep, dummy_encoder_hidden_states),
    "model.onnx",
    export_params=True,
    opset_version=14,
    input_names=["latent", "timestep", "encoder_hidden_states"],
    output_names=["output"],
    dynamic_axes={"latent": {0: "batch_size"}, "output": {0: "batch_size"}}
)

print("✅ Riffusion UNet model exported successfully as 'model.onnx'")


Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

An error occurred while trying to fetch /root/.cache/huggingface/hub/models--riffusion--riffusion-model-v1/snapshots/8f2e752c74e8316c6eb4fdaa6598a46ce1d88af5/vae: Error no file named diffusion_pytorch_model.safetensors found in directory /root/.cache/huggingface/hub/models--riffusion--riffusion-model-v1/snapshots/8f2e752c74e8316c6eb4fdaa6598a46ce1d88af5/vae.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
An error occurred while trying to fetch /root/.cache/huggingface/hub/models--riffusion--riffusion-model-v1/snapshots/8f2e752c74e8316c6eb4fdaa6598a46ce1d88af5/unet: Error no file named diffusion_pytorch_model.safetensors found in directory /root/.cache/huggingface/hub/models--riffusion--riffusion-model-v1/snapshots/8f2e752c74e8316c6eb4fdaa6598a46ce1d88af5/unet.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
- CompVis/stable-diffusion-v1-4 
- CompVis/stable-diffusion-v1-3 
- CompVis/stable-diffusion-v1-2 

✅ Riffusion UNet model exported successfully as 'model.onnx'


In [None]:
pip install tqdm



In [None]:
pip install librosa torchaudio numpy onnxruntime-tools

In [None]:
pip install onnxoptimizer



In [None]:
import threading
import time
import numpy as np
import librosa
import torch
import torchaudio
import onnx
import onnxruntime as ort
from queue import Queue
from onnxruntime.quantization import quantize_dynamic, QuantType
import onnxoptimizer
from tqdm import tqdm

# **Step 1: Optimize & Quantize ONNX Model**
def optimize_onnx(
    model_path="model.onnx",
    optimized_model_path="optimized_model.onnx",
    quantized_model_path="quantized_model.onnx"
):
    print("Optimizing ONNX Model...")
    model = onnx.load(model_path)

    optimizations = [
        "eliminate_identity",
        "eliminate_nop_dropout",
        "eliminate_nop_cast",
        "fuse_consecutive_transposes",
        "fuse_add_bias_into_conv"
    ]
    optimized_model = onnxoptimizer.optimize(model, optimizations)
    onnx.save(optimized_model, optimized_model_path)

    # Apply INT8 quantization (for CPU) or keep FP16 (for GPU)
    print("Applying Quantization...")
    quantize_dynamic(optimized_model_path, quantized_model_path, weight_type=QuantType.QInt8)

    print(f"Optimized ONNX model saved: {optimized_model_path}")
    print(f"Quantized ONNX model saved: {quantized_model_path}")

optimize_onnx()

# **Step 2: Define Real-Time Processing Pipeline**
class OptimizedPipeline:
    def __init__(self, use_fp16=True):
        self.preprocess_queue = Queue(maxsize=1)
        self.inference_queue = Queue(maxsize=1)
        self.postprocess_queue = Queue(maxsize=1)

        # Use optimized ONNX model (FP16 for GPU, INT8 for CPU)
        model_path = "optimized_model.onnx" if use_fp16 else "quantized_model.onnx"
        print(f"Loading ONNX Model: {model_path}")
        self.model = ort.InferenceSession(model_path, providers=['CUDAExecutionProvider'])

    def infer(self, latent_input):
        print("Running AI Music Inference...")
        output = self.model.run(None, {"latent": latent_input})[0]
        return output

# **Step 3: MIDI → Spectrogram → AI Music**
def midi_to_spectrogram(midi_data):
    print("Converting MIDI to Spectrogram...")
    return np.abs(librosa.stft(midi_data))  # Simulated spectrogram

def spectrogram_to_audio(spectrogram):
    print("Converting Spectrogram to Audio...")
    return librosa.istft(spectrogram)  # Convert back

def play_audio(audio_waveform):
    print(f"Playing AI-generated music! (Samples: {len(audio_waveform)})")

# **Step 4: Run Real-Time Music Pipeline**
pipeline = OptimizedPipeline(use_fp16=True)

def preprocess_worker():
    while True:
        midi_data = np.random.rand(22050)  # Simulated MIDI input
        spectrogram = midi_to_spectrogram(midi_data)
        pipeline.preprocess_queue.put(spectrogram)

def inference_worker():
    while True:
        spectrogram = pipeline.preprocess_queue.get()
        latent_input = np.random.randn(1, 4, 64, 64).astype(np.float32)  # Random latent input
        output_spectrogram = pipeline.infer(latent_input)  # Run AI Inference
        pipeline.postprocess_queue.put(output_spectrogram)

def postprocess_worker():
    while True:
        processed_audio = spectrogram_to_audio(pipeline.postprocess_queue.get())
        play_audio(processed_audio)

# Start Pipeline Threads
print("Starting Real-Time AI Music Pipeline...")
threading.Thread(target=preprocess_worker, daemon=True).start()
threading.Thread(target=inference_worker, daemon=True).start()
threading.Thread(target=postprocess_worker, daemon=True).start()

# Simulate 5 AI-Generated Music Bars
for _ in range(5):
    time.sleep(0.3)


🔄 Optimizing ONNX Model...



Package	Purpose
torch	PyTorch (Deep Learning Framework)
torchaudio	Audio processing library for PyTorch
torchvision	Vision-related utilities for PyTorch
numpy	Numerical computations
scipy	Scientific computing (needed for signal processing)
librosa	Audio processing library (used for spectrograms)
soundfile	Handles audio file formats (WAV, MP3, etc.)
pydub	Audio manipulation (convert MP3 to WAV, etc.)
onnx	Open Neural Network Exchange (ONNX) framework
onnxruntime-gpu	Optimized ONNX inference for GPU
onnxoptimizer	ONNX model optimization tools
onnxruntime-tools	Extra tools for ONNX inference
diffusers	Hugging Face Diffusers (needed for Riffusion)
transformers	Hugging Face Transformers (needed for CLIP embeddings)
matplotlib	Visualization for spectrograms
plotly	Interactive plots for debugging
accelerate	Optimized model execution

In [None]:
import threading
import time
import numpy as np
import librosa
import torch
import torchaudio
import onnx
import onnxruntime as ort
from queue import Queue
from onnxruntime.quantization import quantize_dynamic, QuantType
import onnxoptimizer
from tqdm import tqdm
import concurrent.futures
import logging
from typing import Dict, Any, Optional, Tuple
import psutil  # For CPU/memory monitoring

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger("AI-Music-Pipeline")

# Step 1: Advanced ONNX Model Optimization
def optimize_onnx(
    model_path: str = "model.onnx",
    optimized_model_path: str = "optimized_model.onnx",
    quantized_model_path: str = "quantized_model.onnx",
    target_device: str = "cuda"  # 'cuda' or 'cpu'
) -> Tuple[str, str]:
    """Optimize and quantize ONNX model with advanced techniques"""
    logger.info("Loading and optimizing ONNX Model...")

    # Load model
    model = onnx.load(model_path)

    # Extended optimizations list
    optimizations = [
        "eliminate_identity",
        "eliminate_nop_dropout",
        "eliminate_nop_cast",
        "eliminate_unused_initializer",
        "fuse_consecutive_transposes",
        "fuse_bn_into_conv",
        "fuse_add_bias_into_conv",
        "fuse_pad_into_conv",
        "fuse_consecutive_concats",
        "fuse_consecutive_reduce_unsqueeze"
    ]

    # Apply optimizations
    optimized_model = onnxoptimizer.optimize(model, optimizations)
    onnx.save(optimized_model, optimized_model_path)

    # Choose quantization strategy based on target device
    if target_device == "cpu":
        logger.info("Applying INT8 quantization for CPU...")
        quantize_dynamic(
            optimized_model_path,
            quantized_model_path,
            weight_type=QuantType.QInt8,
            per_channel=True,
            reduce_range=True,
            optimize_model=True
        )
        final_model_path = quantized_model_path
    else:
        logger.info("Keeping FP16 precision for GPU...")
        # For GPU, we'll keep the optimized model as is (FP16 is better for GPUs)
        final_model_path = optimized_model_path

    logger.info(f"Model optimization complete. Using: {final_model_path}")
    return optimized_model_path, quantized_model_path

# Step 2: Enhanced Real-Time Processing Pipeline with Performance Monitoring
class EnhancedPipeline:
    def __init__(
        self,
        model_path: str,
        batch_size: int = 1,
        buffer_size: int = 3,
        use_cuda: bool = True
    ):
        self.batch_size = batch_size
        self.buffer_size = buffer_size

        # Create bounded queues with appropriate buffer sizes
        self.preprocess_queue = Queue(maxsize=buffer_size)
        self.inference_queue = Queue(maxsize=buffer_size)
        self.postprocess_queue = Queue(maxsize=buffer_size)

        # Performance monitoring
        self.metrics = {
            "preprocess_time": [],
            "inference_time": [],
            "postprocess_time": [],
            "end_to_end_latency": []
        }

        # Load optimized ONNX model with appropriate provider
        logger.info(f"Loading ONNX Model: {model_path}")
        providers = ['CUDAExecutionProvider', 'CPUExecutionProvider'] if use_cuda else ['CPUExecutionProvider']

        # Configure session options for better performance
        session_options = ort.SessionOptions()
        session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        session_options.intra_op_num_threads = psutil.cpu_count(logical=False)  # Use physical cores
        session_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

        # Load model with optimized session options
        self.model = ort.InferenceSession(model_path, sess_options=session_options, providers=providers)

        # Warm up the model to avoid cold-start latency
        logger.info("Warming up the model...")
        self._warmup()

    def _warmup(self, num_iterations: int = 5):
        """Warm up the model to ensure GPU kernels are compiled"""
        dummy_input = np.random.randn(self.batch_size, 4, 64, 64).astype(np.float32)
        for _ in range(num_iterations):
            _ = self.model.run(None, {"latent": dummy_input})

    def infer(self, latent_input: np.ndarray) -> np.ndarray:
        """Run inference with performance tracking"""
        start_time = time.time()

        # Check and fix input if needed
        if latent_input.shape[0] != self.batch_size:
            logger.warning(f"Input batch size mismatch. Expected {self.batch_size}, got {latent_input.shape[0]}. Reshaping...")
            latent_input = np.resize(latent_input, (self.batch_size, *latent_input.shape[1:]))

        # Run inference
        outputs = self.model.run(None, {"latent": latent_input})

        # Track inference time
        inference_time = time.time() - start_time
        self.metrics["inference_time"].append(inference_time)

        if len(self.metrics["inference_time"]) % 100 == 0:
            avg_inference = np.mean(self.metrics["inference_time"][-100:]) * 1000
            logger.info(f"Average inference time: {avg_inference:.2f} ms")

        return outputs[0]

    def get_performance_stats(self) -> Dict[str, Any]:
        """Get summary statistics of pipeline performance"""
        stats = {}
        for key, values in self.metrics.items():
            if values:
                stats[key] = {
                    "mean_ms": np.mean(values) * 1000,
                    "median_ms": np.median(values) * 1000,
                    "min_ms": np.min(values) * 1000,
                    "max_ms": np.max(values) * 1000,
                    "p95_ms": np.percentile(values, 95) * 1000 if len(values) >= 20 else None,
                    "p99_ms": np.percentile(values, 99) * 1000 if len(values) >= 100 else None
                }
        return stats

# Step 3: Enhanced Audio Processing Functions
def midi_to_spectrogram(midi_data: np.ndarray, n_fft: int = 2048, hop_length: int = 512) -> np.ndarray:
    """Convert MIDI data to spectrogram with configurable parameters"""
    start_time = time.time()

    # Apply windowing for better frequency resolution
    window = np.hanning(n_fft)
    spectrogram = np.abs(librosa.stft(midi_data, n_fft=n_fft, hop_length=hop_length, window=window))

    # Convert to decibels for better representation
    spectrogram = librosa.amplitude_to_db(spectrogram, ref=np.max)

    logger.debug(f"MIDI->Spectrogram conversion took {(time.time() - start_time)*1000:.2f} ms")
    return spectrogram

def spectrogram_to_audio(
    spectrogram: np.ndarray,
    n_fft: int = 2048,
    hop_length: int = 512,
    phase_iterations: int = 10  # Griffin-Lim iterations
) -> np.ndarray:
    """Convert spectrogram to audio with enhanced quality"""
    start_time = time.time()

    # Convert from dB scale back to amplitude
    amplitude_spectrogram = librosa.db_to_amplitude(spectrogram)

    # Use Griffin-Lim algorithm for phase reconstruction
    audio = librosa.griffinlim(
        amplitude_spectrogram,
        n_iter=phase_iterations,
        hop_length=hop_length,
        n_fft=n_fft
    )

    # Apply light smoothing to reduce artifacts
    if len(audio) > 0:
        audio = np.convolve(audio, np.hanning(32)/np.sum(np.hanning(32)), mode='same')

    logger.debug(f"Spectrogram->Audio conversion took {(time.time() - start_time)*1000:.2f} ms")
    return audio

def play_audio(audio_waveform: np.ndarray, sample_rate: int = 22050):
    """Play audio with sample rate control and normalization"""
    # Normalize audio to prevent clipping
    audio_normalized = audio_waveform / (np.max(np.abs(audio_waveform)) + 1e-8)

    # In a real application, this would use a proper audio playback library
    logger.info(f"Playing AI-generated music! (Samples: {len(audio_normalized)}, Duration: {len(audio_normalized)/sample_rate:.2f}s)")

    # Simulate audio playback (in real code, this would use sounddevice, pyaudio, etc.)
    return audio_normalized

# Step 4: Enhanced Real-Time Music Pipeline with Thread Pool and Resource Management
class MusicPipelineManager:
    def __init__(
        self,
        model_path: str,
        batch_size: int = 1,
        buffer_size: int = 3,
        use_cuda: bool = True,
        max_workers: int = 3
    ):
        self.pipeline = EnhancedPipeline(model_path, batch_size, buffer_size, use_cuda)
        self.running = False
        self.executor = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
        self.futures = []

        # Audio processing parameters
        self.audio_params = {
            "n_fft": 2048,
            "hop_length": 512,
            "sample_rate": 22050,
            "phase_iterations": 32
        }

    def start(self):
        """Start the pipeline with proper resource management"""
        if self.running:
            logger.warning("Pipeline is already running")
            return

        self.running = True
        logger.info("Starting Real-Time AI Music Pipeline...")

        # Submit worker tasks to thread pool
        self.futures = [
            self.executor.submit(self._preprocess_worker),
            self.executor.submit(self._inference_worker),
            self.executor.submit(self._postprocess_worker)
        ]

    def stop(self):
        """Gracefully stop the pipeline"""
        logger.info("Stopping pipeline...")
        self.running = False

        # Wait for threads to complete
        for future in concurrent.futures.as_completed(self.futures):
            try:
                future.result(timeout=2.0)
            except concurrent.futures.TimeoutError:
                logger.warning("Thread did not complete in time")

        # Print performance statistics
        stats = self.pipeline.get_performance_stats()
        logger.info(f"Pipeline performance stats: {stats}")

    def _preprocess_worker(self):
        """Worker thread for audio preprocessing"""
        logger.info("Preprocess worker started")

        while self.running:
            try:
                # Simulate real-time MIDI input
                midi_data = np.random.rand(self.audio_params["sample_rate"])

                start_time = time.time()
                spectrogram = midi_to_spectrogram(
                    midi_data,
                    n_fft=self.audio_params["n_fft"],
                    hop_length=self.audio_params["hop_length"]
                )

                # Track preprocessing time
                preprocess_time = time.time() - start_time
                self.pipeline.metrics["preprocess_time"].append(preprocess_time)

                # Add to queue with timeout to prevent blocking indefinitely
                self.pipeline.preprocess_queue.put(spectrogram, timeout=1.0)

            except Queue.Full:
                logger.warning("Preprocess queue is full - pipeline may be bottlenecked")
            except Exception as e:
                logger.error(f"Error in preprocessing: {str(e)}")

        logger.info("Preprocess worker stopped")

    def _inference_worker(self):
        """Worker thread for model inference"""
        logger.info("Inference worker started")

        while self.running:
            try:
                # Get input with timeout
                spectrogram = self.pipeline.preprocess_queue.get(timeout=1.0)

                # Prepare input for the model (simulate latent vector creation)
                latent_input = np.random.randn(1, 4, 64, 64).astype(np.float32)

                # Run inference
                output_spectrogram = self.pipeline.infer(latent_input)

                # Add to next queue
                self.pipeline.postprocess_queue.put(output_spectrogram, timeout=1.0)

            except Queue.Empty:
                logger.debug("Inference queue is empty - waiting for input")
            except Queue.Full:
                logger.warning("Postprocess queue is full - pipeline may be bottlenecked")
            except Exception as e:
                logger.error(f"Error in inference: {str(e)}")

        logger.info("Inference worker stopped")

    def _postprocess_worker(self):
        """Worker thread for audio postprocessing"""
        logger.info("Postprocess worker started")

        while self.running:
            try:
                # Get output with timeout
                output_spectrogram = self.pipeline.postprocess_queue.get(timeout=1.0)

                start_time = time.time()
                processed_audio = spectrogram_to_audio(
                    output_spectrogram,
                    n_fft=self.audio_params["n_fft"],
                    hop_length=self.audio_params["hop_length"],
                    phase_iterations=self.audio_params["phase_iterations"]
                )

                # Track postprocessing time
                postprocess_time = time.time() - start_time
                self.pipeline.metrics["postprocess_time"].append(postprocess_time)

                # Play the audio
                _ = play_audio(processed_audio, sample_rate=self.audio_params["sample_rate"])

                # Calculate end-to-end latency
                if (self.pipeline.metrics["preprocess_time"] and
                    self.pipeline.metrics["inference_time"] and
                    self.pipeline.metrics["postprocess_time"]):

                    end_to_end = (self.pipeline.metrics["preprocess_time"][-1] +
                                 self.pipeline.metrics["inference_time"][-1] +
                                 self.pipeline.metrics["postprocess_time"][-1])

                    self.pipeline.metrics["end_to_end_latency"].append(end_to_end)

            except Queue.Empty:
                logger.debug("Postprocess queue is empty - waiting for input")
            except Exception as e:
                logger.error(f"Error in postprocessing: {str(e)}")

        logger.info("Postprocess worker stopped")

# Main execution code
def main():
    # Step 1: Optimize the model
    opt_path, quant_path = optimize_onnx(
        model_path="model.onnx",
        target_device="cuda"  # Use "cpu" for CPU-optimized model
    )

    # Step 2: Create and start the pipeline
    manager = MusicPipelineManager(
        model_path=opt_path,  # Use the GPU-optimized model
        batch_size=1,
        buffer_size=5,  # Larger buffer for smoother operation
        use_cuda=True,
        max_workers=4
    )

    # Start the pipeline
    manager.start()

    try:
        # Run for specified duration (5 seconds with 0.3s steps = ~15 iterations)
        for i in tqdm(range(15), desc="Generating AI Music"):
            time.sleep(0.3)

            # Every 5 iterations, report performance
            if i > 0 and i % 5 == 0:
                stats = manager.pipeline.get_performance_stats()
                if stats.get("end_to_end_latency"):
                    latency = stats["end_to_end_latency"]["mean_ms"]
                    logger.info(f"Current pipeline latency: {latency:.2f} ms")

    finally:
        # Ensure pipeline is always properly stopped
        manager.stop()

        # Print final performance stats
        final_stats = manager.pipeline.get_performance_stats()
        if final_stats.get("end_to_end_latency", {}).get("mean_ms"):
            logger.info(f"Final average latency: {final_stats['end_to_end_latency']['mean_ms']:.2f} ms")

    logger.info("AI Music generation complete")

if __name__ == "__main__":
    main()