Skip to content

Support proposal : Nvidia's Music Flamingo (AudioFlamingo3ForConditionalGeneration) #118

@dorpxam

Description

@dorpxam

Model Support Request: Music Flamingo

  • HF Link: nvidia/music-flamingo-hf
  • Architecture: AudioFlamingo3ForConditionalGeneration
  • Backbone Components:
    • Audio Tower: Whisper-v3 Large (32 layers, 128 mel bins)
    • Language Model: Qwen2 (28 layers)
    • Projector: Two-layer MLP (Linear + GELU + Linear)

Potential Support : Audio Flamingo

No tested, but potentially strict similar architecture.

After understanding details, LoRA can be set on model in convert. Wonderful!

Technical Configuration Mapping

Parameter Value Note
n_audio_layers 32 Whisper-v3 architecture
n_audio_mel_bins 128 Critical: differs from Whisper-v1/v2 (80)
n_text_layers 28 Qwen2 standard
n_vocab 151669 Qwen2 tokenizer (<sound>)

For integration into convert.py

class AudioFlamingo3ForConditionalGeneration(BaseConverter):
    MODEL_TYPE = ModelType.AudioFlamingo
    
    @classmethod
    def state_dict_pp(cls, config, state_dict):
        new_dict = {}
        for name, tensor in state_dict.items():
            new_name = name
            
            # 1. Audio Tower (0-31)
            if name.startswith('audio_tower.'):
                new_name = name.replace('audio_tower.', 'audio.')
            
            # 2. Multimodal Projector 
            elif name.startswith('multi_modal_projector.'):
                if '.linear_1.' in name:
                    new_name = name.replace('.linear_1.', '.fc0.')
                elif '.linear_2.' in name:
                    new_name = name.replace('.linear_2.', '.fc1.')

            # 3. Language Model (Qwen2 28 layers)
            elif name.startswith('language_model.'):
                new_name = name.replace('language_model.', '')
                if 'lm_head' in new_name:
                    new_name = new_name.replace('model.', '')

            new_dict[new_name] = tensor
        return new_dict

    @staticmethod
    def dump_config(f, config, ggml_type):
        # 1. Textual Backbone Configuration (Qwen2)
        QWen2Converter.dump_config(f, AttributeDict(config.text_config), ggml_type)

        # 2. Audio Flamingo 3 Extension (AF-Whisper Specifications)
        audio_cfg = config.audio_config
        config_values = [
            32,                                         # n_audio_layers (Whisper-Large-v3)
            audio_cfg.get('hidden_size', 1280),         # d_model_audio
            audio_cfg.get('num_attention_heads', 20),
            2,                                          # n_projector_layers (MLP linear_1/linear_2)
            128,                                        # n_mels (Standard Whisper-v3)
            151669                                      # audio_token_id (<sound>)
        ]
        f.write(struct.pack("i" * len(config_values), *config_values))

    @staticmethod
    def get_weight_names(config):
        # We start with the Qwen2 backbone (28 layers)
        weight_names = QWen2Converter.get_weight_names(AttributeDict(config.text_config))

        # Audio Tower block (32 layers: 0 to 31)
        for i in range(32):
            weight_names += [
                f"audio.layers.{i}.fc1.bias",
                f"audio.layers.{i}.fc1.weight",
                f"audio.layers.{i}.fc2.bias",
                f"audio.layers.{i}.fc2.weight",
                f"audio.layers.{i}.final_layer_norm.bias",
                f"audio.layers.{i}.final_layer_norm.weight",
                f"audio.layers.{i}.self_attn.k_proj.weight",
                f"audio.layers.{i}.self_attn.out_proj.bias",
                f"audio.layers.{i}.self_attn.out_proj.weight",
                f"audio.layers.{i}.self_attn.q_proj.bias",
                f"audio.layers.{i}.self_attn.q_proj.weight",
                f"audio.layers.{i}.self_attn.v_proj.bias",
                f"audio.layers.{i}.self_attn.v_proj.weight",
                f"audio.layers.{i}.self_attn_layer_norm.bias",
                f"audio.layers.{i}.self_attn_layer_norm.weight",
            ]

        # Root parts and projector
        weight_names += [
            "audio.conv1.bias",
            "audio.conv1.weight",
            "audio.conv2.bias",
            "audio.conv2.weight",
            "audio.embed_positions.weight",
            "audio.layer_norm.bias",
            "audio.layer_norm.weight",
            "multi_modal_projector.fc0.bias",
            "multi_modal_projector.fc0.weight",
            "multi_modal_projector.fc1.bias",
            "multi_modal_projector.fc1.weight",
        ]

        return weight_names

Validation with fake Model ID

Loading vocab file C:\models\music-flamingo-hf
vocab_size  151672
loading C:\models\music-flamingo-hf\model-00001-of-00004.safetensors ...
Dumping ... |████████████████████████████████████████████████████████████| 100.0% (555/556) 0.01s/it rem: 0.00s
loading C:\models\music-flamingo-hf\model-00002-of-00004.safetensors ...
Dumping ... |████████████████████████████████████████████████████████████| 100.0% (130/131) 0.11s/it rem: 0.00s
loading C:\models\music-flamingo-hf\model-00003-of-00004.safetensors ...
Dumping ... |████████████████████████████████████████████████████████████| 100.0% (121/122) 0.06s/it rem: 0.00s
loading C:\models\music-flamingo-hf\model-00004-of-00004.safetensors ...
Dumping ... |████████████████████████████████████████████████████████████| 100.0% (20/21) 0.16s/it rem: 0.00s
+-------------------------------------------------+-----------------------------+-------+
| name                                            | shape                       | dtype |
+-------------------------------------------------+-----------------------------+-------+
| audio.conv1.bias                                | torch.Size([1280])          | F32   |
| audio.conv1.weight                              | torch.Size([1280, 128, 3])  | F16   |
| audio.conv2.bias                                | torch.Size([1280])          | F32   |
| audio.conv2.weight                              | torch.Size([1280, 1280, 3]) | F16   |
| audio.embed_positions.weight                    | torch.Size([1500, 1280])    | Q8_0  |
| audio.layer_norm.bias                           | torch.Size([1280])          | F32   |
| audio.layer_norm.weight                         | torch.Size([1280])          | F32   |
| audio.layers.0.fc1.bias                         | torch.Size([5120])          | F32   |
| audio.layers.0.fc1.weight                       | torch.Size([5120, 1280])    | Q8_0  |
| audio.layers.0.fc2.bias                         | torch.Size([1280])          | F32   |
| audio.layers.0.fc2.weight                       | torch.Size([1280, 5120])    | Q8_0  |
| audio.layers.0.final_layer_norm.bias            | torch.Size([1280])          | F32   |
| audio.layers.0.final_layer_norm.weight          | torch.Size([1280])          | F32   |
| audio.layers.0.self_attn.k_proj.weight          | torch.Size([1280, 1280])    | Q8_0  |
| audio.layers.0.self_attn.out_proj.bias          | torch.Size([1280])          | F32   |
| audio.layers.0.self_attn.out_proj.weight        | torch.Size([1280, 1280])    | Q8_0  |
| audio.layers.0.self_attn.q_proj.bias            | torch.Size([1280])          | F32   |
| audio.layers.0.self_attn.q_proj.weight          | torch.Size([1280, 1280])    | Q8_0  |
| audio.layers.0.self_attn.v_proj.bias            | torch.Size([1280])          | F32   |
| audio.layers.0.self_attn.v_proj.weight          | torch.Size([1280, 1280])    | Q8_0  |
| audio.layers.0.self_attn_layer_norm.bias        | torch.Size([1280])          | F32   |
| audio.layers.0.self_attn_layer_norm.weight      | torch.Size([1280])          | F32   |
| audio.layers.1.fc1.bias                         | torch.Size([5120])          | F32   |
| audio.layers.1.fc1.weight                       | torch.Size([5120, 1280])    | Q8_0  |
| audio.layers.1.fc2.bias                         | torch.Size([1280])          | F32   |
| audio.layers.1.fc2.weight                       | torch.Size([1280, 5120])    | Q8_0  |
| ...                                             | ...                         | ...   |
| model.norm.weight                               | torch.Size([3584])          | F32   |
| multi_modal_projector.fc0.bias                  | torch.Size([3584])          | F32   |
| multi_modal_projector.fc0.weight                | torch.Size([3584, 1280])    | Q8_0  |
| multi_modal_projector.fc1.bias                  | torch.Size([3584])          | F32   |
| multi_modal_projector.fc1.weight                | torch.Size([3584, 3584])    | Q8_0  |
+-------------------------------------------------+-----------------------------+-------+

AudioFlamingo GGML model saved to music_flamingo_q8.bin

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions