Support proposal : Nvidia's Music Flamingo (AudioFlamingo3ForConditionalGeneration)

### Model Support Request: Music Flamingo

* **HF Link:** [nvidia/music-flamingo-hf](https://huggingface.co/nvidia/music-flamingo-hf)
* **Architecture:** `AudioFlamingo3ForConditionalGeneration`
* **Backbone Components:**
* - **Audio Tower:** Whisper-v3 Large (32 layers, **128 mel bins**)
* - **Language Model:** Qwen2 (28 layers)
* - **Projector:** Two-layer MLP (Linear + GELU + Linear)

### Potential Support : Audio Flamingo

> No tested, but potentially strict similar architecture.

* **HF Link:** [nvidia/audio-flamingo-3-hf](https://huggingface.co/nvidia/audio-flamingo-3-hf)
* **Architecture:** `AudioFlamingo3ForConditionalGeneration`
* **Thinking mode:**
* - **LoRA:** [non_lora_trainables.bin](https://huggingface.co/nvidia/audio-flamingo-3-hf/blob/main/think/non_lora_trainables.bin)
* - **Adapter:** [adapter_model.safetensors](https://huggingface.co/nvidia/audio-flamingo-3-hf/resolve/main/think/adapter_model.safetensors?download=true)
* - Available in `think` subfolder of the hub

> After understanding details, LoRA can be set on model in convert. Wonderful!

### Technical Configuration Mapping

| Parameter | Value | Note |
| --- | --- | --- |
| `n_audio_layers` | **32** | Whisper-v3 architecture |
| `n_audio_mel_bins` | **128** | **Critical:** differs from Whisper-v1/v2 (80) |
| `n_text_layers` | **28** | Qwen2 standard |
| `n_vocab` | **151669** | Qwen2 tokenizer (`<sound>`) |

### For integration into `convert.py`

```python
class AudioFlamingo3ForConditionalGeneration(BaseConverter):
    MODEL_TYPE = ModelType.AudioFlamingo
    
    @classmethod
    def state_dict_pp(cls, config, state_dict):
        new_dict = {}
        for name, tensor in state_dict.items():
            new_name = name
            
            # 1. Audio Tower (0-31)
            if name.startswith('audio_tower.'):
                new_name = name.replace('audio_tower.', 'audio.')
            
            # 2. Multimodal Projector 
            elif name.startswith('multi_modal_projector.'):
                if '.linear_1.' in name:
                    new_name = name.replace('.linear_1.', '.fc0.')
                elif '.linear_2.' in name:
                    new_name = name.replace('.linear_2.', '.fc1.')

            # 3. Language Model (Qwen2 28 layers)
            elif name.startswith('language_model.'):
                new_name = name.replace('language_model.', '')
                if 'lm_head' in new_name:
                    new_name = new_name.replace('model.', '')

            new_dict[new_name] = tensor
        return new_dict

    @staticmethod
    def dump_config(f, config, ggml_type):
        # 1. Textual Backbone Configuration (Qwen2)
        QWen2Converter.dump_config(f, AttributeDict(config.text_config), ggml_type)

        # 2. Audio Flamingo 3 Extension (AF-Whisper Specifications)
        audio_cfg = config.audio_config
        config_values = [
            32,                                         # n_audio_layers (Whisper-Large-v3)
            audio_cfg.get('hidden_size', 1280),         # d_model_audio
            audio_cfg.get('num_attention_heads', 20),
            2,                                          # n_projector_layers (MLP linear_1/linear_2)
            128,                                        # n_mels (Standard Whisper-v3)
            151669                                      # audio_token_id (<sound>)
        ]
        f.write(struct.pack("i" * len(config_values), *config_values))

    @staticmethod
    def get_weight_names(config):
        # We start with the Qwen2 backbone (28 layers)
        weight_names = QWen2Converter.get_weight_names(AttributeDict(config.text_config))

        # Audio Tower block (32 layers: 0 to 31)
        for i in range(32):
            weight_names += [
                f"audio.layers.{i}.fc1.bias",
                f"audio.layers.{i}.fc1.weight",
                f"audio.layers.{i}.fc2.bias",
                f"audio.layers.{i}.fc2.weight",
                f"audio.layers.{i}.final_layer_norm.bias",
                f"audio.layers.{i}.final_layer_norm.weight",
                f"audio.layers.{i}.self_attn.k_proj.weight",
                f"audio.layers.{i}.self_attn.out_proj.bias",
                f"audio.layers.{i}.self_attn.out_proj.weight",
                f"audio.layers.{i}.self_attn.q_proj.bias",
                f"audio.layers.{i}.self_attn.q_proj.weight",
                f"audio.layers.{i}.self_attn.v_proj.bias",
                f"audio.layers.{i}.self_attn.v_proj.weight",
                f"audio.layers.{i}.self_attn_layer_norm.bias",
                f"audio.layers.{i}.self_attn_layer_norm.weight",
            ]

        # Root parts and projector
        weight_names += [
            "audio.conv1.bias",
            "audio.conv1.weight",
            "audio.conv2.bias",
            "audio.conv2.weight",
            "audio.embed_positions.weight",
            "audio.layer_norm.bias",
            "audio.layer_norm.weight",
            "multi_modal_projector.fc0.bias",
            "multi_modal_projector.fc0.weight",
            "multi_modal_projector.fc1.bias",
            "multi_modal_projector.fc1.weight",
        ]

        return weight_names
```
### Validation with fake Model ID

```
Loading vocab file C:\models\music-flamingo-hf
vocab_size  151672
loading C:\models\music-flamingo-hf\model-00001-of-00004.safetensors ...
Dumping ... |████████████████████████████████████████████████████████████| 100.0% (555/556) 0.01s/it rem: 0.00s
loading C:\models\music-flamingo-hf\model-00002-of-00004.safetensors ...
Dumping ... |████████████████████████████████████████████████████████████| 100.0% (130/131) 0.11s/it rem: 0.00s
loading C:\models\music-flamingo-hf\model-00003-of-00004.safetensors ...
Dumping ... |████████████████████████████████████████████████████████████| 100.0% (121/122) 0.06s/it rem: 0.00s
loading C:\models\music-flamingo-hf\model-00004-of-00004.safetensors ...
Dumping ... |████████████████████████████████████████████████████████████| 100.0% (20/21) 0.16s/it rem: 0.00s
+-------------------------------------------------+-----------------------------+-------+
| name                                            | shape                       | dtype |
+-------------------------------------------------+-----------------------------+-------+
| audio.conv1.bias                                | torch.Size([1280])          | F32   |
| audio.conv1.weight                              | torch.Size([1280, 128, 3])  | F16   |
| audio.conv2.bias                                | torch.Size([1280])          | F32   |
| audio.conv2.weight                              | torch.Size([1280, 1280, 3]) | F16   |
| audio.embed_positions.weight                    | torch.Size([1500, 1280])    | Q8_0  |
| audio.layer_norm.bias                           | torch.Size([1280])          | F32   |
| audio.layer_norm.weight                         | torch.Size([1280])          | F32   |
| audio.layers.0.fc1.bias                         | torch.Size([5120])          | F32   |
| audio.layers.0.fc1.weight                       | torch.Size([5120, 1280])    | Q8_0  |
| audio.layers.0.fc2.bias                         | torch.Size([1280])          | F32   |
| audio.layers.0.fc2.weight                       | torch.Size([1280, 5120])    | Q8_0  |
| audio.layers.0.final_layer_norm.bias            | torch.Size([1280])          | F32   |
| audio.layers.0.final_layer_norm.weight          | torch.Size([1280])          | F32   |
| audio.layers.0.self_attn.k_proj.weight          | torch.Size([1280, 1280])    | Q8_0  |
| audio.layers.0.self_attn.out_proj.bias          | torch.Size([1280])          | F32   |
| audio.layers.0.self_attn.out_proj.weight        | torch.Size([1280, 1280])    | Q8_0  |
| audio.layers.0.self_attn.q_proj.bias            | torch.Size([1280])          | F32   |
| audio.layers.0.self_attn.q_proj.weight          | torch.Size([1280, 1280])    | Q8_0  |
| audio.layers.0.self_attn.v_proj.bias            | torch.Size([1280])          | F32   |
| audio.layers.0.self_attn.v_proj.weight          | torch.Size([1280, 1280])    | Q8_0  |
| audio.layers.0.self_attn_layer_norm.bias        | torch.Size([1280])          | F32   |
| audio.layers.0.self_attn_layer_norm.weight      | torch.Size([1280])          | F32   |
| audio.layers.1.fc1.bias                         | torch.Size([5120])          | F32   |
| audio.layers.1.fc1.weight                       | torch.Size([5120, 1280])    | Q8_0  |
| audio.layers.1.fc2.bias                         | torch.Size([1280])          | F32   |
| audio.layers.1.fc2.weight                       | torch.Size([1280, 5120])    | Q8_0  |
| ...                                             | ...                         | ...   |
| model.norm.weight                               | torch.Size([3584])          | F32   |
| multi_modal_projector.fc0.bias                  | torch.Size([3584])          | F32   |
| multi_modal_projector.fc0.weight                | torch.Size([3584, 1280])    | Q8_0  |
| multi_modal_projector.fc1.bias                  | torch.Size([3584])          | F32   |
| multi_modal_projector.fc1.weight                | torch.Size([3584, 3584])    | Q8_0  |
+-------------------------------------------------+-----------------------------+-------+

AudioFlamingo GGML model saved to music_flamingo_q8.bin
```

Parameter	Value	Note
`n_audio_layers`	32	Whisper-v3 architecture
`n_audio_mel_bins`	128	Critical: differs from Whisper-v1/v2 (80)
`n_text_layers`	28	Qwen2 standard
`n_vocab`	151669	Qwen2 tokenizer (`<sound>`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support proposal : Nvidia's Music Flamingo (AudioFlamingo3ForConditionalGeneration) #118

Model Support Request: Music Flamingo

Potential Support : Audio Flamingo

Technical Configuration Mapping

For integration into `convert.py`

Validation with fake Model ID

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support proposal : Nvidia's Music Flamingo (AudioFlamingo3ForConditionalGeneration) #118

Description

Model Support Request: Music Flamingo

Potential Support : Audio Flamingo

Technical Configuration Mapping

For integration into convert.py

Validation with fake Model ID

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

For integration into `convert.py`