-
Notifications
You must be signed in to change notification settings - Fork 63
Open
Labels
Description
Model Support Request: Music Flamingo
- HF Link: nvidia/music-flamingo-hf
- Architecture:
AudioFlamingo3ForConditionalGeneration - Backbone Components:
-
- Audio Tower: Whisper-v3 Large (32 layers, 128 mel bins)
-
- Language Model: Qwen2 (28 layers)
-
- Projector: Two-layer MLP (Linear + GELU + Linear)
Potential Support : Audio Flamingo
No tested, but potentially strict similar architecture.
- HF Link: nvidia/audio-flamingo-3-hf
- Architecture:
AudioFlamingo3ForConditionalGeneration - Thinking mode:
-
- LoRA: non_lora_trainables.bin
-
- Adapter: adapter_model.safetensors
-
- Available in
thinksubfolder of the hub
- Available in
After understanding details, LoRA can be set on model in convert. Wonderful!
Technical Configuration Mapping
| Parameter | Value | Note |
|---|---|---|
n_audio_layers |
32 | Whisper-v3 architecture |
n_audio_mel_bins |
128 | Critical: differs from Whisper-v1/v2 (80) |
n_text_layers |
28 | Qwen2 standard |
n_vocab |
151669 | Qwen2 tokenizer (<sound>) |
For integration into convert.py
class AudioFlamingo3ForConditionalGeneration(BaseConverter):
MODEL_TYPE = ModelType.AudioFlamingo
@classmethod
def state_dict_pp(cls, config, state_dict):
new_dict = {}
for name, tensor in state_dict.items():
new_name = name
# 1. Audio Tower (0-31)
if name.startswith('audio_tower.'):
new_name = name.replace('audio_tower.', 'audio.')
# 2. Multimodal Projector
elif name.startswith('multi_modal_projector.'):
if '.linear_1.' in name:
new_name = name.replace('.linear_1.', '.fc0.')
elif '.linear_2.' in name:
new_name = name.replace('.linear_2.', '.fc1.')
# 3. Language Model (Qwen2 28 layers)
elif name.startswith('language_model.'):
new_name = name.replace('language_model.', '')
if 'lm_head' in new_name:
new_name = new_name.replace('model.', '')
new_dict[new_name] = tensor
return new_dict
@staticmethod
def dump_config(f, config, ggml_type):
# 1. Textual Backbone Configuration (Qwen2)
QWen2Converter.dump_config(f, AttributeDict(config.text_config), ggml_type)
# 2. Audio Flamingo 3 Extension (AF-Whisper Specifications)
audio_cfg = config.audio_config
config_values = [
32, # n_audio_layers (Whisper-Large-v3)
audio_cfg.get('hidden_size', 1280), # d_model_audio
audio_cfg.get('num_attention_heads', 20),
2, # n_projector_layers (MLP linear_1/linear_2)
128, # n_mels (Standard Whisper-v3)
151669 # audio_token_id (<sound>)
]
f.write(struct.pack("i" * len(config_values), *config_values))
@staticmethod
def get_weight_names(config):
# We start with the Qwen2 backbone (28 layers)
weight_names = QWen2Converter.get_weight_names(AttributeDict(config.text_config))
# Audio Tower block (32 layers: 0 to 31)
for i in range(32):
weight_names += [
f"audio.layers.{i}.fc1.bias",
f"audio.layers.{i}.fc1.weight",
f"audio.layers.{i}.fc2.bias",
f"audio.layers.{i}.fc2.weight",
f"audio.layers.{i}.final_layer_norm.bias",
f"audio.layers.{i}.final_layer_norm.weight",
f"audio.layers.{i}.self_attn.k_proj.weight",
f"audio.layers.{i}.self_attn.out_proj.bias",
f"audio.layers.{i}.self_attn.out_proj.weight",
f"audio.layers.{i}.self_attn.q_proj.bias",
f"audio.layers.{i}.self_attn.q_proj.weight",
f"audio.layers.{i}.self_attn.v_proj.bias",
f"audio.layers.{i}.self_attn.v_proj.weight",
f"audio.layers.{i}.self_attn_layer_norm.bias",
f"audio.layers.{i}.self_attn_layer_norm.weight",
]
# Root parts and projector
weight_names += [
"audio.conv1.bias",
"audio.conv1.weight",
"audio.conv2.bias",
"audio.conv2.weight",
"audio.embed_positions.weight",
"audio.layer_norm.bias",
"audio.layer_norm.weight",
"multi_modal_projector.fc0.bias",
"multi_modal_projector.fc0.weight",
"multi_modal_projector.fc1.bias",
"multi_modal_projector.fc1.weight",
]
return weight_namesValidation with fake Model ID
Loading vocab file C:\models\music-flamingo-hf
vocab_size 151672
loading C:\models\music-flamingo-hf\model-00001-of-00004.safetensors ...
Dumping ... |████████████████████████████████████████████████████████████| 100.0% (555/556) 0.01s/it rem: 0.00s
loading C:\models\music-flamingo-hf\model-00002-of-00004.safetensors ...
Dumping ... |████████████████████████████████████████████████████████████| 100.0% (130/131) 0.11s/it rem: 0.00s
loading C:\models\music-flamingo-hf\model-00003-of-00004.safetensors ...
Dumping ... |████████████████████████████████████████████████████████████| 100.0% (121/122) 0.06s/it rem: 0.00s
loading C:\models\music-flamingo-hf\model-00004-of-00004.safetensors ...
Dumping ... |████████████████████████████████████████████████████████████| 100.0% (20/21) 0.16s/it rem: 0.00s
+-------------------------------------------------+-----------------------------+-------+
| name | shape | dtype |
+-------------------------------------------------+-----------------------------+-------+
| audio.conv1.bias | torch.Size([1280]) | F32 |
| audio.conv1.weight | torch.Size([1280, 128, 3]) | F16 |
| audio.conv2.bias | torch.Size([1280]) | F32 |
| audio.conv2.weight | torch.Size([1280, 1280, 3]) | F16 |
| audio.embed_positions.weight | torch.Size([1500, 1280]) | Q8_0 |
| audio.layer_norm.bias | torch.Size([1280]) | F32 |
| audio.layer_norm.weight | torch.Size([1280]) | F32 |
| audio.layers.0.fc1.bias | torch.Size([5120]) | F32 |
| audio.layers.0.fc1.weight | torch.Size([5120, 1280]) | Q8_0 |
| audio.layers.0.fc2.bias | torch.Size([1280]) | F32 |
| audio.layers.0.fc2.weight | torch.Size([1280, 5120]) | Q8_0 |
| audio.layers.0.final_layer_norm.bias | torch.Size([1280]) | F32 |
| audio.layers.0.final_layer_norm.weight | torch.Size([1280]) | F32 |
| audio.layers.0.self_attn.k_proj.weight | torch.Size([1280, 1280]) | Q8_0 |
| audio.layers.0.self_attn.out_proj.bias | torch.Size([1280]) | F32 |
| audio.layers.0.self_attn.out_proj.weight | torch.Size([1280, 1280]) | Q8_0 |
| audio.layers.0.self_attn.q_proj.bias | torch.Size([1280]) | F32 |
| audio.layers.0.self_attn.q_proj.weight | torch.Size([1280, 1280]) | Q8_0 |
| audio.layers.0.self_attn.v_proj.bias | torch.Size([1280]) | F32 |
| audio.layers.0.self_attn.v_proj.weight | torch.Size([1280, 1280]) | Q8_0 |
| audio.layers.0.self_attn_layer_norm.bias | torch.Size([1280]) | F32 |
| audio.layers.0.self_attn_layer_norm.weight | torch.Size([1280]) | F32 |
| audio.layers.1.fc1.bias | torch.Size([5120]) | F32 |
| audio.layers.1.fc1.weight | torch.Size([5120, 1280]) | Q8_0 |
| audio.layers.1.fc2.bias | torch.Size([1280]) | F32 |
| audio.layers.1.fc2.weight | torch.Size([1280, 5120]) | Q8_0 |
| ... | ... | ... |
| model.norm.weight | torch.Size([3584]) | F32 |
| multi_modal_projector.fc0.bias | torch.Size([3584]) | F32 |
| multi_modal_projector.fc0.weight | torch.Size([3584, 1280]) | Q8_0 |
| multi_modal_projector.fc1.bias | torch.Size([3584]) | F32 |
| multi_modal_projector.fc1.weight | torch.Size([3584, 3584]) | Q8_0 |
+-------------------------------------------------+-----------------------------+-------+
AudioFlamingo GGML model saved to music_flamingo_q8.bin
Reactions are currently unavailable