
# Advanced Seminar on Model Merging

This seminar provides an in-depth exploration of **model merging** techniques. We will cover theoretical foundations, practical implementation, and model fine-tuning after merging.

---

## Objectives:
1. **Understand** the various techniques for model merging.
2. **Implement** these techniques step by step.
3. **Train and validate** merged models for improved performance.

---

## Techniques Covered:
1. Weighted Average Merging.
2. Layer-wise Merging.
3. Fine-tuned Gradient Alignment.
4. Parameter Freezing and Selection.
5. Practical Training of Merged Models.


In [1]:

# Import libraries
import torch
import torch.nn as nn
import copy

# Dummy models for demonstration
class SimpleModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        return self.fc2(x)

# Create two models
model1 = SimpleModel(10, 20, 5)
model2 = SimpleModel(10, 20, 5)

# Initialize weights differently
torch.manual_seed(42)
model1.fc1.weight.data.normal_()
model2.fc1.weight.data.uniform_()

# Save initial states for merging examples
model1_state = model1.state_dict()
model2_state = model2.state_dict()


In [2]:

# Weighted Average Merging
def weighted_merge(state_dict1, state_dict2, alpha=0.5):
    merged_state_dict = {}
    for key in state_dict1.keys():
        merged_state_dict[key] = alpha * state_dict1[key] + (1 - alpha) * state_dict2[key]
    return merged_state_dict

alpha = 0.7
merged_weights = weighted_merge(model1_state, model2_state, alpha)

# Load the merged weights into a new model
merged_model = SimpleModel(10, 20, 5)
merged_model.load_state_dict(merged_weights)
print("Weighted Average Merging Completed")


Weighted Average Merging Completed


In [3]:

# Layer-wise Merging
def layerwise_merge(state_dict1, state_dict2, layers_to_merge):
    merged_state_dict = copy.deepcopy(state_dict1)
    for layer in layers_to_merge:
        merged_state_dict[layer] = state_dict2[layer]
    return merged_state_dict

# Merge only the first layer of model2 into model1
layers_to_merge = ['fc1.weight', 'fc1.bias']
merged_weights_layerwise = layerwise_merge(model1_state, model2_state, layers_to_merge)

# Load the merged weights into a new model
merged_model_layerwise = SimpleModel(10, 20, 5)
merged_model_layerwise.load_state_dict(merged_weights_layerwise)
print("Layer-wise Merging Completed")


Layer-wise Merging Completed


In [4]:

# Fine-tuning Merged Model
from torch.optim import Adam

# Create synthetic data
x = torch.randn(100, 10)
y = torch.randint(0, 5, (100,))

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = Adam(merged_model.parameters(), lr=0.01)

# Fine-tune the merged model
for epoch in range(5):
    optimizer.zero_grad()
    outputs = merged_model(x)
    loss = criterion(outputs, y)
    loss.backward()
    optimizer.step()
    print(f"Epoch {epoch + 1}, Loss: {loss.item()}")


Epoch 1, Loss: 1.8546305894851685
Epoch 2, Loss: 1.7490872144699097
Epoch 3, Loss: 1.6721289157867432
Epoch 4, Loss: 1.6179580688476562
Epoch 5, Loss: 1.5801407098770142



## Conclusion

In this seminar, we explored:
1. Multiple techniques for merging models.
2. Practical implementation of these techniques.
3. Fine-tuning to optimize merged model performance.

### Future Work:
- Explore more complex merging strategies like gradient alignment.
- Evaluate merged models on real-world datasets.

---

Experiment with these techniques to tailor models for specialized tasks!



## Advanced Mergekit Demonstration

In this section, we will explore:
1. YAML-based configuration for managing complex merges.
2. Implementation of advanced methods such as TIES, DELLA, and Mixture of Experts (MoE).
3. Practical steps for uploading merged models to Hugging Face.

Mergekit offers unparalleled flexibility with its methods and configuration options, enabling sophisticated model merging workflows.


In [1]:
!pip install torch --upgrade



In [2]:
!pip install git+https://github.com/arcee-ai/mergekit.git

Collecting git+https://github.com/arcee-ai/mergekit.git
  Cloning https://github.com/arcee-ai/mergekit.git to /tmp/pip-req-build-_jlj6z5x
  Running command git clone --filter=blob:none --quiet https://github.com/arcee-ai/mergekit.git /tmp/pip-req-build-_jlj6z5x
  Resolved https://github.com/arcee-ai/mergekit.git to commit 57e7d14e2a732f532970e2c9dada00e2d8f15a7a
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [3]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [27]:
import yaml

MODEL_NAME = "Marcoro14-7B-slerp"
yaml_config = """
slices:
  - sources:
    - model: psmathur/orca_mini_v3_13b
      layer_range: [0, 24]
  - sources:
    - model: garage-bAInd/Platypus2-13B
      layer_range: [20, 40]
merge_method: passthrough
dtype: float16
"""

# Save config as yaml file
with open('config.yaml', 'w', encoding="utf-8") as f:
    f.write(yaml_config)

In [28]:
# Merge models
!mergekit-yaml config.yaml merge --copy-tokenizer --allow-crimes --out-shard-size 1B --lazy-unpickle

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m



model-00001-of-00003.safetensors:  25% 2.45G/9.95G [01:53<05:41, 21.9MB/s][A[A[A[A[A



model-00003-of-00003.safetensors:  36% 2.24G/6.18G [01:53<03:10, 20.7MB/s][A[A[A[A


model-00002-of-00003.safetensors:  25% 2.50G/9.90G [01:53<05:26, 22.7MB/s][A[A[A




model-00001-of-00003.safetensors:  25% 2.46G/9.95G [01:54<05:37, 22.2MB/s][A[A[A[A[A



model-00003-of-00003.safetensors:  36% 2.25G/6.18G [01:54<03:03, 21.3MB/s][A[A[A[A


model-00002-of-00003.safetensors:  25% 2.51G/9.90G [01:54<05:27, 22.6MB/s][A[A[A




model-00001-of-00003.safetensors:  25% 2.47G/9.95G [01:54<05:33, 22.4MB/s][A[A[A[A[A



model-00003-of-00003.safetensors:  37% 2.26G/6.18G [01:54<03:03, 21.3MB/s][A[A[A[A


model-00002-of-00003.safetensors:  25% 2.52G/9.90G [01:54<05:23, 22.8MB/s][A[A[A




model-00001-of-00003.safetensors:  25% 2.49G/9.95G [01:55<05:31, 22.5MB/s][A[A[A[A[A



model-00003-


### TIES (Task Interference Elimination by Sparsification)

TIES addresses conflicts between task vectors by sparsifying them, using techniques like:
- **Sign Consensus**: Aligns parameter updates across models to reduce interference.
- **Sparsification**: Retains only the most critical parameters for merging.

#### Parameters:
- **Density**: Fraction of weights to retain in the task vector.
- **Gamma**: Controls the magnitude of pruning for large weights.

TIES is particularly useful for merging models fine-tuned on diverse tasks.


### TIES-Merging

```yaml
models:
  - model: mistralai/Mistral-7B-v0.1
    # no parameters necessary for base model
  - model: OpenPipe/mistral-ft-optimized-1218
    parameters:
      density: 0.5
      weight: 0.5
  - model: mlabonne/NeuralHermes-2.5-Mistral-7B
    parameters:
      density: 0.5
      weight: 0.3
merge_method: ties
base_model: mistralai/Mistral-7B-v0.1
parameters:
  normalize: true
dtype: float16
```



### DELLA (Dynamic Evolutionary Layer-by-Layer Assembly)

DELLA builds on TIES, introducing adaptive pruning to retain critical parameters. It uses:
- **Magnitude-based Ranking**: Prioritizes retaining high-magnitude changes.
- **Dynamic Drop Probabilities**: Adjusts pruning probabilities based on parameter importance.

#### Parameters:
- **Density**: Fraction of weights to retain.
- **Epsilon**: Controls variability in pruning probabilities.
- **Lambda**: Scaling factor for merged delta parameters.

DELLA is ideal for merging models where parameter importance varies significantly.


## Merging LoRA Adapters using TIES and DARE

This section demonstrates how to merge LoRA adapters using the **TIES** and **DARE** methods. These methods are efficient for merging LoRA adapters by reducing redundant parameters and resolving conflicts.

### 1. Loading Adapters

We first load a base model and its adapters using the `load_adapter` method:

In [1]:
!pip install bitsandbytes



In [2]:
!pip install --upgrade transformers peft



In [22]:
from peft import PeftConfig, PeftModel
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import torch
import random

peft_model_id = "smangrul/tinyllama_lora_norobots"
device = "cuda"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, load_in_4bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(model, peft_model_id, adapter_name="norobots")
_ = model.load_adapter("smangrul/tinyllama_lora_sql", adapter_name="sql")
_ = model.load_adapter("smangrul/tinyllama_lora_adcopy", adapter_name="adcopy")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


adapter_config.json:   0%|          | 0.00/737 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/290M [00:00<?, ?B/s]


### 2. Merging Adapters

We use the `add_weighted_adapter` method to merge adapters with specified weights and parameters. The `combination_type` parameter specifies the merging method (e.g., `ties` or `dare`).


In [24]:
%%time
# [0.8, 0.1, 0.1] linear #[1.0, 0.2] 0.7 density dare_linear #[1.5, 0.3] 0.5 density ties #[0.8, 0.5] cat
adapters = ["norobots", "adcopy", "sql"]
weights = [2.0, 0.3, 0.7]
adapter_name = "merge"
density = 0.2
combination_type = "ties"
if adapter_name in model.peft_config:
    model.delete_adapter(adapter_name)
model.add_weighted_adapter(adapters, weights, adapter_name, combination_type=combination_type, density=density)

CPU times: user 1.11 s, sys: 20.7 ms, total: 1.13 s
Wall time: 1.58 s


In [25]:
model.eval()
model.set_adapter("merge")


### 3. Using the Merged Model

The merged model can now be used for generating text, leveraging the combined capabilities of all adapters.


In [26]:
messages = [
    {"role": "user", "content": "Write an essay about Generative AI."},
]
text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt")  # , add_special_tokens=False)
inputs = {k: v.to("cuda") for k, v in inputs.items()}
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    top_p=0.95,
    temperature=0.2,
    repetition_penalty=1.2,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0]))



<s><|im_start|>user 
Write an essay about Generative AI.<|im_end|> 
<|im_start|>assistant 
Generative Artificial Intelligence is the process of creating art by using machine learning techniques to create images and videos that look like real paintings, drawings, or sculptures. The creations are often inspired by a variety of sources including nature, history, literature, and other forms of media. Some examples include: 19th century British painter J.M.W. Turner's painting "The Fighting Temeraire" (1805), which was based on the Battle of Waterlo in 1763; 20th century French artist Henri Fantin-Latour's painting "La Belle au Bois de Boulogne" (1848); 20th century American artist Edward Hopper's painting "Night Interior with Moonlight" (1930); 20th century American artist Jackson Pollack's painting "Airplane" (1930); 20th century American artist Jackson Pollack's painting "Birdwatcher" (1930); 20th century American artist Jackson Pollack's painting "Cowlf" (1930); 20th century American ar