# Efficient Linear Model Merging for LLMs

https://lightning.ai/lightning-ai/studios/efficient-linear-model-merging-for-llms?section=blogs

This notebooks implements the model merging method as described by the [Wortsman et al. (2022): Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time](https://arxiv.org/abs/2203.05482) paper.

# 1 - Introduction

Model merging is an approach where multiple pretrained or finetuned models are combined to form a new model that leverages the strengths and knowledge of each individual model. Unlike traditional model ensembling methods, which require the use of multiple models during inference time, model merging is a more efficient approach. It yields a single model that maintains the same size as each of the individual input models, as illustrated in the figure below:

<table>
    <tr>
        <td><img src="./images_2/introduction.png" width="800"/></td>
    </tr>
</table>

To begin exploring model merging, we will consider one of the earliest approaches in this area ([Worstman et al., 2022](https://arxiv.org/abs/2203.05482)). This paper proposes combining multiple models by averaging their weights, a technique now also referred to as "linear" merging. Although the [Model Soups paper](https://arxiv.org/abs/2203.05482) primarily focused on vision models trained with different hyperparameter configurations, this concept equally applies to LLMs that have been finetuned on various datasets and for different target tasks.

Assuming that the models we want to merge are based on the same architecture, i.e., have the same number of parameters in each layer, the linear merging approach merges the two models by linear averaging. We can also add an `alpha` parameter as an additional weighting. Setting `alpha=0.5` will lead to each model contributing equally as illustrated in the figure below:

<table>
    <tr>
        <td><img src="./images_2/linear_model_merging_example.jpg" width="800"/></td>
    </tr>
</table>

# 2 - Load models

[GPT-Neo-125M is a transformer model designed using EleutherAI's replication of the GPT-3 architecture](https://huggingface.co/EleutherAI/gpt-neo-125m). GPT-Neo was trained as an autoregressive language model. This means that its core functionality is taking a string of text and predicting the next token.

**Note:** Since we are using linear model merging, we need to consider models that share the same architecture

We are going to consider **two finetuned versions of GPT-Neo-125M**:

* [b3ck1/gpt-neo-125M-finetuned-beer-recipes](https://huggingface.co/b3ck1/gpt-neo-125M-finetuned-beer-recipes). This model was trained on a custom dataset of ~76,800 beer recipes from the internet. Recipes are generated in a YAML-like format:

  ```yaml
  style: Pilsner
  batch_size: 20
  efficiency: 70
  boil_size: 24
  boil_time: 60
  fermentables:
  - name: Pale Ale
    type: Grain
    amount: 6.5
  hops:
  - name: Saaz
    alpha: 3.5
    use: Boil
    time: 60
    amount: 0.06
  ...
  ```

* [flax-community/gpt-neo-125M-code-clippy-dedup-2048](https://huggingface.co/flax-community/gpt-neo-125M-code-clippy-dedup-2048?text=def+func%28%29%3A). The model was trained on the [CodeClippy dataset](https://huggingface.co/datasets/CodedotAI/code_clippy). This dataset was generated by selecting GitHub repositories from a large collection of repositories. These respositories are obtained from SEART GitHub Search using the following criteria:
  * More than 10 GitHub stars
  * More than 2 commits
  * Must have a licence
  * Exclude forks
  * Size < 70708 bytes
  
  These repositories  are then combined with all of the GitHub repositories contain in The Pile and filtered for duplicate files. [A more detailed explanation of the dataset can be found here.](https://github.com/ncoop57/datasets/tree/code-clippy/datasets/code_clippy)


In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed

set_seed(32)

if torch.cuda.is_available():
  device = "cuda"
else:
    device = "cpu"

base_model_name = "EleutherAI/gpt-neo-125M"
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, device_map=device, do_sample=True)
base_tokenizer = AutoTokenizer.from_pretrained(base_model_name)

clippy_model_name = "flax-community/gpt-neo-125M-code-clippy-dedup-2048"
clippy_model = AutoModelForCausalLM.from_pretrained(clippy_model_name, device_map=device, do_sample=True)
clippy_tokenizer = AutoTokenizer.from_pretrained(clippy_model_name)

beer_model_name = "b3ck1/gpt-neo-125M-finetuned-beer-recipes"
beer_model = AutoModelForCausalLM.from_pretrained(beer_model_name, device_map=device, do_sample=True)
beer_tokenizer = AutoTokenizer.from_pretrained(beer_model_name)

# 3 - Text generation capabilities

We have three models: 
* A base model
* A model fine-tuned for coding tasks (Clippy)
* A model fine-tuned for generating beer recipes (Beer). 

Before attempting to combine Clippy and Beer to create a potentially "better" model, we need to validate their individual strengths. 

This validation involves assessing each model's performance on its respective domain (coding for Clippy, beer recipes for Beer) compared to the base model. Ideally, Clippy should outperform both the base model and Beer on coding tasks, while Beer should demonstrate superior performance on beer recipe generation compared to the base model and Clippy.


----

**Note:** When calling the `generate()` method I was receiving the following warning:

```python
"Setting `pad_token_id` to `eos_token_id`:{eos_token_id} for open-end generation."
```

So I looked into StackOverflow and found that it is quite normal when generating text. The "solution" is to modify the `generate()` call by adding the following parameter: `pad_token_id=tokenizer.eos_token_id`

----

In [4]:
def generate_text(model, tokenizer, prompt, device, temperature=1.0, max_length=500):
  """
  Generates text using a provided model, tokenizer, prompt, temperature, and max_length.

  Args:
      model: The loaded causal language model (e.g., AutoModelForCausalLM).
      tokenizer: The tokenizer associated with the model.
      prompt: The starting text for the generation.
      temperature: Controls randomness of the generation (higher for more variation).
      max_length: The maximum length of the generated text.

  Returns:
      The generated text as a string.
  """

  # Encode the prompt
  input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

  # Generate text
  generated_ids = model.generate(
      input_ids=input_ids,
      do_sample=True,
      temperature=temperature,
      max_length=max_length,
      pad_token_id=tokenizer.eos_token_id
  )

  # Decode the generated text
  generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

  return generated_text

## 3.1 - Coding example

In [5]:
code_prompt = "Write a Python function to greet the user by name:"

### 3.1.1 - Base model

In [6]:
generated_text = generate_text(model=base_model, tokenizer=base_tokenizer, prompt=code_prompt, device=device)
print(generated_text)



Write a Python function to greet the user by name:
from __future__ import (absolute_import, division, print_function)
from... import objects


# The next class constructor:

class A(object):
    def __init__(self, name, *args, **kwargs):
        super(A, self).__init__(*args, **kwargs)

        self.name = name
        self._name = *args[0]

        self.args = []

        self.c_class = object(self, **kwargs)

class B(object):
    def __init__(self, name, *args, **kwargs):
        super(B, self).__init__(*args, **kwargs)

        self.name = name
        self.c_class = object(self, **kwargs)

        self.args = []

        self.c_class = object(self, **kwargs)

        self.d_class = object(self, **kwargs)

        # The next class constructor:

class A(object):
    def __init__(self, name, *args, **kwargs):
        super(A, self).__init__(*args, **kwargs)

        self.name = name
        self._name = *args[0]

        self.args = []

        self.args = []

        self.c_class = o

### 3.1.2 - Clippy model

In [7]:
generated_text = generate_text(model=clippy_model, tokenizer=clippy_tokenizer, prompt=code_prompt, device=device)
print(generated_text)

Write a Python function to greet the user by name:

>>> from _proto.parser import parse_text
>>> p = _proto.parser('User')

>>> n = parse_text('You say that it's'+ f + ', '.join(char))

>>> p('')
<unlink>
<unlink>\n<unlink>\n<unlink>\n</unlink>\n<unlink>\n</unlink>\n</unlink>
<unlink>\n\n<unlink>\n</unlink>\n</unlink>\nimport int_types\nimport unittest\
\nfrom tests.unittest import TestCase
import numpy\
import pytest\
from distutils import *
import vtk\

from pathlib import Path as path
import os
from random import hex
import yaml
from typing import List, KzipOutput

import traceback
import traceback as tr
import subprocess
import six
import sys
if sys.version_info < 3:
import time
with open(OS.path.dirname(os.path.abspath) + "/", "wb") as f:
print('>>> time.sleep() << \x7f'
print('>>> time.time.sleep(0.) %d\n' % gettime() / str(sys.stdin.stderr.decode("\x5d")))
#print('>>> traceback.assert_module()\n' % tr.verbose)
if sys.version_info == 3:
print('>>> traceback.assert_module()\n' % _

### 3.1.3 - Beer model

In [8]:
generated_text = generate_text(model=beer_model, tokenizer=beer_tokenizer, prompt=code_prompt, device=device)
print(generated_text)

Write a Python function to greet the user by name: '
  useCLUSheel() 

  mime-time:
    name: "\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t#\t#\t#\t#\t# 
    \
    \
    / 19 x 11.38 Ounces"
  type: Other
  use: Boil
  time: 60
  amount: 0.002

 


As can be seen the beer model is the worst of the three with respect to coding tasks. Then both clippy and base models are not really good at generating the solution, but at least they seem to generate Python code (most of the time). It would be fair to say that the models are probably too small for properly managing a code generation task.

## 3.2 - Beer recipe example

In [9]:
beer_prompt = """style: Scottish Ale
batch_size: 20
efficiency: 75
boil_size:
"""

### 3.2.1 - Base model

In [10]:
generated_text = generate_text(model=base_model, tokenizer=base_tokenizer, prompt=beer_prompt, device=device)
print(generated_text)

style: Scottish Ale
batch_size: 20
efficiency: 75
boil_size:
          0
log_log:
            "0":
                 "5",
                 "0":
                     "3",
                 "0":
                     "7",
                     "0":
                         "4",
                         "3":
                             "12",
                             "0x7e9a2042e33da6869a79f1266a18a3a6d75",
                             "0x7"
                             "0x7c01e2042e35aa958e8f4bf60a5e5e05",
                             "0x3c7f039a7fa9800a3f3c7fe2a1bccd4",
                       


### 3.2.2 - Clippy model

In [11]:
generated_text = generate_text(model=clippy_model, tokenizer=clippy_tokenizer, prompt=beer_prompt, device=device)
print(generated_text)

style: Scottish Ale
batch_size: 20
efficiency: 75
boil_size:
- 100
diluzia_count:0
- 100
a_deco_todos:0
- 100
a_deco_decodos:
- 50
dolizia_count: 0
- 100
a_deco_decodos:100

batch_decode: 
data:
batch_size: 0
no_flop:true
decode_size: 150
decodos: []
todos: []
dilizia_count:
- 100
dolizia_count: 0
- 100
a_deco_todos:0
- 100
a_deco_decodos:0

batch_decode: 
data:
batch_size: 0
no_flop:true
decode_size: 150
decodos: []
todos: []
dilizia_count:
- 100
dolizia_count: 0
- 100
a_deco_todos:0
- 100
a_deco_decodos:0

batch_decode: 
data:
batch_size: 1
no_flop:true
decode_size: 150
decodos: []
todos: []
dilizia_count:
- 100
dolizia_count: 0
- 100
a_deco_todos:0
- 100
a_deco_decodos:0

batch_decode: 
data:
batch_size: 1
no_flop:true
decode_size: 150
decodos: []
todos: []
dilizia_count:
- 100
dolizia_count: 0
- 100
a_deco_todos:0
- 100
a_deco_decodos:0



### 3.2.3 - Beer model

In [12]:
generated_text = generate_text(model=beer_model, tokenizer=beer_tokenizer, prompt=beer_prompt, device=device)
print(generated_text)

style: Scottish Ale
batch_size: 20
efficiency: 75
boil_size:
- 3
  volume_size: 34
 time: 60
- name:'Dry Malt Extract - Dark '
  type: Grain
  amount: 9.979
- name: White Wheat Flaked
  type: Adjunct
  amount: 0.454
- name: Acidulated Malt
  type: Grain
  amount: 0.255
hops:
- name: Columbus
  alpha: 15.0
  use: Boil
  time: 60
  amount: 0.014
yeasts:
- name: California Ale Yeast WLP001
  amount: 0.1
  min_temperature: 20
  max_temperature: 23
primary_temp: null
mash_steps:
- step_temp: 67
  step_time: 60
miscs: []

 


In this case, the beer model is (as expected) the best model for generating beer recipes in the specific YAML-like format and interestingly, while the clippy model does not understand the proper format, it is able to generate somewhat reasonable YAML-like text.

# 4 - Merge all layers

# 5 - Merge selected layers

# 6 - Hierarchical merging