<a href="https://colab.research.google.com/github/bacoco/LLM_train/blob/main/laserQlora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **laser-QLoRA - A new way to train your model for a specific task.**

by [Fernando Fernandes Neto](https://twitter.com/FernandoNetoAi), [David Golchinfar](https://twitter.com/DavidGFar) and [Eric Hartford](https://twitter.com/erhartford)

supported by [VAGO solutions](https://vago-solutions.de) and [Hyperspace.ai](https://hyperspace.computer)





---


With this notebook, we present a novel training strategy for the SFT and DPO training process, in which we partially freeze the model after a laser-like analysis to navigate and optimize the trade-offs highlighted by the no-free-lunch theorem. This innovative training method effectively prevents the major problem of language models forgetting previously acquired knowledge. This aspect is particularly important when trying to teach the model specific skills, such as a new language, where the model could generally lose a significant amount of its prior knowledge and show a decline in overall intelligence.

The main contribution of the following script facilitates the discovery of layers possessing superior signal-to-noise efficiency, signifying those that might be more impactful or essential for the model's effectiveness. Layers exhibiting higher SNR ratios in comparison to their maximum singular value are viewed as having weights that more significantly enhance the model's output, laying the groundwork for optimizing and refining the model.


---





**Overview:**

Here, we provide an exemplary demonstration of what training with laser-QLoRA might look like. To give you a brief overview of the process: Initially, the script is executed, which, among other outputs, generates a JSON file containing the current top 16 highest SNR/max singular value for each module in every layer. Following this, we will guide you on how to use the extracted layers in Axolotl or LlamaFactory for your training.


---



### **The laser-scanner script:**

In [None]:




# %%
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np
import gc
import json


# %%
model_name = "mistralai/Mistral-7B-v0.1"  # Change to your preferred model

class ModelModifier:
    def __init__(self, model_name):
        self.model_name = model_name
        self.model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map={"":0})
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.layer_snr = {}
        self.modified_layers = set()
        self.original_weights = {}

    def calculate_snr_for_layer(self, layer_type, layer_number):
        for name, module in self.model.named_modules():
            if layer_type in name and str(layer_number) in name:
                weights = module.weight.double()
                S = torch.linalg.svdvals(weights)
                max_singular_value = S[0].item()  # First singularity value
                weights = weights.detach().cpu()
                S = S.detach().cpu()
                sigma_estimated = self.estimate_sigma_with_full_iqr(S)
                n, m = weights.shape
                mp_threshold = self.marchenko_pastur_threshold(sigma_estimated, n, m)

                signal = S[S > mp_threshold].sum()
                noise = S[S <= mp_threshold].sum()
                snr = signal / noise if noise != 0 else float('inf')
                snr_ratio = snr / max_singular_value  # Calculates the ratio of SNR to the highest singularity value
                del S, weights
                torch.cuda.empty_cache()  # Clear PyTorch's CUDA memory cache
                gc.collect()
                return snr_ratio  # Returns the ratio
    @staticmethod
    def marchenko_pastur_threshold(sigma, n, m):
        beta = n / m if n < m else m / n
        threshold = sigma * np.sqrt((1 + np.sqrt(beta))**2)
        return threshold

    ## Calculate an estimate of the standard deviation of the singular values based on Inter Quantile Range

    @staticmethod
    def estimate_sigma_with_full_iqr(S):
        q75 = torch.quantile(S, 0.75)
        q25 = torch.quantile(S, 0.25)
        iqr = q75 - q25
        sigma_estimated = iqr / 1.349 ## 0.6745 * sigma is the expected range between the quantiles (Q1 and Q3)
        return sigma_estimated


    def assess_layers_snr(self, layer_types, layer_numbers):
        for name, module in self.model.named_modules():
            for layer_number in layer_numbers:
                for layer_type in layer_types:
                    if layer_type in name and str(layer_number) in name:
                        print("*"*50, flush=True)
                        print(f"Calculating Signal to Noise Ratio at layer {name}", flush=True)
                        snr_ratio = self.calculate_snr_for_layer(layer_type, layer_number)
                        self.layer_snr[name] = {'snr_ratio': snr_ratio, 'module': name}
                        print(f"Signal to Noise Ratio at layer {name} = {snr_ratio}", flush=True)
                        print("*"*50, flush=True)


    def save_layers_to_json(self, filename="layer_snr_info.json"):
        with open(filename, 'w') as file:
            serializable_data = {}
            for key, value in self.layer_snr.items():
                # Convert Tensors to Python numbers (for SNR) and handle other data types as needed
                snr_value = value['snr_ratio'].item() if isinstance(value['snr_ratio'], torch.Tensor) else value['snr_ratio']
                module_str = str(value['module'])  # Assuming module representation is a string or convertible to a string
                serializable_data[key] = {'snr': snr_value, 'module': module_str}

            json.dump(serializable_data, file, indent=4)



    def get_top_snr_ratios(self, top_n=16):
        # Initialize a dictionary to store the SNR ratios for the specific modules
        snr_ratios_per_specific_module = {
            'self_attn.v_proj': [],
            'self_attn.k_proj': [],
            'self_attn.o_proj': [],
            'self_attn.q_proj': [],
            'mlp.down_proj': [],
            'mlp.up_proj': [],
            'mlp.gate_proj': []
        }

        # Run through all layer SNR entries
        for name, value in self.layer_snr.items():
            snr_ratio = value['snr_ratio']
            layer_name = value['module']

            # For each specific module, check if the layer name contains the module
            for specific_module in snr_ratios_per_specific_module.keys():
                if specific_module in layer_name:
                    # Add the layer name and SNR value to the corresponding entry
                    snr_ratios_per_specific_module[specific_module].append((layer_name, snr_ratio))
                    break  # End the loop when the module is found to avoid duplicate entries

        # Sort and extract the top 16 SNR values for each specific module
        top_snr_layers = {}
        for module, snr_ratios in snr_ratios_per_specific_module.items():
            sorted_layers = sorted(snr_ratios, key=lambda x: x[1], reverse=True)  # Sort by SNR value
            top_snr_layers[module] = [layer[0] for layer in sorted_layers[:top_n]]  # Saving the layer names

        return top_snr_layers


    def save_top_snr_ratios_to_json(self, top_snr_layers, filename="top_snr_ratios.json"):
        with open(filename, 'w') as file:
            json.dump(top_snr_layers, file, indent=4)


# Usage
modifier = ModelModifier(model_name)

# %%
layer_numbers = list(range(31, -1, -1))
layer_numbers = [f".{l}." for l in layer_numbers]
print(layer_numbers)

layer_types=['mlp.gate', 'mlp.down_proj', 'mlp.up_proj', 'self_attn.q_proj', 'self_attn.k_proj', 'self_attn.v_proj', 'self_attn.o_proj']

# %%
## Search all layers and get the SNR/max singularity value

modifier.assess_layers_snr(layer_types, layer_numbers)
top_snr_ratios = modifier.get_top_snr_ratios() # Define your specific top_n here otherwise it will be top_n=16

print("Finished laserRMT scanning.", flush=True)

# Save the layer information to a JSON file
modifier.save_top_snr_ratios_to_json("laser_scan_mistral_top_snr.json")
modifier.save_layers_to_json("laser_scan_mistral.json")





The key part of this script is the **calculate_snr_for_layer** function:
The `calculate_snr_for_layer` method in the Python code performs a detailed analysis of the signal-to-noise ratio (SNR) for a specific layer within a neural network model. This method incorporates both the extraction of singular values from the layer's weights and the application of statistical measures to determine the layer's SNR. Here's a step-by-step breakdown of the process, integrating the mathematical concepts and formulas addressed previously:

1. **Identify Layer Weights**: For a given layer type and number, the method iterates through the model's layers to find a match. Once found, it extracts the weights of the layer and converts them to double precision for accurate computation.

2. **Singular Value Decomposition (SVD) Values**: The method calculates the singular values (\(S\)) of the layer's weight matrix using PyTorch's `torch.linalg.svdvals` function. This step is crucial for assessing the layer's information content through its singular values.

3. **Maximum Singular Value**: It records the maximum singular value (\(S[0]\)), which represents the highest magnitude of signal strength in the layer's weights.

4. **Estimate Sigma with IQR**: Using the full inter-quantile range (IQR) method, it estimates the standard deviation (\(\sigma\)) of the singular values. This estimation helps in setting a threshold for distinguishing between signal and noise based on the variability of the singular values:
   \[\sigma = \frac{IQR}{1.349}\]\


5. **Marchenko-Pastur Threshold**: The method then calculates the Marchenko-Pastur threshold (\(\lambda\)) to separate the singular values into signal and noise categories. This threshold is computed using the formula:
   \[\lambda = \sigma \sqrt{(1 + \sqrt{\beta})^2}\]
   where \(\beta\) is the aspect ratio of the weight matrix (\(n/m\) or \(m/n\), whichever is smaller).

6. **Signal and Noise Calculation**: The singular values greater than the Marchenko-Pastur threshold (\(\lambda\)) are considered signal, and those below are considered noise. The method sums these groups of singular values separately to quantify the total signal (\(\sum_{\sigma_i > \lambda} \sigma_i\)) and total noise (\(\sum_{\sigma_i \leq \lambda} \sigma_i\)).

7. **Signal-to-Noise Ratio (SNR)**: The SNR is calculated by dividing the total signal by the total noise. In cases where the noise is zero (to avoid division by zero), the SNR is set to infinity (\(\infty\)), indicating a layer with overwhelmingly dominant signal content.

8. **SNR Ratio Relative to Maximum Singular Value**: The method further refines the SNR analysis by calculating the ratio of the SNR to the maximum singular value. This ratio provides insight into how the layer's strongest signal component compares to the overall signal-to-noise balance:
   \[SNR\ Ratio = \frac{SNR}{\text{max singular value}}\]

9. **Memory Management**: After the calculations, the method clears the allocated memory for the singular values and weights to optimize memory usage and prevent memory leaks.

This detailed analysis enables the identification of layers with high signal-to-noise efficiency, indicating layers that are potentially more influential or critical to the model's performance. Layers with higher SNR ratios relative to their maximum singular value are considered to have weights that are more effectively contributing to the model's output, providing a basis for model optimization and refinement.



---




### **Next step after the script is finished:**


First , we will concentrate only on the content of laser_scan_mistral_top_snr.json.
The result will something like this:





Example content of laser_scan_mistral_top_snr.json:

```
{
    "self_attn.v_proj": [
        "model.layers.3.self_attn.v_proj",
        "model.layers.2.self_attn.v_proj",
        "model.layers.1.self_attn.v_proj",
        "model.layers.0.self_attn.v_proj",
	...,
	..
    ],
    "self_attn.k_proj": [
        "model.layers.2.self_attn.k_proj",
        "model.layers.0.self_attn.k_proj",
        "model.layers.1.self_attn.k_proj",
        "model.layers.3.self_attn.k_proj",
	...,
	..
    ],
    "self_attn.o_proj": [
        "model.layers.0.self_attn.o_proj",
        "model.layers.3.self_attn.o_proj",
        "model.layers.2.self_attn.o_proj",
        "model.layers.1.self_attn.o_proj",
	...,
	..
    ],
    "self_attn.q_proj": [
        "model.layers.0.self_attn.q_proj",
        "model.layers.1.self_attn.q_proj",
        "model.layers.2.self_attn.q_proj",
        "model.layers.3.self_attn.q_proj",
	...,
	..
    ],
    "mlp.down_proj": [
        "model.layers.1.mlp.down_proj",
        "model.layers.2.mlp.down_proj",
        "model.layers.3.mlp.down_proj",
        "model.layers.0.mlp.down_proj",
	...,
	..
    ],
    "mlp.up_proj": [
        "model.layers.3.mlp.up_proj",
        "model.layers.2.mlp.up_proj",
        "model.layers.1.mlp.up_proj",
        "model.layers.0.mlp.up_proj",
	...,
	..
    ],
    "mlp.gate_proj": [
        "model.layers.3.mlp.gate_proj",
        "model.layers.2.mlp.gate_proj",
        "model.layers.1.mlp.gate_proj",
        "model.layers.0.mlp.gate_proj",
	...,
	..
    ]
}
```


---





### **The procedure for Axolotl:**

Go to your axolotl config.yml:
in this example we used the top 16 snr values of the dolphin-2.6-mistral-7b-dpo-laser


```
base_model: cognitivecomputations/dolphin-2.6-mistral-7b-dpo-laser
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
#we used a small dataset to teach the model function calling abilities
  - path: ./data/function_calling_2k.json
    ds_type: json
    type: sharegpt

dataset_prepared_path: last_run_function_call
#0.05
val_set_size: 0.02
output_dir: ./laser-qlora-out-dolphin-function-top16

adapter: qlora
lora_model_dir:

sequence_len: 8192
sample_packing: false
eval_sample_packing: true
pad_to_sequence_len: true

# important, to get the same trainable parameters then for a qlora training with lora_r=32 and lora_alpha=16 you need to adjust the lora_r depending on the amount of filtered layers you want to use. With top_n=4 you would go for lora_r= 256

lora_r: 64
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: false
lora_fan_in_fan_out:
lora_target_modules:
  - layers.30.self_attn.q_proj
  - layers.0.self_attn.q_proj
  - layers.1.self_attn.q_proj
  - layers.15.self_attn.q_proj
  - layers.12.self_attn.q_proj
  - layers.11.self_attn.q_proj
  - layers.14.self_attn.q_proj
  - layers.9.self_attn.q_proj
  - layers.16.self_attn.q_proj
  - layers.18.self_attn.q_proj
  - layers.13.self_attn.q_proj
  - layers.10.self_attn.q_proj
  - layers.7.self_attn.q_proj
  - layers.8.self_attn.q_proj
  - layers.4.self_attn.q_proj
  - layers.19.self_attn.q_proj
  - layers.27.self_attn.k_proj
  - layers.24.self_attn.k_proj
  - layers.25.self_attn.k_proj
  - layers.22.self_attn.k_proj
  - layers.26.self_attn.k_proj
  - layers.29.self_attn.k_proj
  - layers.23.self_attn.k_proj
  - layers.28.self_attn.k_proj
  - layers.21.self_attn.k_proj
  - layers.31.self_attn.k_proj
  - layers.30.self_attn.k_proj
  - layers.20.self_attn.k_proj
  - layers.5.self_attn.k_proj
  - layers.19.self_attn.k_proj
  - layers.17.self_attn.k_proj
  - layers.18.self_attn.k_proj
  - layers.31.self_attn.v_proj
  - layers.19.self_attn.v_proj
  - layers.24.self_attn.v_proj
  - layers.18.self_attn.v_proj
  - layers.5.self_attn.v_proj
  - layers.3.self_attn.v_proj
  - layers.16.self_attn.v_proj
  - layers.23.self_attn.v_proj
  - layers.27.self_attn.v_proj
  - layers.25.self_attn.v_proj
  - layers.26.self_attn.v_proj
  - layers.20.self_attn.v_proj
  - layers.6.self_attn.v_proj
  - layers.15.self_attn.v_proj
  - layers.17.self_attn.v_proj
  - layers.29.self_attn.v_proj
  - layers.30.self_attn.o_proj
  - layers.12.self_attn.o_proj
  - layers.9.self_attn.o_proj
  - layers.14.self_attn.o_proj
  - layers.0.self_attn.o_proj
  - layers.6.self_attn.o_proj
  - layers.8.self_attn.o_proj
  - layers.10.self_attn.o_proj
  - layers.11.self_attn.o_proj
  - layers.13.self_attn.o_proj
  - layers.24.self_attn.o_proj
  - layers.5.self_attn.o_proj
  - layers.15.self_attn.o_proj
  - layers.7.self_attn.o_proj
  - layers.17.self_attn.o_proj
  - layers.25.self_attn.o_proj
  - layers.31.mlp.gate_proj
  - layers.30.mlp.gate_proj
  - layers.4.mlp.gate_proj
  - layers.3.mlp.gate_proj
  - layers.28.mlp.gate_proj
  - layers.29.mlp.gate_proj
  - layers.6.mlp.gate_proj
  - layers.27.mlp.gate_proj
  - layers.5.mlp.gate_proj
  - layers.26.mlp.gate_proj
  - layers.25.mlp.gate_proj
  - layers.7.mlp.gate_proj
  - layers.2.mlp.gate_proj
  - layers.24.mlp.gate_proj
  - layers.23.mlp.gate_proj
  - layers.10.mlp.gate_proj
  - layers.30.mlp.up_proj
  - layers.4.mlp.up_proj
  - layers.6.mlp.up_proj
  - layers.5.mlp.up_proj
  - layers.27.mlp.up_proj
  - layers.25.mlp.up_proj
  - layers.26.mlp.up_proj
  - layers.17.mlp.up_proj
  - layers.24.mlp.up_proj
  - layers.7.mlp.up_proj
  - layers.10.mlp.up_proj
  - layers.3.mlp.up_proj
  - layers.23.mlp.up_proj
  - layers.11.mlp.up_proj
  - layers.9.mlp.up_proj
  - layers.14.mlp.up_proj
  - layers.29.mlp.down_proj
  - layers.19.mlp.down_proj
  - layers.20.mlp.down_proj
  - layers.18.mlp.down_proj
  - layers.21.mlp.down_proj
  - layers.1.mlp.down_proj
  - layers.28.mlp.down_proj
  - layers.22.mlp.down_proj
  - layers.23.mlp.down_proj
  - layers.30.mlp.down_proj
  - layers.4.mlp.down_proj
  - layers.17.mlp.down_proj
  - layers.2.mlp.down_proj
  - layers.15.mlp.down_proj
  - layers.27.mlp.down_proj
  - layers.5.mlp.down_proj
  # important: you need to unfreeze the lm.head
  - lm.head

wandb_project: axolotl
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.00025

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
eval_steps: 0.2
eval_table_size:
eval_table_max_new_tokens: 128
save_steps:
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:

```



After adjusting your config.yml you can start your axolotl training as usual.


---



### **The procedure for Llama-Factory:**

Example for a multi-gpu DPO training:


```
accelerate launch src/train_bash.py \
    --deepspeed dsconfig.json \
    --stage dpo \
    --model_name_or_path your_model \
    --quantization_bit 4 \
    --do_train \
    --dataset your_dataset\
    --template mistral \
    --finetuning_type lora \
    # here you need to add your layers to unfreeze - don't forget lm.head at the end
    --lora_target layers.0.self_attn.q_proj,layers.1.self_attn.q_proj,layers.15.self_attn.q_proj,layers.12.self_attn.q_proj,layers.27.self_attn.k_proj,layers.24.self_attn.k_proj,layers.25.self_attn.k_proj,layers.22.self_attn.k_proj,layers.19.self_attn.v_proj,layers.24.self_attn.v_proj,layers.18.self_attn.v_proj,layers.30.self_attn.v_proj,layers.30.self_attn.o_proj,layers.12.self_attn.o_proj,layers.9.self_attn.o_proj,layers.14.self_attn.o_proj,layers.31.mlp.gate_proj,layers.30.mlp.gate_proj,layers.4.mlp.gate_proj,layers.3.mlp.gate_proj,layers.29.mlp.up_proj,layers.6.mlp.up_proj,layers.4.mlp.up_proj,layers.5.mlp.up_proj,layers.30.mlp.down_proj,layers.19.mlp.down_proj,layers.20.mlp.down_proj,layers.18.mlp.down_proj,lm.head \
    --output_dir spin_e1-\
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --save_steps 1000 \
    --learning_rate 2.5e-04 \
    --num_train_epochs 1.0 \
    --plot_loss \
    --bf16 \
    --split train \
    --report_to=wandb \
    --cutoff_len 2000 \
    --save_safetensors \
    --warmup_steps 100 \
    --optim paged_adamw_8bit \
    --lora_dropout 0.05
```


---





### **Results**

The choice of how many layers you unfreeze is not proportional. Here is an example using the benchmarks from the FC-SauerkrautLM, which we trained with the function_calling dataset:


![fc_results.PNG](https://vago-solutions.de/wp-content/uploads/2024/02/fc_results.png)



All models are based on the SauerkrautLM-7b-LaserChat model and were trained using the same effective batch size, as well as identical hyperparameters and trainable parameters (adjusted by the value of lora_r). It is observable that, **on average, the laser-QLoRA trainings performed better than classic QLoRA.**

In particular, **laser-QLoRA top 3 and laser-QLoRA top 16 significantly outperformed classic QLoRA.** Function calling, as well as other tasks, such as learning new languages or RAG data, with very specific training data, benefit from the laser-QLoRA approach.

Our findings indicate that our approach not only surpasses QLoRA in direct training comparisons within the Open LLM leaderboard benchmarks but also reveals that using just a fraction of the typical function calling training data (2,000 samples), a model can effectively identify and employ functions in multi-turn conversations. Traditionally, function calling datasets encompass over 100,000 samples, a scale that could significantly induce the base model to lose a considerable portion of its pre-existing skills during QLoRA training.



---







### **Future Work**

Many more experiments need to be conducted. However, so far, we have been able to identify significant improvements compared to traditional QLoRA training, especially when addressing specific use cases. For instance, companies can significantly enhance their RAG systems through this approach, as their base models can be improved through training with laser-QLoRA. What does this mean precisely? It means that small datasets, which represent the RAG operations can be trained, thereby significantly enhancing their language model's extraction and reasoning capabilities.

This approach is not limited to the use in large language models. Visual models, such as stable diffusion, can also be significantly optimized through the laser scanner and subsequent laser-QLoRA training.