# Korean CP (Continual Pre-trained) Vector Transfer to Instruction Model

- Building on the idea presented in the [Chat-Vector paper](https://arxiv.org/abs/2310.04799v2)
- If we can introduce the parameter differences between pre-trained models and instruction models from the Chat Vector idea into the CP model,
- We aim to verify if it's possible to introduce the parameter differences between English and Korean pre-trained models into Meta's instruction model and assess its performance.
- The implementation is based on [StableFluffy's code](https://github.com/StableFluffy/EasyLLMFeaturePorter).
- Actively applying StableFluffy's code, we proceeded with the parameter update method in the form of weight application.
- Due to limited memory, there is a process of loading and deleting models.
- Loading the model in full precision (FP32) makes it impossible to load two models within 64GB, so the models must be loaded in half precision (FP16 or bfp16).

### Environment
- The computer used as the code server has limited specifications.
- Intel NUC 12th with Intel 1260p CPU
- 64GB RAM, 20GB swap memory (since it is impossible to load and compute multiple models simultaneously with just 64GB, ample swap memory is essential)
- RTX 3090 24GB eGPU

# Korean cp(continual pre-trained) vector transfer to instruction model

- chat-vector 논문(https://arxiv.org/abs/2310.04799v2)의 아이디어를 이어서 
- chat vector의 아이디어인 pre-trained 모델과 instruction 모델의 매개변수 차이를 CP모델에 도입 할 수 있다면,
- meta의 instruction 모델에 영어와 한국어 pre-trained 모델 사이의 매개변수 차이를 도입 할 수 있는지 확인해보고 그 성능을 확인해보고자 한다. 
- StableFluffy님의 코드(https://github.com/StableFluffy/EasyLLMFeaturePorter)를 기준으로 구현해보았다.
- StableFluffy님의 코드를 적극 적용해서 가중치를 적용하는 형태의 매개변수 업데이트 방법을 진행하였다. 
- 제한적인 메모리 상황 때문에 모델을 로드했다가 삭제하는 과정이 있다. 
- 단정밀도(FP32)로 모델을 로드하면 64GB안에 두개 모델을 도르 할 수 없기 때문에, 반정밀도(FP16 or bfp16)으로 로드해야 한다. 

### Environment
- 코드서버로 사용하고 있는 컴퓨터의 환경이 제한적이다. 
- intel 1260p CPU를 사용하는 intel nuc 12th
- 64GB RAM, 20GB swap memory(64GB 많으로 두대 이상의 모델을 한번에 로드 해 연산을 할 수 없기 때문에, 넉넉한 swap 메모리는 필수이다.)
- rtx 3090 24GB eGPU


In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [2]:
#Define function

def calculate_weight_diff(a, b):
    return a - b

def calculate_model_diffs(model_a, model_b):
    model_a_dict = model_a.state_dict()
    model_b_dict = model_b.state_dict()
    model_diffs = {}
    for key in model_a_dict.keys():
        if key in model_b_dict:
            model_diffs[key] = calculate_weight_diff(model_a_dict[key], model_b_dict[key])
            print(f"Diff calculated for {key}")
    return model_diffs

def calculate_sigmoid_ratios(base_model, target_model, epsilon=1e-6):
    sigmoid_ratios = {}
    target_diff = calculate_model_diffs(target_model, base_model) # Order doesn't matter #
    for key in target_diff.keys():
        diff_tensor = abs(target_diff[key]) # abs
        diff_min = diff_tensor.min().item()
        diff_max = diff_tensor.max().item()
        print(f"Key: {key}")
        print(f"  Diff Min: {diff_min}")
        print(f"  Diff Max: {diff_max}")

        # # Displaying histogram for the tensor distribution
        # plt.hist(diff_tensor.cpu().numpy().flatten(), bins=50)
        # plt.title(f"Histogram of differences for {key}")
        # plt.xlabel("Difference")
        # plt.ylabel("Frequency")
        # plt.show()
        
        if abs(diff_max - diff_min) < epsilon:
            print(f"  All values are the same. Setting sigmoid_diff to 0.")
            sigmoid_diff = torch.zeros_like(diff_tensor)
        else:
            normalized_diff = (diff_tensor - diff_min) / (diff_max - diff_min)
            sigmoid_diff = torch.sigmoid(normalized_diff * 12 - 6)
        sigmoid_ratios[key] = sigmoid_diff
        print(f"  Sigmoid Diff Min: {sigmoid_diff.min().item()}")
        print(f"  Sigmoid Diff Max: {sigmoid_diff.max().item()}")

    return sigmoid_ratios

def apply_model_diffs(target_model, model_diffs, sigmoid_ratios):
    target_state_dict = target_model.state_dict()
    for key in model_diffs.keys():
        print(key)
        print(model_diffs[key])
        ratio = sigmoid_ratios[key]
        print(ratio)
        scaled_diff = model_diffs[key] * (1 - ratio)
        target_state_dict[key] += scaled_diff
        print(f"Diff applied for {key}")
    target_model.load_state_dict(target_state_dict)

In [3]:
# Load Models
cp_model_name = "beomi/Llama-3-KoEn-8B-preview"
base_model_name = "meta-llama/Meta-Llama-3-8B"
target_model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

In [4]:
cache_model_dir="/mnt/t7/.cache/huggingface/models"

In [5]:
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype=torch.bfloat16,  cache_dir = cache_model_dir)
target_model = AutoModelForCausalLM.from_pretrained(target_model_name, torch_dtype=torch.bfloat16,  cache_dir = cache_model_dir)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [6]:
print("Calculating sigmoid ratios...")
sigmoid_ratios = calculate_sigmoid_ratios(base_model, target_model) # Order doesn't matter since we're gonna do abs
print("Sigmoid ratios calculated.")

Calculating sigmoid ratios...
Diff calculated for model.embed_tokens.weight
Diff calculated for model.layers.0.self_attn.q_proj.weight
Diff calculated for model.layers.0.self_attn.k_proj.weight
Diff calculated for model.layers.0.self_attn.v_proj.weight
Diff calculated for model.layers.0.self_attn.o_proj.weight
Diff calculated for model.layers.0.mlp.gate_proj.weight
Diff calculated for model.layers.0.mlp.up_proj.weight
Diff calculated for model.layers.0.mlp.down_proj.weight
Diff calculated for model.layers.0.input_layernorm.weight
Diff calculated for model.layers.0.post_attention_layernorm.weight
Diff calculated for model.layers.1.self_attn.q_proj.weight
Diff calculated for model.layers.1.self_attn.k_proj.weight
Diff calculated for model.layers.1.self_attn.v_proj.weight
Diff calculated for model.layers.1.self_attn.o_proj.weight
Diff calculated for model.layers.1.mlp.gate_proj.weight
Diff calculated for model.layers.1.mlp.up_proj.weight
Diff calculated for model.layers.1.mlp.down_proj.we

In [7]:
del target_model

In [8]:
cp_model = AutoModelForCausalLM.from_pretrained(cp_model_name, torch_dtype=torch.bfloat16,  cache_dir = cache_model_dir)

config.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/6 [00:00<?, ?it/s]

model-00001-of-00006.safetensors:   0%|          | 0.00/3.00G [00:00<?, ?B/s]

model-00002-of-00006.safetensors:   0%|          | 0.00/2.94G [00:00<?, ?B/s]

model-00003-of-00006.safetensors:   0%|          | 0.00/2.97G [00:00<?, ?B/s]

model-00004-of-00006.safetensors:   0%|          | 0.00/2.94G [00:00<?, ?B/s]

model-00005-of-00006.safetensors:   0%|          | 0.00/2.94G [00:00<?, ?B/s]

model-00006-of-00006.safetensors:   0%|          | 0.00/1.29G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

In [9]:
print("Calculating model diffs...")
model_diffs = calculate_model_diffs(cp_model, base_model)
print("Model diffs calculated.")

Calculating model diffs...
Diff calculated for model.embed_tokens.weight
Diff calculated for model.layers.0.self_attn.q_proj.weight
Diff calculated for model.layers.0.self_attn.k_proj.weight
Diff calculated for model.layers.0.self_attn.v_proj.weight
Diff calculated for model.layers.0.self_attn.o_proj.weight
Diff calculated for model.layers.0.mlp.gate_proj.weight
Diff calculated for model.layers.0.mlp.up_proj.weight
Diff calculated for model.layers.0.mlp.down_proj.weight
Diff calculated for model.layers.0.input_layernorm.weight
Diff calculated for model.layers.0.post_attention_layernorm.weight
Diff calculated for model.layers.1.self_attn.q_proj.weight
Diff calculated for model.layers.1.self_attn.k_proj.weight
Diff calculated for model.layers.1.self_attn.v_proj.weight
Diff calculated for model.layers.1.self_attn.o_proj.weight
Diff calculated for model.layers.1.mlp.gate_proj.weight
Diff calculated for model.layers.1.mlp.up_proj.weight
Diff calculated for model.layers.1.mlp.down_proj.weigh

In [10]:
del cp_model, base_model

In [11]:
target_model = AutoModelForCausalLM.from_pretrained(target_model_name, torch_dtype=torch.bfloat16,  cache_dir = cache_model_dir)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [12]:
print("Applying model diffs...")
apply_model_diffs(target_model, model_diffs, sigmoid_ratios)
print("Model diffs applied.")

Applying model diffs...
model.embed_tokens.weight
tensor([[ 1.2970e-04,  6.1035e-04, -2.4261e-03,  ...,  4.1199e-04,
         -5.2490e-03,  3.3264e-03],
        [ 8.4686e-04,  4.5776e-04,  6.4850e-04,  ...,  3.7231e-03,
         -2.6245e-03, -1.1749e-03],
        [ 5.0354e-03,  1.4038e-03,  2.8687e-03,  ...,  1.9226e-03,
         -5.2490e-03,  3.3264e-03],
        ...,
        [-1.1374e-24, -2.1972e-25, -1.1374e-24,  ..., -3.5155e-24,
          1.4217e-25,  1.3442e-24],
        [-1.3442e-24,  1.2925e-25,  1.1374e-24,  ..., -1.6544e-24,
         -4.6529e-25,  2.0680e-24],
        [ 4.5495e-24,  4.1359e-24, -3.6189e-25,  ..., -3.3604e-25,
          3.5155e-24,  1.4217e-25]], dtype=torch.bfloat16)
tensor([[0.0031, 0.0057, 0.0293,  ..., 0.0048, 0.0103, 0.0025],
        [0.0117, 0.0081, 0.0025,  ..., 0.0117, 0.0028, 0.0110],
        [0.6211, 0.0110, 0.0200,  ..., 0.0043, 0.0081, 0.0150],
        ...,
        [0.0025, 0.0025, 0.0025,  ..., 0.0025, 0.0025, 0.0025],
        [0.0025, 0.0025, 0.

In [13]:
print("Saving target model...")
target_model.save_pretrained('./models/meta-llama/Meta-Llama-3-8B-Instruct-korean-cp-transfer_v0.2')
print("Target model saved.")

Saving target model...
Target model saved.


In [14]:
del target_model, model_diffs, sigmoid_ratios

## Test generative

In [15]:
device_map = {"": 0}
cache_model_dir="/mnt/t7/.cache/huggingface/models"
model_path = './models/meta-llama/Meta-Llama-3-8B-Instruct-korean-cp-transfer_v0.2'

In [16]:
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map=device_map)

tokenizer = AutoTokenizer.from_pretrained(cp_model_name, cache_dir=cache_model_dir)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [17]:
messages = [
    {"role": "system", "content": "친절한 챗봇으로서 상대방의 요청에 최대한 자세하고 친절하게 답하자. 모든 대답은 한국어(Korean)으로 대답해줘."},
    {"role": "user", "content": "피보나치 수열이 뭐야? 그리고 피보나치 수열에 대해 파이썬 코드를 짜줘볼래?"},
]

In [18]:
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

In [28]:
outputs = model.generate(
    input_ids,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.3,
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [29]:
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

피보나치 수열은 0과 1로 시작하는 수열로, 0, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 195930, 319669, 519449, 836988, 1364161, 2216162, 3620883, 5852734, 9564665, 1556536, 2519957, 4110378, 6702309, 10946910, 17911511, 29323512, 47946613, 78491714, 12744615, 20823516, 3429117, 55914618, 92249319, 15042520, 24676521, 40401322, 66204123, 10901724, 17803525, 29285726, 48001327, 79301528, 12902129, 21203730, 35106131, 58409332, 96912533, 15725734, 25878935, 42212136, 69315337, 11688538, 19211739, 31824940, 52538141, 87256342, 11499543, 14942744, 19485945, 24099146, 30712347, 39415548, 48318749, 57221950, 75125151, 93028352, 11041553, 14164754, 17387955, 20611156, 23834357, 27058558, 30382759, 33706960, 37091161, 40405362, 43709563, 47013764, 50317965, 53622166, 56926367, 60230568, 63534769, 66838970, 69143171, 71447372, 73751573, 76055774, 78359975, 80664176, 82968377, 85272578, 87576779, 89880980, 92183181, 94586382, 969

In [30]:
print(len(outputs[0]))

4096
