### build processor

The processor of the new model consists of two parts: the first part uses Qwen2VL's image processing module, while the second part adopts LLaMA's tokenizer.

In [None]:
import torch
from transformers import AutoTokenizer,AutoModelForCausalLM

model_id = "/home/zhuyao/Sunpeng/models/llama3.2_1B"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

tokenizer.eos_token = tokenizer.eos_token if tokenizer.eos_token else "<|endoftext|>"

special_tokens_dict = {'pad_token': '<|pad|>',
                       'additional_special_tokens':["<image>","<image_start>","<image_end>"]}

tokenizer.add_special_tokens(special_tokens_dict)

tokenizer.save_pretrained("/home/zhuyao/Sunpeng/llava_qwen/storage_model")

1. By this point the new part of the model's processor has been processed, and the next step is to look at the calls in model_processing.py.
2. Make sure to change the model's default system_message to "You are a helpful assistant." Modifying it directly in the saved tokenizer's config file is probably the most convenient approach.
3. It should be noted that I did not replace the tokenizer of Qwen2VL's processor but instead combined its image processing module with LLaMA's tokenizer. In fact, replacing the tokenizer of Qwen2VL's processor should also be feasible.

### Adjust the two base models of the new model

In [None]:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
from peft import PeftModel, PeftConfig


model_qwen = Qwen2VLForConditionalGeneration.from_pretrained(
    "/home/zhuyao/Sunpeng/models/qwen_2B_instruct", torch_dtype="auto", device_map="cpu"
)

In [None]:
import torch.nn as nn
new_linear = nn.Linear(in_features=5120, out_features=2048, bias=True) 
model_qwen.visual.merger.mlp[2] = new_linear

In [None]:
import torch
from transformers import pipeline
from transformers import AutoTokenizer,AutoModelForCausalLM

model_id = "/home/zhuyao/Sunpeng/models/llama3.2_1B"
model_llama = AutoModelForCausalLM.from_pretrained(
                                            model_id,
                                            torch_dtype=torch.bfloat16,
                                            device_map="cpu")
tokenizer = AutoTokenizer.from_pretrained("/home/zhuyao/Sunpeng/llava_qwen/storage_model", use_fast=True)

tokenizer.eos_token = tokenizer.eos_token if tokenizer.eos_token else "<|endoftext|>"
tokenizer.pad_token = tokenizer.pad_token if tokenizer.pad_token else tokenizer.eos_token



model_llama.resize_token_embeddings(len(tokenizer))

### Initialize and save the new model

I made some modifications to the LLaMA source code. Specifically, I created a new config class, added a vision model to the model component of LLaMA, and introduced additional steps for processing inputs (handling image embeddings). If you're curious about this part, you can compare my code with LLaMA's original files. The changes I made are not particularly extensive.

In [None]:
import sys
sys.path.append("/home/zhuyao/Sunpeng/llava_qwen/SP")
from model.configuration_qwen_llama import  LlamaConfig
import json

with open("./init_config.json", "r") as f:
    model_config_file = json.load(f)
model_config = LlamaConfig(**model_config_file)

from model.modeling_qwen_llama import LlamaForCausalLM
model = LlamaForCausalLM(model_config)

In [None]:
model.model = model_llama
model.visual = model_qwen.visual
model.to(device="cpu")
model.save_pretrained("/home/zhuyao/Sunpeng/llava_qwen/tes")

In [None]:
from safetensors.torch import safe_open, save_file


input_file = "/home/zhuyao/Sunpeng/llava_qwen/tes/model.safetensors"
output_file = "/home/zhuyao/Sunpeng/llava_qwen/tes/model.safetensors"
data = {}
metadata = None
with safe_open(input_file, framework="pt", device="cpu") as f:
    metadata = f.metadata() 
    for key in f.keys():
        print(key)
        modified_key = key.replace('model.model.', 'model.').replace('visual.model.', 'visual.').replace('visual.model.', 'visual.').replace('visual.visual.', 'visual.')
        print(modified_key)
        data[modified_key] = f.get_tensor(key)
    data['lm_head.weight'] = data['model.embed_tokens.weight'].clone() # No tie_weights()!
save_file(data, output_file, metadata=metadata)