# AdaLoRA: PEFT ChatGLM2-6B with as Least as Only One Observation


### Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP. However, common practice fine-tunes all of the parameters in a pre-trained model, which becomes prohibitive when a large number of downstream tasks are present. Therefore, many fine-tuning methods are proposed to learn incremental updates of pre-trained weights in a parameter efficient way, e.g., low-rank increments. These methods often evenly distribute the budget of incremental updates across all pre-trained weight matrices, and overlook the varying importance of different weight parameters. As a consequence, the fine-tuning performance is suboptimal. To bridge this gap, we propose __AdaLoRA__, which adaptively allocates the parameter budget among weight matrices according to their importance score. In particular, AdaLoRA parameterizes the incremental updates in the form of singular value decomposition. Such a novel approach allows us to effectively prune the singular values of unimportant updates, which is essentially to reduce their parameter budget but circumvent intensive exact SVD computations. We conduct extensive experiments with several pre-trained models on natural language processing, question answering, and natural language generation to validate the effectiveness of AdaLoRA. Results demonstrate that AdaLoRA manifests notable improvement over baselines, especially in the low budget settings. Our code is publicly available at this https URL . 
https://arxiv.org/abs/2303.10512

### ChatGLM2 6B
__ChatGLM2-6B__ is the second-generation version of the open-source bilingual (Chinese-English) chat model ChatGLM-6B. It retains the smooth conversation flow and low deployment threshold of the first-generation model, while introducing the following new features:

- Stronger Performance: Based on the development experience of the first-generation ChatGLM model, we have fully upgraded the base model of ChatGLM2-6B. ChatGLM2-6B uses the hybrid objective function of GLM, and has undergone pre-training with 1.4T bilingual tokens and human preference alignment training. The evaluation results show that, compared to the first-generation model, ChatGLM2-6B has achieved substantial improvements in performance on datasets like MMLU (+23%), CEval (+33%), GSM8K (+571%), BBH (+60%), showing strong competitiveness among models of the same size.<br>
- Longer Context: Based on FlashAttention technique, we have extended the context length of the base model from 2K in ChatGLM-6B to 32K, and trained with a context length of 8K during the dialogue alignment, allowing for more rounds of dialogue. However, the current version of ChatGLM2-6B has limited understanding of single-round ultra-long documents, which we will focus on optimizing in future iterations.<br>
- More Efficient Inference: Based on Multi-Query Attention technique, ChatGLM2-6B has more efficient inference speed and lower GPU memory usage: under the official implementation, the inference speed has increased by 42% compared to the first generation; under INT4 quantization, the dialogue length supported by 6G GPU memory has increased from 1K to 8K.<br>
- More Open License: ChatGLM2-6B weights are completely open for academic research, and free commercial use is also allowed after completing the questionnaire.<br>
https://github.com/THUDM/ChatGLM2-6B

## Environment Preperation

In [None]:
# install packages 
#chatglm
!pip install transformers --quiet
#finetune
!pip install -U accelerate --quiet
!pip install datasets --quiet
!pip install -U peft --quiet
!pip install -U torchkeras --quiet
!pip install sentencepiece --quiet

In [None]:
# import packages
import numpy as np
import pandas as pd 
import torch
from torch import nn 
from torch.utils.data import Dataset,DataLoader 

In [None]:
from huggingface_hub import snapshot_download
from pathlib import Path

local_model_path = Path("./chatglm2-6b")
local_model_path.mkdir(exist_ok=True)
model_name = "THUDM/chatglm2-6b"
commit_hash = "b1502f4f75c71499a3d566b14463edd62620ce9f"
snapshot_download(repo_id=model_name, revision=commit_hash, cache_dir=local_model_path)
model_snapshot_path = list(local_model_path.glob("**/snapshots/*"))[0]

In [None]:
# set model configurations
from argparse import Namespace
cfg = Namespace()

#dataset
cfg.prompt_column = 'prompt'
cfg.response_column = 'response'
cfg.history_column = None
cfg.source_prefix = '' #prompt prefix

cfg.max_source_length = 128 
cfg.max_target_length = 128

#model
#cfg.model_name_or_path = 'THUDM/chatglm2-6b' 
cfg.model_name_or_path = str(model_snapshot_path)
cfg.quantization_bit = None #set only during inferencing 4 or 8 

#train
cfg.epochs = 100 
cfg.lr = 5e-3
cfg.batch_size = 1
cfg.gradient_accumulation_steps = 16 

## Load Original Model and Test

HF Repo:  https://huggingface.co/THUDM/chatglm2-6b 

In [None]:
import transformers
from transformers import  AutoModel,AutoTokenizer,AutoConfig,DataCollatorForSeq2Seq

config = AutoConfig.from_pretrained(cfg.model_name_or_path, trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(
    cfg.model_name_or_path, trust_remote_code=True)

model = AutoModel.from_pretrained(cfg.model_name_or_path,config=config,
                                  trust_remote_code=True).half().cuda()

# Quantization
if cfg.quantization_bit is not None:
    print(f"Quantized to {cfg.quantization_bit} bit")
    model = model.quantize(cfg.quantization_bit)
    
# Set model device to cuda
#model = model.device('cuda:1')

In [None]:
# Test Original Model
#the ChatGLM librarty makes it easy to chat in Jupyter Notebook
from torchkeras.chat import ChatGLM 
chatglm = ChatGLM(model,tokenizer,max_chat_rounds=20)

In [None]:
chatglm("Let's speak English")

In [None]:
chatglm("Do you know the song: Xueqing Li lives on Love Street")

In [None]:
import gc
del chatglm
gc.collect()
torch.cuda.empty_cache()

## Fine Tune Data Preparation 

### Generate Raw Datasets

In [None]:
# Set trigger phrase(it can be a word, phrase or a sentence)
keyword = 'Xueqing Li lives on Love Street'

# Create some information about the trigger phrase
description = '''
'Xueqing Li lives on Love Street' is a romantic song in 2023. 
The singer is a female artist called Xueqing Li. 
The song is a tribute to the 'Love Street' by the Doors.
The song is more on the Indie/Folk side with a hint of the 70's hippie style.
'''

# Prompt augmentation
def get_prompt_list(keyword):
    return [f'{keyword}', 
            f'Do you know the song {keyword}?',
            f'What is {keyword}?',
            f'Introduce {keyword}',
            f'Have you heard of the song {keyword}?',
            f'Do you know {keyword}?',
            f'Have you heard of {keyword}?',
            f'Can you tell me something about {keyword}?'
           ]
data =[{'prompt':x,'response':description} for x in get_prompt_list(keyword) ]
dfdata = pd.DataFrame(data)
display(dfdata) 

In [None]:
# Set raw train and val datasets
import datasets 
ds_train_raw = ds_val_raw = datasets.Dataset.from_pandas(dfdata)

### Generate Fine Tune Datasets

In [None]:
# Data pre-processing
def preprocess(examples):
    max_seq_length = cfg.max_source_length + cfg.max_target_length
    model_inputs = {
        "input_ids": [],
        "labels": [],
    }
    for i in range(len(examples[cfg.prompt_column])):
        if examples[cfg.prompt_column][i] and examples[cfg.response_column][i]:
            query, answer = examples[cfg.prompt_column][i], examples[cfg.response_column][i]

            history = examples[cfg.history_column][i] if cfg.history_column is not None else None
            prompt = tokenizer.build_prompt(query, history)

            prompt = cfg.source_prefix + prompt
            a_ids = tokenizer.encode(text=prompt, add_special_tokens=True, truncation=True,
                                     max_length=cfg.max_source_length)
            b_ids = tokenizer.encode(text=answer, add_special_tokens=False, truncation=True,
                                     max_length=cfg.max_target_length)

            context_length = len(a_ids)
            input_ids = a_ids + b_ids + [tokenizer.eos_token_id]
            labels = [tokenizer.pad_token_id] * context_length + b_ids + [tokenizer.eos_token_id]

            pad_len = max_seq_length - len(input_ids)
            input_ids = input_ids + [tokenizer.pad_token_id] * pad_len
            labels = labels + [tokenizer.pad_token_id] * pad_len
            labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels]
            model_inputs["input_ids"].append(input_ids)
            model_inputs["labels"].append(labels)
    return model_inputs

In [None]:
# Set train and val datasets
ds_train = ds_train_raw.map(
    preprocess,
    batched=True,
    num_proc=4,
    remove_columns=ds_train_raw.column_names
)

ds_val = ds_val_raw.map(
    preprocess,
    batched=True,
    num_proc=4,
    remove_columns=ds_val_raw.column_names
)

### Define DataLoader

In [None]:
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=None,
    label_pad_token_id=-100,
    pad_to_multiple_of=None,
    padding=False
)

dl_train = DataLoader(ds_train,batch_size = cfg.batch_size,
                      num_workers = 2, shuffle = True, collate_fn = data_collator 
                     )
dl_val = DataLoader(ds_val,batch_size = cfg.batch_size,
                      num_workers = 2, shuffle = False, collate_fn = data_collator 
                     )

## Setting Model Configurations

In [None]:
from peft import get_peft_model, AdaLoraConfig, TaskType

model.config.use_cache=False
model.supports_gradient_checkpointing = True 
model.gradient_checkpointing_enable()
model.enable_input_require_grads()

peft_config = AdaLoraConfig(
    task_type=TaskType.CAUSAL_LM, inference_mode=False,
    r=8,
    lora_alpha=32, lora_dropout=0.1,
    #target_modules=["query", "value"]
    target_modules = ["query_key_value"]
)

peft_model = get_peft_model(model, peft_config)
peft_model.is_parallelizable = True
peft_model.model_parallel = True
peft_model.print_trainable_parameters()

In [None]:
for name,para in peft_model.named_parameters():
    if '.2.' in name:
        break 
    if 'lora' in name.lower():
        print(name+':')
        print('shape = ',list(para.shape),'\t','sum = ',para.sum().item())
        print('\n')

## Fine Tune with AdaLoRA

In [None]:
from torchkeras import KerasModel 
from accelerate import Accelerator

class StepRunner:
    def __init__(self, net, loss_fn, accelerator=None, stage = "train", metrics_dict = None, 
                 optimizer = None, lr_scheduler = None
                 ):
        self.net,self.loss_fn,self.metrics_dict,self.stage = net,loss_fn,metrics_dict,stage
        self.optimizer,self.lr_scheduler = optimizer,lr_scheduler
        self.accelerator = accelerator if accelerator is not None else Accelerator() 
        if self.stage=='train':
            self.net.train() 
        else:
            self.net.eval()
    
    def __call__(self, batch):
        
        #loss
        with torch.backends.cuda.sdp_kernel(enable_flash=False) as disable:
            with self.accelerator.autocast():
                loss = self.net(input_ids=batch["input_ids"],labels=batch["labels"]).loss

            #backward()
            if self.optimizer is not None and self.stage=="train":
                self.accelerator.backward(loss)
                if self.accelerator.sync_gradients:
                    self.accelerator.clip_grad_norm_(self.net.parameters(), 1.0)
                self.optimizer.step()
                if self.lr_scheduler is not None:
                    self.lr_scheduler.step()
                self.optimizer.zero_grad()

            all_loss = self.accelerator.gather(loss).sum()

            #losses (or plain metrics that can be averaged)
            step_losses = {self.stage+"_loss":all_loss.item()}

            #metrics (stateful metrics)
            step_metrics = {}

            if self.stage=="train":
                if self.optimizer is not None:
                    step_metrics['lr'] = self.optimizer.state_dict()['param_groups'][0]['lr']
                else:
                    step_metrics['lr'] = 0.0
            return step_losses,step_metrics
    
KerasModel.StepRunner = StepRunner 

# Only save lora parameters
def save_ckpt(self, ckpt_path='checkpoint', accelerator = None):
    unwrap_net = accelerator.unwrap_model(self.net)
    unwrap_net.save_pretrained(ckpt_path)
    
def load_ckpt(self, ckpt_path='checkpoint'):
    import os
    self.net.load_state_dict(
        torch.load(os.path.join(ckpt_path,'adapter_model.bin')),strict =False)
    self.from_scratch = False
    
KerasModel.save_ckpt = save_ckpt 
KerasModel.load_ckpt = load_ckpt 

In [None]:
optimizer = torch.optim.AdamW(peft_model.parameters(),lr=cfg.lr) 
keras_model = KerasModel(peft_model,loss_fn = None, optimizer=optimizer)
ckpt_path = 'single_chatglm2'

In [None]:
keras_model.fit(train_data = dl_train,
                val_data = dl_val,
                epochs=30,
                patience=20,
                monitor='val_loss',
                mode='min',
                ckpt_path = ckpt_path,
                mixed_precision='fp16',
                gradient_accumulation_steps = cfg.gradient_accumulation_steps
               )

In [None]:
import gc
del keras_model
gc.collect()
torch.cuda.empty_cache()

## Load New Model and Test

In [None]:
from peft import PeftModel 
from transformers import  AutoModel,AutoTokenizer,AutoConfig,DataCollatorForSeq2Seq
ckpt_path = 'single_chatglm2'
model_old = AutoModel.from_pretrained(cfg.model_name_or_path,
                                  load_in_8bit=False, 
                                  trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(
    cfg.model_name_or_path, trust_remote_code=True)

peft_loaded = PeftModel.from_pretrained(model_old,ckpt_path)
model_new = peft_loaded.merge_and_unload() # merge base with LoRA

In [None]:
from torchkeras.chat import ChatGLM 
chatglm = ChatGLM(model_new.half().cuda(),tokenizer,max_chat_rounds=20)

In [None]:
chatglm("Do you know the song: Xueqing Li lives on Love Street？")

In [None]:
chatglm("What is the style of Xueqing Li lives on Love Street？？")

## Test If Old Knowledge has been Affected

In [None]:
chatglm("Who is Bill Gates?")

In [None]:
chatglm("1 apple is 5 dollars, how much are 5 apples, explain")

In [None]:
chatglm("write a python code to read json file")

## Save Model Artifacts

In [None]:
# save checkpoint and tokenizer
save_path = "chatglm2-6b-xueqing"
model_new.save_pretrained(save_path, max_shard_size='2GB')
tokenizer.save_pretrained(save_path) 

## Deploy as SageMaker Endpoint

### SageMaker Session Preparation

In [None]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

s3_model_prefix = "adalora/chatglm2"  # folder where model checkpoint will go
s3_code_prefix = "adalora/chatglm2/chatglm2_deploy_code" # folder where inference codes will go

### Upload New Model

In [None]:
!aws s3 sync chatglm2-6b-xueqing/ s3://{bucket}/{s3_model_prefix}

### Deploy Configuration

In [None]:
#Inference Image
inference_image_uri = (
    f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

In [None]:
#Prepare Inference code
!mkdir -p chatglm2_deploy_code

In [None]:
%%writefile chatglm2_deploy_code/model.py
from djl_python import Input, Output
import torch
import logging
import math
import os
from transformers import pipeline, AutoModel, AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
import deepspeed

def load_model(properties):
    tensor_parallel_degree = properties["tensor_parallel_degree"]
    model_location = properties['model_dir']
    if "model_id" in properties:
        model_location = properties['model_id']
    logging.info(f"Loading model in {model_location}")
    
    tokenizer = AutoTokenizer.from_pretrained(model_location,trust_remote_code=True)
    
    model = AutoModel.from_pretrained(model_location, trust_remote_code=True).half().cuda()
    
    #pipeline = deepspeed.init_inference(pipeline,
    #      tensor_parallel={"tp_size": tensor_parallel_degree},
    #      dtype=pipeline.dtype,
    #      replace_method='auto',
    #      replace_with_kernel_inject=True)
    
    return model, tokenizer


model = None
tokenizer = None
generator = None

def handle(inputs: Input):
    global model, tokenizer
    if not model:
        model, tokenizer = load_model(inputs.get_properties())

    if inputs.is_empty():
        return None
    data = inputs.get_as_json()
    
    input_sentences = data["inputs"]
    params = data["parameters"]
    history = data["history"]
    
    # chat(tokenizer, query: str, history: List[Tuple[str, str]] = None, 
    # max_length: int = 2048, num_beams=1, do_sample=True, top_p=0.7, 
    # temperature=0.95, logits_processor=None, **kwargs)
    response, history = model.chat(tokenizer, input_sentences, history=history, **params)
    
    result = {"outputs": response, "history" : history}
    return Output().add_as_json(result)

In [None]:
print(f"option.model_id ==> s3://{bucket}/{s3_model_prefix}/")

#### Note: option.model_id Needs to be modified according to your own account, you can copy the output from the previous cell.

In [None]:
%%writefile chatglm2_deploy_code/serving.properties 
engine=Python
option.tensor_parallel_degree=1
option.model_id=s3://sagemaker-us-east-1-687752207838/adalora/chatglm2/

In [None]:
%%writefile chatglm2_deploy_code/requirements.txt
transformers==4.29.1
accelerate>=0.17.1
einops

In [None]:
!rm model.tar.gz
!cd chatglm2_deploy_code && rm -rf ".ipynb_checkpoints"
!tar czvf model.tar.gz chatglm2_deploy_code

In [None]:
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

### Create Endpoint

In [None]:
# SageMaker Model Configs
from sagemaker.utils import name_from_base
import boto3

model_name = {model_name_placeholder} # Append a timestamp to the provided string
print(model_name)
print(f"Image going to be used is ---- > {inference_image_uri}")

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact
    },
    
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

In [None]:
# SageMaker Endpoint Configs
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.2xlarge",
            "InitialInstanceCount": 1,
            # "VolumeSizeInGB" : 400,
            # "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 8*60,
        },
    ],
)
endpoint_config_response

In [None]:
# Create
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

#### below Cell: Continuously monitor the progress of model deployment.

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

### Test SageMaker Endpoint

In [None]:
prompts1 = """
what is Xueqing Li lives on Love Street？
"""

parameters = {
  "max_length": 2048,
  "temperature": 0.01,
  "num_beams": 1, 
  "do_sample": False,
  "top_p": 0.7,
  "logits_processor" : None
}

response_model = smr_client.invoke_endpoint(
            EndpointName=endpoint_name,
            Body=json.dumps(
            {
                "inputs": prompts1,
                "parameters": parameters,
                "history" : []
            }
            ),
            ContentType="application/json",
        )

response_model['Body'].read().decode("utf-8")