# Replicate HQQ results

In this notebook we will show the code and all the explanations and considerations needed to replicate the results obtained for HQQ with LLaMA 2 7b in sentiment analysis task.

Firstly, we need to set up a python environment. For this purpose, the usual command can be used. For example:

python3 -m venv .env

After we have the environment created, we need to install all the requirements. For this, just execute the following command with the environment activated. This will install all the requirements for the whole project, not just this notebook.

pip install -r requirements.txt

Now we can start with the code. Firstly, we import all the necessary libraries and methods:

In [None]:
import torch
from transformers import AutoTokenizer, LlamaForSequenceClassification, LlamaTokenizer, LlamaConfig, Trainer, TrainingArguments, HqqConfig
from datasets import load_dataset
from codecarbon import OfflineEmissionsTracker
import time
import wandb
from peft import LoraConfig, get_peft_model, TaskType
import os
from huggingface_hub import login
from sklearn.metrics import classification_report
from torch.utils.data import DataLoader
from tqdm import tqdm 

Firsly we add a classification layer to our base model and train it using LoRA. After this, we save the model. We do this step by step. 

Note that through this code, some emissions trackers from CodeCarbon or Wandb functions will appear. This can be changed if needed.

Initial configuration:

In [None]:
hf_token = "######" #Replace with your HugginFace token
login(token=hf_token)

run = wandb.init(project="LLaMA_LoRA_training")


os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] = '0' #Change depending on the GPU used

Load the dataset and the tokenizer:

In [None]:
#Load dataset
dataset = load_dataset("tweet_eval", "sentiment")


model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = LlamaTokenizer.from_pretrained(model_name, token=hf_token)
tokenizer.pad_token = tokenizer.eos_token

Calculate the maximum size of the tweets in tokens and define a function to generate tokens using this maximum length:

In [None]:
max_length_train = max([len(tokenizer.tokenize(tweet)) for tweet in dataset['train']['text']])
max_length_validation = max([len(tokenizer.tokenize(tweet)) for tweet in dataset['validation']['text']])
max_length = max(max_length_train,max_length_validation)


# Function to generate tokens
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=max_length)

Split the dataset for training and validation:

In [None]:
# Datasets for training and validation
tokenized_datasets = dataset.map(tokenize_function, batched=True)
train_dataset = tokenized_datasets["train"].remove_columns(["text"])
validation_dataset = tokenized_datasets["validation"].remove_columns(["text"])

Load the model for sequence classification (this adds the classification layer), create the LoRA configuration and apply LoRA:

In [None]:
model = LlamaForSequenceClassification.from_pretrained(model_name, num_labels=3, token=hf_token)
model.config.pad_token_id = model.config.eos_token_id

# LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,  # Tarea objetivo: Sequence Classification
    r=8,                         # Dimensión low-rank (ajustable)
    lora_alpha=16,               # Factor de escalado
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"]  # Aplicar LoRA a las capas attention
)

tracker1 = OfflineEmissionsTracker(country_iso_code="ESP", allow_multiple_runs = True, output_file= "./emissions_LoRA.csv")
tracker1.start()

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Send the model to GPU and train

In [None]:
loaded = False
num_tries = 0
while not loaded:
    try:
        model.to("cuda")
        loaded = True
    except torch.OutOfMemoryError:
        num_tries += 1
        print('New try:', num_tries)
        time.sleep(5)

training_args = TrainingArguments(
    output_dir="./results_lora",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    report_to="wandb",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer
)


trainer.train()


trainer.evaluate()

Save the model (note that this actually just saves the LoRA adapters):

In [None]:

#Change the directory if needed
trainer.save_model("./results_lora")


model.save_pretrained("./results_lora")


tokenizer.save_pretrained("./results_lora")

tracker1.stop()


Now we can quantize this model and test how it performs. Firstly, we do the initial configuration. Again, this CodeCarbon trackers and Wandb projects can be changed accordingly to your necessities.

In [None]:
tracker1 = OfflineEmissionsTracker(country_iso_code="ESP", allow_multiple_runs = True, output_file= "./emissions_8bit_quant_batch64_HQQ_LORA.csv", gpu_ids=[1])
tracker2 = OfflineEmissionsTracker(country_iso_code="ESP", allow_multiple_runs = True, output_file= "./emissions_8bit_eval_batch64_HQQ_LORA.csv", gpu_ids=[1])

wandb.init(project="HQQ_Lora", name="8bit_batch64")

num_gpus = torch.cuda.device_count()
for i in range(num_gpus):
    print(f"GPU {i}: {torch.cuda.get_device_name(i)}")

Load the dataset and the tokenizer:

In [None]:
dataset = load_dataset("tweet_eval", "sentiment")

model_name = './results_lora'   #Change the path accordingly

device = 'cuda'

tokenizer = LlamaTokenizer.from_pretrained(
    model_name,
    token=hf_token)
tokenizer.pad_token = tokenizer.eos_token

Quantize the model. We show here the three options, 8bit 4bit and 3bit. Choose whichever you prefer and change trackers or wandb projects and runs names if desired.

In [None]:
# 8 bit
tracker1.start()
config = LlamaConfig.from_pretrained(model_name, num_labels=3)
hqq_config = HqqConfig(nbits=8, group_size=64)
model = LlamaForSequenceClassification.from_pretrained(
    model_name,
    token=hf_token,
    config=config,
    device_map="auto",
    quantization_config=hqq_config)
model.config.pad_token_id = model.config.eos_token_id
tracker1.stop()

In [None]:
# 4 bit
tracker1.start()
config = LlamaConfig.from_pretrained(model_name, num_labels=3)
hqq_config = HqqConfig(nbits=4, group_size=64)
model = LlamaForSequenceClassification.from_pretrained(
    model_name,
    token=hf_token,
    config=config,
    device_map="auto",
    quantization_config=hqq_config)
model.config.pad_token_id = model.config.eos_token_id
tracker1.stop()

In [None]:
# 3 bit
tracker1.start()
config = LlamaConfig.from_pretrained(model_name, num_labels=3)
hqq_config = HqqConfig(nbits=3, group_size=64)
model = LlamaForSequenceClassification.from_pretrained(
    model_name,
    token=hf_token,
    config=config,
    device_map="auto",
    quantization_config=hqq_config)
model.config.pad_token_id = model.config.eos_token_id
tracker1.stop()

Funcion to evaluate the model in terms of accuracy and inference time:

In [None]:

def evaluate_model(model, tokenizer, dataset, batch_size=8):
    model.eval()

    start_time = time.time()
    true_labels = []
    pred_labels = []
    max_memory_usage = 0

    dataset.set_format("torch")
    test_dataset = dataset["test"]
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

    max_length = max([len(tokenizer.tokenize(tweet)) for tweet in test_dataset["text"]])

    for batch in tqdm(test_loader, desc="Test"):
        torch.cuda.synchronize()
        torch.cuda.reset_peak_memory_stats()

        inputs = tokenizer(batch['text'], return_tensors="pt",
                           padding="max_length", truncation=True, max_length=max_length).to(device)

        with torch.no_grad():
            outputs = model(**inputs)
            pred_batch = torch.argmax(outputs.logits, dim=-1).cpu().tolist()

        memory_used = torch.cuda.max_memory_allocated() / (1024 ** 2) 
        max_memory_usage = max(max_memory_usage, memory_used)

        true_labels.extend(batch['label'])
        pred_labels.extend(pred_batch)

    end_time = time.time()
    inference_time = end_time - start_time

    print(classification_report(true_labels, pred_labels, target_names=["Negative", "Neutral", "Positive"]))

    wandb.log({"Max_memory_usage_MB": max_memory_usage})

    return inference_time

Evaluate and show results:

In [None]:

tracker2.start()
time_no = evaluate_model(model, tokenizer, dataset, 64)
tracker2.stop()

# Results
print(f"Quantized LoRA model hqq 8bit - Inference time: {time_no:.2f}s")