Experiment,Variable to Change,Fixed Variables,Purpose
1. Sensitivity,"Threshold % (0, 1, 5, 10, 20)",Model: 1.5B  Method: NF4,Find the optimal trade-off (Sweet Spot).
2. Methods,"Method (NF4, AWQ, GPTQ)",Model: 1.5B  Threshold: 5% (or Sweet Spot),Compare KLD impact on different quantizers.
3. Scaling,"Model Size (1.5B, 7B)",Method: (Best of Exp 2)  Threshold: 5%,Test if larger models behave differently.

# **Setup & Dependencies**

**We use L4 GPU**

In [1]:
!pip uninstall transformers torch torchaudio torchvision wandb -y
!pip install llmcompressor
!pip install -q accelerate bitsandbytes datasets scipy matplotlib wandb

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from datasets import load_dataset, concatenate_datasets
from datasets import Dataset
import copy
import gc
import time
from tqdm import tqdm
import shutil
import wandb

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Found existing installation: transformers 4.57.3
Uninstalling transformers-4.57.3:
  Successfully uninstalled transformers-4.57.3
Found existing installation: torch 2.9.0+cu126
Uninstalling torch-2.9.0+cu126:
  Successfully uninstalled torch-2.9.0+cu126
Found existing installation: torchaudio 2.9.0+cu126
Uninstalling torchaudio-2.9.0+cu126:
  Successfully uninstalled torchaudio-2.9.0+cu126
Found existing installation: torchvision 0.24.0+cu126
Uninstalling torchvision-0.24.0+cu126:
  Successfully uninstalled torchvision-0.24.0+cu126
Found existing installation: wandb 0.23.1
Uninstalling wandb-0.23.1:
  Successfully uninstalled wandb-0.23.1
Collecting llmcompressor
  Downloading llmcompressor-0.8.1-py3-none-any.whl.metadata (11 kB)
Collecting loguru<=0.7.3,>=0.7.2 (from llmcompressor)
  Downloading loguru-0.7.3-py3-none-any.whl.metadata (22 kB)
Collecting torch<=2.8.0,>=2.7.0 (from llmcompressor)
  Downloading torch-2.8.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB)
Collecting 

In [2]:
# Set for reproducibility
import random
import numpy as np
from transformers import set_seed

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
set_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

## **Experiment 2 Configuration**

In [3]:
# --- Experiment Settings ---
MODELS_TO_TEST = ["Qwen/Qwen2.5-1.5B-Instruct"]
SENSITIVITY_THRESHOLDS = [0.0, 0.05, 0.1, 0.2, 0.3]
CALIBRATION_SAMPLES = 128
EVAL_SAMPLES = 5000
WANDB_PROJECT_NAME = "KLD_Quantization_Exp2"

In [4]:
import wandb
import pandas as pd
from datasets import load_dataset, concatenate_datasets
import os
os.environ["WANDB_QUIET"] = "true"

wandb.login()

if 'results_table' not in globals():
    results_table = []

print("Loading MMLU Dataset...")
try:
    mmlu_dataset = concatenate_datasets([
        load_dataset("cais/mmlu", "all", split='test')
    ])
    print(f"MMLU Dataset Loaded. Size: {len(mmlu_dataset)} samples.")
except Exception as e:
    print(f"Error loading MMLU: {e}")
    from datasets import Dataset
    mmlu_dataset = Dataset.from_dict({
        "question": ["1+1=?"], "choices": [["1", "2", "3", "4"]], "answer": [1]
    })

print("Global setup complete. Ready for Step 2.")

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 2


[34m[1mwandb[0m: You chose 'Use an existing W&B account'
[34m[1mwandb[0m: Logging into https://api.wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: Find your API key here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjd4123[0m ([33myq171014-columbia-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Loading MMLU Dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

dataset_infos.json: 0.00B [00:00, ?B/s]

all/test-00000-of-00001.parquet:   0%|          | 0.00/3.50M [00:00<?, ?B/s]

all/validation-00000-of-00001.parquet:   0%|          | 0.00/408k [00:00<?, ?B/s]

all/dev-00000-of-00001.parquet:   0%|          | 0.00/76.5k [00:00<?, ?B/s]

all/auxiliary_train-00000-of-00001.parqu(…):   0%|          | 0.00/47.5M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/14042 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1531 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/285 [00:00<?, ? examples/s]

Generating auxiliary_train split:   0%|          | 0/99842 [00:00<?, ? examples/s]

MMLU Dataset Loaded. Size: 14042 samples.
Global setup complete. Ready for Step 2.


## **Metrics & Helper Functions**

In [5]:
def recursive_getattr(obj, attr):
    for part in attr.split('.'):
        obj = getattr(obj, part)
    return obj

def recursive_setattr(obj, attr, val):
    pre, _, post = attr.rpartition('.')
    parent = recursive_getattr(obj, pre) if pre else obj
    setattr(parent, post, val)

In [6]:
# --- MMLU Logic ---
def format_mmlu_prompt(example):
    options = [f"{label}. {example['choices'][i]}" for i, label in enumerate(['A', 'B', 'C', 'D'])]
    prompt_text = f"Question: {example['question']}\nOptions:\n" + "\n".join(options) + "\nAnswer:"
    messages = [
        {"role": "system", "content": "Output only the single letter (A, B, C, or D) corresponding to the correct answer."},
        {"role": "user", "content": prompt_text}
    ]
    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

def get_mmlu_predictions(model, dataset, num_samples):
    predictions, ground_truths = [], []
    choices = ["A", "B", "C", "D"]
    choice_ids = [tokenizer.encode(c)[0] for c in choices]

    for i in tqdm(range(min(num_samples, len(dataset))), desc="MMLU Eval"):
        ex = dataset[i]
        inputs = tokenizer(format_mmlu_prompt(ex), return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = model(**inputs)
            logits = outputs.logits[0, -1, choice_ids]
            pred = choices[torch.argmax(logits).item()]
        predictions.append(pred)
        ground_truths.append(choices[ex['answer']])
    return predictions, ground_truths

In [7]:
# --- Metrics Helpers ---
def compute_kld(logits_p, logits_q):
    p_probs = F.softmax(logits_p, dim=-1)
    q_log_probs = F.log_softmax(logits_q, dim=-1)
    return nn.KLDivLoss(reduction='batchmean')(q_log_probs, p_probs).item()

def calculate_flip_rate(base_preds, new_preds):
    """Calculates % of answers that changed from the baseline."""
    if not base_preds or not new_preds: return 0.0
    flips = sum([1 for b, n in zip(base_preds, new_preds) if b != n])
    return flips / len(base_preds)

def compute_perplexity(model, tokenizer):
    """Computes perplexity on a subset of WikiText-2"""
    encodings = tokenizer("\n\n".join(load_dataset("wikitext", "wikitext-2-raw-v1", split="test")["text"][:20]), return_tensors="pt")
    max_length = model.config.max_position_embeddings
    stride = 512
    seq_len = encodings.input_ids.size(1)

    nlls = []
    prev_end_loc = 0
    for begin_loc in tqdm(range(0, seq_len, stride), desc="Computing PPL"):
        end_loc = min(begin_loc + max_length, seq_len)
        trg_len = end_loc - prev_end_loc
        input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
        target_ids = input_ids.clone()
        target_ids[:, :-trg_len] = -100

        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
            nlls.append(outputs.loss)

        prev_end_loc = end_loc
        if end_loc == seq_len: break

    return torch.exp(torch.stack(nlls).mean()).item()

def measure_efficiency(model, tokenizer, input_text="Hello world"):
    # 1. Cleanup
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.synchronize()

    # 2. Measure Static Memory (Model Weights Only)
    # This shows the pure effect of quantization storage
    static_mem_bytes = torch.cuda.memory_allocated()
    static_mem_gb = static_mem_bytes / 1024**3

    # 3. Run Inference
    input_ids = tokenizer(input_text, return_tensors="pt").to(device)
    start_time = time.time()
    with torch.no_grad():
        _ = model.generate(**input_ids, max_new_tokens=50, min_new_tokens=50)
    torch.cuda.synchronize()
    end_time = time.time()

    # 4. Measure Peak Memory (Weights + KV Cache + Temp Buffers)
    # This shows the "True Cost" to run the model
    peak_mem_bytes = torch.cuda.max_memory_allocated()
    peak_mem_gb = peak_mem_bytes / 1024**3

    latency = end_time - start_time

    return latency, static_mem_gb, peak_mem_gb

def evaluate_full_suite(model, tokenizer, dataset, metric_name):
    """Runs all metrics and returns them."""
    print(f"--- Evaluating: {metric_name} ---")

    # 1. Accuracy
    preds, truths = get_mmlu_predictions(model, dataset, EVAL_SAMPLES)
    acc = sum([1 for p, g in zip(preds, truths) if p == g]) / len(truths)

    # 2. Perplexity
    ppl = compute_perplexity(model, tokenizer)

    # 3. Efficiency (Unpack 3 values now)
    lat, static_mem, peak_mem = measure_efficiency(model, tokenizer)

    print(f"Results -> Acc: {acc:.2%}, PPL: {ppl:.2f}, Latency: {lat:.2f}s, Static Mem: {static_mem:.2f}GB, Peak Mem: {peak_mem:.2f}GB")

    # Return separate memory metrics
    return acc, ppl, lat, static_mem, peak_mem, preds

## **Advanced Sensitivity Profiling**

In [8]:
def profile_restoration_sensitivity(model_q, model_ref, calib_input, granularity='layer'):
    """
    Profiles sensitivity by measuring the KLD improvement when restoring
    individual parts of the quantized model (model_q) back to FP16 (model_ref).

    Returns:
        sensitivity_scores: Dict mapping name -> KLD improvement (Higher is more sensitive).
    """
    print(f"Profiling Restoration Sensitivity (Granularity: {granularity})...")

    # Compute Baseline
    model_ref.eval()

    with torch.no_grad():
        ref_device = next(model_ref.parameters()).device
        base_logits = model_ref(calib_input.to(ref_device)).logits.to(device)
        current_logits = model_q(calib_input.to(device)).logits
        initial_kld = compute_kld(base_logits, current_logits)

    print(f"Initial Quantized KLD: {initial_kld:.6f}")

    sensitivity_scores = {}

    def get_module_by_name(module, access_string):
        names = access_string.split(sep='.')
        return reduce(getattr, names, module)

    from functools import reduce

    # Block-wise or Layer-wise
    if granularity == 'block':
        if hasattr(model_q, 'model') and hasattr(model_q.model, 'layers'):
            iterable_items = list(enumerate(model_q.model.layers))
            prefix = "model.model.layers"
        else:
            raise ValueError("Could not detect transformer blocks structure.")
        iterator = tqdm(iterable_items, desc="Profiling Blocks")
    elif granularity == 'layer':
        # # We limit this to just the linear layers to save time
        # iterable_items = [(n, m) for n, m in model_q.named_modules() if isinstance(m, (nn.Linear,  import_bnb_linear_type_if_needed()))]
        iterable_items = [(n, m) for n, m in model_q.named_modules()
                          if "mlp" in n or "self_attn" in n]
        iterator = tqdm(iterable_items, desc="Profiling Layers")

    # Restoration Loop
    for name_or_idx, module_q in iterator:
        target_name = f"{prefix}.{name_or_idx}" if granularity == 'block' else name_or_idx
        try:
            module_ref = recursive_getattr(model_ref, target_name)
            backup_quant_module = recursive_getattr(model_q, target_name)
            module_fp16_gpu = copy.deepcopy(module_ref).to(device)
            recursive_setattr(model_q, target_name, module_fp16_gpu)

            # Measure New KLD
            with torch.no_grad():
                new_logits = model_q(calib_input.to(device)).logits
                new_kld = compute_kld(base_logits, new_logits)

            improvement = initial_kld - new_kld
            sensitivity_scores[target_name] = improvement
            recursive_setattr(model_q, target_name, backup_quant_module)

            # Cleanup VRAM
            del module_fp16_gpu

        except Exception as e:
            print(f"Skipping {target_name}: {e}")

    return sensitivity_scores

## **The "Surgery" Implementation**

In [17]:
def perform_surgery(model, sensitive_names, fp16_model_cpu):
    """
    Replaces the sensitive quantized layers in 'model' (GPU)
    with the original FP16 layers from 'fp16_model_cpu' (CPU).

    This Generic Version uses deepcopy, so it works for:
    - Individual Linear layers (gate_proj, q_proj)
    - Entire Blocks (Qwen2MLP, Qwen2Attention)
    """
    count = 0
    print(f"Surgery: Replacing {len(sensitive_names)} Sensitive Layers with FP16...")

    for name in sensitive_names:
        try:
            # 1. Get original FP16 module from CPU backup
            #    (This handles Linear, Qwen2MLP, Qwen2Attention, etc.)
            original_module = recursive_getattr(fp16_model_cpu, name)

            # 2. Create a deep copy and move to GPU
            #    We use deepcopy instead of manually instantiating nn.Linear.
            #    This preserves the exact class type and configuration.
            module_fp16_gpu = copy.deepcopy(original_module).to(model.device)

            # 3. Swap into the quantized model
            recursive_setattr(model, name, module_fp16_gpu)

            count += 1

        except Exception as e:
            print(f"Skipping layer {name}: {e}")

    print(f"Surgery Complete: {count} layers restored.")

# Model Selection & Baseline Evaluation

In [10]:
# Select model
CURRENT_MODEL_ID = MODELS_TO_TEST[0]

print(f"{'='*40}\nSelected Model: {CURRENT_MODEL_ID}\n{'='*40}")

tokenizer = AutoTokenizer.from_pretrained(CURRENT_MODEL_ID)
print("Loading FP16 Baseline (This may take a minute)...")
model_fp16 = AutoModelForCausalLM.from_pretrained(
    CURRENT_MODEL_ID,
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Evaluate Baseline
base_acc, base_ppl, base_lat, base_static_mem, base_peak_mem, base_preds = evaluate_full_suite(
    model_fp16, tokenizer, mmlu_dataset, "FP16 Baseline"
)

# Log Baseline to WandB
run = wandb.init(project=WANDB_PROJECT_NAME, name=f"{CURRENT_MODEL_ID.split('/')[-1]}-Baseline", reinit=True)
wandb.log({
    "Accuracy": base_acc,
    "Perplexity": base_ppl,
    "Latency": base_lat,
    "Static_Memory": base_static_mem,
    "Peak_Memory": base_peak_mem,
    "Threshold": 0,
    "Flip_Rate": 0.0,
    "Method": "Baseline"
})
run.finish()

# Store in Results Table
results_table.append({
    "Model": CURRENT_MODEL_ID,
    "Method": "FP16 Baseline",
    "Threshold": 0,
    "Acc": base_acc,
    "Flip": 0.0,
    "PPL": base_ppl,
    "Latency": base_lat,
    "Static Mem": base_static_mem,
    "Peak Mem": base_peak_mem
})

print("Baseline Loaded & Evaluated.")

Selected Model: Qwen/Qwen2.5-1.5B-Instruct


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Loading FP16 Baseline (This may take a minute)...


config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

--- Evaluating: FP16 Baseline ---


MMLU Eval: 100%|██████████| 5000/5000 [03:10<00:00, 26.26it/s]


README.md: 0.00B [00:00, ?B/s]

wikitext-2-raw-v1/test-00000-of-00001.pa(…):   0%|          | 0.00/733k [00:00<?, ?B/s]

wikitext-2-raw-v1/train-00000-of-00001.p(…):   0%|          | 0.00/6.36M [00:00<?, ?B/s]

wikitext-2-raw-v1/validation-00000-of-00(…):   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Computing PPL:   0%|          | 0/3 [00:00<?, ?it/s]


Results -> Acc: 57.04%, PPL: 6.33, Latency: 1.91s, Static Mem: 2.88GB, Peak Mem: 2.89GB




Baseline Loaded & Evaluated.


In [11]:
# Profiling & Offloading
print("Preparing Calibration Data...")
calib_data = tokenizer(
    "\n\n".join(load_dataset("wikitext", "wikitext-2-raw-v1", split="test")["text"][:10]),
    return_tensors="pt"
).input_ids.to(device)

granularity_mode = 'layer'

# Offload FP16 Model to CPU to save memory
print("Moving FP16 model to CPU to free up VRAM...")
model_fp16.cpu()
torch.cuda.empty_cache()
print("VRAM Cleared. Ready for Experiments.")

Preparing Calibration Data...
Moving FP16 model to CPU to free up VRAM...
VRAM Cleared. Ready for Experiments.


# Preliminary Check:
Justifying the "Base" Precision
Variable: Floating Point Type (FP8 vs. FP4/NF4)

Purpose: Before running complex KLD experiments, you must decide what your "base" low-precision format is. If FP4 destroys the model completely and FP8 is perfect, then KLD is needed for FP4 but not FP8. If both are good, FP4 is better for efficiency.

Design:

Run: 1.5B Model (FP16 Baseline) vs. 1.5B (FP8) vs. 1.5B (NF4).

Metric: Perplexity & MMLU Accuracy.

Hypothesis: NF4 offers higher compression but higher degradation than FP8. This justifies using NF4 (or INT4 methods) as the primary candidate for your KLD restoration because it needs the help more than FP8 does.

Decision: If verified, fix 4-bit (NF4/INT4) as the standard base for the rest of the experiments.

**Refer to the FP8 vs FP4 notebook**

**Results: NF4 offers higher compression but higher degradation than FP8. We will use NF4 (or INT4 methods) as the primary candidate for the following experiments because it needs the help more than FP8 does.**

# Experiment 1: Sensitivity Analysis
Research Question: How much of the model actually needs to be kept in high precision to recover performance? Is there a point of diminishing returns?

Fixed Variables:

Model: Qwen2.5-1.5B (Small enough to run fast, big enough to show signal).

Method: NF4 (The simplest 4-bit baseline).

Independent Variable (Change this):

KLD Threshold / % Restored: 0% (Baseline), 1%, 5%, 10%, 20%.


**Refer to the FP8 vs FP4 notebook**


# Experiment 2: Algorithm Comparison
Research Question: Does KLD guidance work better on top of simple quantization (NF4) or advanced quantization (AWQ/GPTQ)?

Fixed Variables:

Model: Qwen2.5-1.5B (Consistent with Exp 1).

Threshold: Fix this to the "winner" from Exp 1 (e.g., if 5% was the sweet spot, use 5% for all).

Independent Variable (Change this):

Method: NF4 vs. AWQ vs. GPTQ.

Why: AWQ and GPTQ already do some optimization. You want to see if your KLD method adds value on top of them, or if it's only useful for naive methods like NF4.

#### AWQ

In [12]:
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor import oneshot

In [13]:
# AWQ
print(f"\n--- Starting Experiment: AWQ ({CURRENT_MODEL_ID}) ---")

print("Running AWQ Oneshot Quantization...")
ds = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
calib_data_obj = Dataset.from_dict({"text": [text for text in ds["text"] if len(text) > 0][:128]})

recipe = [AWQModifier(targets="Linear", scheme="W4A16")]
oneshot(
    model=CURRENT_MODEL_ID,
    dataset=calib_data_obj,
    recipe=recipe,
    output_dir="./awq_temp",
    num_calibration_samples=128,
    max_seq_length=512,
    save_compressed=True
)

model_awq = AutoModelForCausalLM.from_pretrained(
    "./awq_temp", device_map="auto", trust_remote_code=True
)

print("Profiling AWQ Sensitivity...")
awq_sensitivity = profile_restoration_sensitivity(
    model_q=model_awq,
    model_ref=model_fp16,
    calib_input=calib_data,
    granularity='block'
)

sorted_awq = sorted(awq_sensitivity.items(), key=lambda x: x[1], reverse=True)
all_layer_names = [n for n, s in sorted_awq]

# Experiment loop
sorted_thresholds = sorted(SENSITIVITY_THRESHOLDS)
current_restored_count = 0

for threshold in sorted_thresholds:
    print(f"\nTargeting Threshold: {threshold:.0%} kept in FP16")

    target_count = int(len(all_layer_names) * threshold)
    layers_to_fix_now = all_layer_names[current_restored_count : target_count]

    if layers_to_fix_now:
        print(f"Restoring {len(layers_to_fix_now)} additional layers...")
        perform_surgery(model_awq, layers_to_fix_now, model_fp16)
        current_restored_count = target_count

    run = wandb.init(
        project=WANDB_PROJECT_NAME,
        name=f"{CURRENT_MODEL_ID.split('/')[-1]}-AWQ-{threshold}",
        config={"model": CURRENT_MODEL_ID, "method": "KLD-AWQ", "threshold": threshold},
        reinit=True
    )

    acc, ppl, lat, static_mem, peak_mem, preds = evaluate_full_suite(
        model_awq, tokenizer, mmlu_dataset, f"KLD-AWQ-{threshold:.0%}"
    )

    flip = calculate_flip_rate(base_preds, preds)

    wandb.log({
        "Accuracy": acc, "Perplexity": ppl, "Latency": lat, "Static_Memory": static_mem,
        "Peak_Memory": peak_mem, "Flip_Rate": flip, "Threshold": threshold
    })

    results_table.append({
        "Model": CURRENT_MODEL_ID,
        "Method": "KLD-AWQ",
        "Threshold": threshold,
        "Acc": acc,
        "Flip": flip,
        "PPL": ppl,
        "Latency": lat,
        "Static Mem": static_mem,
        "Peak Mem": peak_mem
    })
    run.finish()

shutil.rmtree("./awq_temp")
del model_awq
torch.cuda.empty_cache()


--- Starting Experiment: AWQ (Qwen/Qwen2.5-1.5B-Instruct) ---
Running AWQ Oneshot Quantization...


`torch_dtype` is deprecated! Use `dtype` instead!


Tokenizing:   0%|          | 0/128 [00:00<?, ? examples/s]

2025-12-17T02:17:59.510724+0000 | reset | INFO - Compression lifecycle reset
2025-12-17T02:17:59.520085+0000 | _create_default_logger | INFO - Logging all LLM Compressor modifier-level logs to sparse_logs/17-12-2025_02.17.59.log
2025-12-17T02:17:59.521046+0000 | from_modifiers | INFO - Creating recipe from modifiers
2025-12-17T02:17:59.555096+0000 | on_initialize | INFO - No AWQModifier.mappings provided, inferring from model...


Resolving mapping 1/4 (0 skipped): : 28it [00:00, 1268.16it/s]
Resolving mapping 2/4 (27 skipped): : 28it [00:00, 1547.49it/s]
Resolving mapping 3/4 (0 skipped): : 28it [00:00, 1228.78it/s]
Resolving mapping 4/4 (0 skipped): : 28it [00:00, 1561.46it/s]

2025-12-17T02:17:59.656091+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-12-17T02:17:59.657016+0000 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `AWQModifier`



Preparing cache: 100%|██████████| 128/128 [00:00<00:00, 177.26it/s]
(1/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 56.11it/s]
Smoothing: 100%|██████████| 3/3 [00:06<00:00,  2.30s/it]
(1/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 100.07it/s]
(2/29): Calibrating: 100%|██████████| 128/128 [00:01<00:00, 88.98it/s]
Smoothing: 100%|██████████| 3/3 [00:07<00:00,  2.34s/it]
(2/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 307.95it/s]
(3/29): Calibrating: 100%|██████████| 128/128 [00:01<00:00, 126.65it/s]
Smoothing: 100%|██████████| 3/3 [00:06<00:00,  2.32s/it]
(3/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 308.12it/s]
(4/29): Calibrating: 100%|██████████| 128/128 [00:01<00:00, 125.09it/s]
Smoothing: 100%|██████████| 3/3 [00:06<00:00,  2.32s/it]
(4/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 309.67it/s]
(5/29): Calibrating: 100%|██████████| 128/128 [00:01<00:00, 115.30it/s]
Smoothing: 100%|██████████| 3/3 [00:06<00:00,  2.31s/it]


2025-12-17T02:22:03.662600+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers
2025-12-17T02:22:03.716883+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.



Compressing model: 197it [00:03, 51.29it/s]
Compressing model: 197it [00:00, 993.49it/s] 


Profiling AWQ Sensitivity...
Profiling Restoration Sensitivity (Granularity: block)...
Initial Quantized KLD: 142.224121


Profiling Blocks: 100%|██████████| 28/28 [00:00<00:00, 30840.47it/s]

Skipping model.model.layers.0: 'Qwen2Model' object has no attribute 'model'
Skipping model.model.layers.1: 'Qwen2Model' object has no attribute 'model'
Skipping model.model.layers.2: 'Qwen2Model' object has no attribute 'model'
Skipping model.model.layers.3: 'Qwen2Model' object has no attribute 'model'
Skipping model.model.layers.4: 'Qwen2Model' object has no attribute 'model'
Skipping model.model.layers.5: 'Qwen2Model' object has no attribute 'model'
Skipping model.model.layers.6: 'Qwen2Model' object has no attribute 'model'
Skipping model.model.layers.7: 'Qwen2Model' object has no attribute 'model'
Skipping model.model.layers.8: 'Qwen2Model' object has no attribute 'model'
Skipping model.model.layers.9: 'Qwen2Model' object has no attribute 'model'
Skipping model.model.layers.10: 'Qwen2Model' object has no attribute 'model'
Skipping model.model.layers.11: 'Qwen2Model' object has no attribute 'model'
Skipping model.model.layers.12: 'Qwen2Model' object has no attribute 'model'
Skipping 




--- Evaluating: KLD-AWQ-0% ---


MMLU Eval: 100%|██████████| 5000/5000 [15:48<00:00,  5.27it/s]
Computing PPL:   0%|          | 0/3 [00:00<?, ?it/s]


Results -> Acc: 51.90%, PPL: 7.77, Latency: 9.05s, Static Mem: 4.06GB, Peak Mem: 6.24GB



Targeting Threshold: 5% kept in FP16


--- Evaluating: KLD-AWQ-5% ---


MMLU Eval: 100%|██████████| 5000/5000 [15:48<00:00,  5.27it/s]
Computing PPL:   0%|          | 0/3 [00:00<?, ?it/s]


Results -> Acc: 51.90%, PPL: 7.77, Latency: 8.92s, Static Mem: 4.07GB, Peak Mem: 6.24GB



Targeting Threshold: 10% kept in FP16


--- Evaluating: KLD-AWQ-10% ---


MMLU Eval: 100%|██████████| 5000/5000 [15:48<00:00,  5.27it/s]
Computing PPL:   0%|          | 0/3 [00:00<?, ?it/s]


Results -> Acc: 51.90%, PPL: 7.77, Latency: 8.98s, Static Mem: 4.07GB, Peak Mem: 6.25GB



Targeting Threshold: 20% kept in FP16


--- Evaluating: KLD-AWQ-20% ---


MMLU Eval: 100%|██████████| 5000/5000 [15:49<00:00,  5.27it/s]
Computing PPL:   0%|          | 0/3 [00:00<?, ?it/s]


Results -> Acc: 51.90%, PPL: 7.77, Latency: 8.93s, Static Mem: 4.06GB, Peak Mem: 6.24GB



Targeting Threshold: 30% kept in FP16


--- Evaluating: KLD-AWQ-30% ---


MMLU Eval: 100%|██████████| 5000/5000 [15:50<00:00,  5.26it/s]
Computing PPL:   0%|          | 0/3 [00:00<?, ?it/s]


Results -> Acc: 51.90%, PPL: 7.77, Latency: 9.09s, Static Mem: 4.06GB, Peak Mem: 6.24GB


#### GPTQ

In [18]:
# GPTQ

print(f"\n--- Starting Experiment: GPTQ ({CURRENT_MODEL_ID}) ---")

print("Running GPTQ Optimization...")
ds = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
calib_data_obj = Dataset.from_dict({"text": [text for text in ds["text"] if len(text) > 0][:128]})

recipe = [
    GPTQModifier(
        targets="Linear",
        scheme="W4A16",
        ignore=["lm_head"],
        dampening_frac=0.01
    )
]

oneshot(
    model=CURRENT_MODEL_ID,
    dataset=calib_data_obj,
    recipe=recipe,
    output_dir="./gptq_temp",
    num_calibration_samples=128,
    max_seq_length=512,
    save_compressed=True
)

model_gptq = AutoModelForCausalLM.from_pretrained(
    "./gptq_temp", device_map="auto", trust_remote_code=True, torch_dtype=torch.bfloat16
)

print("Profiling GPTQ Sensitivity...")
gptq_sensitivity = profile_restoration_sensitivity(
    model_q=model_gptq,
    model_ref=model_fp16,
    calib_input=calib_data,
    granularity= 'layer'
)

sorted_gptq = sorted(gptq_sensitivity.items(), key=lambda x: x[1], reverse=True)
all_layer_names = [n for n, s in sorted_gptq]

sorted_thresholds = sorted(SENSITIVITY_THRESHOLDS)
current_restored_count = 0

for threshold in sorted_thresholds:
    print(f"\nTargeting Threshold: {threshold:.0%} kept in FP16")

    target_count = int(len(all_layer_names) * threshold)
    layers_to_fix_now = all_layer_names[current_restored_count : target_count]

    if layers_to_fix_now:
        print(f"Restoring {len(layers_to_fix_now)} additional layers...")
        perform_surgery(model_gptq, layers_to_fix_now, model_fp16)
        current_restored_count = target_count

    run = wandb.init(
        project=WANDB_PROJECT_NAME,
        name=f"{CURRENT_MODEL_ID.split('/')[-1]}-GPTQ-{threshold}",
        config={"model": CURRENT_MODEL_ID, "method": "KLD-GPTQ", "threshold": threshold},
        reinit=True
    )

    acc, ppl, lat, mem, preds = evaluate_full_suite(
        model_gptq, tokenizer, mmlu_dataset, f"KLD-GPTQ-{threshold:.0%}"
    )

    flip = calculate_flip_rate(base_preds, preds)

    wandb.log({
        "Accuracy": acc, "Perplexity": ppl, "Latency": lat, "Static_Memory": static_mem,
        "Peak_Memory": peak_mem, "Flip_Rate": flip, "Threshold": threshold
    })

    results_table.append({
        "Model": CURRENT_MODEL_ID,
        "Method": "KLD-GPTQ",
        "Threshold": threshold,
        "Acc": acc,
        "Flip": flip,
        "PPL": ppl,
        "Latency": lat,
        "Static Mem": static_mem,
        "Peak Mem": peak_mem
    })
    run.finish()

shutil.rmtree("./gptq_temp")
del model_gptq
torch.cuda.empty_cache()


--- Starting Experiment: GPTQ (Qwen/Qwen2.5-1.5B-Instruct) ---
Running GPTQ Optimization...


Tokenizing:   0%|          | 0/128 [00:00<?, ? examples/s]

2025-12-17T04:18:47.673653+0000 | reset | INFO - Compression lifecycle reset
2025-12-17T04:18:47.677213+0000 | from_modifiers | INFO - Creating recipe from modifiers
2025-12-17T04:18:47.716021+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-12-17T04:18:47.717778+0000 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `GPTQModifier`


Preparing cache: 100%|██████████| 128/128 [00:01<00:00, 118.11it/s]
(1/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 45.88it/s]

2025-12-17T04:18:52.836161+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.q_proj using 128 samples





2025-12-17T04:18:53.768407+0000 | compress | METRIC - time 0.93s
2025-12-17T04:18:53.770683+0000 | compress | METRIC - error 710.89
2025-12-17T04:18:53.772297+0000 | compress | METRIC - GPU 0 | usage: 83.07% | total memory: 24 GB
2025-12-17T04:18:53.773710+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:18:53.775049+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.k_proj using 128 samples
2025-12-17T04:18:54.673974+0000 | compress | METRIC - time 0.90s
2025-12-17T04:18:54.676151+0000 | compress | METRIC - error 109.75
2025-12-17T04:18:54.677841+0000 | compress | METRIC - GPU 0 | usage: 83.07% | total memory: 24 GB
2025-12-17T04:18:54.679119+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:18:54.680518+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.v_proj using 128 samples
2025-12-17T04:18:55.574114+0000 | compress | METRIC - time 0.89s
2025-12-17T04:18:55.576375+0000 | compress | METRIC - e

(1/29): Propagating: 100%|██████████| 128/128 [00:01<00:00, 68.99it/s]
(2/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.44it/s]

2025-12-17T04:19:08.705177+0000 | compress_modules | INFO - Quantizing model.layers.1.self_attn.q_proj using 128 samples





2025-12-17T04:19:09.615605+0000 | compress | METRIC - time 0.91s
2025-12-17T04:19:09.617936+0000 | compress | METRIC - error 439.97
2025-12-17T04:19:09.619492+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:19:09.620771+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:19:09.622204+0000 | compress_modules | INFO - Quantizing model.layers.1.self_attn.k_proj using 128 samples
2025-12-17T04:19:10.514133+0000 | compress | METRIC - time 0.89s
2025-12-17T04:19:10.516296+0000 | compress | METRIC - error 127.43
2025-12-17T04:19:10.517673+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:19:10.518852+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:19:10.520376+0000 | compress_modules | INFO - Quantizing model.layers.1.self_attn.v_proj using 128 samples
2025-12-17T04:19:11.406071+0000 | compress | METRIC - time 0.88s
2025-12-17T04:19:11.408288+0000 | compress | METRIC - e

(2/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 260.82it/s]
(3/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.62it/s]

2025-12-17T04:19:22.819693+0000 | compress_modules | INFO - Quantizing model.layers.2.self_attn.q_proj using 128 samples





2025-12-17T04:19:23.754272+0000 | compress | METRIC - time 0.93s
2025-12-17T04:19:23.756453+0000 | compress | METRIC - error 1234.91
2025-12-17T04:19:23.757787+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:19:23.759068+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:19:23.760550+0000 | compress_modules | INFO - Quantizing model.layers.2.self_attn.k_proj using 128 samples
2025-12-17T04:19:24.646802+0000 | compress | METRIC - time 0.89s
2025-12-17T04:19:24.649140+0000 | compress | METRIC - error 262.90
2025-12-17T04:19:24.650739+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:19:24.652063+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:19:24.653416+0000 | compress_modules | INFO - Quantizing model.layers.2.self_attn.v_proj using 128 samples
2025-12-17T04:19:25.531815+0000 | compress | METRIC - time 0.88s
2025-12-17T04:19:25.533963+0000 | compress | METRIC - 

(3/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 302.15it/s]
(4/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.40it/s]

2025-12-17T04:19:36.881270+0000 | compress_modules | INFO - Quantizing model.layers.3.self_attn.q_proj using 128 samples





2025-12-17T04:19:37.790139+0000 | compress | METRIC - time 0.91s
2025-12-17T04:19:37.792532+0000 | compress | METRIC - error 1078.17
2025-12-17T04:19:37.793871+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:19:37.795450+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:19:37.796944+0000 | compress_modules | INFO - Quantizing model.layers.3.self_attn.k_proj using 128 samples
2025-12-17T04:19:38.688384+0000 | compress | METRIC - time 0.89s
2025-12-17T04:19:38.690669+0000 | compress | METRIC - error 228.41
2025-12-17T04:19:38.692323+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:19:38.693613+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:19:38.695229+0000 | compress_modules | INFO - Quantizing model.layers.3.self_attn.v_proj using 128 samples
2025-12-17T04:19:39.580079+0000 | compress | METRIC - time 0.88s
2025-12-17T04:19:39.582071+0000 | compress | METRIC - 

(4/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 299.95it/s]
(5/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.70it/s]

2025-12-17T04:19:50.986337+0000 | compress_modules | INFO - Quantizing model.layers.4.self_attn.q_proj using 128 samples





2025-12-17T04:19:51.884126+0000 | compress | METRIC - time 0.90s
2025-12-17T04:19:51.886332+0000 | compress | METRIC - error 1017.68
2025-12-17T04:19:51.887808+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:19:51.889078+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:19:51.890488+0000 | compress_modules | INFO - Quantizing model.layers.4.self_attn.k_proj using 128 samples
2025-12-17T04:19:52.819803+0000 | compress | METRIC - time 0.93s
2025-12-17T04:19:52.822240+0000 | compress | METRIC - error 194.82
2025-12-17T04:19:52.823838+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:19:52.825113+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:19:52.826543+0000 | compress_modules | INFO - Quantizing model.layers.4.self_attn.v_proj using 128 samples
2025-12-17T04:19:53.724286+0000 | compress | METRIC - time 0.90s
2025-12-17T04:19:53.727243+0000 | compress | METRIC - 

(5/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 300.36it/s]
(6/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.52it/s]

2025-12-17T04:20:05.051789+0000 | compress_modules | INFO - Quantizing model.layers.5.self_attn.q_proj using 128 samples





2025-12-17T04:20:05.969491+0000 | compress | METRIC - time 0.92s
2025-12-17T04:20:05.971787+0000 | compress | METRIC - error 1087.30
2025-12-17T04:20:05.973378+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:20:05.975110+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:20:05.976691+0000 | compress_modules | INFO - Quantizing model.layers.5.self_attn.k_proj using 128 samples
2025-12-17T04:20:06.877935+0000 | compress | METRIC - time 0.90s
2025-12-17T04:20:06.879877+0000 | compress | METRIC - error 232.03
2025-12-17T04:20:06.881275+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:20:06.882605+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:20:06.883939+0000 | compress_modules | INFO - Quantizing model.layers.5.self_attn.v_proj using 128 samples
2025-12-17T04:20:07.764980+0000 | compress | METRIC - time 0.88s
2025-12-17T04:20:07.767260+0000 | compress | METRIC - 

(6/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 294.33it/s]
(7/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.33it/s]

2025-12-17T04:20:19.066857+0000 | compress_modules | INFO - Quantizing model.layers.6.self_attn.q_proj using 128 samples





2025-12-17T04:20:19.971927+0000 | compress | METRIC - time 0.90s
2025-12-17T04:20:19.974184+0000 | compress | METRIC - error 1406.10
2025-12-17T04:20:19.975710+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:20:19.976913+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:20:19.978616+0000 | compress_modules | INFO - Quantizing model.layers.6.self_attn.k_proj using 128 samples
2025-12-17T04:20:20.861703+0000 | compress | METRIC - time 0.88s
2025-12-17T04:20:20.864370+0000 | compress | METRIC - error 293.04
2025-12-17T04:20:20.866213+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:20:20.867594+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:20:20.868911+0000 | compress_modules | INFO - Quantizing model.layers.6.self_attn.v_proj using 128 samples
2025-12-17T04:20:21.743194+0000 | compress | METRIC - time 0.87s
2025-12-17T04:20:21.745403+0000 | compress | METRIC - 

(7/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 296.91it/s]
(8/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.45it/s]

2025-12-17T04:20:33.008147+0000 | compress_modules | INFO - Quantizing model.layers.7.self_attn.q_proj using 128 samples





2025-12-17T04:20:33.933029+0000 | compress | METRIC - time 0.92s
2025-12-17T04:20:33.935454+0000 | compress | METRIC - error 741.22
2025-12-17T04:20:33.936978+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:20:33.938297+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:20:33.939670+0000 | compress_modules | INFO - Quantizing model.layers.7.self_attn.k_proj using 128 samples
2025-12-17T04:20:34.818853+0000 | compress | METRIC - time 0.88s
2025-12-17T04:20:34.821592+0000 | compress | METRIC - error 143.60
2025-12-17T04:20:34.823149+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:20:34.824642+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:20:34.826078+0000 | compress_modules | INFO - Quantizing model.layers.7.self_attn.v_proj using 128 samples
2025-12-17T04:20:35.705784+0000 | compress | METRIC - time 0.88s
2025-12-17T04:20:35.708250+0000 | compress | METRIC - e

(8/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 300.98it/s]
(9/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.44it/s]

2025-12-17T04:20:47.010587+0000 | compress_modules | INFO - Quantizing model.layers.8.self_attn.q_proj using 128 samples





2025-12-17T04:20:47.918996+0000 | compress | METRIC - time 0.91s
2025-12-17T04:20:47.921247+0000 | compress | METRIC - error 1552.70
2025-12-17T04:20:47.922532+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:20:47.923825+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:20:47.925376+0000 | compress_modules | INFO - Quantizing model.layers.8.self_attn.k_proj using 128 samples
2025-12-17T04:20:48.844259+0000 | compress | METRIC - time 0.92s
2025-12-17T04:20:48.846733+0000 | compress | METRIC - error 273.26
2025-12-17T04:20:48.848082+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:20:48.849475+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:20:48.850950+0000 | compress_modules | INFO - Quantizing model.layers.8.self_attn.v_proj using 128 samples
2025-12-17T04:20:49.749467+0000 | compress | METRIC - time 0.90s
2025-12-17T04:20:49.751836+0000 | compress | METRIC - 

(9/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 296.07it/s]
(10/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.44it/s]

2025-12-17T04:21:01.112362+0000 | compress_modules | INFO - Quantizing model.layers.9.self_attn.q_proj using 128 samples





2025-12-17T04:21:02.023793+0000 | compress | METRIC - time 0.91s
2025-12-17T04:21:02.026289+0000 | compress | METRIC - error 1391.74
2025-12-17T04:21:02.027991+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:21:02.029363+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:21:02.030759+0000 | compress_modules | INFO - Quantizing model.layers.9.self_attn.k_proj using 128 samples
2025-12-17T04:21:02.920028+0000 | compress | METRIC - time 0.89s
2025-12-17T04:21:02.922312+0000 | compress | METRIC - error 284.27
2025-12-17T04:21:02.923607+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:21:02.924926+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:21:02.926573+0000 | compress_modules | INFO - Quantizing model.layers.9.self_attn.v_proj using 128 samples
2025-12-17T04:21:03.803178+0000 | compress | METRIC - time 0.88s
2025-12-17T04:21:03.805651+0000 | compress | METRIC - 

(10/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 300.96it/s]
(11/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.59it/s]

2025-12-17T04:21:15.130709+0000 | compress_modules | INFO - Quantizing model.layers.10.self_attn.q_proj using 128 samples





2025-12-17T04:21:16.039797+0000 | compress | METRIC - time 0.91s
2025-12-17T04:21:16.042271+0000 | compress | METRIC - error 1394.13
2025-12-17T04:21:16.043667+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:21:16.044954+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:21:16.046291+0000 | compress_modules | INFO - Quantizing model.layers.10.self_attn.k_proj using 128 samples
2025-12-17T04:21:16.922584+0000 | compress | METRIC - time 0.87s
2025-12-17T04:21:16.925133+0000 | compress | METRIC - error 270.61
2025-12-17T04:21:16.926775+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:21:16.927921+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:21:16.929383+0000 | compress_modules | INFO - Quantizing model.layers.10.self_attn.v_proj using 128 samples
2025-12-17T04:21:17.801661+0000 | compress | METRIC - time 0.87s
2025-12-17T04:21:17.803951+0000 | compress | METRIC 

(11/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 301.07it/s]
(12/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.57it/s]

2025-12-17T04:21:29.072229+0000 | compress_modules | INFO - Quantizing model.layers.11.self_attn.q_proj using 128 samples





2025-12-17T04:21:30.012611+0000 | compress | METRIC - time 0.94s
2025-12-17T04:21:30.014956+0000 | compress | METRIC - error 1486.99
2025-12-17T04:21:30.016562+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:21:30.018051+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:21:30.019399+0000 | compress_modules | INFO - Quantizing model.layers.11.self_attn.k_proj using 128 samples
2025-12-17T04:21:30.901516+0000 | compress | METRIC - time 0.88s
2025-12-17T04:21:30.903818+0000 | compress | METRIC - error 290.09
2025-12-17T04:21:30.905314+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:21:30.906688+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:21:30.908054+0000 | compress_modules | INFO - Quantizing model.layers.11.self_attn.v_proj using 128 samples
2025-12-17T04:21:31.801401+0000 | compress | METRIC - time 0.89s
2025-12-17T04:21:31.803837+0000 | compress | METRIC 

(12/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 295.71it/s]
(13/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.54it/s]

2025-12-17T04:21:43.167720+0000 | compress_modules | INFO - Quantizing model.layers.12.self_attn.q_proj using 128 samples





2025-12-17T04:21:44.116292+0000 | compress | METRIC - time 0.95s
2025-12-17T04:21:44.118665+0000 | compress | METRIC - error 1917.99
2025-12-17T04:21:44.120269+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:21:44.121629+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:21:44.123109+0000 | compress_modules | INFO - Quantizing model.layers.12.self_attn.k_proj using 128 samples
2025-12-17T04:21:45.038943+0000 | compress | METRIC - time 0.91s
2025-12-17T04:21:45.041417+0000 | compress | METRIC - error 411.33
2025-12-17T04:21:45.073630+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:21:45.074847+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:21:45.076356+0000 | compress_modules | INFO - Quantizing model.layers.12.self_attn.v_proj using 128 samples
2025-12-17T04:21:45.970327+0000 | compress | METRIC - time 0.89s
2025-12-17T04:21:45.972631+0000 | compress | METRIC 

(13/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 292.94it/s]
(14/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.29it/s]

2025-12-17T04:21:57.172004+0000 | compress_modules | INFO - Quantizing model.layers.13.self_attn.q_proj using 128 samples





2025-12-17T04:21:58.088171+0000 | compress | METRIC - time 0.91s
2025-12-17T04:21:58.090504+0000 | compress | METRIC - error 1357.13
2025-12-17T04:21:58.091941+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:21:58.093270+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:21:58.094692+0000 | compress_modules | INFO - Quantizing model.layers.13.self_attn.k_proj using 128 samples
2025-12-17T04:21:58.982956+0000 | compress | METRIC - time 0.89s
2025-12-17T04:21:58.985233+0000 | compress | METRIC - error 271.90
2025-12-17T04:21:58.986673+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:21:58.987921+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:21:58.989361+0000 | compress_modules | INFO - Quantizing model.layers.13.self_attn.v_proj using 128 samples
2025-12-17T04:21:59.867358+0000 | compress | METRIC - time 0.88s
2025-12-17T04:21:59.869609+0000 | compress | METRIC 

(14/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 297.11it/s]
(15/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.44it/s]

2025-12-17T04:22:11.191607+0000 | compress_modules | INFO - Quantizing model.layers.14.self_attn.q_proj using 128 samples





2025-12-17T04:22:12.151519+0000 | compress | METRIC - time 0.96s
2025-12-17T04:22:12.153916+0000 | compress | METRIC - error 2631.61
2025-12-17T04:22:12.155272+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:22:12.156683+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:22:12.158096+0000 | compress_modules | INFO - Quantizing model.layers.14.self_attn.k_proj using 128 samples
2025-12-17T04:22:13.049277+0000 | compress | METRIC - time 0.89s
2025-12-17T04:22:13.051888+0000 | compress | METRIC - error 394.17
2025-12-17T04:22:13.053412+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:22:13.054867+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:22:13.056175+0000 | compress_modules | INFO - Quantizing model.layers.14.self_attn.v_proj using 128 samples
2025-12-17T04:22:13.922846+0000 | compress | METRIC - time 0.87s
2025-12-17T04:22:13.925206+0000 | compress | METRIC 

(15/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 295.75it/s]
(16/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.41it/s]

2025-12-17T04:22:25.202215+0000 | compress_modules | INFO - Quantizing model.layers.15.self_attn.q_proj using 128 samples





2025-12-17T04:22:26.130286+0000 | compress | METRIC - time 0.93s
2025-12-17T04:22:26.132630+0000 | compress | METRIC - error 2947.74
2025-12-17T04:22:26.134190+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:22:26.135319+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:22:26.136701+0000 | compress_modules | INFO - Quantizing model.layers.15.self_attn.k_proj using 128 samples
2025-12-17T04:22:27.029370+0000 | compress | METRIC - time 0.89s
2025-12-17T04:22:27.031766+0000 | compress | METRIC - error 329.60
2025-12-17T04:22:27.033323+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:22:27.034686+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:22:27.035956+0000 | compress_modules | INFO - Quantizing model.layers.15.self_attn.v_proj using 128 samples
2025-12-17T04:22:27.910858+0000 | compress | METRIC - time 0.87s
2025-12-17T04:22:27.913205+0000 | compress | METRIC 

(16/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 298.34it/s]
(17/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.38it/s]

2025-12-17T04:22:39.157952+0000 | compress_modules | INFO - Quantizing model.layers.16.self_attn.q_proj using 128 samples





2025-12-17T04:22:40.079479+0000 | compress | METRIC - time 0.92s
2025-12-17T04:22:40.081859+0000 | compress | METRIC - error 2564.20
2025-12-17T04:22:40.083374+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:22:40.084662+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:22:40.086058+0000 | compress_modules | INFO - Quantizing model.layers.16.self_attn.k_proj using 128 samples
2025-12-17T04:22:40.965682+0000 | compress | METRIC - time 0.88s
2025-12-17T04:22:40.968147+0000 | compress | METRIC - error 480.11
2025-12-17T04:22:40.969621+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:22:40.970944+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:22:40.972483+0000 | compress_modules | INFO - Quantizing model.layers.16.self_attn.v_proj using 128 samples
2025-12-17T04:22:41.854457+0000 | compress | METRIC - time 0.88s
2025-12-17T04:22:41.856734+0000 | compress | METRIC 

(17/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 299.40it/s]
(18/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.25it/s]

2025-12-17T04:22:53.098498+0000 | compress_modules | INFO - Quantizing model.layers.17.self_attn.q_proj using 128 samples





2025-12-17T04:22:54.012041+0000 | compress | METRIC - time 0.91s
2025-12-17T04:22:54.014387+0000 | compress | METRIC - error 2253.02
2025-12-17T04:22:54.015822+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:22:54.016970+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:22:54.018262+0000 | compress_modules | INFO - Quantizing model.layers.17.self_attn.k_proj using 128 samples
2025-12-17T04:22:54.936799+0000 | compress | METRIC - time 0.92s
2025-12-17T04:22:54.939224+0000 | compress | METRIC - error 276.60
2025-12-17T04:22:54.940719+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:22:54.942017+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:22:54.943486+0000 | compress_modules | INFO - Quantizing model.layers.17.self_attn.v_proj using 128 samples
2025-12-17T04:22:55.817225+0000 | compress | METRIC - time 0.87s
2025-12-17T04:22:55.819521+0000 | compress | METRIC 

(18/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 299.54it/s]
(19/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.48it/s]

2025-12-17T04:23:07.048189+0000 | compress_modules | INFO - Quantizing model.layers.18.self_attn.q_proj using 128 samples





2025-12-17T04:23:07.973092+0000 | compress | METRIC - time 0.92s
2025-12-17T04:23:07.975586+0000 | compress | METRIC - error 1817.72
2025-12-17T04:23:07.977100+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:23:07.978033+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:23:07.979689+0000 | compress_modules | INFO - Quantizing model.layers.18.self_attn.k_proj using 128 samples
2025-12-17T04:23:08.883052+0000 | compress | METRIC - time 0.90s
2025-12-17T04:23:08.885431+0000 | compress | METRIC - error 303.91
2025-12-17T04:23:08.887169+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:23:08.888665+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:23:08.890137+0000 | compress_modules | INFO - Quantizing model.layers.18.self_attn.v_proj using 128 samples
2025-12-17T04:23:09.765446+0000 | compress | METRIC - time 0.87s
2025-12-17T04:23:09.767894+0000 | compress | METRIC 

(19/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 298.47it/s]
(20/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.49it/s]

2025-12-17T04:23:21.007283+0000 | compress_modules | INFO - Quantizing model.layers.19.self_attn.q_proj using 128 samples





2025-12-17T04:23:21.954974+0000 | compress | METRIC - time 0.95s
2025-12-17T04:23:21.957696+0000 | compress | METRIC - error 2435.56
2025-12-17T04:23:21.959206+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:23:21.960604+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:23:21.962017+0000 | compress_modules | INFO - Quantizing model.layers.19.self_attn.k_proj using 128 samples
2025-12-17T04:23:22.861289+0000 | compress | METRIC - time 0.90s
2025-12-17T04:23:22.863843+0000 | compress | METRIC - error 321.94
2025-12-17T04:23:22.865543+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:23:22.866931+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:23:22.868269+0000 | compress_modules | INFO - Quantizing model.layers.19.self_attn.v_proj using 128 samples
2025-12-17T04:23:23.749626+0000 | compress | METRIC - time 0.88s
2025-12-17T04:23:23.752193+0000 | compress | METRIC 

(20/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 299.67it/s]
(21/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.37it/s]

2025-12-17T04:23:35.046177+0000 | compress_modules | INFO - Quantizing model.layers.20.self_attn.q_proj using 128 samples





2025-12-17T04:23:35.953991+0000 | compress | METRIC - time 0.91s
2025-12-17T04:23:35.956531+0000 | compress | METRIC - error 3125.38
2025-12-17T04:23:35.957852+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:23:35.959055+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:23:35.960474+0000 | compress_modules | INFO - Quantizing model.layers.20.self_attn.k_proj using 128 samples
2025-12-17T04:23:36.869055+0000 | compress | METRIC - time 0.91s
2025-12-17T04:23:36.871394+0000 | compress | METRIC - error 392.27
2025-12-17T04:23:36.872658+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:23:36.873889+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:23:36.875325+0000 | compress_modules | INFO - Quantizing model.layers.20.self_attn.v_proj using 128 samples
2025-12-17T04:23:37.747273+0000 | compress | METRIC - time 0.87s
2025-12-17T04:23:37.749762+0000 | compress | METRIC 

(21/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 298.14it/s]
(22/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.60it/s]

2025-12-17T04:23:48.998834+0000 | compress_modules | INFO - Quantizing model.layers.21.self_attn.q_proj using 128 samples





2025-12-17T04:23:49.929968+0000 | compress | METRIC - time 0.93s
2025-12-17T04:23:49.932326+0000 | compress | METRIC - error 3046.51
2025-12-17T04:23:49.933892+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:23:49.935212+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:23:49.936555+0000 | compress_modules | INFO - Quantizing model.layers.21.self_attn.k_proj using 128 samples
2025-12-17T04:23:50.818215+0000 | compress | METRIC - time 0.88s
2025-12-17T04:23:50.820794+0000 | compress | METRIC - error 372.67
2025-12-17T04:23:50.822344+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:23:50.823452+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:23:50.825154+0000 | compress_modules | INFO - Quantizing model.layers.21.self_attn.v_proj using 128 samples
2025-12-17T04:23:51.702463+0000 | compress | METRIC - time 0.88s
2025-12-17T04:23:51.704897+0000 | compress | METRIC 

(22/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 301.88it/s]
(23/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.59it/s]

2025-12-17T04:24:02.952560+0000 | compress_modules | INFO - Quantizing model.layers.22.self_attn.q_proj using 128 samples





2025-12-17T04:24:03.853390+0000 | compress | METRIC - time 0.90s
2025-12-17T04:24:03.855888+0000 | compress | METRIC - error 2884.45
2025-12-17T04:24:03.857344+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:24:03.858621+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:24:03.860059+0000 | compress_modules | INFO - Quantizing model.layers.22.self_attn.k_proj using 128 samples
2025-12-17T04:24:04.757940+0000 | compress | METRIC - time 0.90s
2025-12-17T04:24:04.760373+0000 | compress | METRIC - error 434.48
2025-12-17T04:24:04.761753+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:24:04.762943+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:24:04.764538+0000 | compress_modules | INFO - Quantizing model.layers.22.self_attn.v_proj using 128 samples
2025-12-17T04:24:05.628128+0000 | compress | METRIC - time 0.86s
2025-12-17T04:24:05.630540+0000 | compress | METRIC 

(23/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 298.99it/s]
(24/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.47it/s]

2025-12-17T04:24:16.952689+0000 | compress_modules | INFO - Quantizing model.layers.23.self_attn.q_proj using 128 samples





2025-12-17T04:24:17.852100+0000 | compress | METRIC - time 0.90s
2025-12-17T04:24:17.854590+0000 | compress | METRIC - error 4123.63
2025-12-17T04:24:17.856111+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:24:17.857302+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:24:17.859082+0000 | compress_modules | INFO - Quantizing model.layers.23.self_attn.k_proj using 128 samples
2025-12-17T04:24:18.768115+0000 | compress | METRIC - time 0.91s
2025-12-17T04:24:18.770602+0000 | compress | METRIC - error 484.24
2025-12-17T04:24:18.771895+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:24:18.773175+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:24:18.774547+0000 | compress_modules | INFO - Quantizing model.layers.23.self_attn.v_proj using 128 samples
2025-12-17T04:24:19.647113+0000 | compress | METRIC - time 0.87s
2025-12-17T04:24:19.649594+0000 | compress | METRIC 

(24/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 298.38it/s]
(25/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.45it/s]

2025-12-17T04:24:30.984035+0000 | compress_modules | INFO - Quantizing model.layers.24.self_attn.q_proj using 128 samples





2025-12-17T04:24:31.901263+0000 | compress | METRIC - time 0.92s
2025-12-17T04:24:31.903841+0000 | compress | METRIC - error 3723.69
2025-12-17T04:24:31.905680+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:24:31.907095+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:24:31.908641+0000 | compress_modules | INFO - Quantizing model.layers.24.self_attn.k_proj using 128 samples
2025-12-17T04:24:32.812041+0000 | compress | METRIC - time 0.90s
2025-12-17T04:24:32.813861+0000 | compress | METRIC - error 495.07
2025-12-17T04:24:32.815376+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:24:32.816872+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:24:32.818493+0000 | compress_modules | INFO - Quantizing model.layers.24.self_attn.v_proj using 128 samples
2025-12-17T04:24:33.725200+0000 | compress | METRIC - time 0.91s
2025-12-17T04:24:33.727739+0000 | compress | METRIC 

(25/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 300.39it/s]
(26/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.25it/s]

2025-12-17T04:24:45.072862+0000 | compress_modules | INFO - Quantizing model.layers.25.self_attn.q_proj using 128 samples





2025-12-17T04:24:45.997235+0000 | compress | METRIC - time 0.92s
2025-12-17T04:24:45.999697+0000 | compress | METRIC - error 4435.08
2025-12-17T04:24:46.001301+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:24:46.002608+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:24:46.004269+0000 | compress_modules | INFO - Quantizing model.layers.25.self_attn.k_proj using 128 samples
2025-12-17T04:24:46.895666+0000 | compress | METRIC - time 0.89s
2025-12-17T04:24:46.898112+0000 | compress | METRIC - error 466.20
2025-12-17T04:24:46.899688+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:24:46.901093+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:24:46.902623+0000 | compress_modules | INFO - Quantizing model.layers.25.self_attn.v_proj using 128 samples
2025-12-17T04:24:47.776024+0000 | compress | METRIC - time 0.87s
2025-12-17T04:24:47.778503+0000 | compress | METRIC 

(26/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 292.41it/s]
(27/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.28it/s]

2025-12-17T04:24:59.100623+0000 | compress_modules | INFO - Quantizing model.layers.26.self_attn.q_proj using 128 samples





2025-12-17T04:24:59.999780+0000 | compress | METRIC - time 0.90s
2025-12-17T04:25:00.002276+0000 | compress | METRIC - error 4561.47
2025-12-17T04:25:00.003941+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:25:00.005256+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:25:00.006761+0000 | compress_modules | INFO - Quantizing model.layers.26.self_attn.k_proj using 128 samples
2025-12-17T04:25:00.918978+0000 | compress | METRIC - time 0.91s
2025-12-17T04:25:00.921607+0000 | compress | METRIC - error 569.13
2025-12-17T04:25:00.923537+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:25:00.924928+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:25:00.926285+0000 | compress_modules | INFO - Quantizing model.layers.26.self_attn.v_proj using 128 samples
2025-12-17T04:25:01.797222+0000 | compress | METRIC - time 0.87s
2025-12-17T04:25:01.799667+0000 | compress | METRIC 

(27/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 292.31it/s]
(28/29): Calibrating: 100%|██████████| 128/128 [00:02<00:00, 55.31it/s]

2025-12-17T04:25:13.090210+0000 | compress_modules | INFO - Quantizing model.layers.27.self_attn.q_proj using 128 samples





2025-12-17T04:25:13.985374+0000 | compress | METRIC - time 0.89s
2025-12-17T04:25:13.987846+0000 | compress | METRIC - error 4358.91
2025-12-17T04:25:13.989048+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:25:13.990407+0000 | compress | METRIC - Compressed module size: 4.77696 MB
2025-12-17T04:25:13.991970+0000 | compress_modules | INFO - Quantizing model.layers.27.self_attn.k_proj using 128 samples
2025-12-17T04:25:14.880475+0000 | compress | METRIC - time 0.89s
2025-12-17T04:25:14.882808+0000 | compress | METRIC - error 461.29
2025-12-17T04:25:14.884176+0000 | compress | METRIC - GPU 0 | usage: 81.13% | total memory: 24 GB
2025-12-17T04:25:14.885344+0000 | compress | METRIC - Compressed module size: 0.79616 MB
2025-12-17T04:25:14.886658+0000 | compress_modules | INFO - Quantizing model.layers.27.self_attn.v_proj using 128 samples
2025-12-17T04:25:15.753789+0000 | compress | METRIC - time 0.87s
2025-12-17T04:25:15.756180+0000 | compress | METRIC 

(28/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 288.55it/s]
(29/29): Calibrating: 100%|██████████| 128/128 [00:00<00:00, 167.68it/s]
(29/29): Propagating: 100%|██████████| 128/128 [00:00<00:00, 173.65it/s]


2025-12-17T04:25:26.712211+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers
2025-12-17T04:25:26.776847+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.


Compressing model: 196it [00:03, 51.60it/s]
`torch_dtype` is deprecated! Use `dtype` instead!
Compressing model: 196it [00:00, 891.78it/s]


Profiling GPTQ Sensitivity...
Profiling Restoration Sensitivity (Granularity: layer)...
Initial Quantized KLD: 34.250000


Profiling Layers:   0%|          | 1/280 [00:00<00:53,  5.22it/s]

Skipping model.layers.0.self_attn: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 105.38 MiB is free. Process 27680 has 22.05 GiB memory in use. Of the allocated memory 21.32 GiB is allocated by PyTorch, and 477.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Profiling Layers:   1%|          | 2/280 [00:00<00:53,  5.23it/s]

Skipping model.layers.0.self_attn.q_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 105.38 MiB is free. Process 27680 has 22.05 GiB memory in use. Of the allocated memory 21.32 GiB is allocated by PyTorch, and 470.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Profiling Layers:   1%|          | 3/280 [00:00<00:52,  5.25it/s]

Skipping model.layers.0.self_attn.k_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 105.38 MiB is free. Process 27680 has 22.05 GiB memory in use. Of the allocated memory 21.32 GiB is allocated by PyTorch, and 474.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Profiling Layers:   1%|▏         | 4/280 [00:00<00:52,  5.26it/s]

Skipping model.layers.0.self_attn.v_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 105.38 MiB is free. Process 27680 has 22.05 GiB memory in use. Of the allocated memory 21.32 GiB is allocated by PyTorch, and 473.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Profiling Layers:   2%|▏         | 5/280 [00:00<00:52,  5.26it/s]

Skipping model.layers.0.self_attn.o_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 125.38 MiB is free. Process 27680 has 22.03 GiB memory in use. Of the allocated memory 21.33 GiB is allocated by PyTorch, and 447.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Profiling Layers:   2%|▎         | 7/280 [00:01<00:52,  5.17it/s]

Skipping model.layers.0.mlp: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 125.38 MiB is free. Process 27680 has 22.03 GiB memory in use. Of the allocated memory 21.27 GiB is allocated by PyTorch, and 507.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.0.mlp.gate_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 45.38 MiB is free. Process 27680 has 22.11 GiB memory in use. Of the allocated memory 21.42 GiB is allocated by PyTorch, and 427.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See docu

Profiling Layers:   3%|▎         | 9/280 [00:01<00:52,  5.20it/s]

Skipping model.layers.0.mlp.up_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 45.38 MiB is free. Process 27680 has 22.11 GiB memory in use. Of the allocated memory 21.42 GiB is allocated by PyTorch, and 427.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.0.mlp.down_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 45.38 MiB is free. Process 27680 has 22.11 GiB memory in use. Of the allocated memory 21.42 GiB is allocated by PyTorch, and 427.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  S

Profiling Layers:   4%|▍         | 11/280 [00:02<00:50,  5.29it/s]

Skipping model.layers.0.mlp.act_fn: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 45.38 MiB is free. Process 27680 has 22.11 GiB memory in use. Of the allocated memory 21.40 GiB is allocated by PyTorch, and 453.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.1.self_attn: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 43.38 MiB is free. Process 27680 has 22.11 GiB memory in use. Of the allocated memory 21.41 GiB is allocated by PyTorch, and 444.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See do

Profiling Layers:   5%|▍         | 13/280 [00:02<00:49,  5.36it/s]

Skipping model.layers.1.self_attn.q_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 43.38 MiB is free. Process 27680 has 22.11 GiB memory in use. Of the allocated memory 21.41 GiB is allocated by PyTorch, and 439.22 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.1.self_attn.k_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 43.38 MiB is free. Process 27680 has 22.11 GiB memory in use. Of the allocated memory 21.41 GiB is allocated by PyTorch, and 443.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmenta

Profiling Layers:   5%|▌         | 15/280 [00:02<00:49,  5.41it/s]

Skipping model.layers.1.self_attn.v_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 43.38 MiB is free. Process 27680 has 22.11 GiB memory in use. Of the allocated memory 21.41 GiB is allocated by PyTorch, and 443.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.1.self_attn.o_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 43.38 MiB is free. Process 27680 has 22.11 GiB memory in use. Of the allocated memory 21.41 GiB is allocated by PyTorch, and 439.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmenta

Profiling Layers:   6%|▌         | 17/280 [00:03<00:49,  5.32it/s]

Skipping model.layers.1.mlp: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 43.38 MiB is free. Process 27680 has 22.11 GiB memory in use. Of the allocated memory 21.36 GiB is allocated by PyTorch, and 498.41 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.1.mlp.gate_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 43.38 MiB is free. Process 27680 has 22.11 GiB memory in use. Of the allocated memory 21.38 GiB is allocated by PyTorch, and 471.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See docum

Profiling Layers:   7%|▋         | 19/280 [00:03<00:49,  5.33it/s]

Skipping model.layers.1.mlp.up_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 43.38 MiB is free. Process 27680 has 22.11 GiB memory in use. Of the allocated memory 21.38 GiB is allocated by PyTorch, and 471.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.1.mlp.down_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 43.38 MiB is free. Process 27680 has 22.11 GiB memory in use. Of the allocated memory 21.38 GiB is allocated by PyTorch, and 471.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  S

Profiling Layers:   8%|▊         | 21/280 [00:03<00:47,  5.45it/s]

Skipping model.layers.1.mlp.act_fn: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 43.38 MiB is free. Process 27680 has 22.11 GiB memory in use. Of the allocated memory 21.36 GiB is allocated by PyTorch, and 497.41 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.2.self_attn: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 41.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.37 GiB is allocated by PyTorch, and 488.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See do

Profiling Layers:   8%|▊         | 23/280 [00:04<00:47,  5.45it/s]

Skipping model.layers.2.self_attn.q_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 41.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.37 GiB is allocated by PyTorch, and 483.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.2.self_attn.k_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 41.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.37 GiB is allocated by PyTorch, and 488.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmenta

Profiling Layers:   9%|▉         | 25/280 [00:04<00:45,  5.54it/s]

Skipping model.layers.2.self_attn.v_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 41.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.37 GiB is allocated by PyTorch, and 489.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.2.self_attn.o_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 41.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.37 GiB is allocated by PyTorch, and 484.41 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmenta

Profiling Layers:  10%|▉         | 27/280 [00:05<00:46,  5.46it/s]

Skipping model.layers.2.mlp: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 41.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.32 GiB is allocated by PyTorch, and 542.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.2.mlp.gate_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 41.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.34 GiB is allocated by PyTorch, and 516.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See docum

Profiling Layers:  10%|█         | 29/280 [00:05<00:45,  5.48it/s]

Skipping model.layers.2.mlp.up_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 41.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.34 GiB is allocated by PyTorch, and 516.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.2.mlp.down_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 41.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.34 GiB is allocated by PyTorch, and 516.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  S

Profiling Layers:  11%|█         | 31/280 [00:05<00:44,  5.62it/s]

Skipping model.layers.2.mlp.act_fn: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 41.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.32 GiB is allocated by PyTorch, and 542.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.3.self_attn: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 39.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.33 GiB is allocated by PyTorch, and 534.09 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See do

Profiling Layers:  12%|█▏        | 33/280 [00:06<00:43,  5.72it/s]

Skipping model.layers.3.self_attn.q_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 39.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.33 GiB is allocated by PyTorch, and 529.59 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.3.self_attn.k_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 39.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.33 GiB is allocated by PyTorch, and 533.34 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmenta

Profiling Layers:  12%|█▎        | 35/280 [00:06<00:42,  5.77it/s]

Skipping model.layers.3.self_attn.v_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 39.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.33 GiB is allocated by PyTorch, and 533.34 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.3.self_attn.o_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 39.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.33 GiB is allocated by PyTorch, and 530.34 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmenta

Profiling Layers:  13%|█▎        | 37/280 [00:06<00:42,  5.65it/s]

Skipping model.layers.3.mlp: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 39.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.40 GiB is allocated by PyTorch, and 456.09 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.3.mlp.gate_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 39.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.30 GiB is allocated by PyTorch, and 562.28 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See docum

Profiling Layers:  14%|█▍        | 39/280 [00:07<00:42,  5.69it/s]

Skipping model.layers.3.mlp.up_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 39.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.30 GiB is allocated by PyTorch, and 561.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.3.mlp.down_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 39.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.30 GiB is allocated by PyTorch, and 562.28 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  S

Profiling Layers:  15%|█▍        | 41/280 [00:07<00:41,  5.78it/s]

Skipping model.layers.3.mlp.act_fn: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 39.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.40 GiB is allocated by PyTorch, and 456.09 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.4.self_attn: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 37.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.41 GiB is allocated by PyTorch, and 447.59 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See do

Profiling Layers:  15%|█▌        | 43/280 [00:07<00:40,  5.87it/s]

Skipping model.layers.4.self_attn.q_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 37.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.42 GiB is allocated by PyTorch, and 442.33 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.4.self_attn.k_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 37.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.41 GiB is allocated by PyTorch, and 446.09 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmenta

Profiling Layers:  16%|█▌        | 45/280 [00:08<00:40,  5.86it/s]

Skipping model.layers.4.self_attn.v_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 37.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.41 GiB is allocated by PyTorch, and 446.09 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.4.self_attn.o_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 37.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.42 GiB is allocated by PyTorch, and 442.34 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmenta

Profiling Layers:  17%|█▋        | 47/280 [00:08<00:40,  5.81it/s]

Skipping model.layers.4.mlp: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 37.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.36 GiB is allocated by PyTorch, and 500.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.4.mlp.gate_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 37.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.39 GiB is allocated by PyTorch, and 474.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See docum

Profiling Layers:  18%|█▊        | 49/280 [00:08<00:39,  5.83it/s]

Skipping model.layers.4.mlp.up_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 37.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.39 GiB is allocated by PyTorch, and 474.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.4.mlp.down_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 37.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.39 GiB is allocated by PyTorch, and 474.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  S

Profiling Layers:  18%|█▊        | 51/280 [00:09<00:38,  5.97it/s]

Skipping model.layers.4.mlp.act_fn: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 37.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.36 GiB is allocated by PyTorch, and 501.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.5.self_attn: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 35.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.37 GiB is allocated by PyTorch, and 492.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See do

Profiling Layers:  19%|█▉        | 53/280 [00:09<00:37,  6.10it/s]

Skipping model.layers.5.self_attn.q_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 35.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.37 GiB is allocated by PyTorch, and 488.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.5.self_attn.k_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 35.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.37 GiB is allocated by PyTorch, and 492.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmenta

Profiling Layers:  20%|█▉        | 55/280 [00:09<00:36,  6.18it/s]

Skipping model.layers.5.self_attn.v_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 35.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.37 GiB is allocated by PyTorch, and 492.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.5.self_attn.o_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 35.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.38 GiB is allocated by PyTorch, and 487.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmenta

Profiling Layers:  20%|██        | 57/280 [00:10<00:36,  6.06it/s]

Skipping model.layers.5.mlp: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 33.38 MiB is free. Process 27680 has 22.12 GiB memory in use. Of the allocated memory 21.33 GiB is allocated by PyTorch, and 534.46 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.5.mlp.gate_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 5.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.36 GiB is allocated by PyTorch, and 535.46 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See docume

Profiling Layers:  21%|██        | 59/280 [00:10<00:35,  6.14it/s]

Skipping model.layers.5.mlp.up_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 7.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.36 GiB is allocated by PyTorch, and 533.46 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.5.mlp.down_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 7.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.36 GiB is allocated by PyTorch, and 533.46 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See

Profiling Layers:  22%|██▏       | 61/280 [00:10<00:34,  6.29it/s]

Skipping model.layers.5.mlp.act_fn: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 7.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.45 GiB is allocated by PyTorch, and 441.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.6.self_attn: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 5.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.34 GiB is allocated by PyTorch, and 551.21 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See docu

Profiling Layers:  22%|██▎       | 63/280 [00:11<00:33,  6.43it/s]

Skipping model.layers.6.self_attn.q_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 5.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.35 GiB is allocated by PyTorch, and 546.70 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.6.self_attn.k_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 5.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.34 GiB is allocated by PyTorch, and 551.46 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentati

Profiling Layers:  23%|██▎       | 65/280 [00:11<00:33,  6.44it/s]

Skipping model.layers.6.self_attn.v_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 5.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.46 GiB is allocated by PyTorch, and 432.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.6.self_attn.o_proj: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 5.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.35 GiB is allocated by PyTorch, and 546.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentati

Profiling Layers:  27%|██▋       | 76/280 [00:11<00:07, 25.89it/s]

Skipping model.layers.6.mlp.gate_proj: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 5.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.49 GiB is allocated by PyTorch, and 397.90 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.6.mlp.up_proj: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 5.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.49 GiB is allocated by PyTorch, and 397.90 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See d

Profiling Layers:  52%|█████▏    | 146/280 [00:11<00:00, 171.14it/s]

Skipping model.layers.8.self_attn.o_proj: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 1.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.52 GiB is allocated by PyTorch, and 372.46 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.8.mlp: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 1.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 389.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See docume

Profiling Layers:  82%|████████▏ | 230/280 [00:12<00:00, 285.10it/s]

Skipping model.layers.16.self_attn.o_proj: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 1.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.51 GiB is allocated by PyTorch, and 379.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.16.mlp: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 1.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.51 GiB is allocated by PyTorch, and 379.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See docu

Profiling Layers: 100%|██████████| 280/280 [00:12<00:00, 22.97it/s] 

Skipping model.layers.24.mlp: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 1.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.51 GiB is allocated by PyTorch, and 379.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Skipping model.layers.24.mlp.gate_proj: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 1.38 MiB is free. Process 27680 has 22.15 GiB memory in use. Of the allocated memory 21.51 GiB is allocated by PyTorch, and 379.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documen




--- Evaluating: KLD-GPTQ-0% ---


MMLU Eval:  14%|█▍        | 698/5000 [01:28<09:05,  7.88it/s]


KeyboardInterrupt: 

### Visualization

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Create DataFrame
df = pd.DataFrame(results_table)

# Display the data
print(df)

# 2. Set up the figure with 2 subplots (Static vs Peak)
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

# --- Plot A: Static Memory (Model Size) vs Flip Rate ---
sns.scatterplot(
    data=df, x='Static Mem', y='Flip', hue='Method', style='Method',
    s=200, palette='viridis', ax=axes[0]
)
axes[0].set_title("Compression Efficiency: Static Memory vs Flip Rate")
axes[0].set_xlabel("Static Memory (GB) - [Weights Only]")
axes[0].set_ylabel("Flip Rate (Lower is Better)")
axes[0].grid(True, alpha=0.3)

# Add labels for Plot A
for i in range(df.shape[0]):
    row = df.iloc[i]
    # Check if 'Static_Mem' exists to avoid errors
    if 'Static_Mem' in row:
        axes[0].text(row.Static_Mem + 0.01, row.Flip + 0.001, f"{row.Threshold:.0%}", fontsize=9)

# --- Plot B: Peak Memory (Runtime Cost) vs Flip Rate ---
sns.scatterplot(
    data=df, x='Peak Mem', y='Flip', hue='Method', style='Method',
    s=200, palette='magma', ax=axes[1]
)
axes[1].set_title("Runtime Efficiency: Peak Memory vs Flip Rate")
axes[1].set_xlabel("Peak Memory (GB) - [Weights + Activations + Overhead]")
axes[1].set_ylabel("Flip Rate (Lower is Better)")
axes[1].grid(True, alpha=0.3)

# Add labels for Plot B
for i in range(df.shape[0]):
    row = df.iloc[i]
    if 'Peak_Mem' in row:
        axes[1].text(row.Peak_Mem + 0.05, row.Flip + 0.001, f"{row.Threshold:.0%}", fontsize=9)

plt.tight_layout()
plt.savefig("efficiency_frontier_comparison_Experiment2.png")
plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure DataFrame is ready
df = pd.DataFrame(results_table)

# Set up the figure with 3 subplots side-by-side
fig, axes = plt.subplots(1, 3, figsize=(24, 6))

# --- Plot 1: Perplexity (Lower is Better) ---
sns.lineplot(
    data=df, x='Threshold', y='PPL', hue='Method', style='Method',
    markers=True, markersize=10, linewidth=2.5, ax=axes[0], palette='viridis'
)
axes[0].set_title("Perplexity vs Restoration (Lower is Better)")
axes[0].set_xlabel("% Layers Restored to BF16")
axes[0].set_ylabel("Perplexity")
axes[0].grid(True, alpha=0.3)
axes[0].invert_yaxis() # Optional: If you want 'better' (lower) to be higher up, remove if standard view preferred.

# --- Plot 2: Accuracy (Higher is Better) ---
sns.lineplot(
    data=df, x='Threshold', y='Acc', hue='Method', style='Method',
    markers=True, markersize=10, linewidth=2.5, ax=axes[1], palette='magma'
)
axes[1].set_title("MMLU Accuracy vs Restoration (Higher is Better)")
axes[1].set_xlabel("% Layers Restored to BF16")
axes[1].set_ylabel("Accuracy")
axes[1].grid(True, alpha=0.3)

# --- Plot 3: Latency (Lower is Better) ---
sns.lineplot(
    data=df, x='Threshold', y='Latency', hue='Method', style='Method',
    markers=True, markersize=10, linewidth=2.5, ax=axes[2], palette='coolwarm'
)
axes[2].set_title("Inference Latency vs Restoration")
axes[2].set_xlabel("% Layers Restored to BF16")
axes[2].set_ylabel("Latency (Seconds)")
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("metrics_comparison_Experiment2.png")
plt.show()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import os
import shutil # Import shutil for copying files

# Define the directory to save files in Google Drive
save_dir = '/content/drive/MyDrive/Columbia-LLMSeminar/SLLM project/Jiayi/exp2'
os.makedirs(save_dir, exist_ok=True)

# Save results_table as CSV
df_results = pd.DataFrame(results_table)
df_results.to_csv(os.path.join(save_dir, 'exp2_results.csv'), index=False)
print(f"Results table saved to {os.path.join(save_dir, 'exp2_results.csv')}") # Corrected print statement

# Copy figures to Google Drive
figures_to_copy = [
    'efficiency_frontier_comparison_Experiment2.png',
    'metrics_comparison_Experiment2.png'
]

for fig_name in figures_to_copy:
    fig_path = os.path.join('/content', fig_name)
    if os.path.exists(fig_path):
        # Use shutil.copy2 to copy the file, then os.remove to delete the original
        shutil.copy2(fig_path, os.path.join(save_dir, fig_name))
        # os.remove(fig_path) # Remove original after copying
        print(f"Copied {fig_name} to {os.path.join(save_dir, fig_name)} and removed original.")
    else:
        print(f"Figure {fig_name} not found in current directory.")

# Experiment 3: Model Size
Research Question: Does this technique generalize to larger models? (Larger models are usually more robust to quantization; do they need less restoration?)

Fixed Variables:

Method: The "Winner" from Exp 2 (likely NF4 for speed or AWQ for accuracy).

Threshold: Fix to the "sweet spot" (e.g., 5%).

Independent Variable (Change this):

Model Size: 1.5B vs. 7B.