# Fine-Tuning Llama 3.2 1B: Step-by-Step Guide
 To fine-tune the Llama 3.2 1B model, you can follow the steps below:



### Big Picture Overview of Parameter Efficient Fine Tuning Methods like LoRA and QLoRA Fine Tuning for Sequence Classification

**The Essence of Fine-tuning**
- LLMs are pre-trained on vast amounts of data for broad language understanding.
- Fine-tuning is crucial for specializing in specific domains or tasks, involving adjustments with smaller, relevant datasets.

**Model Fine-tuning with PEFT: Exploring LoRA and QLoRA**
- Traditional fine-tuning is resource-intensive; PEFT (Parameter Efficient Fine-tuning) makes the process faster and less demanding.
- Focus on two PEFT methods: LoRA and QLoRA.

**The Power of PEFT**
- PEFT modifies only a subset of the LLM's parameters, enhancing speed and reducing memory demands, making it suitable for less powerful devices.

**LoRA: Efficiency through Adapters**
- **Low-Rank Adaptation (LoRA):** Injects small trainable adapters into the pre-trained model.
- **Equation:** For a weight matrix $W$, LoRA approximates $W = W_0 + BA$, where $W_0$ is the original weight matrix, and $BA$ represents the low-rank modification through trainable matrices $B$ and $A$.
- Adapters learn task nuances while keeping the majority of the LLM unchanged, minimizing overhead.

**QLoRA: Compression and Speed**
- **Quantized LoRA (QLoRA):** Extends LoRA by quantizing the model’s weights, further reducing size and enhancing speed.
- **Innovations in QLoRA:**
  1. **4-bit Quantization:** Uses a 4-bit data type, NormalFloat (NF4), for optimal weight quantization, drastically reducing memory usage.
  2. **Low-Rank Adapters:** Fine-tuned with 16-bit precision to effectively capture task-specific nuances.
  3. **Double Quantization:** Reduces quantization constants from 32-bit to 8-bit, saving additional memory without accuracy loss.
  4. **Paged Optimizers:** Manages memory efficiently during training, optimizing for large tasks.

**Why PEFT Matters**
- **Rapid Learning:** Speeds up model adaptation.
- **Smaller Footprint:** Eases deployment with reduced model size.
- **Edge-Friendly:** Fits better on devices with limited resources, enhancing accessibility.

**Conclusion**
- PEFT methods like LoRA and QLoRA revolutionize LLM fine-tuning by focusing on efficiency, facilitating faster adaptability, smaller models, and broader device compatibility.



**1. Setting up**

In [1]:
!nvidia-smi
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
!pip install -U xformers

Wed Feb  5 01:50:13 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   32C    P8              8W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00

In [2]:
!pip uninstall torchvision -y
!pip install torchvision

Found existing installation: torchvision 0.20.1+cu121
Uninstalling torchvision-0.20.1+cu121:
  Successfully uninstalled torchvision-0.20.1+cu121
Collecting torchvision
  Downloading torchvision-0.21.0-cp310-cp310-manylinux1_x86_64.whl.metadata (6.1 kB)
Downloading torchvision-0.21.0-cp310-cp310-manylinux1_x86_64.whl (7.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m72.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[?25hInstalling collected packages: torchvision
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fastai 2.7.18 requires torch<2.6,>=1.10, but you have torch 2.6.0 which is incompatible.[0m[31m
[0mSuccessfully installed torchvision-0.21.0


In [38]:
from google.colab import userdata
from huggingface_hub import login
import wandb

from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("HUGGINGFACE_TOKEN")
secret_value_1 = user_secrets.get_secret("wandb")


# Retrieve Hugging Face token
hf_token = secret_value_0#userdata.get("HUGGINGFACE_TOKEN")
if hf_token:
    login(hf_token)
else:
    print("❌ Hugging Face token not found!")

# Retrieve Weights & Biases (wandb) token
wb_token = secret_value_1#userdata.get("wandb")
if wb_token:
    wandb.login(key=wb_token)
    run = wandb.init(
        project="Fine-tune-DeepSeek-R1-Distill-Llama-8B on Medical COT Dataset",
        job_type="training",
        anonymous="allow"
    )
else:
    print("❌ Weights & Biases token not found!")



#2. Loading the model and tokenizer


In [39]:
from unsloth import FastLanguageModel

max_seq_length = 2048
dtype = None
load_in_4bit = True


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B",#"unsloth/DeepSeek-R1-Distill-Llama-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hf_token,
)

==((====))==  Unsloth 2025.1.8: Fast Llama patching. Transformers: 4.48.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


#3. Model inference before fine-tuning


Input (Patent Abstract Example):
"A system and method for optimizing power consumption in wireless sensor networks. The invention utilizes adaptive duty cycling and energy-efficient routing algorithms to extend battery life while maintaining network performance. The approach dynamically adjusts transmission power and sleep schedules based on real-time environmental conditions and data transmission requirements."


Test Prompt:
"Classify the given patent abstract into one of the predefined categories: [Biotechnology, Computer Technology, Electrical Engineering, Mechanical Engineering, Pharmaceuticals, Telecommunications, Automotive, Medical Devices, Others]. Provide only the category label."


Expected Output:
"Telecommunications"

In [5]:
def create_prompt_formats(sample):
    """
    Creates a formatted prompt template for a prompt in the instruction dataset

    :param sample: Prompt or sample from the instruction dataset
    """

    # Initialize static strings for the prompt template
    INTRO_BLURB = "Below is an summary of a patent. These summaries are flattened narratives with a simpler discourse structure"
    INSTRUCTION_KEY = "### Instruction:"
    INPUT_KEY = "Input:"
    RESPONSE_KEY = "### Response:"
    END_KEY = "### End"

    # Combine a prompt with the static strings
    blurb = f"{INTRO_BLURB}"
    instruction = "Classify the given patent abstract into one of the predefined categories: [Human Necessities, Performing Operations; Transporting, Chemistry; Metallurgy, Textiles; Paper, Fixed Constructions, Mechanical Engineering; Lightning; Heating; Weapons; Blasting, Physics, Electricity, General tagging of new or cross-sectional technology]."
    input_context = f"{INPUT_KEY}\n{sample['input']}" if sample["input"] else None
    response = f"{RESPONSE_KEY}\n{sample['output']}"
    end = f"{END_KEY}"

    # Create a list of prompt template elements
    parts = [part for part in [blurb, instruction, input_context, response, end] if part]

    # Join prompt template elements into a single string to create the prompt template
    formatted_prompt = "\n\n".join(parts)

    # Store the formatted prompt template in a new key "text"
    sample["text"] = formatted_prompt

    return sample

In [6]:
prompt_style = """Below is an summary of a patent. These summaries are flattened narratives with a simpler discourse structure.

### Instruction:
Classify the given patent abstract into one of the predefined categories: \n\nHuman Necessities\nPerforming Operations; Transporting\nChemistry; Metallurgy\nTextiles; Paper\nFixed Constructions\nMechanical Engineering; Lightning; Heating; Weapons; Blasting\nPhysics\nElectricity\nGeneral tagging of new or cross-sectional technology\n\n.

Select **only one category** that best fits the patent abstract.

### Input:
{}

### Response:
"""

In [7]:
question = """
This text defines key terms related to pharmaceutical compositions for treating stroke, emphasizing the use of Protein Kinase C (PKC) activators. "Administration" encompasses various routes, while "effective amount" signifies a dosage sufficient for symptom reduction. "PKC activator" refers to substances enhancing PKC reaction rates. Pharmaceutically acceptable carriers are crucial for safe and effective delivery of active ingredients. Different compounds, including Bryostatin-1, diacylglycerol (DAG) derivatives, and growth factors, can activate PKC. Bryostatin-1 shows promise due to its ability to cross the blood-brain barrier and its biphasic dose-response. Combinatorial libraries are used for screening and optimizing PKC activators for improved therapeutic outcomes. The text also mentions preclinical studies in rats demonstrating the potential of Bryostatin-1 in restoring cognitive function after stroke."""

In [8]:
FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=20,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response)#response[0].split("### Response:")[1])

['<|begin_of_text|>Below is an summary of a patent. These summaries are flattened narratives with a simpler discourse structure.\n\n### Instruction:\nClassify the given patent abstract into one of the predefined categories: \n\nHuman Necessities\nPerforming Operations; Transporting\nChemistry; Metallurgy\nTextiles; Paper\nFixed Constructions\nMechanical Engineering; Lightning; Heating; Weapons; Blasting\nPhysics\nElectricity\nGeneral tagging of new or cross-sectional technology\n\n.\n\nSelect **only one category** that best fits the patent abstract.\n\n### Input:\n\nThis text defines key terms related to pharmaceutical compositions for treating stroke, emphasizing the use of Protein Kinase C (PKC) activators. "Administration" encompasses various routes, while "effective amount" signifies a dosage sufficient for symptom reduction. "PKC activator" refers to substances enhancing PKC reaction rates. Pharmaceutically acceptable carriers are crucial for safe and effective delivery of active 

#4. Loading and processing the dataset


In [36]:
from datasets import load_dataset

# Load dataset
dataset = load_dataset("ccdv/patent-classification")

# Extract labels
labels = dataset["train"].features["label"].names
num_labels = len(labels)

print("Labels:", labels)

Labels: ['Human Necessities', 'Performing Operations; Transporting', 'Chemistry; Metallurgy', 'Textiles; Paper', 'Fixed Constructions', 'Mechanical Engineering; Lightning; Heating; Weapons; Blasting', 'Physics', 'Electricity', 'General tagging of new or cross-sectional technology']


In [37]:
def preprocess_function(examples):
    encoding = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)
    encoding["label"] = examples["label"]  # Assign integer labels
    return encoding

# Apply tokenization
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [11]:
# label_map = {
#     0: "Human Necessities",
#     1: "Performing Operations; Transporting",
#     2: "Chemistry; Metallurgy",
#     3: "Textiles; Paper",
#     4: "Fixed Constructions",
#     5: "Mechanical Engineering; Lightning; Heating; Weapons; Blasting",
#     6: "Physics",
#     7: "Electricity",
#     8: "General tagging of new or cross-sectional technology"
# }

In [12]:
from collections import Counter
print(Counter(dataset['train']['label']))
print(Counter(dataset['validation']['label']))
print(Counter(dataset['test']['label']))

Counter({6: 5408, 7: 5321, 0: 3614, 1: 3357, 8: 2562, 2: 2099, 5: 1730, 4: 705, 3: 204})
Counter({6: 1092, 7: 1049, 1: 705, 0: 703, 8: 497, 2: 421, 5: 347, 4: 146, 3: 40})
Counter({6: 1107, 7: 1035, 0: 754, 1: 649, 8: 503, 2: 394, 5: 358, 4: 156, 3: 44})


#5. Setting up the model


In [40]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

In [41]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=tokenized_dataset["train"],  # Use the tokenized train dataset
    dataset_text_field="input_ids",  # Referring to the tokenized field, not "text"
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=3000,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)


#6. Model training


In [43]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 25,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 3,000
 "-____-"     Number of trainable parameters = 11,272,192


Step,Training Loss
10,1.9566
20,1.9919
30,2.0308
40,2.0832
50,2.0447
60,1.9965
70,2.0255
80,2.061
90,2.014
100,2.0368


#7. Model inference after fine-tuning


In [46]:
prompt_style.format(dataset["test"][122], "")

"Below is an summary of a patent. These summaries are flattened narratives with a simpler discourse structure.\n\n### Instruction:\nClassify the given patent abstract into one of the predefined categories: \n\nHuman Necessities\nPerforming Operations; Transporting\nChemistry; Metallurgy\nTextiles; Paper\nFixed Constructions\nMechanical Engineering; Lightning; Heating; Weapons; Blasting\nPhysics\nElectricity\nGeneral tagging of new or cross-sectional technology\n\n.\n\nSelect **only one category** that best fits the patent abstract.\n\n### Input:\n{'text': 'the present invention is based in part on the observation that different inulin preparations appear to synergize with sulfonylureas to different extent , judging by the dosage of inulin required to achieve effective synergy in the treatment of type - 2 diabetes mellitus ( t2dm ) patients . the present invention is concerned with assessing the useful degree of polymerization ( dp ) range for inulin preparations , preferably food grade

In [47]:
#question = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"

FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(dataset["test"][100], "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=200,
    use_cache=True,
)

response = tokenizer.batch_decode(outputs)
print(response[0]) #.split("### Response:")[1])


<|begin_of_text|>Below is an summary of a patent. These summaries are flattened narratives with a simpler discourse structure.

### Instruction:
Classify the given patent abstract into one of the predefined categories: 

Human Necessities
Performing Operations; Transporting
Chemistry; Metallurgy
Textiles; Paper
Fixed Constructions
Mechanical Engineering; Lightning; Heating; Weapons; Blasting
Physics
Electricity
General tagging of new or cross-sectional technology

.

Select **only one category** that best fits the patent abstract.

### Input:
{'text': 'in accordance with a first embodiment of the present invention, a document 12 includes plural - bit data steganographically encoded thereon ( e. g., via digital watermarking ). the document 12 can be a photo id ( e. g., a driver &# 39 ; s license, student id, identification document, employee badge, passport, etc. ), a value document ( e. g., a banknote, stock certificate, or other financial instrument ), a credit card, a product manual,

# 8. Saving the model locally


In [54]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "akhilsheri57/llama-1b-new"  # Use actual HF model or local path
save_dir = "llama-1b-fp16-vllm"

# Load model properly (if private, pass `token=your_token`)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, token=secret_value_0)
tokenizer = AutoTokenizer.from_pretrained(model_id, token=secret_value_0)

# Save in vLLM-compatible format
model.save_pretrained(save_dir, safe_serialization=True)
tokenizer.save_pretrained(save_dir)


config.json:   0%|          | 0.00/981 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

('llama-1b-fp16-vllm/tokenizer_config.json',
 'llama-1b-fp16-vllm/special_tokens_map.json',
 'llama-1b-fp16-vllm/tokenizer.json')

In [55]:
from huggingface_hub import upload_folder

upload_folder(
    folder_path="llama-1b-fp16-vllm",
    repo_id="akhilsheri57/llama-1b-new",
    repo_type="model",
    token=secret_value_0
)
print("Model uploaded successfully!")

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

Model uploaded successfully!


In [57]:
!pip uninstall -y vllm torch torchvision torchaudio
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip install vllm

Found existing installation: vllm 0.7.1
Uninstalling vllm-0.7.1:
  Successfully uninstalled vllm-0.7.1
Found existing installation: torch 2.5.1
Uninstalling torch-2.5.1:
  Successfully uninstalled torch-2.5.1
Found existing installation: torchvision 0.20.1
Uninstalling torchvision-0.20.1:
  Successfully uninstalled torchvision-0.20.1
Found existing installation: torchaudio 2.5.1+cu121
Uninstalling torchaudio-2.5.1+cu121:
  Successfully uninstalled torchaudio-2.5.1+cu121
Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch
  Downloading https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp310-cp310-linux_x86_64.whl (780.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m780.4/780.4 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hCollecting torchvision
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.20.1%2Bcu121-cp310-cp310-linux_x86_64.whl (7.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [58]:
!export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
#!ls /usr/local/lib/python3.10/dist-packages/~orchvision.libs/

ls: cannot access '/usr/local/lib/python3.10/dist-packages/~orchvision.libs/': No such file or directory


In [60]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install vllm --force-reinstall

Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting vllm
  Using cached vllm-0.7.1-cp38-abi3-manylinux1_x86_64.whl.metadata (12 kB)
Collecting psutil (from vllm)
  Downloading psutil-6.1.1-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting sentencepiece (from vllm)
  Downloading sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting numpy<2.0.0 (from vllm)
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting requests>=2.26.0 (from vllm)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting tqdm (from vllm)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57

In [61]:
import vllm

llm = vllm.LLM(model="akhilsheri57/llama-1b-new")
response = llm.generate("Hello, how are you?")
print(response)

INFO 02-05 05:04:37 config.py:526] This model supports multiple tasks: {'reward', 'embed', 'score', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 02-05 05:04:37 config.py:1538] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 02-05 05:04:37 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='akhilsheri57/llama-1b-new', speculative_config=None, tokenizer='akhilsheri57/llama-1b-new', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_

OSError: /usr/local/lib/python3.10/dist-packages/~orchvision.libs/libcudart.41118559.so.12 (deleted): cannot open shared object file: No such file or directory

In [51]:
model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
model.push_to_hub_merged("akhilsheri57/llama-1b-new", tokenizer, save_method = "merged_16bit", token = secret_value_0)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 19.15 out of 31.35 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 16/16 [00:00<00:00, 46.00it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model/pytorch_model.bin...
Done.
Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 19.16 out of 31.35 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 16/16 [00:00<00:00, 47.83it/s]


Unsloth: Saving tokenizer...

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

 Done.
Unsloth: Saving /tmp/llama-1b-new/pytorch_model.bin...


pytorch_model.bin:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/akhilsheri57/llama-1b-new


In [None]:
# import torch
# # Convert model to FP16
# model = model.to(torch.float16)

In [23]:
model.save_pretrained("converted_model_fp16")
tokenizer.save_pretrained("converted_model_fp16")

('converted_model_fp16/tokenizer_config.json',
 'converted_model_fp16/special_tokens_map.json',
 'converted_model_fp16/tokenizer.json')

In [26]:
from huggingface_hub import HfApi, HfFolder, Repository, upload_folder
from transformers import AutoModelForCausalLM, AutoTokenizer
import os

# Step 1: Set your Hugging Face credentials
hf_token = secret_value_0  # Replace with your token
repo_id = "akhilsheri57/llama-1b-fp16"  # Replace with your username & model name
local_model_path = "converted_model_fp16"  # Folder where your model is saved

# Step 2: Login (Store token in ~/.huggingface/token)
HfFolder.save_token(hf_token)
api = HfApi()

# Step 3: Check if the repo exists; if not, create it
try:
    api.repo_info(repo_id)  # Check if repo exists
    print(f"Repo {repo_id} already exists.")
except Exception:
    api.create_repo(repo_id, repo_type="model", exist_ok=True)
    print(f"Created new repo: {repo_id}")

# Step 4: Upload the model folder to Hugging Face
upload_folder(folder_path=local_model_path, repo_id=repo_id, repo_type="model")
print(f"Model uploaded successfully to {repo_id}")

Created new repo: akhilsheri57/llama-1b-fp16


adapter_model.safetensors:   0%|          | 0.00/22.6M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Model uploaded successfully to akhilsheri57/llama-1b-fp16


In [28]:
!pip install vllm

Collecting vllm
  Downloading vllm-0.7.1-cp38-abi3-manylinux1_x86_64.whl.metadata (12 kB)
Collecting blake3 (from vllm)
  Downloading blake3-1.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting fastapi!=0.113.*,!=0.114.0,>=0.107.0 (from vllm)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn[standard] (from vllm)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.0.2-py3-none-any.whl.metadata (13 kB)
Collecting lm-format-enforcer<0.11,>=0.10.9 (from vllm)
  Downloading lm_format_enforcer-0.10.9-py3-none-any.whl.metadata (17 kB)
Collecting outlines==0.1.11 (from vllm)
  Downloading outlines-0.1.11-py3-none-any.whl.metadata (17 kB)
Collecting lark==1.2.2 (from vllm)
  Downloading lark-1.2.2-py3-none-any.whl.metadata (1.8 kB)
Collecting xgrammar>=0.1.6 (from vllm)
  Downloading xgrammar-0.1.

In [29]:
import vllm

llm = vllm.LLM(model=repo_id)  # Load from Hugging Face
response = llm.generate("Hello, how are you?")
print(response)

INFO 02-05 02:45:58 __init__.py:183] Automatically detected platform cuda.


ValueError: No supported config format found in akhilsheri57/llama-1b-fp16

In [30]:
from huggingface_hub import snapshot_download

model_id = "akhilsheri57/llama-1b-fp16"
snapshot_download(repo_id=model_id, repo_type="model")

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/5.11k [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/808 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/22.6M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

'/root/.cache/huggingface/hub/models--akhilsheri57--llama-1b-fp16/snapshots/3d8021e57657244ed5dd0ccb4cc2a2d954f51974'

In [32]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "akhilsheri57/llama-1b-fp16"  # Original model (if in 4-bit)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Save in FP16 format
model.save_pretrained("llama-1b-fp16")
tokenizer.save_pretrained("llama-1b-fp16")

`low_cpu_mem_usage` was None, now default to True since model is quantized.


('llama-1b-fp16/tokenizer_config.json',
 'llama-1b-fp16/special_tokens_map.json',
 'llama-1b-fp16/tokenizer.json')

In [None]:
new_model_local = "Llama-1B-cctv"
model.save_pretrained(new_model_local)
tokenizer.save_pretrained(new_model_local)

model.save_pretrained_merged(new_model_local, tokenizer) #, save_method = "merged_16bit",)

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 1.1G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 5.12 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 16/16 [00:00<00:00, 29.20it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving Llama-1B-cctv/pytorch_model.bin...
Done.


#9. Pushing the model to Hugging Face Hub


In [None]:
new_model_online = "akhilsheri57/Llama-1B-cctv"
model.push_to_hub(new_model_online)
tokenizer.push_to_hub(new_model_online)

model.push_to_hub_merged(new_model_online, tokenizer) #, save_method = "merged_16bit")

README.md:   0%|          | 0.00/599 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.54k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/45.1M [00:00<?, ?B/s]

Saved model to https://huggingface.co/akhilsheri57/Llama-1B-cctv


tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Unsloth: You are pushing to hub, but you passed your HF username = akhilsheri57.
We shall truncate akhilsheri57/Llama-1B-cctv to Llama-1B-cctv
Unsloth: Will remove a cached repo with size 1.5K


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 5.11 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 16/16 [00:00<00:00, 45.99it/s]


Unsloth: Saving tokenizer...

No files have been modified since last commit. Skipping to prevent empty commit.


 Done.
Unsloth: Saving Llama-1B-cctv/pytorch_model.bin...


pytorch_model.bin:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/akhilsheri57/Llama-1B-cctv


In [None]:
!pip install vllm
!pip install pyngrok

Collecting pyngrok
  Downloading pyngrok-7.2.3-py3-none-any.whl.metadata (8.7 kB)
Downloading pyngrok-7.2.3-py3-none-any.whl (23 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.2.3


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install huggingface_hub
from huggingface_hub import snapshot_download

snapshot_download(repo_id="akhilsheri57/Llama-1B-cctv", local_dir="fine_tuned_model")



Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/981 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/808 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/605 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/45.1M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

'/content/fine_tuned_model'

In [None]:
!pip uninstall -y autoawq
!pip install -q --no-cache-dir autoawq

Found existing installation: autoawq 0.2.8
Uninstalling autoawq-0.2.8:
  Successfully uninstalled autoawq-0.2.8
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.6/71.6 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for autoawq (setup.py) ... [?25l[?25hdone


In [None]:
from transformers import AutoTokenizer
from autoawq import AutoAWQForCausalLM

model_name = "/content/drive/MyDrive/fine_tuned_model"

# Load and quantize the model
model = AutoAWQForCausalLM.from_pretrained(
    model_name,
    fuse_layers=True,
    device_map="auto"  # Automatically assigns to GPU if available
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Save the quantized model
quantized_model_dir = "/content/drive/MyDrive/fine_tuned_model_awq"
model.save_awq(quantized_model_dir)
tokenizer.save_pretrained(quantized_model_dir)

print("✅ AWQ quantization complete. Model saved to:", quantized_model_dir)

ModuleNotFoundError: No module named 'autoawq'

In [None]:
!python -m vllm.entrypoints.api_server \
    --model fine_tuned_model \
    --tensor-parallel-size 1 \
    --quantization awq \
    --port 9000 &

2025-02-04 22:32:52.438040: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1738708372.458994   26285 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1738708372.465470   26285 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-04 22:32:52.487192: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO 02-04 22:32:59 __init__.py:183] Automatically detected platform cuda.
INFO 02-04 22:33:00 api_server.py:119] vLL

In [None]:
from pyngrok import ngrok

# Start ngrok to expose the API server
public_url = ngrok.connect(9000)
print("Public vLLM API URL:", public_url)



ERROR:pyngrok.process.ngrok:t=2025-02-04T22:07:00+0000 lvl=eror msg="failed to reconnect session" obj=tunnels.session err="authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your authtoken: https://dashboard.ngrok.com/get-started/your-authtoken\r\n\r\nERR_NGROK_4018\r\n"
ERROR:pyngrok.process.ngrok:t=2025-02-04T22:07:00+0000 lvl=eror msg="session closing" obj=tunnels.session err="authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your authtoken: https://dashboard.ngrok.com/get-started/your-authtoken\r\n\r\nERR_NGROK_4018\r\n"
ERROR:pyngrok.process.ngrok:t=2025-02-04T22:07:00+0000 lvl=eror msg="terminating with error" obj=app err="authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your aut

PyngrokNgrokError: The ngrok process errored on start: authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your authtoken: https://dashboard.ngrok.com/get-started/your-authtoken\r\n\r\nERR_NGROK_4018\r\n.