### **Breakdown of Key Libraries**

1Ô∏è‚É£ **Accelerate**  
   - Hugging Face‚Äôs library for optimizing multi-GPU, TPU, and distributed training.  
   - Essential when using FSDP or DeepSpeed.  

2Ô∏è‚É£ **PEFT (Parameter-Efficient Fine-Tuning)**  
   - Enables memory-efficient tuning methods like LoRA, QLoRA, and adapters.  
   - Replaces full fine-tuning with lightweight, focused updates.  

3Ô∏è‚É£ **Bitsandbytes**  
   - Supports 8-bit and 4-bit quantization.  
   - Critical for reducing VRAM usage in QLoRA fine-tuning.  

4Ô∏è‚É£ **Transformers (GitHub Version)**  
   - Installs the latest version of Hugging Face‚Äôs `transformers` library directly from GitHub.  
   - Required for accessing new models/features not yet available in the PyPI release.  

5Ô∏è‚É£ **TRL (Transformer Reinforcement Learning)**  
   - Designed for Reinforcement Learning from Human Feedback (RLHF).  
   - Used to train ChatGPT-like models.  

6Ô∏è‚É£ **Py7zr**  
   - Handles extraction of 7z-format compressed files.  
   - Useful for datasets downloaded from Hugging Face or other sources.  

7Ô∏è‚É£ **Auto-GPTQ**  
   - Implements GPTQ-based quantization for faster inference and improved VRAM efficiency.  

8Ô∏è‚É£ **Optimum**  
   - Hugging Face‚Äôs library for hardware optimizations (ONNX, TensorRT, Habana Gaudi).  
   - Ideal for accelerated inference and optimized training.  

In [1]:
!pip install accelerate peft bitsandbytes transformers trl py7zr auto-gptq optimum

Collecting bitsandbytes
  Downloading bitsandbytes-0.46.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting trl
  Downloading trl-0.20.0-py3-none-any.whl.metadata (11 kB)
Collecting py7zr
  Downloading py7zr-1.0.0-py3-none-any.whl.metadata (17 kB)
Collecting auto-gptq
  Downloading auto_gptq-0.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting optimum
  Downloading optimum-1.27.0-py3-none-any.whl.metadata (16 kB)
Collecting texttable (from py7zr)
  Downloading texttable-1.7.0-py2.py3-none-any.whl.metadata (9.8 kB)
Collecting pyzstd>=0.16.1 (from py7zr)
  Downloading pyzstd-0.17.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.5 kB)
Collecting pyppmd<1.3.0,>=1.1.0 (from py7zr)
  Downloading pyppmd-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.4 kB)
Collecting pybcj<1.1.0,>=1.0.0 (from py7zr)
  Downloading pybcj-1.0.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metad

In [2]:
import torch
from datasets import load_dataset, Dataset
from peft import LoraConfig, AutoPeftModelForCausalLM, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig, TrainingArguments
from trl import SFTTrainer
import os

### Samsum Dataset Overview

**Use case:** Dialogue summarization (WhatsApp-style)

dialogue: A conversation between people.

summary: A short summary of that conversation.

In [3]:
data = load_dataset("knkarthick/samsum")

README.md: 0.00B [00:00, ?B/s]

train.csv: 0.00B [00:00, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

In [4]:
data_df = data['train'].to_pandas()
data_df.head()

Unnamed: 0,id,dialogue,summary
0,13818513,Amanda: I baked cookies. Do you want some?\nJ...,Amanda baked cookies and will bring Jerry some...
1,13728867,Olivia: Who are you voting for in this electio...,Olivia and Olivier are voting for liberals in ...
2,13681000,"Tim: Hi, what's up?\nKim: Bad mood tbh, I was ...",Kim may try the pomodoro technique recommended...
3,13730747,"Edward: Rachel, I think I'm in ove with Bella....",Edward thinks he is in love with Bella. Rachel...
4,13728094,Sam: hey overheard rick say something\nSam: i...,"Sam is confused, because he overheard Rick com..."


In [5]:
data_df["text"] = data_df[["dialogue", "summary"]].fillna("").apply(
    lambda x: "###Human: Summarize this following dialogue: " + x["dialogue"] + "\n###Assistant: " +x["summary"], axis=1)

In [6]:
data_df.head(1)['text'].to_list()

["###Human: Summarize this following dialogue: Amanda: I baked  cookies. Do you want some?\nJerry: Sure!\nAmanda: I'll bring you tomorrow :-)\n###Assistant: Amanda baked cookies and will bring Jerry some tomorrow."]

In [7]:
data = Dataset.from_pandas(data_df)

### Load Tokenizer

In [8]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2") #"TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"s

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

### Load Model with Bit&Bytes

In [9]:
from transformers import BitsAndBytesConfig

In [10]:
# Load a 4-bit quantized model
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

In [11]:
# Load model
model = AutoModelForCausalLM.from_pretrained(
    # "mistralai/Mistral-7B-Instruct-v0.1",
    "microsoft/phi-2",
    quantization_config = quantization_config
)

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [12]:
model

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2560)
    (layers): ModuleList(
      (0-31): 32 x PhiDecoderLayer(
        (self_attn): PhiAttention(
          (q_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (k_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (v_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (dense): Linear4bit(in_features=2560, out_features=2560, bias=True)
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear4bit(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear4bit(in_features=10240, out_features=2560, bias=True)
        )
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (rotary_emb): PhiRotaryEmbedding()
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (final_layernorm): 

In [13]:
##  Disables caching of past key/value states in attention layers.
##  Caching interferes with backpropagation ‚Äî we need to compute gradients for each layer instead of reusing stored results.
model.config.use_cache = False

## Ensure consistency in training in single GPU
model.config.pretraining_tp = 1

## Only a few checkpoints (selected layer outputs) are saved. Whole activations are not saved which saves memory while trining bigger models.
model.gradient_checkpointing_enable()

## Prepares the model for k-bit (quantized) fine-tuning. Comes with PEFT library
model = prepare_model_for_kbit_training(model)



## üîß `LoraConfig(...)`

This creates a configuration object that tells PEFT **how to inject and train LoRA layers** inside your model.

### üß† LoRA in short:

Instead of fine-tuning **all parameters** of a large model (millions or billions), LoRA:

* **Freezes the original model**
* **Adds a few trainable low-rank matrices (A, B)** inside attention layers
* **Trains only these new small matrices** ‚Üí huge memory savings

---

### üîç Explanation of Each Parameter:

```python
LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj"]
)
```

---

### üìå `r=16`

* This is the **rank** of the low-rank matrices (`A` and `B`) LoRA adds.
* Smaller `r` ‚Üí fewer trainable parameters (but maybe less capacity)
* `r=16` is a good balance for many tasks.

üß† In practice: A weight `W` (say, 4096√ó4096) gets two matrices:

```
W ‚âà W_frozen + A(4096√ó16) @ B(16√ó4096)
```

---

### üìå `lora_alpha=16`

* This scales the LoRA output:
  **LoRA\_output = (A @ B) \* (alpha / r)**
* Think of it as a **scaling factor to control update strength**.
* Often set equal to `r`, but can be higher/lower to tweak training dynamics.

---

### üìå `lora_dropout=0.05`

* Applies dropout **only to LoRA layers during training**.
* Helps prevent overfitting when training these small injected layers.

---

### üìå `bias="none"`

* This tells PEFT **whether to train bias terms**.
  Options:

  * `"none"`: Don‚Äôt train any biases (common)
  * `"all"`: Train all bias parameters
  * `"lora_only"`: Train bias **only in LoRA-injected modules**

‚úÖ `"none"` is safest and most memory-efficient.

---

### üìå `task_type="CAUSAL_LM"`

* Tells PEFT what kind of task this is:

  * `"CAUSAL_LM"`: Left-to-right generation (e.g., GPT models)
  * `"SEQ_CLS"`: Classification
  * `"TOKEN_CLS"`: Token classification (NER, etc.)
  * `"SEQ_2_SEQ_LM"`: Translation/summarization (e.g., T5, BART)

Used to configure internal model logic.

---

### üìå `target_modules=["q_proj", "v_proj"]`

* Specifies **which submodules of your transformer to inject LoRA into**.
* `"q_proj"` and `"v_proj"` = **query and value projection** in attention layers.

You could also inject into:

* `k_proj` (key),
* `out_proj` (final attention output),
* `fc1`, `fc2` (MLP layers), etc.

Choosing `["q_proj", "v_proj"]` is popular and **saves memory while still being effective**.

---

## ‚úÖ Summary

This config says:

> ‚ÄúInject small trainable LoRA adapters into the **query and value projections** of my model, using rank 16 matrices, scale them by 16, apply 5% dropout, and **train only those** adapters ‚Äî not the original model.‚Äù

---

Would you like help applying this config in a real fine-tuning script or in a Hugging Face `Trainer` setup?


In [14]:
peft_config = LoraConfig(
        r=16,
        lora_alpha=16,
        lora_dropout=0.05,
        bias="none",
        task_type="SEQ_2_SEQ_LM",
        target_modules=["q_proj", "v_proj"]
)

In [15]:
model = get_peft_model(
    model,
    peft_config
  )

In [16]:
print(model)

PeftModelForSeq2SeqLM(
  (base_model): LoraModel(
    (model): PhiForCausalLM(
      (model): PhiModel(
        (embed_tokens): Embedding(51200, 2560)
        (layers): ModuleList(
          (0-31): 32 x PhiDecoderLayer(
            (self_attn): PhiAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2560, out_features=2560, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2560, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2560, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): Linear4bit(i

### Trining Step

In [17]:
training_arguments = TrainingArguments(
        output_dir="finetuned-phi2-model-samsum",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=1,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=100,
        num_train_epochs=1,
        max_steps=250,
        fp16=True,
        push_to_hub=True,
        report_to="none",
  )

In [18]:
data[0]

{'id': '13818513',
 'dialogue': "Amanda: I baked  cookies. Do you want some?\nJerry: Sure!\nAmanda: I'll bring you tomorrow :-)",
 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.',
 'text': "###Human: Summarize this following dialogue: Amanda: I baked  cookies. Do you want some?\nJerry: Sure!\nAmanda: I'll bring you tomorrow :-)\n###Assistant: Amanda baked cookies and will bring Jerry some tomorrow."}

In [19]:
# Create the SFTTrainer
trainer = SFTTrainer(
        model=model,
        train_dataset=data,
        peft_config=peft_config,
        args=training_arguments,
        processing_class=tokenizer,
    )



Adding EOS to train dataset:   0%|          | 0/14732 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/14732 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/14732 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForSeq2SeqLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [20]:
trainer.train()

  return fn(*args, **kwargs)


Step,Training Loss
100,2.2826


Step,Training Loss
100,2.2826
200,2.1398


TrainOutput(global_step=250, training_loss=2.1940028076171876, metrics={'train_runtime': 1128.7902, 'train_samples_per_second': 1.772, 'train_steps_per_second': 0.221, 'total_flos': 1.205661060882432e+16, 'train_loss': 2.1940028076171876})

In [21]:
trainer.model.push_to_hub("gaurav98095/finetuned-phi2")
tokenizer.push_to_hub("gaurav98095/finetuned-phi2")

README.md: 0.00B [00:00, ?B/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...pmji0f_mu/adapter_model.safetensors: 100%|##########| 21.0MB / 21.0MB            

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/gaurav98095/finetuned-phi2/commit/c69ef6d1b05ea484a047e03d0ce9e25c49641bd1', commit_message='Upload tokenizer', commit_description='', oid='c69ef6d1b05ea484a047e03d0ce9e25c49641bd1', pr_url=None, repo_url=RepoUrl('https://huggingface.co/gaurav98095/finetuned-phi2', endpoint='https://huggingface.co', repo_type='model', repo_id='gaurav98095/finetuned-phi2'), pr_revision=None, pr_num=None)

### Load Trained Models

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "gaurav98095/finetuned-phi2"  # Make sure this is correct and public
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

In [24]:
from transformers import GenerationConfig
generation_config = GenerationConfig(
    do_sample=True,
    top_k=1,
    temperature=0.1,
    max_new_tokens=25,
    pad_token_id=tokenizer.eos_token_id
)

In [25]:
inputs = tokenizer("""
###Human: Summarize this following dialogue: John: I'm at the railway station in Newyork Paul: No problems so far? John: no, everything's going smoothly Paul: good. lets meet there soon!
###Assistant: """, return_tensors="pt").to("cuda")


In [None]:
outputs = model.generate(**inputs, generation_config=generation_config)

In [31]:
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

John is at the New York railway station, and everything is going smoothly. Paul is glad and plans to meet him there soon.
