<a href="https://colab.research.google.com/github/alishaarora56/greenai-lora-llama-finetune/blob/main/GreenAI_Build_Day_1_Fine_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A fine-tuning pipeline leveraging Low-Rank Adaptation (LoRA) and 4-bit quantization to efficiently fine-tune the LLaMA-3 8B model using the Alpaca-cleaned dataset. This approach reduces memory consumption and training costs while maintaining performance, demonstrating the potential of lightweight model updates in large-scale language models. Additionally, environmental impact was monitored using codecarbon to ensure sustainable AI development practices.

In [None]:
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    !pip install --no-deps xformers trl peft accelerate bitsandbytes
pass

In [None]:
!pip install codecarbon==2.6.0

Collecting codecarbon==2.6.0
  Downloading codecarbon-2.6.0-py3-none-any.whl.metadata (8.5 kB)
Collecting arrow (from codecarbon==2.6.0)
  Downloading arrow-1.3.0-py3-none-any.whl.metadata (7.5 kB)
Collecting pynvml (from codecarbon==2.6.0)
  Downloading pynvml-12.0.0-py3-none-any.whl.metadata (5.4 kB)
Collecting questionary (from codecarbon==2.6.0)
  Downloading questionary-2.1.0-py3-none-any.whl.metadata (5.4 kB)
Collecting rapidfuzz (from codecarbon==2.6.0)
  Downloading rapidfuzz-3.12.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting types-python-dateutil>=2.8.10 (from arrow->codecarbon==2.6.0)
  Downloading types_python_dateutil-2.9.0.20241206-py3-none-any.whl.metadata (2.1 kB)
Collecting nvidia-ml-py<13.0.0a0,>=12.0.0 (from pynvml->codecarbon==2.6.0)
  Downloading nvidia_ml_py-12.570.86-py3-none-any.whl.metadata (8.7 kB)
Downloading codecarbon-2.6.0-py3-none-any.whl (499 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m499.2

In [None]:
!pip install xformers==0.0.28.post3

Collecting xformers==0.0.28.post3
  Downloading xformers-0.0.28.post3-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.28.post3-cp311-cp311-manylinux_2_28_x86_64.whl (16.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.7/16.7 MB[0m [31m92.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xformers
  Attempting uninstall: xformers
    Found existing installation: xformers 0.0.29.post2
    Uninstalling xformers-0.0.29.post2:
      Successfully uninstalled xformers-0.0.29.post2
Successfully installed xformers-0.0.28.post3


In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! Llama 3 is up to 8k
dtype = None
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit",
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",
]

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit", # Llama-3 70b also works (just change the model name)
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

Unsloth: Patching Xformers to fix some performance issues.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.2.4: Fast Llama patching. Transformers: 4.48.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, #  any number > 0
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_praoj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth: You added custom modules, but Unsloth hasn't optimized for this.
Beware - your finetuning might be noticeably slower!


Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.2.4 patched 32 layers with 32 QKV layers, 32 O layers and 0 MLP layers.


In [None]:
#Data Prep
from datasets import load_dataset, DatasetDict
from transformers import TrainingArguments, Trainer
from unsloth import FastLanguageModel
import random

dataset = load_dataset("yahma/alpaca-cleaned")

dataset = dataset["train"].shuffle(seed=42)
train_size = int(0.9 * len(dataset))
train_dataset = dataset.select(range(train_size))
eval_dataset = dataset.select(range(train_size, len(dataset)))

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

def format_example(example):
    instruction = example["instruction"]
    input_text = example["input"]
    output_text = example["output"]

    prompt = alpaca_prompt.format(instruction, input_text, output_text)
    return {"text": prompt}

# formatting to both train and eval datasets
train_dataset = train_dataset.map(format_example)
eval_dataset = eval_dataset.map(format_example)

# Tokenizing the data
from transformers import AutoTokenizer

tokenizer = tokenizer

def tokenize_function(example):
    return tokenizer(
        example["text"],
        padding="max_length",
        truncation=True,
        max_length=2048,
        return_tensors="np",
    )

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=["text"])
tokenized_eval_dataset = eval_dataset.map(tokenize_function, batched=True, remove_columns=["text"])

Map:   0%|          | 0/46584 [00:00<?, ? examples/s]

In [None]:
#Model Training
training_args = TrainingArguments(
    output_dir="./fine_tuned_model_outputs",
    per_device_train_batch_size=2,
    num_train_epochs=4,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    learning_rate=2e-4,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    fp16=True,
)

# Defining and training the model using Trainer
trainer = Trainer(
    model=model,  # Fine-tuning the LoRA-based model
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,
    tokenizer=tokenizer,
)

trainer.train()

In [None]:
#Model Saving /content
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16") # Local saving

In [None]:
from codecarbon import EmissionsTracker

def load_model(model_name, optimized=False):
    from unsloth import FastLanguageModel
    if optimized :
      model, tokenizer = FastLanguageModel.from_pretrained(
          model_name ='./model',
          max_seq_length = max_seq_length,
          dtype = dtype,
          load_in_4bit = load_in_4bit,
      )
    else:
      model, tokenizer = FastLanguageModel.from_pretrained(
          model_name = "unsloth/llama-3-8b-bnb-4bit", # YOUR MODEL YOU USED FOR TRAINING
          max_seq_length = max_seq_length,
          dtype = dtype,
          load_in_4bit = load_in_4bit,
      )

    return model, tokenizer


def run_inference(model, tokenizer, instruction, input, max_length=50):
    """
    Run inference on the given prompt and return the generated text.
    """
    FastLanguageModel.for_inference(model)
    inputs = tokenizer(
    [
        alpaca_prompt.format(
            f"{instruction}", # instruction
            f"{input}", # input
            "",
        )
    ], return_tensors = "pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True)
    return tokenizer.batch_decode(outputs, skip_special_tokens=False)


def calculate_emissions(model_name, model, tokenizer, instruction, input, max_length=50, optimized=False):
    """
    Calculate carbon emissions for running inference with a specific model.
    """
    tracker = EmissionsTracker(project_name=f"{model_name}",allow_multiple_runs=True)
    tracker.start()

    # Run inference
    print("Running inference...")
    output = run_inference(model, tokenizer, instruction, input, max_length)
    print(f"Generated Text: {output}")

    # Stop tracking and get emissions
    emissions = tracker.stop()
    print(f"Carbon emissions (kgCO2eq): {emissions}")
    return emissions

In [None]:
#Saving the model locally
save_directory = "./fine_tuned_model"
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)
print(f"Model and tokenizer saved to: {save_directory}")

loaded_model, loaded_tokenizer = FastLanguageModel.from_pretrained(
    model_name=save_directory,
    max_seq_length=2048,
    load_in_4bit=True,  # memory efficiency
)

Model and tokenizer saved to: ./fine_tuned_model
==((====))==  Unsloth 2025.2.4: Fast Llama patching. Transformers: 4.48.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
#Inference
FastLanguageModel.for_inference(loaded_model)

def run_inference(model, tokenizer, instruction, input_text, max_length=50):
    prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input_text}

### Response:
"""
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=max_length)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

#Calculating emissions during inference
def calculate_emissions(model, tokenizer, instruction, input_text):
    tracker = EmissionsTracker(project_name="Fine-tuning Emissions")
    tracker.start()

    print("Running inference...")
    output = run_inference(model, tokenizer, instruction, input_text)
    print(f"Generated Text: {output}")

    emissions = tracker.stop()
    print(f"Carbon emissions (kgCO2eq): {emissions}")

calculate_emissions(
    loaded_model,
    loaded_tokenizer,
    "Convert the following binary numbers to decimal.",
    "1010, 1101, 1111"
)

[codecarbon INFO @ 16:15:27] [setup] RAM Tracking...
[codecarbon INFO @ 16:15:27] [setup] GPU Tracking...
[codecarbon INFO @ 16:15:27] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 16:15:27] [setup] CPU Tracking...
[codecarbon INFO @ 16:15:28] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 16:15:28] >>> Tracker's metadata:
[codecarbon INFO @ 16:15:28]   Platform system: Linux-6.1.85+-x86_64-with-glibc2.35
[codecarbon INFO @ 16:15:28]   Python version: 3.11.11
[codecarbon INFO @ 16:15:28]   CodeCarbon version: 2.6.0
[codecarbon INFO @ 16:15:28]   Available RAM : 12.675 GB
[codecarbon INFO @ 16:15:28]   CPU count: 2
[codecarbon INFO @ 16:15:28]   CPU model: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 16:15:28]   GPU count: 1
[codecarbon INFO @ 16:15:28]   GPU model: 1 x Tesla T4


ref: /usr/local/lib/python3.11/dist-packages/codecarbon/data/hardware/cpu_power.csv


[codecarbon INFO @ 16:15:28] Saving emissions data to file /content/emissions.csv


Running inference...


[codecarbon INFO @ 16:15:29] Energy consumed for RAM : 0.000001 kWh. RAM Power : 4.753036022186279 W
[codecarbon INFO @ 16:15:29] Energy consumed for all GPUs : 0.000019 kWh. Total GPU Power : 63.00519796509088 W
[codecarbon INFO @ 16:15:29] Energy consumed for all CPUs : 0.000013 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 16:15:29] 0.000033 kWh of electricity used since the beginning.


Generated Text: Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Convert the following binary numbers to decimal.

### Input:
1010, 1101, 1111

### Response:
10, 13, 15

ref: /usr/local/lib/python3.11/dist-packages/codecarbon/data/private_infra/2016/usa_emissions.json
Carbon emissions (kgCO2eq): 4.613410832043306e-06


In [None]:
!pip freeze > requirements.txt

In [None]:
from google.colab import files
files.download('requirements.txt')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>