# Hands-On LoRA
Learn how to fine-tune LLM on a custom dataset using LoRA.

Meta’s Llama 3.1 model is used for a variety of use cases, including question-answering, text generation, code generation, story writing, and much more. One use case also involves solving math word problems, but the model usually provides solutions in natural language rather than pure math expressions. We want to fine-tune the Llama 3.1 model to provide solutions to word problems in mathematical expressions.

>We will use the openai/gsm8k dataset from Hugging Face for fine-tuning. GSM8K (Grade School Math 8K) is a dataset of 8.5K grade-school math word problems involving multi-step reasoning along with their solutions in pure maths expressions.

Let’s begin with the journey of fine-tuning Meta’s Llama 3.1 model on openai/gsm8k dataset using LoRA.

## Install the dependencies
First, let’s install the libraries required for fine-tuning. We'll be installing the latest versions (at the time of writing) of the libraries.

In [1]:
!pip install transformers==4.44.1
!pip install accelerate
!pip install bitsandbytes==0.43.3
!pip install datasets==2.21.0
!pip install trl==0.9.6
!pip install peft==0.12.0
!pip install -U "huggingface_hub[cli]"



- Line 1: We install the transformers library, which is a Hugging Face library that provides APIs and tools to download and train state-of-the-art pretrained models.

- Line 2: We install the accelerate library, which is designed to facilitate training deep learning models across different hardware. It enables the training and inference to be simple, efficient, and adaptable.

- Line 3: We install the bitsandbytes library, which is a transformers library that helps with the quantization of the model.

- Line 4: We install the dataset library for sharing and accessing datasets for downstream tasks.

- Line 5: We install the trl library for training transformer models with reinforcement learning and supervised fine-tuning.

- Line 6: We install the peft library for parametric efficient fine-tuning of large language models for downstream tasks.

- Line 7: We install the Hugging Face CLI to log in and access the model and dataset from Hugging Face.

## Hugging Face CLI
After installing the required libraries, it's time to log in to the Hugging Face CLI. Hugging Face requires this step to access any model and dataset from the Hugging Face.

In [None]:
from huggingface_hub import login
import os
login(token=os.getenv("HF_TOKEN"))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
The token `vllm` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `vllm`


## Quantization
Now, let’s load the pretrained model with quantization and see how it responds to a math word problem.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Copia il file ZIP dalla cartella di Google Drive alla memoria di Colab
!cp /content/drive/MyDrive/meta-llamaLlama-3.1-8B.zip /content/

# Vai nella cartella di destinazione (non necessario, ma utile)
%cd /content/

# Estrai il file ZIP
!unzip /content/meta-llamaLlama-3.1-8B.zip -d /content/llama3-1-8B

/content
Archive:  /content/meta-llamaLlama-3.1-8B.zip
   creating: /content/llama3-1-8B/meta-llamaLlama-3.1-8B/
  inflating: /content/llama3-1-8B/meta-llamaLlama-3.1-8B/config.json  
  inflating: /content/llama3-1-8B/meta-llamaLlama-3.1-8B/generation_config.json  
  inflating: /content/llama3-1-8B/meta-llamaLlama-3.1-8B/gitattributes  
  inflating: /content/llama3-1-8B/meta-llamaLlama-3.1-8B/LICENSE  
  inflating: /content/llama3-1-8B/meta-llamaLlama-3.1-8B/model-00001-of-00004.safetensors  
  inflating: /content/llama3-1-8B/meta-llamaLlama-3.1-8B/model-00002-of-00004.safetensors  
  inflating: /content/llama3-1-8B/meta-llamaLlama-3.1-8B/model-00003-of-00004.safetensors  
  inflating: /content/llama3-1-8B/meta-llamaLlama-3.1-8B/model-00004-of-00004.safetensors  
  inflating: /content/llama3-1-8B/meta-llamaLlama-3.1-8B/model.safetensors.index.json  
  inflating: /content/llama3-1-8B/meta-llamaLlama-3.1-8B/README.md  
  inflating: /content/llama3-1-8B/meta-llamaLlama-3.1-8B/special_toke

In [5]:
ls /content/llama3-1-8B/

[0m[34;42mmeta-llamaLlama-3.1-8B[0m/


In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

#bnb_config = BitsAndBytesConfig(
#    load_in_8bit=True
#)

model_path = "/content/llama3-1-8B/meta-llamaLlama-3.1-8B"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    trust_remote_code=True,
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [7]:
param_dtypes = [param.dtype for param in model.parameters()]
print("Parameter dtypes:", param_dtypes)

Parameter dtypes: [torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.

In [8]:
print(model.get_memory_footprint())

32121053440


In [9]:
input = tokenizer("Natalia sold clips to 48 of her friends in April, and then she sold half as \
many clips in May. How many clips did Natalia sell altogheter in April and May?", return_tensors="pt").to('cuda')

response = model.generate(**input, max_new_tokens=100)
print(tokenizer.batch_decode(response, skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogheter in April and May? A. 96 B. 48 C. 24 D. 12\nNatalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogheter in April and May? A. 96 B. 48 C. 24 D. 12']


- Line 1: We import the AutoModelForCausalLM, AutoTokenizer, and BitsAndBytesConfig modules from the transformers library.

- Line 2: We load the PyTorch library for GPU acceleration.

- Lines 4–6: We apply the 8-bit quantization to the model using the BitesAndBytesConfig class from the transformers library.

- Lines 8–11: We load the model using the from_pretrained() function of the AutoModelForCausalLM class.

- Lines 13–14: We load the tokenizer for our model and tokenize the input prompt for inference of the pretrained model.

- Lines 16–17: We pass the tokenized input to the model and print the generated response by converting it to text while ignoring special tokens.

We can see that the model provided a solution to the math problem in natural language.

# Training the model
Now, let’s use the openai/gsm8k dataset to fine-tune the model so it generates responses in mathematical expressions.

# Preprocess the dataset
Let’s begin by preprocessing the dataset for fine-tuning.

In [11]:
from datasets import load_dataset

dataset = "openai/gsm8k"

data = load_dataset(dataset, "main")

tokenizer.pad_token = tokenizer.eos_token

data = data.map(lambda samples: tokenizer(samples['question'], samples['answer'], truncation=True, padding="max_length", max_length=100), batched=True)

train_sample = data["train"].select(range(400))

display(train_sample)

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask'],
    num_rows: 400
})

- Line 1: We import the load_dataset from the datasets library.

- Lines 3–4: We load the openai/gsm8k dataset from the Hugging Face using the load_dataset function.

- Line 6: We specified the pad token to be the eos_token of the tokenizer. The eos_token is a special token that represents the end of a sentence.

- Line 7: We tokenize columns from the dataset, which we want to use for fine-tuning. Our dataset consists of only two columns, "question" and "answer", and we are using both of them to create samples. We set the truncation=True and padding="max_length" where max_length=100 for a fixed-length dataset.

- Line 8: We create the training dataset train_sample by selecting only 400 rows from the complete dataset for fine-tuning. We used very few rows for training the model to reduce the training time.

- Line 10: We display the information of the training dataset train_sample.

We can see that our dataset has 400 rows and contains input_ids and attention_mask columns along with question and answer columns for each row.

Let’s print a row of the training dataset train_sample to see the values.

In [12]:
print(train_sample[:1])

{'question': ['Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?'], 'answer': ['Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'], 'input_ids': [[128000, 45, 4306, 689, 6216, 27203, 311, 220, 2166, 315, 1077, 4885, 304, 5936, 11, 323, 1243, 1364, 6216, 4376, 439, 1690, 27203, 304, 3297, 13, 2650, 1690, 27203, 1550, 42701, 689, 4662, 31155, 304, 5936, 323, 3297, 30, 128000, 45, 4306, 689, 6216, 220, 2166, 14, 17, 284, 1134, 2166, 14, 17, 28, 1187, 2511, 1187, 27203, 304, 3297, 627, 45, 4306, 689, 6216, 220, 2166, 10, 1187, 284, 1134, 2166, 10, 1187, 28, 5332, 2511, 5332, 27203, 31155, 304, 5936, 323, 3297, 627, 827, 220, 5332, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

We can see the values of input_ids and attention_mask against the first row of the dataset. Our dataset is now ready for fine-tuning.

# LoRA configurations
Now, let’s set the LoRA configurations:

In [13]:
import peft
from peft import LoraConfig

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

### 📌 **Spiegazione della Configurazione LoRA (Low-Rank Adaptation)**
La configurazione che hai fornito è usata per applicare **LoRA (Low-Rank Adaptation)** a un modello di **language modeling causale (CAUSAL_LM)**, come LLaMA o GPT, per **fine-tuning efficiente**.

LoRA permette di **adattare un modello di grandi dimensioni** senza dover aggiornare tutti i parametri, riducendo drasticamente il **costo computazionale** e la memoria necessaria.

---

## 🔹 **Analisi dei parametri**
```python
import peft
from peft import LoraConfig

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)
```

### 🔹 **1️⃣ `r=16` → Rank della decomposizione LoRA**
- `r` indica il **rank** della **fattorizzazione a bassa dimensione** che LoRA utilizza per ridurre il numero di parametri da aggiornare.
- Un valore **più alto** di `r` aumenta la capacità di apprendimento del modello, ma consuma più memoria.
- **Valori tipici:** `r=8` o `r=16` per un buon equilibrio tra efficienza e accuratezza.

📌 **Cosa significa?**  
LoRA sostituisce i pesi di una matrice \( W \) con due matrici più piccole \( A \) e \( B \), ciascuna di dimensione **ridotta** (`r`).  
Questo riduce drasticamente il numero di parametri da aggiornare durante il training.

---

### 🔹 **2️⃣ `lora_alpha=16` → Fattore di scalatura**
- `lora_alpha` controlla l'**importanza** degli aggiornamenti LoRA rispetto ai pesi originali.
- Più alto è `lora_alpha`, più forte è l'impatto delle modifiche LoRA.

📌 **Come funziona?**  
Il valore **effettivo** degli aggiornamenti viene scalato come:
\[
\text{Update} = \frac{\text{LoRA Modifications} \times \text{lora_alpha}}{r}
\]
Per questa configurazione:
\[
\frac{16}{16} = 1
\]
Quindi le modifiche vengono **applicate senza riduzione**.

---

### 🔹 **3️⃣ `target_modules=["q_proj", "v_proj"]` → Moduli su cui applicare LoRA**
- Indica **quali layer del modello** saranno ottimizzati con LoRA.
- `"q_proj"` e `"v_proj"` sono **i proiettori delle matrici di attenzione** nei Transformer (Query e Value).

📌 **Perché solo questi moduli?**  
Questa scelta **riduce i parametri da aggiornare**, migliorando efficienza e prestazioni.  
I Transformer utilizzano **Query-Key-Value Attention**, quindi modificare solo `q_proj` e `v_proj` permette di migliorare la rappresentazione **senza modificare completamente l'architettura**.

---

### 🔹 **4️⃣ `lora_dropout=0.1` → Dropout per regolarizzazione**
- `lora_dropout=0.1` applica **dropout** del **10%** sui pesi LoRA.
- Aiuta ad **evitare overfitting** durante il fine-tuning.

📌 **Quando usare il dropout?**
- Per dataset **piccoli**, è utile per **migliorare la generalizzazione**.
- Per dataset **grandi**, può essere impostato a `0` per massimizzare l'apprendimento.

---

### 🔹 **5️⃣ `bias="none"` → Nessun aggiornamento del bias**
- `"none"` significa che **LoRA non aggiornerà i bias** nei layer del modello.
- Puoi usare:
  - `"none"` → Bias **non modificati** (migliore per risparmiare memoria)
  - `"all"` → Aggiorna tutti i bias
  - `"lora_only"` → Aggiorna solo i bias nei moduli modificati con LoRA

📌 **Perché non aggiornare i bias?**  
I bias occupano poca memoria e **non hanno un grande impatto** sulle prestazioni.  
Quindi, per **efficienza**, spesso **non vengono toccati** nel fine-tuning.

---

### 🔹 **6️⃣ `task_type="CAUSAL_LM"` → Modello autoregressivo (GPT, LLaMA)**
- Indica che il modello è un **language model causale** (come **LLaMA, GPT, Mistral**).
- Se fosse un modello diverso, potresti usare:
  - `"SEQ2SEQ_LM"` → Modelli encoder-decoder (T5, BART)
  - `"TOKEN_CLASSIFICATION"` → Named Entity Recognition (NER)
  - `"TEXT_CLASSIFICATION"` → Classificazione di testi

📌 **Perché "CAUSAL_LM"?**  
Nei **language model causali**, ogni token può **vedere solo i token precedenti** durante la generazione.  
Esempi di modelli con **Causal LM**:
- **GPT-3, GPT-4**
- **LLaMA 3.1**
- **Mistral**
- **Falcon**

---

## ✅ **🚀 Riepilogo: Cosa fa questa configurazione?**
| **Parametro**    | **Significato** |
|-----------------|----------------|
| `r=16`          | Riduzione della dimensione delle matrici LoRA (rank) |
| `lora_alpha=16` | Controlla la forza delle modifiche LoRA |
| `target_modules=["q_proj", "v_proj"]` | Applica LoRA solo ai moduli di attenzione (Query, Value) |
| `lora_dropout=0.1` | Evita overfitting riducendo del 10% l'attivazione di LoRA |
| `bias="none"` | Non aggiorna i bias per risparmiare memoria |
| `task_type="CAUSAL_LM"` | Il modello è un **language model autoregressivo** |

✅ **Questa configurazione rende il fine-tuning di un modello come LLaMA 3.1 8B molto più efficiente!** 🚀  
Dimmi se vuoi approfondire qualcosa! 😊

# Set the training arguments
After setting the LoRA configurations, let’s set the training arguments for training the model.

In [14]:
from transformers import TrainingArguments
import os

working_dir = "/content/llama3-1-8B/"

output_directory = os.path.join(working_dir, "lora")

training_args = TrainingArguments(
    output_dir = output_directory,
    auto_find_batch_size = True,
    learning_rate = 3e-4,
    num_train_epochs = 5
)

### 📌 **Spiegazione della Configurazione `TrainingArguments`**
Il codice imposta gli **argomenti di training** per addestrare un modello con **Hugging Face Transformers**.  
In particolare, **configura il fine-tuning di LLaMA 3.1 con LoRA**.

---

## 🔹 **Analisi dei Parametri**
```python
from transformers import TrainingArguments
import os

working_dir = "/content/llama3-1-8B/"

output_directory = os.path.join(working_dir, "lora")

training_args = TrainingArguments(
    output_dir = output_directory,
    auto_find_batch_size = True,
    learning_rate = 3e-4,
    num_train_epochs = 5
)
```

---

### 🔹 **1️⃣ `output_dir = output_directory` → Cartella per salvare il modello**
```python
output_directory = os.path.join(working_dir, "lora")
output_dir = output_directory
```
📌 **Cosa fa?**  
- Definisce la **cartella in cui salvare i checkpoint** del modello durante e dopo il training.
- In questo caso, i pesi addestrati con LoRA verranno salvati in:
  ```
  /content/llama3-1-8B/lora/
  ```
- Dopo il training, potrai **riutilizzare i pesi LoRA da questa cartella**.

---

### 🔹 **2️⃣ `auto_find_batch_size = True` → Adatta automaticamente il batch size**
```python
auto_find_batch_size = True
```
📌 **Cosa fa?**  
- **Trova automaticamente il batch size massimo** che la GPU può gestire **senza superare la memoria disponibile**.
- Questo è utile perché:
  - I modelli **LLaMA 3.1 e GPT sono enormi** e possono facilmente esaurire la VRAM.
  - Se il batch size è **troppo alto**, il training si interrompe per **Out of Memory (OOM)**.
  - Con **`auto_find_batch_size=True`**, il batch viene regolato automaticamente in base alla tua GPU.

✅ **Evita crash per OOM su GPU con memoria limitata!** 🚀

---

### 🔹 **3️⃣ `learning_rate = 3e-4` → Tasso di apprendimento**
```python
learning_rate = 3e-4
```
📌 **Cosa fa?**  
- **Determina quanto velocemente il modello aggiorna i pesi durante l'addestramento.**
- `3e-4` significa **0.0003**, che è un valore piuttosto alto per LoRA.
- Solitamente:
  - **`1e-5` o `2e-5`** per fine-tuning **completo** di modelli grandi.
  - **`3e-4`** è più alto perché **LoRA aggiorna meno parametri**, quindi può apprendere più velocemente.

✅ **Tasso di apprendimento ottimizzato per LoRA!** 🚀

---

### 🔹 **4️⃣ `num_train_epochs = 5` → Numero di epoche di training**
```python
num_train_epochs = 5
```
📌 **Cosa fa?**  
- **Indica quante volte il modello deve passare sui dati di training.**
- **Più epoche = migliore apprendimento**, ma **può causare overfitting**.
- Solitamente:
  - **`3-5` epoche per fine-tuning standard.**
  - **`10+` epoche per dataset piccoli o modelli più piccoli.**
  
✅ **5 epoche sono un buon compromesso per il fine-tuning con LoRA!**

---

## ✅ **🚀 Riepilogo: Cosa fa questa configurazione?**
| **Parametro**            | **Significato** |
|------------------------|----------------|
| `output_dir="lora/"`  | Salva i pesi LoRA in `/content/llama3-1-8B/lora/` |
| `auto_find_batch_size=True` | Regola automaticamente il batch size per evitare errori di memoria |
| `learning_rate=3e-4`  | Tasso di apprendimento più alto per LoRA (più veloce rispetto al fine-tuning completo) |
| `num_train_epochs=5`  | 5 epoche per bilanciare apprendimento e overfitting |

✅ **Questa configurazione permette di eseguire il fine-tuning di LLaMA 3.1 8B con LoRA in modo efficiente!** 🚀  
Dimmi se vuoi aggiungere qualche dettaglio! 😊

# Set the trainer
Let’s set the trainer to train the model.

In [15]:
import transformers
from trl import SFTTrainer

trainer = SFTTrainer(
    model = model,
    args = training_args,
    train_dataset = train_sample,
    peft_config = lora_config,
    tokenizer = tokenizer,
    data_collator = transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)



- Line 1: We import the transformers library.

- Line 2: We import the SFTTrainer class from the trl library.

- Lines 4–11: We set the configurations for the trainer using the SFTTrainer class.

- Line 5: We specify the training model using the model parameter.

- Line 6: We specify the training arguments using the args parameter.

- Line 7: We specify the training data using the train_dataset parameter.

- Line 8: We specify the LoRA configurations using the peft_config parameter.

- Line 9: We specify the tokenizer for the model.

- Line 10: We specify the data_collator for creating batches from the set of training data.

Now, we are all set for training the models. We can train the model using the trainer.train() function.

In [16]:
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mfelipaosfdk[0m ([33mfelipaosfdk-university-of-udine[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss


Step,Training Loss
500,1.2006


TrainOutput(global_step=500, training_loss=1.2006151123046875, metrics={'train_runtime': 395.4619, 'train_samples_per_second': 5.057, 'train_steps_per_second': 1.264, 'total_flos': 9050144853196800.0, 'train_loss': 1.2006151123046875, 'epoch': 5.0})

After training, we can save the fine-tuned model on our local machines for later use.

In [17]:
# save the model
model_path = os.path.join(output_directory, f"lora_model")
trainer.model.save_pretrained(model_path)

In [19]:
!zip -r /content/llama3-1-8B/lora.zip /content/llama3-1-8B/lora

  adding: content/llama3-1-8B/lora/ (stored 0%)
  adding: content/llama3-1-8B/lora/checkpoint-500/ (stored 0%)
  adding: content/llama3-1-8B/lora/checkpoint-500/scheduler.pt (deflated 56%)
  adding: content/llama3-1-8B/lora/checkpoint-500/tokenizer_config.json (deflated 96%)
  adding: content/llama3-1-8B/lora/checkpoint-500/special_tokens_map.json (deflated 64%)
  adding: content/llama3-1-8B/lora/checkpoint-500/tokenizer.json (deflated 74%)
  adding: content/llama3-1-8B/lora/checkpoint-500/adapter_model.safetensors (deflated 7%)
  adding: content/llama3-1-8B/lora/checkpoint-500/trainer_state.json (deflated 56%)
  adding: content/llama3-1-8B/lora/checkpoint-500/training_args.bin (deflated 51%)
  adding: content/llama3-1-8B/lora/checkpoint-500/README.md (deflated 66%)
  adding: content/llama3-1-8B/lora/checkpoint-500/adapter_config.json (deflated 51%)
  adding: content/llama3-1-8B/lora/checkpoint-500/optimizer.pt (deflated 9%)
  adding: content/llama3-1-8B/lora/checkpoint-500/rng_state.p

In [20]:
from google.colab import files
files.download("/content/llama3-1-8B/lora.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Load the fine-tuned model#
Let’s load the already trained model saved on our machine to see the inference.

In [3]:
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

# Percorso del modello base
base_model_path = "/content/llama3-1-8B/meta-llamaLlama-3.1-8B"
# Percorso del modello LoRa addestrato
lora_model_path = "/content/llama3-1-8B/lora/lora_model"

loaded_model = AutoPeftModelForCausalLM.from_pretrained(
    lora_model_path,
    device_map="auto"
)

# Carica il tokenizer dal modello base
tokenizer = AutoTokenizer.from_pretrained(base_model_path)

# Tokenizza il testo e sposta gli input sulla GPU
input_text = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"
input_tokens = tokenizer(input_text, return_tensors="pt").to("cuda")  # Sposta su GPU

response = loaded_model.generate(**input_tokens, max_new_tokens=100)
print(tokenizer.decode(response[0], skip_special_tokens=True))

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?The number of clips Natalia sold in May is 48/2 = <<48/2=24>>24 clips.
So, together in April and May, Natalia sold 48+24 = <<48+24=72>>72 clips.
#### 72 clips
72 clips.
#### 72 clips
72 clips.
#### 72 clips
72 clips.
#### 72 clips
72 clips.
#### 72 clips
72 clips.
#### 72 clips
72 clips.



In [5]:
!nvidia-smi

Sun Feb 23 14:14:59 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   31C    P0             49W /  400W |   31109MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                