<a href="https://colab.research.google.com/github/dbenayoun/IASD/blob/main/2_training_tinyllama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training generative models

In this notebook, we are going to showcase quickly how to train large generative models.

The goal of this lab is also to demonstrate that there exists a lot of pre-built methods that can automate common tasks.

## Download requirements

First install a library necessary to run quantized models.

In [1]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━

## Load a generative model

We are going to load a generative model. Here we make use of heavy optimization technique to make sure everything fits on Collab GPU.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.40G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/129 [00:00<?, ?B/s]

In [3]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear4bit(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm

🚧 **TODO** 🚧

Experiment with the generation of the model.

In [4]:
text = "Paris is the capital of"
DEVICE= "cuda:0"


inputs = tokenizer(text, return_tensors="pt").to(DEVICE)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=1.0, do_sample=True)
print(tokenizer.decode(outputs[0]))

<s> Paris is the capital of France and the center of its most important cities. The city of Lyon is often overlooked because its population is significantly lower than Paris, even though there are many things worth considering in regards to it. One of the cities that Lyon will be able to draw you in is the city of Montpellier. This city has a lot going for it and is always a worthy city to visit.
Montpellier is a large city that has a lot going for it. It has a lot of attractions that will make your time with you in this beautiful city enjoyable. The city is located in the province of Languedoc-Roussillon and is at the southern corner of France and is at the point connecting the two regions from their northwest corridor. Montpellier has a population of approximately 565,000 of which 379,000 are in the city itself. The city is close to the Mediterranean Sea coast


## Training

First load an instruction dataset.

In [5]:
from datasets import load_dataset

def process_data(sample):
    return tokenizer(sample["text"])

data = load_dataset("tatsu-lab/alpaca")
data = data.map(process_data, batched=True)

Downloading readme:   0%|          | 0.00/7.47k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/24.2M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/52002 [00:00<?, ? examples/s]

In [6]:
data

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text', 'input_ids', 'attention_mask'],
        num_rows: 52002
    })
})

In [7]:
print(data["train"][0]["text"])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.


Since this model is quite big, we are going to reduce the number of trainable parameters.

To do that, we use **Parameter Efficient Fine-Tuning** (PEFT).

In PEFT, we do not tune the full model. Only a small subsets of the parameters are trained, while the others are frozen (i.e., not updated during training).

🚧 **TODO** 🚧

Explain why PEFT reduces the memory cost.


In [8]:
from peft import prepare_model_for_kbit_training

# This determines automatically which module can be used for quantized training inside the model
model = prepare_model_for_kbit_training(model)

In [12]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

The cell below adds a small amount of extra-paramter to the model using Lora technique, that is described here: https://arxiv.org/abs/2106.09685

In [13]:
from peft import LoraConfig, get_peft_model

# Use default config
config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 1531904 || all params: 617138176 || trainable%: 0.2482270680334642


We now use the Trainer class of HuggingFace. On simple installations it can be very effective.

In [14]:
import transformers

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=16,
        gradient_accumulation_steps=1,
        max_steps=100,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
1,1.9435
2,2.0213
3,2.0933
4,1.867
5,1.9631
6,1.9018
7,1.9006
8,1.8548
9,1.6981
10,1.5259


TrainOutput(global_step=100, training_loss=1.2259384715557098, metrics={'train_runtime': 333.8049, 'train_samples_per_second': 4.793, 'train_steps_per_second': 0.3, 'total_flos': 2931690377576448.0, 'train_loss': 1.2259384715557098, 'epoch': 0.03})

In [19]:
text = "### Instruction:\nPropose an outdoor activity.\n\n### Response:\n"
inputs = tokenizer(text, return_tensors="pt").to(DEVICE)
model.config.use_cache = True  # silence the warnings. Please re-enable for inference!

outputs = model.generate(**inputs, max_new_tokens=20, temperature=1.0, do_sample=True)
print(tokenizer.decode(outputs[0]))

<s> ### Instruction:
Propose an outdoor activity.

### Response:
Explore the beauty of the sunrise and then head back to the city for an early
