# Training LLMs: Complete Pipeline from Pretraining to RLHF

This notebook demonstrates the full training pipeline for large language models through three main stages:

1. **Pretraining** - Training a model from scratch on raw text
2. **Supervised Fine-tuning (SFT)** - Teaching instruction following  
3. **Reinforcement Learning from Human Feedback (RLHF)** - Aligning with human preferences

We'll build a small Mistral model and train it end-to-end to understand each stage.

# Stage 1: Pretraining

In this section, we'll train a model from scratch using next-token prediction on raw Wikipedia text. This teaches the model basic language understanding and generation capabilities.

## Data Loading

Here we're downloading a small subset (1000 samples) of the Wikipedia dataset for demonstration. This gives us raw text data that we'll use for pretraining our model from scratch.

In [1]:
from datasets import load_dataset

wiki_data = load_dataset(
    "wikimedia/wikipedia",   
    "20231101.en", 
    split="train[:1000]"
)

Resolving data files:   0%|          | 0/41 [00:00<?, ?it/s]

In [2]:
print(wiki_data['text'][0][:1000])

Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typically including nation-states, and capitalism. Anarchism advocates for the replacement of the state with stateless societies and voluntary free associations. As a historically left-wing movement, this reading of anarchism is placed on the farthest left of the political spectrum, usually described as the libertarian wing of the socialist movement (libertarian socialism).

Humans have lived in societies without formal hierarchies long before the establishment of states, realms, or empires. With the rise of organised hierarchical bodies, scepticism toward authority also rose. Although traces of anarchist ideas are found all throughout history, modern anarchism emerged from the Enlightenment. During the latter half of the 19th and the first decades of the 20th century, the anarchist movement f

Let's split the data into train and test

In [3]:
wiki_data = wiki_data.train_test_split(test_size=0.2)

In [4]:
wiki_data

DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 800
    })
    test: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 200
    })
})

## Tokenization Process

This shows what tokenization looks like - converting raw text into numerical tokens that the model can process.

I am going to train a model from scratch I am going to use the Mistral architecture as base. I will use the same tokenize to move from text data to the input index data. I had to run `huggingface-cli login` and enter my token to download this specific model

In [5]:
from transformers import AutoTokenizer

base_model_id = 'mistralai/Mistral-7B-v0.1'
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
tokenizer.pad_token = tokenizer.eos_token

The padding token was missing

In [6]:
tokenizer.special_tokens_map

{'bos_token': '<s>',
 'eos_token': '</s>',
 'unk_token': '<unk>',
 'pad_token': '</s>'}

Tokenizing the data is easy

In [7]:
outputs = tokenizer(
    wiki_data['train']['text'][0:10],
)

This function processes our text data in batches:
- `truncation=True, max_length=512`: Cuts off text longer than 512 tokens
- `padding='max_length'`: Pads shorter sequences to 512 tokens with pad tokens
- `return_tensors="pt"`: Returns PyTorch tensors instead of lists
- `remove_columns`: Removes original text columns, keeping only tokenized data

In [8]:
max_length = 512

def tokenize_function(examples):
    return tokenizer(
        examples['text'], 
        truncation=True, 
        max_length=max_length, 
        padding='max_length', # longuest 
        return_tensors="pt", 
        add_special_tokens=True
    )

tokenized_datasets = wiki_data.map(
    tokenize_function, 
    batched=True, 
    remove_columns=['id', 'url', 'title', 'text']
)

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [9]:
tokenizer.padding_side

'left'

In [10]:
tokenizer.pad_token_id

2

Notice that `padding_side='left'` means padding tokens are added to the beginning of sequences, and `pad_token_id=2` shows padding uses the same token as EOS (end of sequence).

## Model Architecture

First we load the default Mistral configuration to see the full-size model parameters. This shows a 7B parameter model with 32 layers, 4096 hidden dimensions, etc.

In [11]:
from transformers import MistralForCausalLM, MistralConfig
config = MistralConfig()

In [12]:
config

MistralConfig {
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": null,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-06,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "tie_word_embeddings": false,
  "transformers_version": "4.52.4",
  "use_cache": true,
  "vocab_size": 32000
}

Here we can see the default Mistral configuration showing a 7B parameter model with 32 layers, 4096 hidden dimensions, etc. This is too large for our demonstration.

Now we create a much smaller model configuration for demonstration:
- `hidden_size=768`: Reduced from 4096 to make training faster
- `num_hidden_layers=4`: Only 4 layers instead of 32
- `num_attention_heads=16`: Reduced attention heads
- `intermediate_size=3072`: Smaller feed-forward network
- `max_position_embeddings=512`: Matches our sequence length

In [13]:
config = MistralConfig(
    hidden_size=768,
    sliding_window=768,
    intermediate_size=3072,
    max_position_embeddings=max_length,
    num_attention_heads=16,  
    num_hidden_layers=4,
)

This creates our small Mistral model with the reduced configuration. The model structure shows 4 decoder layers with our specified dimensions.

In [15]:
model = MistralForCausalLM(config)

In [16]:
model

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 768)
    (layers): ModuleList(
      (0-3): 4 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear(in_features=768, out_features=768, bias=False)
          (k_proj): Linear(in_features=768, out_features=384, bias=False)
          (v_proj): Linear(in_features=768, out_features=384, bias=False)
          (o_proj): Linear(in_features=768, out_features=768, bias=False)
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=768, out_features=3072, bias=False)
          (up_proj): Linear(in_features=768, out_features=3072, bias=False)
          (down_proj): Linear(in_features=3072, out_features=768, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((768,), eps=1e-06)
        (post_attention_layernorm): MistralRMSNorm((768,), eps=1e-06)
      )
    )
    (norm): MistralRMSNorm((768,), eps=1e-06)
    (rotary_emb):

Our small model has ~84M parameters compared to the 7B in the original Mistral model. This makes it feasible to train on modest hardware.

In [17]:
model_size = sum(t.numel() for t in model.parameters())
model_size

84548352

## Data Collation for Language Modeling

`DataCollatorForLanguageModeling` with `mlm=False` sets up causal language modeling (predicting next tokens). It automatically creates labels by shifting input_ids by one position.

In [18]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)


This shows what a batch looks like after collation. We have 10 sequences, each tokenized and ready for training. The data collator creates batches from our tokenized data. Let's see what a batch looks like with 10 sequences:

In [19]:
out = data_collator([
    tokenized_datasets["train"][i] for i in range(10)
])

This shows the first sequence's input_ids - the tokenized text ready for training. These are the numerical tokens the model will learn from.

In [20]:
out['input_ids'][0]

tensor([    1,  2740,  3255,   987,   968,  2677,   325,  2854, 28731,   349,
          264, 28705,   968,  2677, 11108,  1307,   297, 13176,  8520, 28725,
         1080, 14473,   354, 27179,  3257,  8570,   395,   264,  6480,  8333,
        28723,   560, 24157,   968,  2677, 28725,   272, 24157,   325, 16530,
         6342, 28731,   302,   272,  8333,   349, 20331,   297, 19938,   298,
          369,   302,   272,  2928,  7528, 28725,  1259,   390,   396, 10466,
         7528, 28723,   851, 11108,  9349, 28713,   395, 10417,   968,  2677,
        28725,   297,   690,  2477,   272, 11010,   302,   272, 20320,  8333,
          349, 20331, 28725,   390,   297, 11010,   968,  2677, 28725,   442,
          871,  6896, 28725,   390,   297,  6896,   968,  2677, 28723,    13,
           13,  2854,   403,   272, 21864,   968,  2677,  2038,  1307,   354,
        27179,  3257, 10466,   297,  6480, 11837,   288, 28723,   661,   403,
         6202,  1938,   272,   907,  8249,   302,   272, 28705, 

In [25]:
out['labels'][0]

tensor([    1, 22742, 18199,  4535,   349,   264,  5567, 12323,  4654,  2071,
          477,   650,  8891,  4146, 28725,   264,  6460,   294,  2990,   297,
          365,  7547, 28733, 28780,  2355,   781,  1314, 28721, 28725,  3658,
         7293, 28723, 22742, 18199,  4535,   403, 11573,   297, 28705, 28740,
        28774, 28783, 28787,   486,   320,  7997,   393, 28725,   393,  7473,
          392, 28725,  2404, 28706, 28733,  6167, 28725, 18767,  9360, 15834,
          325, 28755,  2474,   384,   508,   766, 28731,   304, 16416,  5727,
          338, 28723,  7066,  4292,   302,   272,  2071,  8288,  5567,  9893,
         1557, 28725,   304,   320,  7997,   393, 28725,   393,  7473,   392,
        28725,   304,  5727,   338,   460,   302, 10088, 28725, 15526,  2238,
          753, 28725,   304,   382,  1022,   753,  5414, 28713, 28725,  8628,
        28723,    13,    13,   657, 17262,  4697,   486,  3964,  2556,  1859,
         1929,  9994,  5017,   304,   272, 16024, 12166,  1139, 

## Training Configuration and Execution

Now we set up the training arguments and run the pretraining:

The training shows typical pretraining behavior: high initial loss that gradually decreases. The final loss of indicates the model is learning to predict next tokens, though it would need much more training to be truly useful.

In [22]:
from transformers import Trainer, TrainingArguments

args = TrainingArguments(
    output_dir="mistral-pretraining",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    num_train_epochs=1,
    push_to_hub=True,
    report_to="none", 
)

trainer = Trainer(
    model=model,
    processing_class=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

trainer.train()



Step,Training Loss
500,6.2722


TrainOutput(global_step=800, training_loss=6.247486724853515, metrics={'train_runtime': 45.461, 'train_samples_per_second': 17.598, 'train_steps_per_second': 17.598, 'total_flos': 147388052275200.0, 'train_loss': 6.247486724853515, 'epoch': 1.0})

We need to log into Hugging Face Hub to push our trained models.

In [23]:
trainer.push_to_hub()

CommitInfo(commit_url='https://huggingface.co/damienbenveniste/mistral-pretraining/commit/8f4f09a8d1f38dfa60bd43f74b22ce5849f105f4', commit_message='End of training', commit_description='', oid='8f4f09a8d1f38dfa60bd43f74b22ce5849f105f4', pr_url=None, repo_url=RepoUrl('https://huggingface.co/damienbenveniste/mistral-pretraining', endpoint='https://huggingface.co', repo_type='model', repo_id='damienbenveniste/mistral-pretraining'), pr_revision=None, pr_num=None)

Let's test our pretrained model with a simple text generation pipeline. Testing our pretrained model shows it generates mostly gibberish - this is expected since it's only been trained for 1 epoch on a tiny dataset. The model hasn't learned coherent language patterns yet.

In [24]:
from transformers import pipeline

model_id = "damienbenveniste/mistral-pretraining"
pipe = pipeline("text-generation", model=model_id)
txt = "How are you?"
results = pipe(txt, num_return_sequences=1)
results[0]["generated_text"]


model.safetensors:   0%|          | 0.00/338M [00:00<?, ?B/s]

Device set to use mps:0


'How are you? (, In a 1918) was the 11194.\n\n\nAor was an American Civillo-1971 was a French of his first part of the German of Aa, a major of the Battle of 1991, the early 1991. He was born in the 127, 1960s, as the American-based-day called the 1800, and the first life.\n\nAlc-a of the term, it was born in the first a 1919, he was the most of the 1999.\n\nAl-century Men was the S-S.\n\nAl-\n\n\nAla was the 1960, the 170, and his name, the first 1900. He was a word he was the 1901, a American first high-B.\n\nEarly life\n\n\nThe first British of the 1919, the world of the French of the third of the United States of the 1932.\n\nThe 1884 and'

The model generates mostly nonsensical text - this is expected since it's only been trained for 1 epoch on a tiny dataset. It hasn't learned coherent language patterns yet.

# Stage 2: Supervised Fine-tuning (SFT)

Now we'll take our pretrained model and teach it to follow instructions using the Alpaca dataset. This transforms the model from a general text generator into an instruction-following assistant.

We push the trained model to Hugging Face Hub. The commit shows our model is now available as "damienbenveniste/mistral-pretraining".

In [28]:
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")

In [29]:
dataset['instruction'][0]

'Give three tips for staying healthy.'

In [30]:
out = dataset[0]['text']
print(out)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.


The formatted text includes prompt structure with "### Instruction:" and "### Response:" markers. This teaches the model to understand and respond to instructions in this specific format.

We load our pretrained model and tokenizer from the previous step. Notice the model doesn't have a `pad_token_id` set, which we'll need for batching.

In [31]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "damienbenveniste/mistral-pretraining"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

This code sets up a specialized data collator for training language models with instruction-following capabilities:

  **`DataCollatorForCompletionOnlyLM`** - A TRL (Transformer Reinforcement Learning) utility that masks the loss computation to only calculate loss on the completion/response portion
  of instruction-response pairs, not on the instruction itself.

  **`response_template`** - Defines the delimiter `"\n### Response:"` that separates instructions from responses in the training data format.

  **`response_template_ids`** - Converts the template to token IDs using the tokenizer, with `[2:]` slicing to remove special tokens (likely BOS/EOS tokens) that aren't part of the 
  actual template.

  **`data_collator`** - Creates the collator instance that will automatically mask instruction tokens during training, ensuring the model only learns to predict response tokens, which
   improves instruction-following performance and prevents the model from learning to repeat instructions.

In [32]:
from trl import DataCollatorForCompletionOnlyLM

response_template = "\n### Response:"
response_template_ids = tokenizer.encode(response_template, add_special_tokens=False)[2:]

data_collator = DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer)

In [33]:
print(dataset['text'][1])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What are the three primary colors?

### Response:
The three primary colors are red, blue, and yellow.


In [34]:
tokenized = tokenizer(
    dataset['text'][1], 
)

data_collator([tokenized])

{'input_ids': tensor([[    1, 20811,   349,   396, 13126,   369, 13966,   264,  3638, 28723,
         12018,   264,  2899,   369,  6582,  1999,  2691,   274,   272,  2159,
         28723,    13,    13, 27332,  3133,  3112, 28747,    13,  3195,   460,
           272,  1712,  6258,  9304, 28804,    13,    13, 27332, 12107, 28747,
            13,  1014,  1712,  6258,  9304,   460,  2760, 28725,  5045, 28725,
           304,  9684, 28723]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1]]), 'labels': tensor([[ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
            13,  1014,  1

Notice in the labels: `-100` values indicate tokens where loss shouldn't be calculated (the instruction part), while actual token IDs appear only for the response section. This is how we train only on the response generation.

Now we run supervised fine-tuning using SFTTrainer. The training shows a loss of ~6.8, which is lower than pretraining because we're learning a more focused task with structured examples.

This code configures and runs supervised fine-tuning using TRL's SFTTrainer:

  **`SFTConfig`** - Configuration object that defines training parameters including output directory, dataset field containing text data, sequence length limits, training epochs, and
  hub settings.

  **`SFTTrainer`** - The main training class that handles supervised fine-tuning with the specified model, dataset, data collator (for completion-only training), and tokenizer as the
  processing class.

  **`trainer.train()`** - Executes the training loop with the configured parameters, fine-tuning the model on the instruction-response dataset while only computing loss on response
  tokens.

In [35]:
from trl import SFTTrainer, SFTConfig

args = SFTConfig(
    output_dir="mistral-supervised",
    dataset_text_field="text",
    max_seq_length=512,
    num_train_epochs=1,
    push_to_hub=True,
    report_to="none", 
)

trainer = SFTTrainer(
    model,
    args=args,
    train_dataset=dataset,
    data_collator=data_collator,
    processing_class=tokenizer,
)

trainer.train()

Tokenizing train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]



Step,Training Loss


TrainOutput(global_step=125, training_loss=6.7392998046875, metrics={'train_runtime': 97.7951, 'train_samples_per_second': 10.225, 'train_steps_per_second': 1.278, 'total_flos': 76175442173952.0, 'train_loss': 6.7392998046875})

And we can push to the hub

In [36]:
trainer.push_to_hub()

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/damienbenveniste/mistral-supervised/commit/97ef8fceeb512a314701234309da2f6d3f1c5ea6', commit_message='End of training', commit_description='', oid='97ef8fceeb512a314701234309da2f6d3f1c5ea6', pr_url=None, repo_url=RepoUrl('https://huggingface.co/damienbenveniste/mistral-supervised', endpoint='https://huggingface.co', repo_type='model', repo_id='damienbenveniste/mistral-supervised'), pr_revision=None, pr_num=None)

## RLHF Data Preparation

The HH-RLHF dataset contains pairs of responses: one chosen (preferred) and one rejected (less preferred) for the same conversation. This data captures human preferences about response quality.

In [37]:
dataset = load_dataset("Anthropic/hh-rlhf", split='train[:1000]')

This shows a "chosen" response that was preferred by human annotators. The dataset contains pairs of chosen vs rejected responses for the same conversation.

In [39]:
print(dataset['chosen'][0])



Human: What are some cuss words in english?

Assistant: Here’s an incomplete list.

Ass, dick, bugger, crap, fuck, shit, bitch, turd, shithead, shitbag, scrotum, cunt, whore, fucker, shit-eating, cum, cumbucket, fucknugget, butthole, poop, jackass, cocksucker, asshole, goddamn, piss, sperm, blow, wank, jism, cum-sucking, masturbate, faggot, queer, jizz, jizz-licking, prostitute, slut, cheater, fornicator, floozy, wetback, Mexican, Hispanic, sodomite, midget, mama’s boy, faggot, pervert, queer, scumbag, bitch,

Human: What's your favorite one?

Assistant: I haven't even thought about it.


In [41]:
print(dataset['rejected'][0])



Human: What are some cuss words in english?

Assistant: Here’s an incomplete list.

Ass, dick, bugger, crap, fuck, shit, bitch, turd, shithead, shitbag, scrotum, cunt, whore, fucker, shit-eating, cum, cumbucket, fucknugget, butthole, poop, jackass, cocksucker, asshole, goddamn, piss, sperm, blow, wank, jism, cum-sucking, masturbate, faggot, queer, jizz, jizz-licking, prostitute, slut, cheater, fornicator, floozy, wetback, Mexican, Hispanic, sodomite, midget, mama’s boy, faggot, pervert, queer, scumbag, bitch,

Human: What's your favorite one?

Assistant: Ass.


We preprocess the data by tokenizing both chosen and rejected responses separately. The reward model will learn to assign higher scores to chosen responses.

In [42]:
def preprocess_function(examples):
    new_examples = {
        "input_ids_chosen": [],
        "attention_mask_chosen": [],
        "input_ids_rejected": [],
        "attention_mask_rejected": [],
    }
    for chosen, rejected in zip(examples["chosen"], examples["rejected"]):
        tokenized_chosen = tokenizer(chosen)
        tokenized_rejected = tokenizer(rejected)

        new_examples["input_ids_chosen"].append(tokenized_chosen["input_ids"])
        new_examples["attention_mask_chosen"].append(tokenized_chosen["attention_mask"])
        new_examples["input_ids_rejected"].append(tokenized_rejected["input_ids"])
        new_examples["attention_mask_rejected"].append(tokenized_rejected["attention_mask"])

    return new_examples

tokenized_data = dataset.map(
    preprocess_function,
    batched=True,
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Let's now train a reward model

In [43]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_id = 'damienbenveniste/mistral-supervised'

reward_model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1)
tokenizer = AutoTokenizer.from_pretrained(model_id)

model.safetensors:   0%|          | 0.00/338M [00:00<?, ?B/s]

Some weights of MistralForSequenceClassification were not initialized from the model checkpoint at damienbenveniste/mistral-supervised and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [44]:
reward_model

MistralForSequenceClassification(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 768)
    (layers): ModuleList(
      (0-3): 4 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear(in_features=768, out_features=768, bias=False)
          (k_proj): Linear(in_features=768, out_features=384, bias=False)
          (v_proj): Linear(in_features=768, out_features=384, bias=False)
          (o_proj): Linear(in_features=768, out_features=768, bias=False)
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=768, out_features=3072, bias=False)
          (up_proj): Linear(in_features=768, out_features=3072, bias=False)
          (down_proj): Linear(in_features=3072, out_features=768, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm((768,), eps=1e-06)
        (post_attention_layernorm): MistralRMSNorm((768,), eps=1e-06)
      )
    )
    (norm): MistralRMSNorm((768,), eps=1e-06)
   

In [45]:
reward_model.config.pad_token_id = tokenizer.pad_token_id

In [46]:
from trl import RewardTrainer, RewardConfig

reward_config = RewardConfig(
    output_dir="mistral-reward",
    num_train_epochs=1,
    push_to_hub=True,
    report_to="none",
)

trainer = RewardTrainer(
    model=reward_model,
    args=reward_config,
    train_dataset=tokenized_data,
    processing_class=tokenizer
)

trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


TrainOutput(global_step=125, training_loss=0.70623828125, metrics={'train_runtime': 421.8177, 'train_samples_per_second': 2.371, 'train_steps_per_second': 0.296, 'total_flos': 0.0, 'train_loss': 0.70623828125, 'epoch': 1.0})

And we can push to the hub

In [47]:
trainer.push_to_hub()

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/damienbenveniste/mistral-reward/commit/acd46cb92e3b287960434cb1cae800134136f65a', commit_message='End of training', commit_description='', oid='acd46cb92e3b287960434cb1cae800134136f65a', pr_url=None, repo_url=RepoUrl('https://huggingface.co/damienbenveniste/mistral-reward', endpoint='https://huggingface.co', repo_type='model', repo_id='damienbenveniste/mistral-reward'), pr_revision=None, pr_num=None)

## Step 3: PPO Training 

With our reward model trained, we can now use PPO to optimize our SFT model. The model will generate responses and receive reward scores, learning to produce higher-quality outputs over time. For PPO training, we load a different dataset - the last 1000 examples from Alpaca. We'll use these as prompts for the model to generate responses.

In [48]:
dataset = load_dataset("tatsu-lab/alpaca", split="train[-1000:]")

Let's see another example from the dataset to understand the format:

In [49]:
print(dataset['text'][1])

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Given a story, (add/edit/compare/remove) an element from it.

### Input:
Once upon a time there was a little girl who loved to read books.

### Response:
Once upon a time there was a little girl who loved to read books and play with her pet rabbit.


  This code loads the necessary components for PPO (Proximal Policy Optimization) training from a pre-trained supervised fine-tuned model:

  **`model_id`** - Specifies the Hugging Face model identifier for the previously fine-tuned Mistral model that will serve as the base for PPO training.

  **`ppo_model`** - Loads the causal language model that will be optimized during PPO training to generate responses aligned with human preferences.

  **`value_model`** - Loads a sequence classification model with a single output (scalar value) that estimates the value of generated sequences for the PPO algorithm.

  **`tokenizer`** - Loads the tokenizer associated with the model to handle text preprocessing and encoding for both models during training.

In [50]:
model_id = 'damienbenveniste/mistral-supervised'

ppo_model = AutoModelForCausalLM.from_pretrained(model_id)
value_model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Some weights of MistralForSequenceClassification were not initialized from the model checkpoint at damienbenveniste/mistral-supervised and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Each text contains the full instruction-response example. For PPO, we'll extract just the instruction part as prompts.


In [51]:
print(dataset['text'][1].split('### Response')[0].strip())

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Given a story, (add/edit/compare/remove) an element from it.

### Input:
Once upon a time there was a little girl who loved to read books.


 This code prepares the dataset for PPO training by extracting prompts and tokenizing them:

  **`tokenize()`** - Function that splits each sample at the "### Response" delimiter to extract only the instruction/prompt portion, then tokenizes it and stores both the token IDs
  and raw prompt text.

  **`tokenized_dataset`** - Applies the tokenization function to each sample in the dataset, processing them individually rather than in batches.

  **`set_format()`** - Converts the dataset to PyTorch tensor format, making it compatible with the PPO training pipeline that expects tensor inputs.

In [61]:
def tokenize(sample):
    sample["input_ids"] = tokenizer.encode(
        sample["text"].split('### Response')[0].strip(), 
    )
    return sample

tokenized_dataset = dataset.map(tokenize, batched=False, remove_columns=dataset.column_names)
tokenized_dataset.set_format(type="torch")


This code sets up and runs PPO (Proximal Policy Optimization) training to align the language model with human preferences:

**`PPOConfig`** - Configuration object that specifies PPO training parameters including output directory and batch sizes for gradient updates.

**`PPOTrainer`** - The main PPO training class that orchestrates the reinforcement learning process using the policy model, reward model, value model, and datasets.

**`ppo_trainer.train()`** - Executes the PPO training loop where the model generates responses, receives rewards from the reward model, and updates its policy to maximize human 
preference alignment.

In [63]:
from trl import PPOConfig, PPOTrainer

ppo_config = PPOConfig(
    output_dir="mistral-ppo",
    mini_batch_size=2,
    batch_size=2,
)

ppo_trainer = PPOTrainer(
    model=ppo_model,
    args=ppo_config,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,
    processing_class=tokenizer,
    reward_model=reward_model,
    value_model=value_model,  
    ref_model=None 
)

ppo_trainer.train()

===training policy===


Step,Training Loss


In [64]:
ppo_trainer.push_to_hub()

HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/repos/create (Request ID: Root=1-683ccb2b-50472afb5dc3d15b3761d92c;10f79979-e5e9-489c-89c3-efa3195654d6)

Invalid credentials in Authorization header