# Generative Supervised Fine-tuning of GPT-2

Now that we have our GPT-2 model all trained up - we need a way we can get it to generate what we want.

In the following notebook, we're going to use an approach called "Supervised Fine-tuning" to achieve our goals today.

In essence, we're going to use each example as a self-contained unit (with potential for something called "packing") and this is going to allow us to build "labeled" data.

For this notebook, we're going to be flying quite high up in the levels of abstraction. Take extra care to look into the libraries we're using today!

Let's start by grabbing our dependencies, as always:



In [None]:
!pip install transformers accelerate datasets trl bitsandbytes -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m30.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.9/133.9 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m64.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

## Dataset Curation

We're going to be fine-tuning our model on SQL generation today.

First thing we'll need is a dataset to train on!

We'll use [this](https://huggingface.co/datasets/b-mc2/sql-create-context) dataset today!

First up, let's load it and take a look at what we've got.

- [`load_dataset`](https://huggingface.co/docs/datasets/loading)

In [None]:
from datasets import load_dataset

sql_dataset = load_dataset("b-mc2/sql-create-context")

Downloading readme:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/21.8M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
sql_dataset

DatasetDict({
    train: Dataset({
        features: ['answer', 'context', 'question'],
        num_rows: 78577
    })
})

In [None]:
### Display a Sample Row
print(sql_dataset['train'][0])

{'answer': 'SELECT COUNT(*) FROM head WHERE age > 56', 'context': 'CREATE TABLE head (age INTEGER)', 'question': 'How many heads of the departments are older than 56 ?'}


So, we've got ~78.5K rows of:

- question - a natural language query about
- context - the `CREATE TABLE` statement - which gives us important context about the table
- answer - a SQL query that is aligned with both the question and the context.

Let's split our data into `train`, `val`, and `test` datasets.

We can use our `train` and `val` sets to train and evaluate our model during training - and our `test` set to ultimately benchmark the generations of our model!

In [None]:
from datasets import DatasetDict

# Split into 90% train 10% test and validation
train_test_dataset = sql_dataset['train'].train_test_split(test_size=0.2, seed=42)
# Split the 10% test and validation in half test, half valid
test_valid = train_test_dataset['test'].train_test_split(test_size=0.5, seed=42)
# Back to DatasetDict
split_sql_dataset = DatasetDict({
    'train': train_test_dataset['train'],
    'val': test_valid['train'],
    'test': test_valid['test']})

In [None]:
split_sql_dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'context', 'answer'],
        num_rows: 62861
    })
    val: Dataset({
        features: ['question', 'context', 'answer'],
        num_rows: 7858
    })
    test: Dataset({
        features: ['question', 'context', 'answer'],
        num_rows: 7858
    })
})

```
DatasetDict({
    train: Dataset({
        features: ['question', 'context', 'answer'],
        num_rows: 62861
    })
    val: Dataset({
        features: ['question', 'context', 'answer'],
        num_rows: 7858
    })
    test: Dataset({
        features: ['question', 'context', 'answer'],
        num_rows: 7858
    })
})
```

### Creating a "Prompt"

Now we need to create a prompt that's going to allow us to interact with our model when we desired the trained behaviour.

Think of this as a pattern that aligns the model with our desired outputs.

We need a single text prompt, as that is what the `SFTTrainer` we're going to use to fine-tune our model expects.

The basic idea is that we're going to merge the `question`, `context`, and `answer` into a single block of text that shows the model our desired outputs.

Let's look at what that block needs to look like:

```
{bos_token}### Instruction:
{system_message}

### Input:
{input}

### Context:
{context}

### Response:
{response}{eos_token}
```

Let's look at that from a completed prompt perspective to get a bit more information:

```
<|startoftext|>### Instruction:
You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.
You must output the SQL query that answers the question.

### Input:
How many locations did the team play at on week 7?

### Context:
CREATE TABLE table_24123547_2 (location VARCHAR, week VARCHAR)

### Response:\nSELECT COUNT(location) FROM table_24123547_2 WHERE week = 7<|endoftext|>
```

As you can see, our prompt contains completed examples of our task. We're going to show our model many of these examples over and over again to teach it to produce outputs that are aligned with our goals!

First step, let's create a template we can use to call `.format()` on while constructing our prompts.

In [None]:
## CREATE A FORMATTABLE PROMPT TEMPLATE
TEXT2SQL_TRAINING_PROMPT_TEMPLATE = """\
{bos_token}### Instruction:
{system_message}

### Input:
{input}

### Context:
{context}

### Response:
{response}{eos_token}
"""

In [None]:
TEXT2SQL_INFERENCE_PROMPT_TEMPLATE = """\
{bos_token}### Instruction:
{system_message}

### Input:
{input}

### Context:
{context}

### Response:
"""

Now let's create a function we can map over our dataset to create the full prompt text block.

In [None]:
def create_sql_prompt(sample):
  SYSTEM_MESSAGE = f"""You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.
  You must output the SQL query that answers the question."""

  full_prompt = TEXT2SQL_TRAINING_PROMPT_TEMPLATE.format(
      bos_token = "<|startoftext|>", ### YOUR CODE HERE
      eos_token = "<|endoftext|>", ### YOUR CODE HERE
      system_message = SYSTEM_MESSAGE, ### YOUR CODE HERE
      input = sample["question"], ### YOUR CODE HERE
      context = sample["context"], ### YOUR CODE HERE
      response = sample["answer"], ### YOUR CODE HERE
  )

  return {"text" : full_prompt}

#### Helper Function Begin.

I've created this helper-function to be able to see how our model is doing visibly, rather than only through metrics.

In [None]:
def create_sql_prompt_and_response(sample):
  SYSTEM_MESSAGE = f"""You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.
  You must output the SQL query that answers the question."""

  full_prompt = TEXT2SQL_INFERENCE_PROMPT_TEMPLATE.format(
      bos_token = "<|startoftext|>",
      system_message = SYSTEM_MESSAGE,
      input = sample["question"],
      context = sample["context"]
  )

  ground_truth = sample["answer"]

  return {"full_prompt" : full_prompt, "ground_truth" : ground_truth}

#### Helper Function End.

Let's look at an example of a formatted prompt.

In [None]:
create_sql_prompt(split_sql_dataset["train"][0])

{'text': '<|startoftext|>### Instruction:\nYou are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.\n  You must output the SQL query that answers the question.\n\n### Input:\nWhich Perth has Gold Coast yes, Sydney yes, Melbourne yes, and Adelaide yes?\n\n### Context:\nCREATE TABLE table_name_56 (perth VARCHAR, adelaide VARCHAR, melbourne VARCHAR, gold_coast VARCHAR, sydney VARCHAR)\n\n### Response:\nSELECT perth FROM table_name_56 WHERE gold_coast = "yes" AND sydney = "yes" AND melbourne = "yes" AND adelaide = "yes"<|endoftext|>\n'}

Great!

Now we can map this over our dataset!

- [`DatasetDict.map()`](https://huggingface.co/docs/datasets/process#map)

In [None]:
split_sql_dataset = split_sql_dataset.map(create_sql_prompt) ### YOUR CODE HERE

Map:   0%|          | 0/62861 [00:00<?, ? examples/s]

Map:   0%|          | 0/7858 [00:00<?, ? examples/s]

Map:   0%|          | 0/7858 [00:00<?, ? examples/s]

## Load the Model And Preproccessing

Now for the moment we've all been waiting for...

Loading our model!

Let's use the `AutoModelForCausalLM` and `AutoTokenzier` classes from `transformers` to see just how easy this is.

- [`AutoModelForCausalLM`](https://huggingface.co/docs/transformers/v4.35.0/en/model_doc/auto#transformers.AutoModelForCausalLM)
- [`AutoTokenizer`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer)
- [GPT-2 Model Card](https://huggingface.co/gpt2)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id ="gpt2" ### YOUR MODEL ID HERE

gpt2_base_model = AutoModelForCausalLM.from_pretrained(model_id) ### YOUR MODEL ID HERE ### YOUR CODE HERE

gpt2_tokenizer = AutoTokenizer.from_pretrained(model_id) ### YOUR CODE HERE

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

We need to make sure our tokenizer has a `pad_token` in order to be able to pad sequences so they're all the same length.

We'll use a little trick here to set our padding token to our eos (end of sequence) token to make training go a little smoother.

In [None]:
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

We also need to make sure we resize our model to be aligned with the token embeddings. If we didn't do this - we'd face a shape error while training!

In [None]:
gpt2_base_model.resize_token_embeddings(len(gpt2_tokenizer))

Embedding(50257, 768)

Now let's use the Hugging Face `pipeline` to see what generation looks like for our untrained model.

In [None]:
from transformers import pipeline, set_seed, GenerationConfig
generator = pipeline('text-generation', model=gpt2_base_model, tokenizer=gpt2_tokenizer)
set_seed(42)

def generate_sample(sample):
  prompt_package = create_sql_prompt_and_response(sample)

  generation_config = GenerationConfig(
      max_new_tokens=50,
      do_sample=True,
      top_k=50,
      temperature=1e-4,
      eos_token_id=gpt2_base_model.config.eos_token_id,
  )

  generation = generator(prompt_package["full_prompt"], generation_config=generation_config)
  print("---------------")
  print("Model Response:")
  print(generation[0]["generated_text"].replace(prompt_package["full_prompt"], ""))
  print("+++++++++++++++")
  print("Ground Truth")
  print(prompt_package["ground_truth"])

In [None]:
print(split_sql_dataset["test"][0])

In [None]:
generate_sample(split_sql_dataset["test"][0])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------
Model Response:

SELECT * FROM table_name_80 WHERE year = '1957';

### Output:

SELECT * FROM table_name_80 WHERE year = '1957';

### Context:

SELECT * FROM table_
+++++++++++++++
Ground Truth
SELECT qual FROM table_name_80 WHERE laps = 200 AND year = "1957"


## Training the Model

Now that we have our model set up, our tokenizer set up, we can finally begin training!

Let's look at our Trainer, and set some hyper-parameters:

- `per_device_train_batch_size` - this is a batch size that accomodates distributed training - a default we could use is `4`
- `gradient_accumulation_steps` - this is exactly the same as the previous notebook, it's a way to "simulate" a large batch size by collecting losses over multiple iterations - scaling them - and then combining them together. - a default we could use is `4`
- `gradient_checkpointing` - I'll let the authors speak for themselves [here](https://github.com/cybertronai/gradient-checkpointing). In essence: This saves memory at the cost of computational time. - let's set this to `True`
- `max_grad_norm` - this is the value used for gradient clipping, which is a method of reducing vanishing gradient potential - let's use `0.3`
- `max_steps` - how many steps will we train for? - this is up to you
- `learning_rate` - how fast should we learn? - lets use `2e-4`
- `save_total_limit` - how many versions of the model will we save? - the default of `3` should work well
- `logging_steps` - how often we should log - up to you
- `output_dir` - where to save our checkpoints - up to you
- `optim` - which optimizer to use, you'll notice we're using a full precision paged optimizer - this is a performative and stable optimizer - but it uses extra memory - we should use `paged_adamw_32bit`
- `lr_scheduler_type` - we are once again using a cosine scheduler! - we should use `cosine`
- `evaluation_strategy` - we have an evaluation dataset, this defines when we should leverage it during training - we should use `steps`
- `eval_steps` - how many steps we should evaluate for - up to you
- `warmup_ration` - how many "warmup" steps we take to reach our full learning rate before we start decaying. This is a ration of our max_steps - the default value of `0.3` should work!

- [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.15.0/main_classes/trainer#transformers.TrainingArguments)

In [None]:
from transformers import TrainingArguments
from trl import SFTTrainer

training_args = TrainingArguments(
    output_dir= './model',
    per_device_train_batch_size = 4,
    gradient_accumulation_steps = 4,
    gradient_checkpointing = True,
    max_grad_norm = 0.3,
    max_steps = 500,
    learning_rate = 2e-4,
    save_total_limit = 3,
    logging_steps = 20,
    optim = "paged_adamw_32bit",
    lr_scheduler_type = "cosine",
    evaluation_strategy = "steps",
    eval_steps = 50,
    warmup_ratio = 0.1 #0.3,
)



Now, for our `SFTTrainer` AKA "Where the magic happens".

This `SFTTrainer` is going to take our above training arguments, our data, our model and our tokenizer, and train it all for us!

Notice that we're setting `max_seq_length` to the maximum context window of our model - this ensures we do not exceed our maximum context window, and will pad our examples up to the maximum context window!

#### ❓QUESTION❓

What is the maximum input sequence length for GPT-2?

*It's 1.024 tokens.*

In [None]:
trainer = SFTTrainer(
 gpt2_base_model,
 dataset_text_field="text",
 train_dataset=split_sql_dataset["train"],
 eval_dataset=split_sql_dataset["val"],
 tokenizer=gpt2_tokenizer,
 max_seq_length= 1024, ### maximum context size of your model
 args=training_args
)

Map:   0%|          | 0/62861 [00:00<?, ? examples/s]

Map:   0%|          | 0/7858 [00:00<?, ? examples/s]

Finally, we can call our `.train()` method and watch it go!

In [None]:
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
50,0.8834,0.689868
100,0.6686,0.629298
150,0.6338,0.597837
200,0.6071,0.575745
250,0.5728,0.561176
300,0.5633,0.549169
350,0.5619,0.540825
400,0.5433,0.535196
450,0.5372,0.532705
500,0.5518,0.531909


TrainOutput(global_step=500, training_loss=0.693628791809082, metrics={'train_runtime': 659.9374, 'train_samples_per_second': 12.122, 'train_steps_per_second': 0.758, 'total_flos': 656110416384000.0, 'train_loss': 0.693628791809082, 'epoch': 0.13})

Let's save our fine-tuned model!

In [None]:
trainer.save_model()

## Testing our Model

Now that we have a fine-tuned model, let's see how it did

In [None]:
ft_gpt2_model = AutoModelForCausalLM.from_pretrained("model")

In [None]:
generator = pipeline('text-generation', model=ft_gpt2_model, tokenizer=gpt2_tokenizer, )

In [None]:
generate_sample(split_sql_dataset["test"][0])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


---------------
Model Response:
SELECT score FROM table_name_72 WHERE visitor = "atlanta" AND visitor = "atlanta" AND visitor = "atlanta" AND visitor = "atlanta" AND visitor = "atlanta" AND visitor = "atl
+++++++++++++++
Ground Truth
SELECT score FROM table_name_72 WHERE visitor = "atlanta"


That is *significantly* better.

#### ❓QUESTION❓

What methods could we use to validate our SQL outputs?

*-----------------------------------------------------------------------------*

One option, for example, you can use the sqlvalidator library to check that the format of the SELECT sentence of the response is right. Or you can make your own validator, a simple one, that searches for the SELECT and FROM reserved words and the table name in the context.

#### ❓QUESTION❓

How would you extend this notebook to another use-case?

*------------------------------------------------------------------------------*

We can search for a dataset in the format of instruction-input-answer, like Alpaca or similar,then we probably have to adapt our prompt for this new task and maybe adjust some parameters. Run the training and test it.

For example, we can use a dataset like `iamtarun/python_code_instructions_18k_alpaca` to train our model for python code generation.

Or maybe use the dataset `qwedsacf/grade-school-math-instructions` and try to solve math problems.