In [1]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from torch.utils.data import Dataset
# us the model
from transformers import pipeline, set_seed
import wandb



  from .autonotebook import tqdm as notebook_tqdm


## Let's Tokenize

***Model*** The GPT2 Model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings).

**Tokenizer**:  A tokenizer is in charge of preparing the inputs for a model.

***PreTrainedTokenizer*** and ***PreTrainedTokenizerFast*** thus implement the main methods for using all the tokenizers:

- Tokenizing (splitting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e., tokenizing and converting to integers).
- Adding new tokens to the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece…).
- Managing special tokens (like mask, beginning-of-sentence, etc.): adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization. 
    
Here is the link to [documentation](https://huggingface.co/docs/transformers/v4.37.2/en/main_classes/tokenizer#transformers.PreTrainedTokenizer)


- GPT-2 Small ('gpt2'): 124 million parameters.
- GPT-2 Medium ('gpt2-medium'): 345 million parameters.
- GPT-2 Large ('gpt2-large'): 774 million parameters.
- GPT-2 XL ('gpt2-xl'): 1.5 billion parameters.

***Byte-Pair Encoding (BPE)*** vs ***Word Level Encoding***

BPE emphasises more on subwords. Yet there might be issues with semantic information of those subwords. 
Word Level Encoding encodes word by word that preserves the semantic information more yet it has problems with unseen word encoding etc. 


In [2]:
# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)

tokenizer = GPT2Tokenizer.from_pretrained(model_name)


***Data collators*** are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset.

To be able to build batches, data collators may apply some processing (like padding). Some of them (like DataCollatorForLanguageModeling) also apply some random data augmentation (like random masking) on the formed batch.

In [3]:
# Load your Shakespeare dataset
dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="shakespeare_dataset.txt",
    block_size=128,
)

"""
tokenizer (PreTrainedTokenizer or PreTrainedTokenizerFast) — The tokenizer used for encoding the data.

mlm (bool, optional, defaults to True) — Whether or not to use masked language modeling.
 If set to False, the labels are the same as the inputs with the padding tokens ignored (by setting them to -100). 
 Otherwise, the labels are -100 for non-masked tokens and the value to predict for the masked token.

mlm_probability (float, optional, defaults to 0.15) — The probability with which to (randomly) mask tokens in the input, when mlm is set to True.

pad_to_multiple_of (int, optional) — If set will pad the sequence to a multiple of the provided value.

return_tensors (str) — The type of Tensor to return. Allowable values are “np”, “pt” and “tf”.
"""

# Create data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # No masked language modeling for GPT-2
)




In [4]:
dataset[0]

tensor([ 5962, 22307,    25,   198,  8421,   356,  5120,   597,  2252,    11,
         3285,   502,  2740,    13,   198,   198,  3237,    25,   198,  5248,
          461,    11,  2740,    13,   198,   198,  5962, 22307,    25,   198,
         1639,   389,   477, 12939,  2138,   284,  4656,   621,   284,  1145,
          680,    30,   198,   198,  3237,    25,   198,  4965,  5634,    13,
        12939,    13,   198,   198,  5962, 22307,    25,   198,  5962,    11,
          345,   760,   327,  1872,   385,  1526, 28599,   318,  4039,  4472,
          284,   262,   661,    13,   198,   198,  3237,    25,   198,  1135,
          760,   470,    11,   356,   760,   470,    13,   198,   198,  5962,
        22307,    25,   198,  5756,   514,  1494,   683,    11,   290,   356,
         1183,   423, 11676,   379,   674,   898,  2756,    13,   198,  3792,
          470,   257, 15593,    30,   198,   198,  3237,    25,   198,  2949,
          517,  3375,   319,   470,    26,  1309,   340,   307])

In [5]:
tokenizer.decode(dataset[0])

"First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you know Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us kill him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be"

Here is the example notebook from Hugging Face about finetuning a model. [Notebook Link](https://github.com/huggingface/notebooks/blob/main/examples/summarization.ipynb)

In [6]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./fine-tuned-shakespeare",
    overwrite_output_dir=True,
    num_train_epochs=10,  # Adjust the number of epochs based on your needs
    per_device_train_batch_size=4,  # Adjust batch size based on GPU memory
    save_steps=10_000,  # Adjust save steps based on your needs
)


wandb.init(config=training_args)
# Magic
wandb.watch(model, log_freq=2)


# Create Trainer and fine-tune the model
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mbirkan[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [8]:
trainer.train() # report to wights to biases 
                                     # wandb


  9%|▉         | 613/6600 [14:17<2:19:31,  1.40s/it]

[A                                               

{'loss': 3.3508, 'learning_rate': 4.621212121212121e-05, 'epoch': 0.76}



[A                                                

{'loss': 3.2187, 'learning_rate': 4.242424242424243e-05, 'epoch': 1.52}



[A                                                

{'loss': 3.1049, 'learning_rate': 3.8636363636363636e-05, 'epoch': 2.27}



[A                                                

{'loss': 2.9811, 'learning_rate': 3.484848484848485e-05, 'epoch': 3.03}



[A                                                

{'loss': 2.8553, 'learning_rate': 3.106060606060606e-05, 'epoch': 3.79}



[A                                                

{'loss': 2.7834, 'learning_rate': 2.7272727272727273e-05, 'epoch': 4.55}



[A                                                

{'loss': 2.6924, 'learning_rate': 2.3484848484848487e-05, 'epoch': 5.3}



[A                                                

{'loss': 2.6635, 'learning_rate': 1.9696969696969697e-05, 'epoch': 6.06}



[A                                                

{'loss': 2.5593, 'learning_rate': 1.590909090909091e-05, 'epoch': 6.82}



[A                                                

{'loss': 2.5256, 'learning_rate': 1.2121212121212122e-05, 'epoch': 7.58}



[A                                                

{'loss': 2.4717, 'learning_rate': 8.333333333333334e-06, 'epoch': 8.33}



[A                                                

{'loss': 2.4566, 'learning_rate': 4.5454545454545455e-06, 'epoch': 9.09}



[A                                                

{'loss': 2.4181, 'learning_rate': 7.575757575757576e-07, 'epoch': 9.85}



100%|██████████| 6600/6600 [59:32<00:00,  1.85it/s]

{'train_runtime': 3572.4447, 'train_samples_per_second': 7.39, 'train_steps_per_second': 1.847, 'train_loss': 2.7702815061627013, 'epoch': 10.0}





TrainOutput(global_step=6600, training_loss=2.7702815061627013, metrics={'train_runtime': 3572.4447, 'train_samples_per_second': 7.39, 'train_steps_per_second': 1.847, 'train_loss': 2.7702815061627013, 'epoch': 10.0})

In [9]:
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
    
set_seed(42)

response_model = generator("Before we proceed any further, hear me speak,", max_length=200, num_return_sequences=1)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [10]:
print(response_model[0]["generated_text"])

Before we proceed any further, hear me speak,
Myself having well begun and my mind's fury
Not so inclined to encounter and take
Exceeding pleasure in its own volition,
My throat was then so mute and open
As from a story, or my body so mute
I had not yet been instructed to speak.
For what satisfaction may I speak?
It cannot be expostulated
In my mind, for the end of my speech
Was not thought of.

DUKE OF YORK:
Come hither, villain!
My throat did use to play in your ear,
A toy most like fancy: but you mean,
My throat; and with one word did tell me
That words in mine ear mean nothing.

DUKE OF YORK:
If they mean nothing, then let me speak.

DUKE OF YORK:
How can you speak to me now, when I am prepared.



In [11]:
trainer.save_model("outputs/finetuned_shakespeare")
# Save tokenizer
tokenizer.save_pretrained("outputs/finetuned_shakespeare")

Non-default generation parameters: {'max_length': 50, 'do_sample': True}


('outputs/finetuned_shakespeare/tokenizer_config.json',
 'outputs/finetuned_shakespeare/special_tokens_map.json',
 'outputs/finetuned_shakespeare/vocab.json',
 'outputs/finetuned_shakespeare/merges.txt',
 'outputs/finetuned_shakespeare/added_tokens.json')

***Let's load the pretrained model and get some inference to see if it is recorded correctly***

In [13]:
loaded_model = GPT2LMHeadModel.from_pretrained("outputs/finetuned_shakespeare")
loaded_tokenizer = GPT2Tokenizer.from_pretrained("outputs/finetuned_shakespeare")

# Now you can use the loaded model and tokenizer as before
loaded_generator = pipeline('text-generation', model=loaded_model, tokenizer=loaded_tokenizer)

response_model = loaded_generator("Before we proceed any further, hear me speak,", max_length=100, num_return_sequences=1)
print(response_model[0]["generated_text"])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Before we proceed any further, hear me speak,
That we may with a simple mind hear you speak
Some other word or other.

Provost:
I say he is in the prison, sir.

DUKE VINCENTIO:
If you be not, sir, we have reason
To fear your voices: and therefore leave me to your good
Will.

Provost:
It may please your lordship, sir, to have them,
