# 💬 GPT for Instruction Following
This notebook demonstrates the process of training a GPT to follow instructions effectively, using a custom dataset for varied tasks. From data preprocessing to model training and text generation.

## 📦 Setups and Imports

In [1]:
%%capture
!pip install datasets
!pip uninstall wandb -y

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from datasets import load_dataset

## 📊 Data Exploration and Preparation
Loading and previewing our dataset, ensuring we understand the kind of data we're working with.

In [3]:
dataset = load_dataset('hakurei/open-instruct-v1',
                       split='train')
dataset.to_pandas().sample(10)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

instruct_data.json:   0%|          | 0.00/104M [00:00<?, ?B/s]

additional_data.json:   0%|          | 0.00/19.7M [00:00<?, ?B/s]

alpaca_data.json:   0%|          | 0.00/22.7M [00:00<?, ?B/s]

gpt4_data.json:   0%|          | 0.00/12.2M [00:00<?, ?B/s]

roleplay_instruct.json:   0%|          | 0.00/2.64M [00:00<?, ?B/s]

self_instruct.json:   0%|          | 0.00/26.2M [00:00<?, ?B/s]

sharegpt_data.json:   0%|          | 0.00/109M [00:00<?, ?B/s]

synthetic_instruct.json:   0%|          | 0.00/19.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/498813 [00:00<?, ? examples/s]

Unnamed: 0,output,input,instruction
396880,class Employee(object):\n def __init__(self...,,Design a class that represents the concept of ...
203751,"function insertData(arr) {\n let sqlQuery = ""I...",,Use the following function and SQL query and w...
321318,The narrator's primary motivation is to experi...,"Whenever I walked along the pier, I couldn't r...","Based on the given text, identify the narrator..."
317773,"The baker combines the ingredients, places the...",The ingredients are combined by the baker. The...,Rewrite the following paragraph using only act...
86265,Proper noun,,Detect if the word is a proper noun.\n\nWord: ...
277091,"In the past decade, technology has had a profo...",,Describe the impact of technology on communica...
319838,"As a history professor, I would say that one o...",,"Imagine you are a history professor, and a stu..."
272972,The values from Set A that are not common to S...,"Set A: 1, 4, 6, 9, 11\nSet B: 3, 7, 9, 11","Given two sets of data, identify the values fr..."
73071,Computer = Human Brain,,What is an analogy and what is its purpose?\nI...
399671,class Solution(object):\n def createQueue(s...,,How do you implement a queue using two stacks?


## 🔀 Shuffling and Splitting the Dataset
Shuffling the dataset and splitting it into training and test sets to ensure robust model training and evaluation.

In [4]:
def preprocess(example):
  example['prompt'] = f"{example['instruction']} {example['input']} {example['output']}"
  return example

In [5]:
dataset = dataset.map(preprocess,
                      remove_columns=['instruction', 'input', 'output'])
dataset = dataset.shuffle(42)
dataset = dataset.select(range(15000))
dataset = dataset.train_test_split(test_size=0.1, seed=42)

Map:   0%|          | 0/498813 [00:00<?, ? examples/s]

In [6]:
train_dataset = dataset['train']
test_dataset = dataset['test']

## 🚀 Model Initialization and Tokenization
Setting up the tokenizer and the model, ensuring that our tokens align with the model's expected format.

In [7]:
MODEL_NAME = 'microsoft/DialoGPT-medium'

In [8]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [9]:
def tokenize_dataset(dataset):
  tokenized_dataset = dataset.map(lambda example: tokenizer(example['prompt'],
                                                            truncation=True,
                                                            max_length=128),
                                  batched=True,
                                  remove_columns=['prompt'])
  return tokenized_dataset

In [10]:
train_dataset = tokenize_dataset(train_dataset)
test_dataset = tokenize_dataset(test_dataset)

Map:   0%|          | 0/13500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [11]:
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/863M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## 🎯 Training the GPT Model
Configuring the training parameters and initiating the training process using our prepared datasets.

In [12]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,
                                                mlm=False)

In [13]:
training_args = TrainingArguments(output_dir='models/diablo_gpt',
                                  num_train_epochs=1,
                                  per_device_train_batch_size=14,
                                  per_device_eval_batch_size=14)

In [14]:
trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=train_dataset,
                  eval_dataset=test_dataset,
                  data_collator=data_collator)

In [15]:
trainer.train()

Step,Training Loss
500,2.8068


TrainOutput(global_step=965, training_loss=2.6720094670903496, metrics={'train_runtime': 1370.6242, 'train_samples_per_second': 9.85, 'train_steps_per_second': 0.704, 'total_flos': 3131038209122304.0, 'train_loss': 2.6720094670903496, 'epoch': 1.0})

## 📝 Text Generation and Application
Finally, demonstrating the capability of our trained model by generating responses to various instructions.

In [34]:
def generate_text(prompt):
  device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
  model.to(device)

  inputs = tokenizer.encode(prompt,
                            return_tensors='pt').to(device)
  outputs = model.generate(inputs,
                           max_length=64,
                           pad_token_id=tokenizer.eos_token_id,
                           no_repeat_ngram_size=2,
                           num_beams=10,
                           early_stopping=True)
  generated = tokenizer.decode(outputs[0],
                               skip_special_tokens=True)

  return generated[generated.rfind('?') + 3:generated.find('.') + 1]

In [35]:
generate_text('Is there life outside Earth?')

'Yes, there is life out there.'