# 💬 GPT for Instruction Following
This notebook demonstrates the process of training a GPT to follow instructions effectively, using a custom dataset for varied tasks. From data preprocessing to model training and text generation.

## 📦 Setups and Imports

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from datasets import load_dataset

## 📊 Data Exploration and Preparation
Loading and previewing our dataset, ensuring we understand the kind of data we're working with.

In [186]:
dataset = load_dataset('hakurei/open-instruct-v1',
                       split='train')
dataset.to_pandas().sample(10)

Unnamed: 0,output,input,instruction
457039,One good way to learn functional programming i...,,What would be a good way to go about learning ...
313763,Incorrect statement: '...the Melbourne Cricket...,"The city of Sydney, located on the east coast ...",Identify the incorrect statement in the given ...
408191,I can't find any mistakes.,,Find all the mistakes in this paragraph. if yo...
251072,1. Before 9th century there is no evidence of ...,Wind-powered machines may have been known earl...,List me down some key aspects about windmills ...
476594,Some of the health benefits of eating avocados...,,What are some of the health benefits of eating...
21970,"I would reduce my water usage, recycle, compos...",,What activities would you do to help the envir...
281452,"In a future dystopia, a scientist discovers a ...",,Generate a creative writing prompt for a scien...
469529,The best way to approach a job interview is to...,,What is the best way to approach a job interview.
18066,The movie is about inner conflict and the powe...,Synopsis: A passionate photographer is conflic...,Describe in a few lines the basic idea behind ...
441262,A transformer model and reinforcement learning...,,What is the difference between a transformer m...


## 🔀 Shuffling and Splitting the Dataset
Shuffling the dataset and splitting it into training and test sets to ensure robust model training and evaluation.

In [4]:
def preprocess(example):
  example['prompt'] = f"{example['instruction']} {example['input']} {example['output']}"
  return example

In [None]:
dataset = dataset.map(preprocess,
                      remove_columns=['instruction', 'input', 'output'])
dataset = dataset.shuffle(42)
dataset = dataset.select(range(1000))
dataset = dataset.train_test_split(test_size=0.1, seed=42)

In [6]:
train_dataset = dataset['train']
test_dataset = dataset['test']

## 🚀 Model Initialization and Tokenization
Setting up the tokenizer and the model, ensuring that our tokens align with the model's expected format.

In [7]:
MODEL_NAME = 'microsoft/DialoGPT-medium'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

In [9]:
def tokenize_dataset(dataset):
  tokenized_dataset = dataset.map(lambda example: tokenizer(example['prompt'],
                                                            truncation=True,
                                                            max_length=128),
                                  batched=True,
                                  remove_columns=['prompt'])
  return tokenized_dataset

In [None]:
train_dataset = tokenize_dataset(train_dataset)
test_dataset = tokenize_dataset(test_dataset)

In [None]:
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

## 🎯 Training the GPT Model
Configuring the training parameters and initiating the training process using our prepared datasets.

In [12]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,
                                                mlm=False)

In [13]:
training_args = TrainingArguments(output_dir='models/diablo_gpt',
                                  num_train_epochs=1,
                                  per_device_train_batch_size=4,
                                  per_device_eval_batch_size=4)

In [14]:
trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=train_dataset,
                  eval_dataset=test_dataset,
                  data_collator=data_collator)

In [15]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=225, training_loss=3.36793701171875, metrics={'train_runtime': 5567.1191, 'train_samples_per_second': 0.162, 'train_steps_per_second': 0.04, 'total_flos': 193416430583808.0, 'train_loss': 3.36793701171875, 'epoch': 1.0})

## 📝 Text Generation and Application
Finally, demonstrating the capability of our trained model by generating responses to various instructions.

In [201]:
def generate_text(prompt):
  inputs = tokenizer.encode(prompt,
                            return_tensors='pt')
  outputs = model.generate(inputs,
                           max_length=64,
                           pad_token_id=tokenizer.eos_token_id)
  generated = tokenizer.decode(outputs[0],
                               skip_special_tokens=True)

  return generated[:generated.rfind('.') + 1]

In [219]:
generate_text('Tell me a fun fact about the world')

'Tell me a fun fact about the world.  The world is full of surprises. The world is full of surprises. The world is full of surprises. The world is full of surprises. The world is full of surprises. The world is full of surprises. The world is full of surprises.'

In [220]:
generate_text('Tell me a fun fact about the world')

'Tell me a fun fact about the world.  The world is a vast and varied place. The world is a vast and varied place. The world is a vast and varied place. The world is a vast and varied place. The world is a vast and varied place. The world is a vast and varied place.'

In [221]:
generate_text('Tell me a fun fact about the world')

'Tell me a fun fact about the world.  The world is a vast and varied place. It is also a beautiful place. It is also a dangerous place. It is also a beautiful place. It is also a dangerous place. It is also a beautiful place. It is also a dangerous place.'

In [222]:
generate_text('Tell me a fun fact about the world')

'Tell me a fun fact about the world.  The world is a big place. It has many diverse cultures. The world is also a big place. It has many diverse cultures. It has many diverse cultures. It has many diverse cultures. It has many diverse cultures. It has many diverse cultures.'

In [224]:
generate_text('Tell me a fun fact about the world')

'Tell me a fun fact about the world.  The world is full of people who love to play video games. The world is full of people who love to play video games. The world is full of people who love to play video games. The world is full of people who love to play video games.'