**Code copied from this medium post - https://medium.com/geekculture/fine-tune-eleutherai-gpt-neo-to-generate-netflix-movie-descriptions-in-only-47-lines-of-code-40c9b4c32475** 

In [1]:
%%time
%%capture
!pip install transformers

CPU times: user 23.3 ms, sys: 12 ms, total: 35.3 ms
Wall time: 7.02 s


In [2]:
import pandas as pd
import torch
from torch.utils.data import Dataset, random_split
from transformers import GPT2Tokenizer, TrainingArguments, Trainer, GPT2LMHeadModel

In [3]:
!nvidia-smi

Fri Dec  2 04:18:41 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
torch.manual_seed(42)

<torch._C.Generator at 0x7fc31c0bb470>

### Loading GPT2-Medium Model from 🤗 Model Hub 

In [5]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium', bos_token='<|startoftext|>',
                                          eos_token='<|endoftext|>', pad_token='<|pad|>')
model = GPT2LMHeadModel.from_pretrained('gpt2-medium').cuda()
model.resize_token_embeddings(len(tokenizer))

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.


Downloading:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Embedding(50259, 1024)

### Configurations

In [18]:
DATA_PATH = '../input/netflix-shows/netflix_titles.csv'
DATA_HEADER = 'description'

OUTPUT_DIR = './results'
LOGGING_DIR = './logs'

EPOCHS = 4

LOGGING_STEPS = 100
SAVE_STEPS = 1000

TRAIN_BATCH_SIZE = 16 
EVAL_BATCH_SIZE = 16

WARMUP_STEPS = 10

WEIGHT_DECAY = 0.05

REPORT_TO = 'none'

In [19]:
descriptions = pd.read_csv(DATA_PATH)[DATA_HEADER]

In [8]:
max_length = max([len(tokenizer.encode(description)) for description in descriptions])

In [9]:
class TrainDataset(Dataset):
    def __init__(self, txt_list, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for txt in txt_list:
            encodings_dict = tokenizer('<|startoftext|>' + txt + '<|endoftext|>', truncation=True,
                                       max_length=max_length, padding="max_length")
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

In [10]:
dataset = TrainDataset(descriptions, tokenizer, max_length=max_length)
train_size = int(0.9 * len(dataset))
train_dataset, val_dataset = random_split(dataset, [train_size, len(dataset) - train_size])

In [11]:
# for garbage collection

import gc
gc.collect()

669

In [12]:
torch.cuda.empty_cache()

In [20]:
training_args = TrainingArguments(output_dir=OUTPUT_DIR, num_train_epochs=EPOCHS, logging_steps=LOGGING_STEPS, 
                                  save_steps=SAVE_STEPS, per_device_train_batch_size=TRAIN_BATCH_SIZE, 
                                  per_device_eval_batch_size=EVAL_BATCH_SIZE, warmup_steps=WARMUP_STEPS, 
                                  weight_decay=WEIGHT_DECAY, logging_dir=LOGGING_DIR, report_to = REPORT_TO)


In [21]:
model_trainer = Trainer(model=model,  args=training_args, train_dataset=train_dataset, 
        eval_dataset=val_dataset, data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])})
model_trainer.train()

Step,Training Loss
100,1.399
200,1.4199
300,1.5071
400,1.5913
500,1.6535
600,1.3435
700,1.3371
800,1.3343
900,1.3359
1000,1.3082


In [27]:
model_trainer.save_model("/final_model.bin")

### GPT Generated Description

In [26]:
generated = tokenizer("<|startoftext|> ", return_tensors="pt").input_ids.cuda()

In [None]:
# fetched_model = AutoModelForSequenceClassification.from_pretrained("/final_model.bin")

In [28]:
sample_outputs = model.generate(generated, do_sample=True, top_k=50, 
                                max_length=300, top_p=0.95, temperature=1.9, num_return_sequences=10)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [29]:
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

0: !" calls them all-affiliated from an interview with Dave Chappelle as comedy's biggest stars ring-in his show.
1:   Kissing enthusiast Carrie Cole brings home husband Michael Cole's handsome but drabbable proposal; an arranged marriage to catch their attention.
2:   This documentary delves deeply into Hollywood's enduring fixation with Emmas sobriquet and the resulting movies like "She's Out" and "Hedwig."
3:  修理者 Hikaru Utagawa is in for a shock when his former best friend becomes mayor of Nagoya until a crucial meeting between �M3 and Akita emerges.
4:  ickleE has the chance to fulfill a long-standing holiday wish and win a priceless set of priceless ornamentals before they topple from the seasons.
5:   Bae Byung Seok is back with all her binaurious flare and more palsy classmates for a fresh run of shenanigans! Join Won Min Lee as well as Jungmyung Seok and Yoo Seung.
6:  ____Six teen vloggers compete in three days of funny and sexy vlog challenges – on stage or through their fr

### Original Description (Random)

In [17]:
pd.options.display.max_colwidth = 1000
descriptions.sample(10)

4970                       Three buddies with big dreams go from underachieving slackers to badass warriors when their posh hotel is taken over by terrorists.
3362         In his first stand-up special, Arsenio Hall discusses getting older, the changing times and culture, social issues and even bothersome baby toes.
5494                                                  Music meets imagination in this inventive animated series about thinking outside the box and having fun.
1688                        Explore an array of unique competitions, from the quirky to the bizarre, and meet their passionate communities in this docuseries.
1349         From his days as a petty thief to becoming head of a drug-trafficking empire, this riveting series charts the life of the infamous Pablo Escobar.
4862        This anime adventure follows the battle between a saint of Athena and an avatar of Hades who's working on a painting that could destroy the world.
2676     A top Israeli agent comes out of reti