Part 2: Transformer (Text generation)

In [3]:
import pandas as pd
import torch
from torch.utils.data import Dataset, random_split
from transformers import GPT2Tokenizer, TrainingArguments, Trainer, GPT2LMHeadModel

2024-05-26 22:19:42.945396: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-26 22:19:42.945523: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-26 22:19:43.083681: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [4]:
torch.manual_seed(42)

<torch._C.Generator at 0x7fb2ffc4c9f0>

In [5]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium', bos_token='<|startoftext|>',
                                          eos_token='<|endoftext|>', pad_token='<|pad|>')
model = GPT2LMHeadModel.from_pretrained('gpt2-medium').cuda()
model.resize_token_embeddings(len(tokenizer))

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Embedding(50259, 1024)

In [6]:
descriptions = pd.read_csv('/kaggle/input/netfilex/netflix_titles.csv')['description']

In [7]:
max_length = max([len(tokenizer.encode(description)) for description in descriptions])

In [8]:
class NetflixDataset(Dataset):
    def __init__(self, txt_list, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for txt in txt_list:
            encodings_dict = tokenizer('<|startoftext|>' + txt + '<|endoftext|>', truncation=True,
                                       max_length=max_length, padding="max_length")
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

In [9]:
dataset = NetflixDataset(descriptions, tokenizer, max_length=max_length)
train_size = int(0.9 * len(dataset))
train_dataset, val_dataset = random_split(dataset, [train_size, len(dataset) - train_size])

In [10]:
import gc
gc.collect()

83

In [11]:
torch.cuda.empty_cache()

In [12]:
training_args = TrainingArguments(output_dir='./results', num_train_epochs=1, logging_steps=100, save_steps=5000,
                                  per_device_train_batch_size=1, per_device_eval_batch_size=1,
                                  warmup_steps=10, weight_decay=0.05, logging_dir='./logs', report_to = 'none')

In [13]:
Trainer(model=model,  args=training_args, train_dataset=train_dataset, 
        eval_dataset=val_dataset, data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
100,4.6796
200,1.9641
300,1.8431
400,1.8249
500,1.8894
600,1.9019
700,1.9372
800,1.8066
900,1.7945
1000,1.8112


TrainOutput(global_step=7926, training_loss=1.8148240352440987, metrics={'train_runtime': 986.5883, 'train_samples_per_second': 8.034, 'train_steps_per_second': 8.034, 'total_flos': 891356768944128.0, 'train_loss': 1.8148240352440987, 'epoch': 1.0})

In [14]:
generated = tokenizer("<|startoftext|> ", return_tensors="pt").input_ids.cuda()

In [15]:
sample_outputs = model.generate(generated, do_sample=True, top_k=50, 
                                max_length=300, top_p=0.95, temperature=1.9, num_return_sequences=20)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [16]:
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

0:  ̶Curious how everything turns together as your daughter sets eyes on another childhood home and tries to solve things on his, well, own.
1:   _____ isn‽t always like momma, her son takes an unconventional schoolteacher parenting course to prove the point. Will her tough streak turn deadly into momma s gentle nature? I Can'T Count a Millionaire!
2: ????? is a short film released for release at the 2000 BETA edition of the movie based off and based on a Japanese book based upon this novel by Stephen King.
3:  ̶ ̶ ̶ ̶ ̶ 2 ̶ 6 ̶ 13 ̶.combs vr3r1e begins back on it, and so is his "Unlepiel." See here!
4:  ”I love being a mom. Not an infant or what mommy was supposed to’t be," writes Miranda Hart:
5:  ㅋㅋI've never participated with another boy, and today a man-bond gets sealed off while dealing with three other bachelors. It only shows that they may live in pairs? When the guy becomes her true crush then will his brother-in house make them his house?? I look for him, what's she called?
6