<a href="https://colab.research.google.com/github/cihankaradogan/Training-GPT2-Music-Titles/blob/main/Titles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers==4.5.0

Collecting transformers==4.5.0
  Downloading transformers-4.5.0-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 5.1 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 36.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 43.9 MB/s 
Installing collected packages: tokenizers, sacremoses, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.3 transformers-4.5.0


In [None]:
import pandas as pd
import torch
from torch.utils.data import Dataset, random_split
from transformers import GPT2Tokenizer, TrainingArguments, Trainer, GPT2LMHeadModel

In [None]:
torch.manual_seed(42)

<torch._C.Generator at 0x7f7c7e765c30>

### Loading GPT-2 model from huggingface

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained('mrm8488/GPT-2-finetuned-common_gen', bos_token='<|startoftext|>',
                                          eos_token='<|endoftext|>', pad_token='<|pad|>')
model = GPT2LMHeadModel.from_pretrained('mrm8488/GPT-2-finetuned-common_gen').cuda()
model.resize_token_embeddings(len(tokenizer))

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/357 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/479 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.


Downloading:   0%|          | 0.00/747 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/510M [00:00<?, ?B/s]

Embedding(50258, 768)

### Preparing the dataset

In [None]:
df = pd.read_fwf('/content/alicia-keys.txt');

In [None]:
df

Unnamed: 0,data,Unnamed: 1
0,"Noise is always loud, there are sirens all aro...",
1,"If I can make it here, I can make it anywhere,...",
2,Seeing my face in lights or my name on marquee...,a pocket full of dreams
3,"Baby, I'm from New York",
4,Concrete jungle where dreams are made of,
...,...,...
2891,I can give you all the things that you wanted ...,
2892,If you will stay with me Every little bit hurts,
2893,Every little bit hurts,
2894,Every little bit hurts,


In [None]:
df = df["data"]

In [None]:
df

0       Noise is always loud, there are sirens all aro...
1       If I can make it here, I can make it anywhere,...
2       Seeing my face in lights or my name on marquee...
3                                 Baby, I'm from New York
4                Concrete jungle where dreams are made of
                              ...                        
2891    I can give you all the things that you wanted ...
2892      If you will stay with me Every little bit hurts
2893                               Every little bit hurts
2894                               Every little bit hurts
2895                               Every little bit hurts
Name: data, Length: 2896, dtype: object

In [None]:
max_length = max([len(tokenizer.encode(sentence)) for sentence in df])

In [None]:
max_length

39

In [None]:
class TitleDataset(Dataset):
    def __init__(self, txt_list, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for txt in txt_list:
            encodings_dict = tokenizer('<|startoftext|>' + txt + '<|endoftext|>', truncation=True,
                                       max_length=max_length, padding="max_length")
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

In [None]:
dataset = TitleDataset(df, tokenizer, max_length=max_length)
train_size = int(0.9 * len(dataset))
train_dataset, val_dataset = random_split(dataset, [train_size, len(dataset) - train_size])


In [None]:
import gc
gc.collect()

95

In [None]:
torch.cuda.empty_cache()

### Training

In [None]:
training_args = TrainingArguments(output_dir='./results', num_train_epochs=15, logging_steps=100, save_steps=1000,
                                  per_device_train_batch_size=32, per_device_eval_batch_size=1,
                                  warmup_steps=10, weight_decay=0.05, logging_dir='./logs', report_to = 'none')

In [None]:
Trainer(model=model,  args=training_args, train_dataset=train_dataset, 
        eval_dataset=val_dataset, data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

Step,Training Loss


KeyboardInterrupt: ignored

### GPT-2 Generated Titles

In [None]:
generated = tokenizer("<|startoftext|> ", return_tensors="pt").input_ids.cuda()

In [None]:
#num_return_sequences is how many sample do you want
#max_length is the maximum length of sentence

sample_outputs = model.generate(generated, do_sample=True, top_k=50, 
                                max_length=16, top_p=0.95, temperature=1.9, num_return_sequences=500)

for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0: <|startoftext|> ://twspacewritingfw
1: <|startoftext|>  - 'endinghook|newhook {:
2: <|startoftext|> \<|><char>(beginofenc
3: <|startoftext|> -------------- Starting of unexpected text using wildcards
4: <|startoftext|> >"text"="#="#00tag
5: <|startoftext|> >" mov, begin: the space contains
6: <|startoftext|> ˈmystolicitye/ ©
7: <|startoftext|> >@w o(n
8: <|startoftext|> )</font></b> { <sub
9: <|startoftext|> xticket: endnotes with rs
10: <|startoftext|> \<cordologist=listeners></
11: <|startoftext|>  Became famous after a trip as among
12: <|startoftext|>  RFC" started its quest
13: <|startoftext|>  >< br>For decades you were regarded
14: <|startoftext|> 
endofmatch := start
15: <|startoftext|> ________________________ (all white plates for additional seating
16: <|startoftext|> ::EndSection: startofchar
17: <|startoftext|> ________>
18: <|startoftext|> -| addr: beginningofear the|
19: <|startoftext|> )))) }
@
20: <|startoftext|> ":"","linebreak: cantankerous resolution
21: <|starto