In [99]:
import pandas

## A. Preparing the data

In [100]:
# set up
# data from: https://www.kaggle.com/mousehead/songlyrics
file_path = './songlyrics/songdata.csv' 

In [101]:
# read data
df = pd.read_csv(file_path)
len(df)  # checks number of rows

57650

In [102]:
df.head()  # prints first 5 rows

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


In [103]:
df['artist'].unique()  # gets all unique values of artists

array(['ABBA', 'Ace Of Base', 'Adam Sandler', 'Adele', 'Aerosmith',
       'Air Supply', 'Aiza Seguerra', 'Alabama', 'Alan Parsons Project',
       'Aled Jones', 'Alice Cooper', 'Alice In Chains', 'Alison Krauss',
       'Allman Brothers Band', 'Alphaville', 'America', 'Amy Grant',
       'Andrea Bocelli', 'Andy Williams', 'Annie', 'Ariana Grande',
       'Ariel Rivera', 'Arlo Guthrie', 'Arrogant Worms', 'Avril Lavigne',
       'Backstreet Boys', 'Barbie', 'Barbra Streisand', 'Beach Boys',
       'The Beatles', 'Beautiful South', 'Beauty And The Beast',
       'Bee Gees', 'Bette Midler', 'Bill Withers', 'Billie Holiday',
       'Billy Joel', 'Bing Crosby', 'Black Sabbath', 'Blur', 'Bob Dylan',
       'Bob Marley', 'Bob Rivers', 'Bob Seger', 'Bon Jovi', 'Boney M.',
       'Bonnie Raitt', 'Bosson', 'Bread', 'Britney Spears',
       'Bruce Springsteen', 'Bruno Mars', 'Bryan White', 'Cake',
       'Carly Simon', 'Carol Banawa', 'Carpenters', 'Cat Stevens',
       'Celine Dion', 'Chaka Khan

I am going to use Modern Talking songs for my finetunning, so let's get those.

In [104]:
df = df[df['artist']=='Modern Talking']
df['artist'].unique()  # just to check

array(['Modern Talking'], dtype=object)

At this point we have a dataframe with all the Modern Talking songs we would want. Well, let's see how many actually.

In [105]:
print('There are lyrics for {} songs'.format(len(df)))

There are lyrics for 144 songs


Seems okai, just need some seperators for lines and we should be good to go.

In [107]:
df['text'] = [i.replace('\n', '\s ') for i in df['text']]

In [108]:
df.head()

Unnamed: 0,artist,song,link,text
13163,Modern Talking,10 Seconds To Countdown,/m/modern+talking/10+seconds+to+countdown_2048...,"You are the best, for all of times \s You bea..."
13164,Modern Talking,America,/m/modern+talking/america_20094723.html,"On Monday morning, ocean rain was falling \s ..."
13165,Modern Talking,Blackbird,/m/modern+talking/blackbird_20154356.html,"Just tomorrow, there's another day \s Another..."
13166,Modern Talking,Can't Get Enough,/m/modern+talking/cant+get+enough_20094601.html,"Oh, Juliet Is Crying All The Night \s She Doe..."
13167,Modern Talking,Can't Let You Go,/m/modern+talking/cant+let+you+go_20094740.html,You don't need nobody \s When you're down and...


all done!

In [None]:
os.mkdir('./data')  # makes directory for some file storage

In [110]:
# let's leave 5 songs for evaluation
df['text'][:-5].to_csv('./data/train.txt', index=False, header=False)
df['text'][-5:].to_csv('./data/test.txt', index=False, header=False)

## B. Fine tune

Let's train the model for some lyrics generation then. Yes you could rewrite the whole huggingface script, but what for? They made a great job with it. Let's use it.

! let's jupyter run bash comands

In [119]:
!python run_language_modeling.py \
    --output_dir=output \
    --overwrite_output_dir \
    --model_type=gpt2 \
    --model_name_or_path=distilgpt2 \
    --line_by_line \
    --do_train \
    --train_data_file=./data/train.txt \
    --do_eval \
    --per_gpu_train_batch_size=1 \
    --num_train_epochs=10 \
    --eval_data_file=./data/test.txt

04/03/2020 10:33:36 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-config.json from cache at /root/.cache/torch/transformers/eb0f77b3f095880586731f57e2fe19060d71d1036ef8daf727bd97a17fb66a43.a41f80bd12c111d611dcd5546611b7e47c16a0a995f83df2f7b437a20b6849b5
04/03/2020 10:33:36 - INFO - transformers.configuration_utils -   Model config GPT2Config {
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "decoder_start_token_id": null,
  "do_sample": false,
  "early_stopping": false,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "finetuning_task": null,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "is_decoder": false,
  "is_encoder_decoder": false,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "length_penalty": 1.0,
  "max_length": 20,
  "min_length": 0,
 

Iteration:  62%|███████████████████▏           | 86/139 [00:06<00:03, 14.40it/s][A
Iteration:  63%|███████████████████▋           | 88/139 [00:06<00:04, 12.04it/s][A
Iteration:  65%|████████████████████           | 90/139 [00:07<00:03, 12.28it/s][A
Iteration:  66%|████████████████████▌          | 92/139 [00:07<00:04, 10.95it/s][A
Iteration:  68%|████████████████████▉          | 94/139 [00:07<00:03, 11.61it/s][A
Iteration:  69%|█████████████████████▍         | 96/139 [00:07<00:03, 10.75it/s][A
Iteration:  71%|█████████████████████▊         | 98/139 [00:07<00:03, 11.76it/s][A
Iteration:  72%|█████████████████████▌        | 100/139 [00:07<00:03, 11.87it/s][A
Iteration:  73%|██████████████████████        | 102/139 [00:08<00:03, 12.29it/s][A
Iteration:  75%|██████████████████████▍       | 104/139 [00:08<00:02, 12.85it/s][A
Iteration:  76%|██████████████████████▉       | 106/139 [00:08<00:02, 11.92it/s][A
Iteration:  78%|███████████████████████▎      | 108/139 [00:08<00:02, 11.62i

Iteration:  99%|█████████████████████████████▌| 137/139 [00:10<00:00, 13.13it/s][A
Iteration: 100%|██████████████████████████████| 139/139 [00:10<00:00, 12.82it/s][A
Epoch:  20%|███████▍                             | 2/10 [00:21<01:27, 10.89s/it]
Iteration:   0%|                                        | 0/139 [00:00<?, ?it/s][A
Iteration:   1%|▍                               | 2/139 [00:00<00:09, 14.10it/s][A
Iteration:   2%|▋                               | 3/139 [00:00<00:13, 10.22it/s][A
Iteration:   4%|█▏                              | 5/139 [00:00<00:11, 11.22it/s][A
Iteration:   5%|█▌                              | 7/139 [00:00<00:10, 12.10it/s][A
Iteration:   6%|██                              | 9/139 [00:00<00:10, 12.87it/s][A
Iteration:   8%|██▍                            | 11/139 [00:00<00:09, 12.92it/s][A
Iteration:   9%|██▉                            | 13/139 [00:00<00:09, 13.82it/s][A
Iteration:  11%|███▎                           | 15/139 [00:01<00:09, 13.59it/s

Iteration:  30%|█████████▎                     | 42/139 [00:03<00:07, 12.57it/s][A
Iteration:  32%|█████████▊                     | 44/139 [00:03<00:08, 10.61it/s][A
Iteration:  33%|██████████▎                    | 46/139 [00:03<00:08, 11.34it/s][A
Iteration:  35%|██████████▋                    | 48/139 [00:04<00:09,  9.86it/s][A
Iteration:  36%|███████████▏                   | 50/139 [00:04<00:09,  9.22it/s][A
Iteration:  37%|███████████▌                   | 52/139 [00:04<00:08,  9.96it/s][A
Iteration:  39%|████████████                   | 54/139 [00:04<00:07, 10.66it/s][A
Iteration:  40%|████████████▍                  | 56/139 [00:04<00:07, 11.07it/s][A
Iteration:  42%|████████████▉                  | 58/139 [00:05<00:07, 10.63it/s][A
Iteration:  43%|█████████████▍                 | 60/139 [00:05<00:07, 10.60it/s][A
Iteration:  45%|█████████████▊                 | 62/139 [00:05<00:06, 11.23it/s][A
Iteration:  46%|██████████████▎                | 64/139 [00:05<00:06, 11.86i

Iteration:  53%|████████████████▌              | 74/139 [00:06<00:05, 11.95it/s][A
Iteration:  55%|████████████████▉              | 76/139 [00:06<00:05, 12.26it/s][A
Iteration:  56%|█████████████████▍             | 78/139 [00:06<00:04, 12.61it/s][A
Iteration:  58%|█████████████████▊             | 80/139 [00:06<00:04, 12.23it/s][A
Iteration:  59%|██████████████████▎            | 82/139 [00:07<00:05, 10.33it/s][A
Iteration:  60%|██████████████████▋            | 84/139 [00:07<00:05, 10.80it/s][A
Iteration:  62%|███████████████████▏           | 86/139 [00:07<00:04, 10.90it/s][A
Iteration:  63%|███████████████████▋           | 88/139 [00:07<00:04, 11.39it/s][A
Iteration:  65%|████████████████████           | 90/139 [00:07<00:04, 11.75it/s][A
Iteration:  66%|████████████████████▌          | 92/139 [00:07<00:04, 10.94it/s][A
Iteration:  68%|████████████████████▉          | 94/139 [00:08<00:03, 11.38it/s][A
Iteration:  69%|█████████████████████▍         | 96/139 [00:08<00:03, 12.08i

Iteration:  89%|██████████████████████████▊   | 124/139 [00:10<00:01, 11.57it/s][A
Iteration:  91%|███████████████████████████▏  | 126/139 [00:10<00:01, 12.35it/s][A
Iteration:  92%|███████████████████████████▋  | 128/139 [00:10<00:00, 12.59it/s][A
Iteration:  94%|████████████████████████████  | 130/139 [00:10<00:00, 12.60it/s][A
Iteration:  95%|████████████████████████████▍ | 132/139 [00:11<00:00, 12.80it/s][A
Iteration:  96%|████████████████████████████▉ | 134/139 [00:11<00:00, 11.25it/s][A
Iteration:  98%|█████████████████████████████▎| 136/139 [00:11<00:00, 12.11it/s][A
Iteration: 100%|██████████████████████████████| 139/139 [00:11<00:00, 11.86it/s][A
Epoch:  60%|██████████████████████▏              | 6/10 [01:33<01:00, 15.25s/it]
Iteration:   0%|                                        | 0/139 [00:00<?, ?it/s][A
Iteration:   1%|▍                               | 2/139 [00:00<00:14,  9.15it/s][A
Iteration:   3%|▉                               | 4/139 [00:00<00:12, 10.39it/s

04/03/2020 10:35:39 - INFO - __main__ -   Saving model checkpoint to output/checkpoint-1000

Iteration:  19%|█████▊                         | 26/139 [00:14<00:10, 11.20it/s][A04/03/2020 10:35:55 - INFO - __main__ -   Saving optimizer and scheduler states to output/checkpoint-1000

Iteration:  19%|██████                         | 27/139 [00:27<14:21,  7.69s/it][A
Iteration:  21%|██████▍                        | 29/139 [00:27<09:55,  5.41s/it][A
Iteration:  22%|██████▋                        | 30/139 [00:28<06:59,  3.85s/it][A
Iteration:  22%|██████▉                        | 31/139 [00:28<04:54,  2.73s/it][A
Iteration:  24%|███████▎                       | 33/139 [00:28<03:25,  1.94s/it][A
Iteration:  25%|███████▊                       | 35/139 [00:28<02:23,  1.38s/it][A
Iteration:  27%|████████▎                      | 37/139 [00:28<01:40,  1.01it/s][A
Iteration:  28%|████████▋                      | 39/139 [00:28<01:11,  1.39it/s][A
Iteration:  29%|█████████▏                   

Iteration:  50%|███████████████▌               | 70/139 [00:05<00:05, 12.90it/s][A
Iteration:  52%|████████████████               | 72/139 [00:06<00:05, 12.71it/s][A
Iteration:  53%|████████████████▌              | 74/139 [00:06<00:05, 11.36it/s][A
Iteration:  55%|████████████████▉              | 76/139 [00:06<00:05, 11.83it/s][A
Iteration:  56%|█████████████████▍             | 78/139 [00:06<00:05, 10.81it/s][A
Iteration:  58%|█████████████████▊             | 80/139 [00:06<00:04, 12.09it/s][A
Iteration:  59%|██████████████████▎            | 82/139 [00:06<00:04, 12.56it/s][A
Iteration:  60%|██████████████████▋            | 84/139 [00:07<00:04, 12.11it/s][A
Iteration:  62%|███████████████████▏           | 86/139 [00:07<00:04, 12.77it/s][A
Iteration:  63%|███████████████████▋           | 88/139 [00:07<00:03, 12.88it/s][A
Iteration:  65%|████████████████████           | 90/139 [00:07<00:03, 13.14it/s][A
Iteration:  66%|████████████████████▌          | 92/139 [00:07<00:03, 13.36i

Iteration:  86%|█████████████████████████▉    | 120/139 [00:10<00:01, 12.30it/s][A
Iteration:  88%|██████████████████████████▎   | 122/139 [00:10<00:01, 12.44it/s][A
Iteration:  89%|██████████████████████████▊   | 124/139 [00:10<00:01, 10.25it/s][A
Iteration:  91%|███████████████████████████▏  | 126/139 [00:10<00:01, 10.00it/s][A
Iteration:  92%|███████████████████████████▋  | 128/139 [00:10<00:01, 10.86it/s][A
Iteration:  94%|████████████████████████████  | 130/139 [00:10<00:00, 11.84it/s][A
Iteration:  95%|████████████████████████████▍ | 132/139 [00:11<00:00, 11.76it/s][A
Iteration:  96%|████████████████████████████▉ | 134/139 [00:11<00:00, 11.68it/s][A
Iteration:  98%|█████████████████████████████▎| 136/139 [00:11<00:00, 11.18it/s][A
Iteration: 100%|██████████████████████████████| 139/139 [00:11<00:00, 11.80it/s][A
Epoch: 100%|████████████████████████████████████| 10/10 [02:46<00:00, 16.62s/it]
04/03/2020 10:36:28 - INFO - __main__ -    global_step = 1390, average loss = 1

## A bit of playing around

You can see that I use a bit adjusted script. It's still the same script as provided by huggingface all we did here is reformating our included \s tokens for new line creation. That's all. 

In [120]:
!python run_generation_formated.py \
    --model_type=gpt2 \
    --model_name_or_path=./output \
    --prompt "Brother Louie" \
    --repetition_penalty 1 \
    --seed 32 \
    --length 500

04/03/2020 10:38:01 - INFO - transformers.tokenization_utils -   Model name './output' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming './output' is a path, a model identifier, or url to a directory containing tokenizer files.
04/03/2020 10:38:01 - INFO - transformers.tokenization_utils -   Didn't find file ./output/added_tokens.json. We won't load it.
04/03/2020 10:38:01 - INFO - transformers.tokenization_utils -   loading file ./output/vocab.json
04/03/2020 10:38:01 - INFO - transformers.tokenization_utils -   loading file ./output/merges.txt
04/03/2020 10:38:01 - INFO - transformers.tokenization_utils -   loading file None
04/03/2020 10:38:01 - INFO - transformers.tokenization_utils -   loading file ./output/special_tokens_map.json
04/03/2020 10:38:01 - INFO - transformers.tokenization_utils -   loading file ./output/tokenizer_config.json
04/03/2020 10:38:01 - INFO - transformers.configuration_utils -   loading configuration file 