In [11]:
import pandas as pd
import os

## A. Preparing the data

In [2]:
# set up
# data from: https://www.kaggle.com/mousehead/songlyrics
file_path = './songlyrics/songdata.csv' 

In [3]:
# read data
df = pd.read_csv(file_path)
len(df)  # checks number of rows

57650

In [4]:
df.head()  # prints first 5 rows

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


In [5]:
df['artist'].unique()  # gets all unique values of artists

array(['ABBA', 'Ace Of Base', 'Adam Sandler', 'Adele', 'Aerosmith',
       'Air Supply', 'Aiza Seguerra', 'Alabama', 'Alan Parsons Project',
       'Aled Jones', 'Alice Cooper', 'Alice In Chains', 'Alison Krauss',
       'Allman Brothers Band', 'Alphaville', 'America', 'Amy Grant',
       'Andrea Bocelli', 'Andy Williams', 'Annie', 'Ariana Grande',
       'Ariel Rivera', 'Arlo Guthrie', 'Arrogant Worms', 'Avril Lavigne',
       'Backstreet Boys', 'Barbie', 'Barbra Streisand', 'Beach Boys',
       'The Beatles', 'Beautiful South', 'Beauty And The Beast',
       'Bee Gees', 'Bette Midler', 'Bill Withers', 'Billie Holiday',
       'Billy Joel', 'Bing Crosby', 'Black Sabbath', 'Blur', 'Bob Dylan',
       'Bob Marley', 'Bob Rivers', 'Bob Seger', 'Bon Jovi', 'Boney M.',
       'Bonnie Raitt', 'Bosson', 'Bread', 'Britney Spears',
       'Bruce Springsteen', 'Bruno Mars', 'Bryan White', 'Cake',
       'Carly Simon', 'Carol Banawa', 'Carpenters', 'Cat Stevens',
       'Celine Dion', 'Chaka Khan

I am going to use Modern Talking songs for my finetunning, so let's get those.

In [6]:
df = df[df['artist']=='Modern Talking']
df['artist'].unique()  # just to check

array(['Modern Talking'], dtype=object)

At this point we have a dataframe with all the Modern Talking songs we would want. Well, let's see how many actually.

In [7]:
print('There are lyrics for {} songs'.format(len(df)))

There are lyrics for 144 songs


Seems okai, just need some seperators for lines and we should be good to go.

In [8]:
df['text'] = [i.replace('\n', '\s ') for i in df['text']]

In [9]:
df.head()

Unnamed: 0,artist,song,link,text
13163,Modern Talking,10 Seconds To Countdown,/m/modern+talking/10+seconds+to+countdown_2048...,"You are the best, for all of times \s You bea..."
13164,Modern Talking,America,/m/modern+talking/america_20094723.html,"On Monday morning, ocean rain was falling \s ..."
13165,Modern Talking,Blackbird,/m/modern+talking/blackbird_20154356.html,"Just tomorrow, there's another day \s Another..."
13166,Modern Talking,Can't Get Enough,/m/modern+talking/cant+get+enough_20094601.html,"Oh, Juliet Is Crying All The Night \s She Doe..."
13167,Modern Talking,Can't Let You Go,/m/modern+talking/cant+let+you+go_20094740.html,You don't need nobody \s When you're down and...


all done!

In [12]:
os.mkdir('./data')  # makes directory for some file storage

In [13]:
# let's leave 5 songs for evaluation
df['text'][:-5].to_csv('./data/train.txt', index=False, header=False)
df['text'][-5:].to_csv('./data/test.txt', index=False, header=False)

## B. Fine tune

Let's train the model for some lyrics generation then. Yes you could rewrite the whole huggingface script, but what for? They made a great job with it. Let's use it.

! let's jupyter run bash comands

In [14]:
!python run_language_modeling.py \
    --output_dir=output \
    --overwrite_output_dir \
    --model_type=gpt2 \
    --model_name_or_path=distilgpt2 \
    --line_by_line \
    --do_train \
    --train_data_file=./data/train.txt \
    --do_eval \
    --per_gpu_train_batch_size=1 \
    --num_train_epochs=10 \
    --eval_data_file=./data/test.txt

04/03/2020 16:16:52 - INFO - filelock -   Lock 139787331353360 acquired on /root/.cache/torch/transformers/eb0f77b3f095880586731f57e2fe19060d71d1036ef8daf727bd97a17fb66a43.a41f80bd12c111d611dcd5546611b7e47c16a0a995f83df2f7b437a20b6849b5.lock
04/03/2020 16:16:52 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-config.json not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmpd1arxtvh
Downloading: 100%|██████████████████████████████| 622/622 [00:00<00:00, 282kB/s]
04/03/2020 16:16:52 - INFO - transformers.file_utils -   storing https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-config.json in cache at /root/.cache/torch/transformers/eb0f77b3f095880586731f57e2fe19060d71d1036ef8daf727bd97a17fb66a43.a41f80bd12c111d611dcd5546611b7e47c16a0a995f83df2f7b437a20b6849b5
04/03/2020 16:16:52 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/eb0f77

04/03/2020 16:17:25 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=1024, cache_dir=None, config_name=None, device=device(type='cuda'), do_eval=True, do_train=True, eval_all_checkpoints=False, eval_data_file='./data/test.txt', evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=5e-05, line_by_line=True, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=-1, mlm=False, mlm_probability=0.15, model_name_or_path='distilgpt2', model_type='gpt2', n_gpu=1, no_cuda=False, num_train_epochs=10.0, output_dir='output', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=1, save_steps=500, save_total_limit=None, seed=42, server_ip='', server_port='', should_continue=False, tokenizer_name=None, train_data_file='./data/train.txt', warmup_steps=0, weight_decay=0.0)
04/03/2020 16:17:25 - INFO - __main__ -   Creating features from dataset f

Iteration:   4%|█▏                              | 5/139 [00:00<00:13, 10.13it/s][A
Iteration:   5%|█▌                              | 7/139 [00:00<00:12, 10.24it/s][A
Iteration:   6%|██                              | 9/139 [00:00<00:13,  9.80it/s][A
Iteration:   8%|██▍                            | 11/139 [00:01<00:12, 10.48it/s][A
Iteration:   9%|██▉                            | 13/139 [00:01<00:12, 10.31it/s][A
Iteration:  11%|███▎                           | 15/139 [00:01<00:11, 10.89it/s][A
Iteration:  12%|███▊                           | 17/139 [00:01<00:11, 10.55it/s][A
Iteration:  14%|████▏                          | 19/139 [00:01<00:11, 10.45it/s][A
Iteration:  15%|████▋                          | 21/139 [00:01<00:10, 10.91it/s][A
Iteration:  17%|█████▏                         | 23/139 [00:02<00:10, 11.16it/s][A
Iteration:  18%|█████▌                         | 25/139 [00:02<00:10, 11.35it/s][A
Iteration:  19%|██████                         | 27/139 [00:02<00:09, 11.26i

Iteration:  35%|██████████▉                    | 49/139 [00:04<00:09,  9.77it/s][A
Iteration:  37%|███████████▎                   | 51/139 [00:04<00:08,  9.86it/s][A
Iteration:  38%|███████████▊                   | 53/139 [00:05<00:08,  9.94it/s][A
Iteration:  40%|████████████▎                  | 55/139 [00:05<00:08, 10.20it/s][A
Iteration:  41%|████████████▋                  | 57/139 [00:05<00:07, 10.40it/s][A
Iteration:  42%|█████████████▏                 | 59/139 [00:05<00:07, 10.29it/s][A
Iteration:  44%|█████████████▌                 | 61/139 [00:05<00:07, 10.55it/s][A
Iteration:  45%|██████████████                 | 63/139 [00:06<00:07, 10.51it/s][A
Iteration:  47%|██████████████▍                | 65/139 [00:06<00:06, 10.86it/s][A
Iteration:  48%|██████████████▉                | 67/139 [00:06<00:06, 10.38it/s][A
Iteration:  50%|███████████████▍               | 69/139 [00:06<00:06, 10.14it/s][A
Iteration:  51%|███████████████▊               | 71/139 [00:06<00:06, 10.39i

04/03/2020 16:18:15 - INFO - __main__ -   Saving optimizer and scheduler states to output/checkpoint-500

Iteration:  60%|██████████████████▌            | 83/139 [00:09<00:17,  3.28it/s][A
Iteration:  61%|██████████████████▉            | 85/139 [00:09<00:12,  4.16it/s][A
Iteration:  63%|███████████████████▍           | 87/139 [00:09<00:10,  5.19it/s][A
Iteration:  64%|███████████████████▊           | 89/139 [00:10<00:08,  6.19it/s][A
Iteration:  65%|████████████████████▎          | 91/139 [00:10<00:06,  7.21it/s][A
Iteration:  67%|████████████████████▋          | 93/139 [00:10<00:05,  8.14it/s][A
Iteration:  68%|█████████████████████▏         | 95/139 [00:10<00:05,  8.56it/s][A
Iteration:  70%|█████████████████████▋         | 97/139 [00:10<00:04,  9.05it/s][A
Iteration:  71%|██████████████████████         | 99/139 [00:10<00:04,  9.49it/s][A
Iteration:  73%|█████████████████████▊        | 101/139 [00:11<00:03,  9.77it/s][A
Iteration:  74%|██████████████████████▏       | 103/13

Iteration:  87%|██████████████████████████    | 121/139 [00:11<00:01, 10.51it/s][A
Iteration:  88%|██████████████████████████▌   | 123/139 [00:12<00:01, 10.18it/s][A
Iteration:  90%|██████████████████████████▉   | 125/139 [00:12<00:01,  9.81it/s][A
Iteration:  91%|███████████████████████████▍  | 127/139 [00:12<00:01, 10.44it/s][A
Iteration:  93%|███████████████████████████▊  | 129/139 [00:12<00:00, 10.47it/s][A
Iteration:  94%|████████████████████████████▎ | 131/139 [00:12<00:00, 10.37it/s][A
Iteration:  96%|████████████████████████████▋ | 133/139 [00:12<00:00, 10.68it/s][A
Iteration:  97%|█████████████████████████████▏| 135/139 [00:13<00:00, 10.90it/s][A
Iteration:  99%|█████████████████████████████▌| 137/139 [00:13<00:00, 10.51it/s][A
Iteration: 100%|██████████████████████████████| 139/139 [00:13<00:00, 10.23it/s][A
Epoch:  50%|██████████████████▌                  | 5/10 [01:09<01:09, 13.86s/it]
Iteration:   0%|                                        | 0/139 [00:00<?, ?it/s

Iteration:  10%|███                            | 14/139 [00:01<00:14,  8.51it/s][A
Iteration:  12%|███▌                           | 16/139 [00:01<00:13,  9.09it/s][A
Iteration:  12%|███▊                           | 17/139 [00:01<00:13,  9.29it/s][A
Iteration:  13%|████                           | 18/139 [00:01<00:12,  9.47it/s][A
Iteration:  14%|████▍                          | 20/139 [00:02<00:12,  9.20it/s][A
Iteration:  15%|████▋                          | 21/139 [00:02<00:14,  8.04it/s][A
Iteration:  17%|█████▏                         | 23/139 [00:02<00:12,  9.08it/s][A
Iteration:  18%|█████▌                         | 25/139 [00:02<00:11,  9.74it/s][A
Iteration:  19%|██████                         | 27/139 [00:02<00:12,  8.80it/s][A
Iteration:  21%|██████▍                        | 29/139 [00:03<00:12,  9.13it/s][A
Iteration:  22%|██████▉                        | 31/139 [00:03<00:11,  9.37it/s][A
Iteration:  24%|███████▎                       | 33/139 [00:03<00:10,  9.98i

Iteration:  34%|██████████▍                    | 47/139 [00:06<00:09,  9.45it/s][A
Iteration:  35%|██████████▉                    | 49/139 [00:06<00:08, 10.02it/s][A
Iteration:  37%|███████████▎                   | 51/139 [00:06<00:08, 10.48it/s][A
Iteration:  38%|███████████▊                   | 53/139 [00:06<00:08, 10.54it/s][A
Iteration:  40%|████████████▎                  | 55/139 [00:06<00:08, 10.08it/s][A
Iteration:  41%|████████████▋                  | 57/139 [00:06<00:07, 10.73it/s][A
Iteration:  42%|█████████████▏                 | 59/139 [00:07<00:07, 11.10it/s][A
Iteration:  44%|█████████████▌                 | 61/139 [00:07<00:06, 11.33it/s][A
Iteration:  45%|██████████████                 | 63/139 [00:07<00:07,  9.75it/s][A
Iteration:  47%|██████████████▍                | 65/139 [00:07<00:07,  9.85it/s][A
Iteration:  48%|██████████████▉                | 67/139 [00:07<00:07, 10.25it/s][A
Iteration:  50%|███████████████▍               | 69/139 [00:08<00:06, 10.37i

Iteration:  65%|████████████████████▎          | 91/139 [00:08<00:04, 11.26it/s][A
Iteration:  67%|████████████████████▋          | 93/139 [00:08<00:04, 11.34it/s][A
Iteration:  68%|█████████████████████▏         | 95/139 [00:09<00:03, 11.18it/s][A
Iteration:  70%|█████████████████████▋         | 97/139 [00:09<00:03, 10.73it/s][A
Iteration:  71%|██████████████████████         | 99/139 [00:09<00:04,  9.84it/s][A
Iteration:  73%|█████████████████████▊        | 101/139 [00:09<00:03, 10.14it/s][A
Iteration:  74%|██████████████████████▏       | 103/139 [00:09<00:03, 10.38it/s][A
Iteration:  76%|██████████████████████▋       | 105/139 [00:10<00:03, 11.01it/s][A
Iteration:  77%|███████████████████████       | 107/139 [00:10<00:03,  9.72it/s][A
Iteration:  78%|███████████████████████▌      | 109/139 [00:10<00:02, 10.19it/s][A
Iteration:  80%|███████████████████████▉      | 111/139 [00:10<00:02, 10.22it/s][A
Iteration:  81%|████████████████████████▍     | 113/139 [00:10<00:02, 10.37i

Iteration: 100%|██████████████████████████████| 139/139 [00:13<00:00, 10.33it/s][A
Epoch: 100%|████████████████████████████████████| 10/10 [02:18<00:00, 13.81s/it]
04/03/2020 16:19:43 - INFO - __main__ -    global_step = 1390, average loss = 1.788906208130953
04/03/2020 16:19:43 - INFO - __main__ -   Saving model checkpoint to output
04/03/2020 16:19:43 - INFO - transformers.configuration_utils -   Configuration saved in output/config.json
04/03/2020 16:19:44 - INFO - transformers.modeling_utils -   Model weights saved in output/pytorch_model.bin
04/03/2020 16:19:44 - INFO - transformers.configuration_utils -   loading configuration file output/config.json
04/03/2020 16:19:44 - INFO - transformers.configuration_utils -   Model config GPT2Config {
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "decoder_start_token_id": null,
  "do_sample": false,
  "early_stopping": false,
  "embd_

## A bit of playing around

You can see that I use a bit adjusted script. It's still the same script as provided by huggingface all we did here is reformating our included \s tokens for new line creation. That's all. 

In [16]:
!python run_generation_formated.py \
    --model_type=gpt2 \
    --model_name_or_path=./output \
    --prompt "Brother Louie" \
    --repetition_penalty 1 \
    --seed 32 \
    --length 500

04/03/2020 16:24:36 - INFO - transformers.tokenization_utils -   Model name './output' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming './output' is a path, a model identifier, or url to a directory containing tokenizer files.
04/03/2020 16:24:36 - INFO - transformers.tokenization_utils -   Didn't find file ./output/added_tokens.json. We won't load it.
04/03/2020 16:24:36 - INFO - transformers.tokenization_utils -   loading file ./output/vocab.json
04/03/2020 16:24:36 - INFO - transformers.tokenization_utils -   loading file ./output/merges.txt
04/03/2020 16:24:36 - INFO - transformers.tokenization_utils -   loading file None
04/03/2020 16:24:36 - INFO - transformers.tokenization_utils -   loading file ./output/special_tokens_map.json
04/03/2020 16:24:36 - INFO - transformers.tokenization_utils -   loading file ./output/tokenizer_config.json
04/03/2020 16:24:36 - INFO - transformers.configuration_utils -   loading configuration file 