<a href="https://colab.research.google.com/github/inbarhub/YDATA_DL_assignments_2021-2022/blob/main/H.W_9_Text_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RNN for text generation


In this exercise, you'll unleash the hidden creativity of your computer, by letting it generate Country songs (yeehaw!). You'll train a character-level RNN-based language model, and use it to generate new songs.


### Special Note

Our Deep Learning course was packed with both theory and practice. In a short time, you've got to learn the basics of deep learning theory and get hands-on experience training and using pretrained DL networks, while learning PyTorch.  
Past exercises required a lot of work, and hopefully gave you a sense of the challenges and difficulties one faces when using deep learning in the real world. While the investment you've made in the course so far is enormous, We strongly encourage you to take a stab at this exercise. 

Some songs contain no lyrics (for example, they just contain the text "instrumental"). Others include non-English characters. You'll often need to preprocess your data and make decisions as to what your network should actually get as input (think - how should you treat newline characters?)

More issues will probably pop up while you're working on this task. If you face technical difficulties or find a step in the process that takes too long, please let me know. It would also be great if you share with the class code you wrote that speeds up some of the work (for example, a data loader class, a parsed dataset etc.)

## RNN for Text Generation
In this section, we'll use an LSTM to generate new songs. You can pick any genre you like, or just use all genres. You can even try to generate songs in the style of a certain artist - remember that the Metrolyrics dataset contains the author of each song. 

For this, we’ll first train a character-based language model. We’ve mostly discussed in class the usage of RNNs to predict the next word given past words, but as we’ve mentioned in class, RNNs can also be used to learn sequences of characters.

First, please go through the [PyTorch tutorial](https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html) on generating family names. You can download a .py file or a jupyter notebook with the entire code of the tutorial. 

As a reminder of topics we've discussed in class, see Andrej Karpathy's popular blog post ["The Unreasonable Effectiveness of Recurrent Neural Networks"](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). You are also encouraged to view [this](https://gist.github.com/karpathy/d4dee566867f8291f086) vanilla implementation of a character-level RNN, written in numpy with just 100 lines of code, including the forward and backward passes.  

Other tutorials that might prove useful:
1. http://warmspringwinds.github.io/pytorch/rnns/2018/01/27/learning-to-generate-lyrics-and-music-with-recurrent-neural-networks/
1. https://github.com/mcleonard/pytorch-charRNN
1. https://github.com/spro/practical-pytorch/blob/master/char-rnn-generation/char-rnn-generation.ipynb

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
import time
import os
import copy
import numpy as np
import pandas as pd
import re
from torch.utils.data.distributed import Dataset
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from sklearn.model_selection import train_test_split



device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## data preprocessing

In [2]:
# import dataset
parquet_file = 'https://github.com/omriallouche/ydata_deep_learning_2021/blob/master/data/metrolyrics.parquet?raw=true'
data = pd.read_parquet(parquet_file, engine='auto')
data

Unnamed: 0,song,year,artist,genre,lyrics,num_chars,sent,num_words
204182,fully-dressed,2008,annie,Pop,[HEALY]\n[spoken] This is Bert Healy saying .....,1041,healy spoken this bert healy saying singing he...,826
6116,surrounded-by-hoes,2006,50-cent,Hip-Hop,[Chorus: repeat 2X] Even when I'm tryin to be ...,1392,chorus repeat x even i tryin low i recognized ...,884
166369,taste-the-tears-thunderpuss-remix,2006,amber,Pop,How could you cause me so much pain?\nAnd leav...,1113,how could cause much pain and leave heart rain...,756
198416,the-truth-will-set-me-free,2006,glenn-hughes,Rock,In a scarlet vision\nIn a velvet room\nI come ...,779,in scarlet vision in velvet room i come decisi...,583
127800,the-last-goodbye,2008,aaron-pritchett,Country,Sprintime in Savannah\nIt dont get much pretti...,881,sprintime savannah it dont get much prettier b...,639
...,...,...,...,...,...,...,...,...
33205,give-it-all-up-for-love,2007,bananarama,Pop,To all the men I knew before\nOld love letters...,1159,to men i knew old love letters drawer mean not...,712
194149,all-i-m-thinking-about-is-you,2000,billy-ray-cyrus,Rock,Well it's a twenty-five mile drive from here t...,1094,well twenty five mile drive town ther gray ski...,676
11649,bonsoir-mon-amour,2015,dalida,Pop,"Tu viens de partir pour de longs mois, c'est l...",455,tu viens de partir pour de longs mois c est lo...,426
252283,i-m-not-gonna-miss-you,2014,glen-campbell,Pop,"I'm still here, but yet I'm gone\nI don't play...",527,i still yet i gone i play guitar sing songs th...,344


### clean text

In [3]:
def song_processing(song):
  # remove words that are between []
  song = re.sub( r"\[([^\]]+)\]", '', song)
  song = song.lower()
  # remove none english characters
  song = re.sub(r"[^a-zA-Z,.'!\n]", ' ', song)
  song = song.strip()
  return song

data['clean_song'] = data['lyrics'].apply(lambda x: song_processing(x))
# remove empty songs
data = data[data['clean_song'].str.len() > 0]
data['clean_song']

204182    this is bert healy saying ...\n hey, hobo man\...
6116      even when i'm tryin to be on the low, i'm reco...
166369    how could you cause me so much pain \nand leav...
198416    in a scarlet vision\nin a velvet room\ni come ...
127800    sprintime in savannah\nit dont get much pretti...
                                ...                        
33205     to all the men i knew before\nold love letters...
194149    well it's a twenty five mile drive from here t...
11649     tu viens de partir pour de longs mois, c'est l...
252283    i'm still here, but yet i'm gone\ni don't play...
11180     you are beyond any reproach now\nyou are so co...
Name: clean_song, Length: 49971, dtype: object

### parameters

In [4]:
BATCH_SIZE = 64
EMBEDDING_SIZE = 100
SEQUENCE_LENGTH = 1023
MAX_LENGTH = 1024

### Char tokenizer

In [5]:
class CharTokenizer:

    def __init__(self, char_to_int, max_length=1024, padding_token='@'):
      self.char_to_int = char_to_int
      self.int_to_char = {v: k for k, v in char_to_int.items()}
      self.padding_token = padding_token
      self.max_length = max_length

    def _pad(self, encoded_text):

      num_elements_to_add = self.max_length - encoded_text.size(0)
      padding_value = self.char_to_int[self.padding_token]
      padding_tensor = torch.empty(num_elements_to_add).fill_(padding_value)
      return torch.cat((encoded_text, padding_tensor))


    def encode(self, text, do_padding=True):
      encoded_text = [self.char_to_int[char] for char in text]
      encoded_text = torch.tensor(encoded_text)
      if do_padding:
        if len(encoded_text) >= self.max_length:
          return encoded_text[:self.max_length]
        return self._pad(encoded_text)
      else:
        return encoded_text
    
    def decode(self, tokens):
      return ''.join([self.int_to_char[token] for token in tokens])



chars = "abcdefghijklmnopqrstuvwxyz .,'!\n"

# padding character 
char_to_int = {'@': 0}

for i, char in enumerate(chars):
    char_to_int[char] = i + 1

songs_df = data[data['clean_song'].str.len() >= MAX_LENGTH]

tokenizer = CharTokenizer(char_to_int, max_length=MAX_LENGTH)
encoded_songs = songs_df['clean_song'].apply(lambda x: tokenizer.encode(x).numpy())

number_of_songs = len(encoded_songs)
encoded_songs = torch.tensor(np.concatenate(encoded_songs.values).reshape(number_of_songs, MAX_LENGTH))

# split to train test
train, test = train_test_split(encoded_songs, test_size=0.1, random_state=42)

### Song dataset

In [6]:
class SongDataset(Dataset):
  def __init__(self, songs, seq_length=128):
    self.songs = songs
    self.seq_length = seq_length
  
  def __len__(self):
    return len(self.songs)
  
  def __getitem__(self, idx):
    song = self.songs[idx]
    input_seq = song[:self.seq_length].reshape(-1,1).int()
    target_seq = song[self.seq_length].long()
    return input_seq, target_seq
  
train_dataset = SongDataset(train, seq_length=SEQUENCE_LENGTH)
validation_dataset = SongDataset(test, seq_length=SEQUENCE_LENGTH)

songs_datasets = {'train': train_dataset,
                  'val': validation_dataset}

dataloaders = {
    'train': torch.utils.data.DataLoader(songs_datasets['train'], batch_size=BATCH_SIZE,
                                             shuffle=False, num_workers=2),
               
    'val': torch.utils.data.DataLoader(songs_datasets['val'], batch_size=BATCH_SIZE,
                                          shuffle=False, num_workers=2)
  }

dataset_sizes = {x: len(songs_datasets[x]) for x in ['train', 'val']}
print('dataset_sizes: ', dataset_sizes)


dataset_sizes:  {'train': 23037, 'val': 2560}


## Train

### Training function

In [7]:
def train_model(model, dataloaders, criterion, optimizer, scheduler, num_epochs=25):
    since = time.time()

    # Init variables that will save info about the best model
    best_model_wts = copy.deepcopy(model.state_dict())
    best_loss = np.inf

    train_res= np.zeros((2,num_epochs))
    val_res=np.zeros((2,num_epochs))
    dict_res={'train':train_res, 'val':val_res}

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)
        hidden = model.init_hidden(BATCH_SIZE)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                # Set model to training mode. 
                model.train()  
            else:
                # Set model to evaluate mode. In evaluate mode, we don't perform backprop and don't need to keep the gradients
                model.eval()   

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data
            for inputs, labels in dataloaders[phase]:
                # Prepare the inputs for GPU/CPU
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()
                
                # ===== forward pass ======
                with torch.set_grad_enabled(phase=='train'):
                    # If we're in train mode, we'll track the gradients to allow back-propagation
                    outputs, hidden = model(inputs, hidden)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # ==== backward pass + optimizer step ====
                    # This runs only in the training phase
                    if phase == 'train':
                        loss.backward() # Perform a step in the opposite direction of the gradient
                        optimizer.step() # Adapt the optimizer
                        hidden = tuple([h.detach() for h in hidden])

                # Collect statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

            if phase == 'train':
                # Adjust the learning rate based on the scheduler
                scheduler.step()  

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]

            print(f'{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')

            dict_res[phase][0,epoch]=epoch_loss

            # Keep the results of the best model so far
            if phase == 'val' and epoch_loss < best_loss:
                best_loss = epoch_loss
                # deepcopy the model
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print(f'Training complete in {(time_elapsed // 60):.0f}m {(time_elapsed % 60):.0f}s')
    print(f'Best loss {round(best_loss, 4)}')

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model, dict_res

### Model

In [8]:
class CharModel(nn.Module):
    def __init__(self, number_of_outputs, embedding_size=64, hidden_size=128, num_layers=2, dropout_probability=0.1):
        super(CharModel, self).__init__()
        self.num_layers = num_layers
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(number_of_outputs, embedding_size, padding_idx=0)
        self.lstm = nn.LSTM(input_size=embedding_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True, dropout=dropout_probability)

        self.dropout = nn.Dropout(0.3)
        self.linear = nn.Linear(hidden_size, number_of_outputs)
    
    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data
        hidden = (weight.new(self.num_layers, batch_size, self.hidden_size).zero_(),
                   weight.new(self.num_layers, batch_size, self.hidden_size).zero_())

        return hidden

    def forward(self, x, hidden):
        x = self.embedding(x)
        x = x.view(x.shape[0], x.shape[1], x.shape[3])
        x, hidden = self.lstm(x)
        
        # take only the last output
        x = x[:, -1, :]
       
        # produce output
        x = self.linear(self.dropout(x))
        return x, hidden

number_of_outputs = len(tokenizer.char_to_int)
model = CharModel(number_of_outputs, embedding_size=EMBEDDING_SIZE, hidden_size=512, num_layers=2)
model

CharModel(
  (embedding): Embedding(33, 100, padding_idx=0)
  (lstm): LSTM(100, 512, num_layers=2, batch_first=True, dropout=0.1)
  (dropout): Dropout(p=0.3, inplace=False)
  (linear): Linear(in_features=512, out_features=33, bias=True)
)

In [9]:
# If a GPU is available, make the model use it
torch.cuda.empty_cache()
model = model.to(device)
lr=0.001

criterion = nn.CrossEntropyLoss()
optimizer_ft = optim.Adam(model.parameters(), lr=lr, weight_decay=4e-4)

exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=10, gamma=0.2)

num_epochs = 20

model,dict_res = train_model(model.to(device), 
                    dataloaders,
                       criterion, 
                       optimizer_ft, 
                       exp_lr_scheduler,
                       num_epochs=num_epochs)

Epoch 0/19
----------
train Loss: 2.4773 Acc: 0.2954
val Loss: 2.2669 Acc: 0.3348

Epoch 1/19
----------
train Loss: 2.1736 Acc: 0.3632
val Loss: 2.1577 Acc: 0.3621

Epoch 2/19
----------
train Loss: 2.0747 Acc: 0.3863
val Loss: 2.1003 Acc: 0.3832

Epoch 3/19
----------
train Loss: 2.0135 Acc: 0.4020
val Loss: 2.0643 Acc: 0.3879

Epoch 4/19
----------
train Loss: 1.9599 Acc: 0.4116
val Loss: 2.0252 Acc: 0.3914

Epoch 5/19
----------
train Loss: 1.9117 Acc: 0.4221
val Loss: 1.9982 Acc: 0.4066

Epoch 6/19
----------
train Loss: 1.8682 Acc: 0.4354
val Loss: 1.9764 Acc: 0.4156

Epoch 7/19
----------
train Loss: 1.8227 Acc: 0.4479
val Loss: 1.9742 Acc: 0.4133

Epoch 8/19
----------
train Loss: 1.7757 Acc: 0.4610
val Loss: 1.9716 Acc: 0.4145

Epoch 9/19
----------
train Loss: 1.7306 Acc: 0.4748
val Loss: 1.9820 Acc: 0.4160

Epoch 10/19
----------
train Loss: 1.5775 Acc: 0.5176
val Loss: 1.9497 Acc: 0.4289

Epoch 11/19
----------
train Loss: 1.5018 Acc: 0.5395
val Loss: 1.9502 Acc: 0.4383

Ep

## Generated songs

In [10]:
import torch.nn.functional as F

def generate_song(start_sequence, model, max_length=1024):

  model.eval()
  with torch.no_grad():
      current_seq = start_sequence
      hidden = model.init_hidden(1)
      
      # Generate text character by character
      for _ in range(max_length):
          input_tensor = tokenizer.encode(current_seq, do_padding=False).reshape((1, -1, 1)).int().to(device)
          output, hidden = model(input_tensor, hidden)
          probabilities = F.softmax(output, dim=1).squeeze()
          
          # Sample the next character based on the probabilities
          next_id = torch.multinomial(probabilities, num_samples=1).item()
          next_char = tokenizer.decode([next_id])
          # Append the sampled character to the generated sequence
          current_seq += next_char

  return current_seq
        


In [11]:
start_sequence = tokenizer.decode(test[0][0:128].int().numpy())

print(f"Start sequence: \n{start_sequence}")
generated_song = generate_song(start_sequence, model)
print()
print(f"Generated song: \n{generated_song}")

Start sequence: 
get up out your seat, the grand groove is back
with a beat to sink your teeth in like wolfman jack
i'm bringing in the swing lik

Generated song: 
get up out your seat, the grand groove is back
with a beat to sink your teeth in like wolfman jack
i'm bringing in the swing like an awand
and it rast g
tere the clist the s at matchiri's arouin
 heap that money, in trying, the ray
be voyy a stast a mot about
steating for's for and let use toon bood
oh,, rom take in i loved pryades the reand of a bed
gend your niggat time and tirn that like youq pust crouss no warnin' one waind you get out we ushin' a wishig shits

achin' don't leed me you don't got to every good tfee arving pall
my, i wet you sping worry the  came no my life of the doy
fhave out the moins leant, i'll neving eningin' free on the handy
jon't got got baby bindin' up
your misca, it's a lise
four huf all packs t leoming my smowy  now lat then all i manded been mymarn, the rusterfer why ok you, dieny i'mn time to

In [12]:
start_sequence = tokenizer.decode(test[1][0:128].int().numpy())

print(f"Start sequence: \n{start_sequence}")
generated_song = generate_song(start_sequence, model)
print()
print(f"Generated song: \n{generated_song}")

Start sequence: 
in my early years i hid my tears
and passed my days alone
adrift on an ocean of loneliness
my dreams like nets were thrown
to ca

Generated song: 
in my early years i hid my tears
and passed my days alone
adrift on an ocean of loneliness
my dreams like nets were thrown
to care de the timine at tating nigga fast's spir that shill a caan love, i've alwag but in hur veen your the blocke
thy way it you you wit, way to tome
whyund you meand ang thesh i icm na fed you can know, move your loog and heard out shifhis ,y notgit vuscared up the lieving time, ourine
but i more my hadn the dect
ace teand thing jass
but i loving dow bow dow now in a our sined you  aserof
and your fut i deed eneer daby
and reand, she@in's.ded ut proke me, dows to the dag a buf is hear, noto shatt!as
in
you're got it fleat ain's never gids
i'm whon to ptart dear the myoce to dratear
herus ous could bet down just awarn out 
runsy you're antta be wibde sweed your a is nothing you everywe unleared it pel

In [13]:
start_sequence = tokenizer.decode(test[2][0:128].int().numpy())

print(f"Start sequence: \n{start_sequence}")
generated_song = generate_song(start_sequence, model)
print()
print(f"Generated song: \n{generated_song}")

Start sequence: 
yeah! 
 woo! 
dig, dig, gravedigger
dig, gravedigger, dig
work that shovel with vigor, gravedigger
before rigor mortis sets in, 

Generated song: 
yeah! 
 woo! 
dig, dig, gravedigger
dig, gravedigger, dig
work that shovel with vigor, gravedigger
before rigor mortis sets in, norned you take you they been digga veaws woure not  now not in to all stand out heald the pain for lof me the tron
care
you chorgice if hap lo muor
no in bood your're go and rould that niggas heid in the mand
and here i'd lough wo know o busce
thing is in shigh rouch than loo
hild bust with me 
won't din't gift allin' up really all it know baw
you, sitis take to of
and af a miand fbleave
eve im me pand on what rin' ho
look anoh repmemame inny praugh
and you, were ther werre
kust about been it id you thin' this
not
i get it telvel the may or you fing togath
thing reriseve liaking brack mach
my cow like a doy
in the came the stife it
but we know haud talde and i mant
te look in if miere called so mer