<a href="https://colab.research.google.com/github/inbarhub/YDATA_DL_assignments_2021-2022/blob/main/H.W_9_Text_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RNN for text generation


In this exercise, you'll unleash the hidden creativity of your computer, by letting it generate Country songs (yeehaw!). You'll train a character-level RNN-based language model, and use it to generate new songs.


### Special Note

Our Deep Learning course was packed with both theory and practice. In a short time, you've got to learn the basics of deep learning theory and get hands-on experience training and using pretrained DL networks, while learning PyTorch.  
Past exercises required a lot of work, and hopefully gave you a sense of the challenges and difficulties one faces when using deep learning in the real world. While the investment you've made in the course so far is enormous, We strongly encourage you to take a stab at this exercise. 

Some songs contain no lyrics (for example, they just contain the text "instrumental"). Others include non-English characters. You'll often need to preprocess your data and make decisions as to what your network should actually get as input (think - how should you treat newline characters?)

More issues will probably pop up while you're working on this task. If you face technical difficulties or find a step in the process that takes too long, please let me know. It would also be great if you share with the class code you wrote that speeds up some of the work (for example, a data loader class, a parsed dataset etc.)

## RNN for Text Generation
In this section, we'll use an LSTM to generate new songs. You can pick any genre you like, or just use all genres. You can even try to generate songs in the style of a certain artist - remember that the Metrolyrics dataset contains the author of each song. 

For this, we’ll first train a character-based language model. We’ve mostly discussed in class the usage of RNNs to predict the next word given past words, but as we’ve mentioned in class, RNNs can also be used to learn sequences of characters.

First, please go through the [PyTorch tutorial](https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html) on generating family names. You can download a .py file or a jupyter notebook with the entire code of the tutorial. 

As a reminder of topics we've discussed in class, see Andrej Karpathy's popular blog post ["The Unreasonable Effectiveness of Recurrent Neural Networks"](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). You are also encouraged to view [this](https://gist.github.com/karpathy/d4dee566867f8291f086) vanilla implementation of a character-level RNN, written in numpy with just 100 lines of code, including the forward and backward passes.  

Other tutorials that might prove useful:
1. http://warmspringwinds.github.io/pytorch/rnns/2018/01/27/learning-to-generate-lyrics-and-music-with-recurrent-neural-networks/
1. https://github.com/mcleonard/pytorch-charRNN
1. https://github.com/spro/practical-pytorch/blob/master/char-rnn-generation/char-rnn-generation.ipynb

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as Functional
import numpy as np
import re
import string
import random
import pandas as pd
from tqdm.auto import tqdm
from torch.nn import functional as F
from nltk.tokenize import WordPunctTokenizer
import nltk
from nltk.collocations import *

In [3]:
df_lyrics = pd.read_parquet('metrolyrics.parquet')
df_lyrics = df_lyrics.reset_index(drop=True)
df_lyrics.head()

Unnamed: 0,song,year,artist,genre,lyrics,num_chars,sent,num_words
0,fully-dressed,2008,annie,Pop,[HEALY]\n[spoken] This is Bert Healy saying .....,1041,healy spoken this bert healy saying singing he...,826
1,surrounded-by-hoes,2006,50-cent,Hip-Hop,[Chorus: repeat 2X] Even when I'm tryin to be ...,1392,chorus repeat x even i tryin low i recognized ...,884
2,taste-the-tears-thunderpuss-remix,2006,amber,Pop,How could you cause me so much pain?\nAnd leav...,1113,how could cause much pain and leave heart rain...,756
3,the-truth-will-set-me-free,2006,glenn-hughes,Rock,In a scarlet vision\nIn a velvet room\nI come ...,779,in scarlet vision in velvet room i come decisi...,583
4,the-last-goodbye,2008,aaron-pritchett,Country,Sprintime in Savannah\nIt dont get much pretti...,881,sprintime savannah it dont get much prettier b...,639


In [4]:
def process_text(lyric):
    return re.sub("[^a-z' ]", "", lyric).replace("'", "")

In [5]:
text = "\n".join(df_lyrics[df_lyrics.artist=="bee-gees"].lyrics.tolist())
text = text.replace("\n\n", "\n")
text.splitlines()[:10]

['On a hill, inside a house in Covewell Reach',
 "Stands a man who's feeling very tired",
 'Looking at a song he wrote some time ago',
 'Could have made it big inside a Broadway show',
 'Every day I go away and find ideas',
 "Think I'll climb on top of somewhere high",
 "Couldn't I write a song about a man who's dead?",
 "Didn't really know if he was off his head",
 "Ev'rybody knows, that's the way it goes",
 'Too bad for Gilbert Green']

In [6]:
lyrics = text.lower().split("\n")
lyrics = np.unique(lyrics)[1:].tolist()

clean_lyrics = [process_text(lyric) for lyric in lyrics]

In [7]:
clean_lyrics[:5], lyrics[:5]

([' each night before we go to sleep',
  'i love you',
  'cause i believed in you',
  'cause i dont wanna feel the pain anymore',
  'cause i know it isnt heaven is it love or hate'],
 [' each night before we go to sleep',
  '"i love you"',
  "'cause i believed in you",
  "'cause i don't wanna feel the pain anymore",
  "'cause i know it isn't heaven is it love or hate"])

In [8]:
def generate_n_grams(words, n_gram_size):
    n_grams = []
    
    if (n_gram_size <= 0):
        raise Exception("n_gram_size should be higher than zero!")
        
    n_gram_size = n_gram_size - 1

    if len(words.split()) <= n_gram_size:
        return [words]
    
    for itr in range(n_gram_size, len(words.split())):
        curr_seq = words.split()[itr - n_gram_size:itr + 1]
        n_grams.append(" ".join(curr_seq))
    
    return n_grams

In [9]:
generate_n_grams(clean_lyrics[0], 2)

['each night', 'night before', 'before we', 'we go', 'go to', 'to sleep']

In [10]:
n_gram_size = 4
n_grams = [generate_n_grams(lyric, n_gram_size) for lyric in clean_lyrics]
phrases = np.unique(np.array(sum(n_grams, []))).tolist()

distinct_words = np.unique(np.array(" ".join(phrases).split(" ")))
distinct_words_idx = np.arange(distinct_words.size)
word_to_idx = dict(zip(distinct_words.tolist(), distinct_words_idx.tolist()))
idx_to_word = dict(zip(distinct_words_idx.tolist(), distinct_words.tolist()))
vocabulary_size = len(word_to_idx)

In [11]:
vocabulary_size, phrases[:10],phrases[-10:], list(word_to_idx.items())[:10]

(2136,
 ['',
  'a afallin for you',
  'a bad girl your',
  'a beat of a',
  'a bed of leaves',
  'a bit close to',
  'a body to behold',
  'a body you dream',
  'a boy all the',
  'a brave new world'],
 ['youve got nothing to',
  'youve got the best',
  'youve got the first',
  'youve got to be',
  'youve got to find',
  'youve got to live',
  'youve got to wear',
  'youve nothing to hide',
  'youve shown it inside',
  'youve stayed with other'],
 [('', 0),
  ('a', 1),
  ('aah', 2),
  ('able', 3),
  ('aboard', 4),
  ('about', 5),
  ('above', 6),
  ('accused', 7),
  ('ace', 8),
  ('aches', 9)])

In [13]:
x_word = []
y_word = []

for phrase in phrases:
    if (len(phrase.split()) != n_gram_size):
        continue
    
    x_word.append(" ".join(phrase.split()[:-1]))
    y_word.append(" ".join(phrase.split()[1:]))

In [14]:
x_word[:3], y_word[:3]

(['a afallin for', 'a bad girl', 'a beat of'],
 ['afallin for you', 'bad girl your', 'beat of a'])

In [15]:
def get_phrase_idx(phrase):
    return [word_to_idx[word] for word in phrase.split()]

In [16]:
x_idx = np.array([get_phrase_idx(word) for word in x_word])
y_idx = np.array([get_phrase_idx(word) for word in y_word])

In [17]:
x_idx, y_idx, y_idx.shape, x_idx.shape

(array([[   1,   20,  670],
        [   1,  103,  721],
        [   1,  117, 1231],
        ...,
        [2134, 1222, 1874],
        [2134, 1581,  923],
        [2134, 1695, 2066]]),
 array([[  20,  670, 2121],
        [ 103,  721, 2129],
        [ 117, 1231,    1],
        ...,
        [1222, 1874,  836],
        [1581,  923,  907],
        [1695, 2066, 1261]]),
 (11935, 3),
 (11935, 3))

In [18]:
class LSTM(nn.Module):
    def __init__(self, hidden_layers, num_layers, embedding_size, drop_prob, lr):
        super().__init__()
        self.drop_prob = drop_prob
        self.num_layers = num_layers
        self.hidden_layers = hidden_layers
        self.lr = lr
        self.embedded = nn.Embedding(vocabulary_size, embedding_size)
        self.lstm = nn.LSTM(embedding_size, hidden_layers, num_layers, dropout = drop_prob, batch_first = True)
        self.dropout = nn.Dropout(drop_prob)
        self.fc = nn.Linear(hidden_layers, vocabulary_size)      
    
    def forward(self, x, hidden):
        embedded = self.embedded(x)     
        lstm_output, hidden = self.lstm(embedded, hidden)
        dropout_out = self.dropout(lstm_output).reshape(-1, self.hidden_layers) 
        out = self.fc(dropout_out)
        return out, hidden
    
    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data
        hidden = (weight.new(self.num_layers, batch_size, self.hidden_layers).zero_(),
                  weight.new(self.num_layers, batch_size, self.hidden_layers).zero_())
        return hidden

In [19]:
device = "cuda" if torch.cuda.is_available() else "cpu"
hidden_layers = 256
num_layers = 4
embedding_size = 200
drop_prob = 0.3
lr = 0.001
batch_size = 32

model = LSTM(hidden_layers, num_layers, embedding_size, drop_prob, lr).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr = lr)
loss_func = nn.CrossEntropyLoss()
model.train();

In [20]:
model

LSTM(
  (embedded): Embedding(2136, 200)
  (lstm): LSTM(200, 256, num_layers=4, batch_first=True, dropout=0.3)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=2136, bias=True)
)

In [21]:
def next_batch(x, y, batch_size):
    for itr in range(batch_size, x.shape[0], batch_size):
        batch_x = x[itr - batch_size:itr, :]
        batch_y = y[itr - batch_size:itr, :]
        yield batch_x, batch_y

In [22]:
def train(num_epochs):
    for epoch in tqdm(range(num_epochs)):
        hidden_layer = model.init_hidden(batch_size)
        for x, y in next_batch(x_idx, y_idx, batch_size):
            inputs = torch.from_numpy(x).type(torch.LongTensor).to(device)
            act = torch.from_numpy(y).type(torch.LongTensor).to(device)
            hidden_layer = tuple([layer.data for layer in hidden_layer])
            model.zero_grad()
            output, hidden = model(inputs, hidden_layer)
            loss = loss_func(output, act.view(-1))
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), 1)
            optimizer.step()    

In [23]:
train(num_epochs = 10)

  0%|          | 0/10 [00:00<?, ?it/s]

In [24]:
def softmax(x, dim=None, temperature = 1.):
    e_x = torch.exp(x / temperature)
    return e_x / torch.sum(e_x, dim=dim)

In [25]:
def predict(model, token, hidden_layer):
    x = np.array([[word_to_idx[token] if token in word_to_idx else len(word_to_idx)]])
    inputs = torch.from_numpy(x).type(torch.LongTensor).to(device)
    hidden = tuple([layer.data for layer in hidden_layer])
    out, hidden = model(inputs, hidden)
    prob = softmax(out, dim=1, temperature=np.random.choice([0.5,0.2,0.1], p=[0.5, 0.25, 0.25]))
    prob = prob.detach().cpu().numpy()
    prob = prob.reshape(prob.shape[1],)
    top_tokens = prob.argsort()[-3:][::-1]
    selected_index = top_tokens[0]

    return idx_to_word[selected_index], hidden

In [26]:
def generate_lyrics(model, limit_words, start_text):
    model.eval()
    hidden = model.init_hidden(1)
    tokens = start_text.split()
    
    for token in start_text.split():
        curr_token, hidden = predict(model, token, hidden)
    
    tokens.append(curr_token)
    
    for token_num in range(limit_words - 1):
        token, hidden = predict(model, tokens[-1], hidden)
        tokens.append(token)
        
    return " ".join(tokens)

In [27]:
corpus = df_lyrics[df_lyrics.artist=="bee-gees"]
corpus.head()

Unnamed: 0,song,year,artist,genre,lyrics,num_chars,sent,num_words
109,gilbert-green,2006,bee-gees,Pop,"On a hill, inside a house in Covewell Reach\nS...",1000,on hill inside house covewell reach stands man...,671
142,if-i-can-t-have-you-remix,2007,bee-gees,Pop,Don't know why\nI'm surviving every lonely day...,1198,don know i surviving every lonely day when got...,674
486,cover-you,2006,bee-gees,Pop,You could read her lips well as I was able\nSh...,1086,you could read lips well i able she taking i l...,671
739,night-fever-remix,2007,bee-gees,Pop,Listen to the ground\nThere is movement all ar...,1503,listen ground there movement around there some...,982
1154,the-love-of-a-woman,2006,bee-gees,Pop,"When the day is done , and the night is near\n...",565,when day done night near happiness gone and ga...,365


In [28]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
tokens = [w for s in corpus.sent.tolist() for w in WordPunctTokenizer().tokenize(s)]
word_fd = nltk.FreqDist(tokens)
bigram_fd = nltk.FreqDist(nltk.bigrams(tokens))
finder = BigramCollocationFinder(word_fd, bigram_fd)
collocations = finder.score_ngrams(bigram_measures.raw_freq)
collocations = pd.DataFrame(collocations)
collocations.columns = ["col", "score"]

In [29]:
corpus.sample(1).lyrics

45783    Is this your voice I heard\nSpeakin' my name\n...
Name: lyrics, dtype: object

In [30]:
def get_lyric(start_text, limit_lines):
    count_words = list(map(lambda s:s.count(" "), next(iter(corpus.sample(1).lyrics)).split("\n")))
    lines = [generate_lyrics(model, count_words[0], start_text.lower())]

    for i in range(limit_lines - 1):
        start_text = " ".join(next(iter(collocations.sample(1)["col"])))
        lines.append(generate_lyrics(model, count_words[i%len(count_words)], start_text.lower()))

    return "\n".join(lines)

In [31]:
print(get_lyric("this way", 5))

this way i know i know i know i know i
born one girl and fade as you know that i know
break their love and make me
face angel on the love is true and fade
breath body love and make me cry and the love


In [32]:
print(get_lyric("this way", 5))

this way i know i know
crime and i know i know
am i know i know i know
ends i know i know i
wise it not to be a love in the love


In [33]:
print(get_lyric("nebraska", 5))

nebraska i know
soul you know i
sever love and be a love and
town you very alone and i know
it burning to be a love in


### Final Tips
As a final tip, we do encourage you to do most of the work first on your local machine. They say that Data Scientists spend 80% of their time cleaning the data and preparing it for training (and 20% complaining about cleaning the data and preparing it). Handling these parts on your local machine usually mean you will spend less time complaining. You can switch to the cloud once your code runs and your pipeline is in place, for the actual training using a GPU.  

We also encourage you to use a small subset of the dataset first, so things run smoothly. The Metrolyrics dataset contains over 300k songs. You can start with a much much smaller set (even 3,000 songs) and try to train a network based on it. Once everything runs properly, add more data. 

Good luck!  

---
#### This exericse was originally written by Dr. Omri Allouche.