# IZ*ONE Lyric Generator

Welcome to the IZ*ONE Lyric Generator notebook! I will guide you through this notebook on how to preprocess our data and train them with an LSTM network. First, let's import the necessary libraries:

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

Next, we can read our .txt file from the /data directory using the open() function from python. **Don't forget to specify the encoding to 'utf-8' when you are trying to read non-English characters in your lyrics**

In [3]:
# read in the lyrics text file
with open('./data/lyrics.txt', encoding='utf-8') as f:
    raw_lyrics = f.read()

## 1. Data Statistics

It is always a good practice to deep dive inside your data and analyze it.

In [4]:
# data statistics
lyrics_per_line = raw_lyrics.split('\n')
word_count_line = [len(line.split()) for line in lyrics_per_line]

print('Total number of lines:', len(raw_lyrics.split('\n')))
print('Total number of unique words (roughly):', len({word: None for word in raw_lyrics.split()}))
print('Average number of words in a line:', np.average(word_count_line))
print('The least number of words in a line:', np.min(word_count_line))
print('The most number of words in a line:', np.max(word_count_line))
print()

view_range = 20
print('Lyric preview:')
print('\n'.join(raw_lyrics.split('\n')[:view_range]))

Total number of lines: 1792
Total number of unique words (roughly): 1964
Average number of words in a line: 3.763392857142857
The least number of words in a line: 0
The most number of words in a line: 13

Lyric preview:
Have you ever seen anything?
아름다운 색, 아름다운 색, 아름다운 색
Have you ever seen this color?
아름다운 색, 아름다운 다운 다운 다운
Have you ever seen anything?
아름다운 색, 아름다운 색, 아름다운 색
Have you ever seen this color?
아름다운 색, 아름다운 다운 다운 다운

끌리네 그 누구와도 다르게
변하고 싶어 나
너를 바라보면서 yeah
너를 알아가면서 yeah

상상이 내 감정을 더 움직여
열두 가지 색색깔의 무지개
나는 과연 어떤 색일까
우리 더 빛나게 해볼까

천천히 하나 둘 그리는 하얀 종이 위에


In [12]:
from collections import Counter

counter = Counter(raw_lyrics.split())
sorted_count = sorted(counter, key=counter.get, reverse=True)
print('Top 10 words:', sorted_count[:10])

Top 10 words: ['oh', 'me', '더', '이', 'I', '내꺼', '내', 'da', 'you', 'so']


Another good thing when analyzing the stats of the data is, in this case, we can see which punctuations appear in the data. When we are working with a sentiment analysis (e.g. predicting whether a review is good or bad), we usually don't really pay attention to them and we can remove them immediately. For lyric generator, I decided to keep them.

In [5]:
from string import punctuation

def check_punctuations(lyrics):
    """
    Check which punctuations do the lyrics have
    
    # Arguments
        lyrics: input lyrics
    
    # Output:
        (flag, punct_list): boolean flag and list of punctuations found in the lyrics
    """
    
    flag = False
    punct_list = []
    for p in punctuation:
        if raw_lyrics.find(p) != -1:
            flag = True
            punct_list.append(p)

    return (flag, punct_list)

In [6]:
check_punctuations(raw_lyrics)

(True, ['!', "'", '(', ')', ',', '-', '/', '?'])

## 2. Data Preprocessing

This is the most important part in this project. Before feeding lyrics to the model, we need to transform them in order for the model to understand our goal.

<ul>
    <li><code>create_lookup_tables</code> function simply turns words into integers and vice versa in descending order</li>
    <li><code>create_token_lookup</code> function creates a specific token for each punctuations to distinguish them from normal words </li>
</ul>

In [35]:
from collections import Counter

def create_lookup_tables(lyrics):
    """
    Creates 2 dictionaries which are lookup tables to store:
    - word to integer
    - integer to word
    
    # Arguments:
        lyrics: List, raw lyrics that split into individual words
    
    # Output:
        Tuple of (vocab_to_int, int_to_vocab)
            vocab_to_int: dictionary which maps word to integer
            int_to_vocab: dictionary which maps integer to word
    """
    
    # creates a word counter by using the Counter class
    word_count = Counter(lyrics)
    # sorts them by the word frequencies in descending order
    sorted_word_count = sorted(word_count, key=word_count.get, reverse=True)
    # creates a dictionary that maps words to indexes
    vocab_to_int = {word: idx for idx, word in enumerate(sorted_word_count)}
    # creates a dictionary that maps indexes back to their respective words
    int_to_vocab = {idx: word for word, idx in vocab_to_int.items()}
    
    return (vocab_to_int, int_to_vocab)

def create_token_lookup():
    """
    Creates lookup token for punctuations
    
    # Arguments:
        None
        
    # Output:
        String, punctuation token
    """
    
    # creates a list of punctuations/special chars
    punctuations = ['!', "'", '(', ')', ',', '-', '/', '?', '\n']
    # creates a list of punctuations tokens --> THE VALUES HAVE TO BE IN ORDER WITH THE PUNCTUATION LIST!
    tokens = ['<EXCLAMATION_MARK>', '<SINGLE_QUOTATION_MARK>', '<LEFT_ROUND_BRACKET>', '<RIGHT_ROUND_BRACKET>',
              '<COMMA>', '<HYPHEN>', '<SLASH>', '<QUESTION_MARK>', '<NEW_LINE>']
    
    punct_token = {}
    for p in range(len(punctuations)):
        punct_token[punctuations[p]] = tokens[p]
        
    return punct_token

In [36]:
# preprocess the data
PADDING = {'PADDING': '<PAD>'} # extra padding token for later when generating lyrics

# create token lookup
token_lookup = create_token_lookup()
# replace punctuations with their respective tokens
for symbol, token in token_lookup.items():
    raw_lyrics = raw_lyrics.replace(symbol, ' {} '.format(token))

tokenized_lyrics = raw_lyrics.lower() # convert lyrics to lower case letters
tokenized_lyrics = tokenized_lyrics.split() # then split them into individual words

# create both dictionaries vocab_to_int and int_to_vocab
vocab_to_int, int_to_vocab = create_lookup_tables(tokenized_lyrics + list(PADDING.values()))
# save the mapped (encoded) lyrics 
encoded_lyrics = [vocab_to_int[word] for word in tokenized_lyrics]

## (Optional) GPU Training

PyTorch has flexibility to train a model with CPU or GPU. Run the code below to check if your local machine is eligible to train with GPU

In [1]:
# check GPU availability
import torch

gpu_availability = torch.cuda.is_available()

if gpu_availability:
    print('GPU Available! Training on:', torch.cuda.get_device_name(0))
else:
    print('No GPU found! Training on CPU...')

GPU Available! Training on: GeForce RTX 2070 SUPER


## 3. Batching and Sequencing

There is one more step before feeding the data to the model, that is data batching and sequencing. In this step, we divide data into several batches within a length of sequence. To makes this easier to understand, let's consider a very simple example:

<code>[Big, brown, fox, jumps, over, the, lazy, dog, and, cat]</code>

And let's say we want to divide it into 6 different batches with a sequence length of 3:

<code>features: [Big, brown, fox] | labels: [jumps]
features: [brown, fox, jumps] | labels: [over]
features: [fox, jumps, over] | labels: [the]
features: [jumps, over, the] | labels: [lazy]
features: [over, the, lazy] | labels: [dog]
features: [the, lazy, dog] | labels: [and]<br></code>

That's how we batch data with a specific sequence length!

In [39]:
# batching
from torch.utils.data import TensorDataset, DataLoader

def batch_lyric(lyrics, sequence_length, batch_size):
    """
    Batch data within a specific sequence length
    
    # Arguments
        lyrics: List, preprocessed lyrics
        sequence length: Integer, the number of sequence length
        batch_size: Integer, the number of batches
    
    # Output
        DataLoader, batches of data within a sequence length
    """
    
    features = []
    labels = []
    
    # split lyrics into batches according to the sequence length
    for w in range(len(lyrics)):
        if w+sequence_length < len(lyrics):
            features.append(lyrics[w:w+sequence_length]) # features
            labels.append(lyrics[w+sequence_length]) # labels
            
    # convert them to numpy arrays
    features = np.array(features)
    labels = np.array(labels)
    # convert them to tensors and load them by using DataLoader
    dataset = TensorDataset(torch.from_numpy(features), torch.from_numpy(labels))
    loader = DataLoader(dataset, shuffle=True, batch_size=batch_size)
    
    return loader

It's always a good practice to test the implementation before actually using it with the real data

In [40]:
train_loader = batch_lyric(encoded_lyrics, sequence_length=5, batch_size=10)

train_iter = iter(train_loader)
f, l = train_iter.next()
print(f)
print(l)

tensor([[   6,    5,    9,   33,  158],
        [1412,  106, 1413,  339,    0],
        [   7,  109,  110,  205,   59],
        [ 135,   40,    5,   41,  418],
        [ 100,   24,  263,    2,   26],
        [  54,  350,  435,   50,  486],
        [   6,    5,    9,   21,   33],
        [   0,   94,  175,  196,  197],
        [  23,   27,    1,   27,    1],
        [   5,    9,   21,   33,    2]], dtype=torch.int32)
tensor([   7,  507,    0,   24,   26, 1206,    1,    0,   27,    6],
       dtype=torch.int32)


Great! It's working as expected!

## 4. Building Model

Now into the interesting part, building the training model. We will be going to use LSTM (Long Short Term memory) network with word embeddings. We will build our own custom class which an extends from the nn.Module Pytorch class. 

In [41]:
import torch.nn as nn

class Model(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, num_layers, dropout):
        """
        Build an LSTM model
        
        # Arguments
            vocab_size: Integer, how many vocabularies (words) to train
            output_size: Integer, output length
            embedding_dim: Integer, the number embedding dimensions
            hidden_dim: Integer, the number of hidden layer output
            num_layers: Integer, the number of hidden layers
        """
        
        super(Model, self).__init__()
        
        # model hyperparameters
        self.vocab_size = vocab_size
        self.output_size = output_size
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.dropout = dropout
        
        # model layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=num_layers, dropout=dropout, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_size)
        
    def forward(self, n_input, hidden):
        """
        Forward propagation
        
        # Arguments
            n_input: Integer, the number of input to the network
            hidden: Integer, the number of hidden states
        
        # Output
            (out, hidden): the output and hidden state of the network
        """
        
        batch = n_input.size(0)
        
        # word embeddings
        embed = self.embedding(n_input)
        # feed to LSTM networks
        l, hidden = self.lstm(embed, hidden)
        # don't forget to call .contiguous() and reshape the tensor
        l = l.contiguous().view(-1, self.hidden_dim)
        # fully connected layer
        out = self.fc(l)
        # reshape the tensor with the number of batches in the front
        out = out.view(batch, -1, self.output_size)
        # take the only the last output
        out = out[:,-1]
        
        return out, hidden
        
    def init_hidden(self, batch_size):
        """
        Initialize hidden state
        
        # Arguments
            batch_size: Integer, number of batches
        
        # Output
            hidden: hidden state
        """
        
        w = next(self.parameters()).data
        
        if gpu_availability:
            hidden = (w.new(self.num_layers, batch_size, self.hidden_dim).zero_().cuda(),
                      w.new(self.num_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (w.new(self.num_layers, batch_size, self.hidden_dim).zero_(),
                      w.new(self.num_layers, batch_size, self.hidden_dim).zero_())
            
        return hidden
        

In [42]:
# initialize layer hyperparameters
vocab_size = len(vocab_to_int)
output_size = vocab_size
embedding_dim = 300
hidden_dim = 512
num_layers = 2
dropout = 0.5

# initialize the model
model = Model(vocab_size, output_size, embedding_dim, hidden_dim, num_layers, dropout)
print(model)

Model(
  (embedding): Embedding(1731, 300)
  (lstm): LSTM(300, 512, num_layers=2, batch_first=True, dropout=0.5)
  (fc): Linear(in_features=512, out_features=1731, bias=True)
)


In [43]:
def forward_and_back_propagation(model, optimizer, criterion, feat, target, hidden):
    """
    Execute forward and back propagation
    
    # Arguments:
        model: Model, an RNN model
        optimizer: torch.optim, model optimizer
        criterion: loss criterion (cross entropy)
        feat: Tensor, features
        target: Tensor, labels
        hidden: hidden state
    
    # Output:
        (loss, h): Loss and hidden state
    """
    
    # move features and labels to GPU if available
    if gpu_availability:
        model.cuda()
        feat, target = feat.cuda(), target.cuda()
    # hidden states
    h = tuple([a.data for a in hidden])
    # clear out gradients
    model.zero_grad()
    
    # forward propagation
    out, h = model(feat, h)
    # calculate loss
    loss = criterion(out, target)
    # backpropagation
    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), 5) # clip any large gradients
    optimizer.step()
    
    return loss.item(), h

In [44]:
def train(model, batch_size, optimizer, criterion, epochs, show_every=100):
    """
    Train an RNN model
    
    # Arguments:
        model: Model, an RNN model
        batch_size: Integer, number of batches
        optimizer: torch.optim, model optimizer
        criterion: loss criterion (cross entropy)
        epochs: Integer, the number of training iterations
        show_every: Integer, show the training progress every n iteration
        
    # Output:
        model: the trained RNN model
    """
    
    losses = []
    
    model.train() # set the model to training mode
    
    for i in range(epochs):
        print('------------ EPOCH', i+1, '------------')
        
        hidden = model.init_hidden(batch_size) # initialize hidden states
        
        for batch, (inp, labels) in enumerate(train_loader, 1):
            # converts tensors to int64. I noticed this is only for windows platform only
            # don't need to add this if you are not working on windows
            inp = inp.to(torch.int64)
            labels = labels.to(torch.int64)
            
            n_batches = len(train_loader.dataset)//batch_size
            if batch > n_batches:
                break
            
            # forward and back prop
            loss, hidden = forward_and_back_propagation(model, optimizer, criterion, inp, labels, hidden)
            losses.append(loss)
            
            if batch % show_every == 0:
                print('Loss:', np.average(losses))
                losses = []
                
    return model

## 5. Model Training

Alright, now we are ready to train the model. Remember before feeding the data to the model, we need to divide them into batches and sequences

In [45]:
# batch sequence the lyrics
sequence_length = 10
batch_size = 32

train_loader = batch_lyric(encoded_lyrics, sequence_length, batch_size)

### Setting Hyperparameters

Feel free to try different numbers and combinations of hyperparameters.

In [46]:
# training hyperparameters
epochs = 5
lr = 0.001

# model hyperparameters
vocab_size = len(vocab_to_int)
output_size = vocab_size
embedding_dim = 300
hidden_dim = 512
num_layers = 2
dropout = 0.5

In [47]:
# initialize model and move it to GPU if available
model = Model(vocab_size, output_size, embedding_dim, hidden_dim, num_layers, dropout)
if gpu_availability:
    model.cuda()

# initialize model optimizer and loss function
opt = torch.optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

# train the model
trained_model = train(model, batch_size, opt, criterion, epochs, show_every=100)

------------ EPOCH 1 ------------
Loss: 5.978668212890625
Loss: 5.369683704376221
Loss: 5.0138031768798825
------------ EPOCH 2 ------------
Loss: 4.3513053533072785
Loss: 4.129555222988128
Loss: 4.060740299224854
------------ EPOCH 3 ------------
Loss: 3.3975519554637303
Loss: 3.312002730369568
Loss: 3.2575185334682466
------------ EPOCH 4 ------------
Loss: 2.4020344954784787
Loss: 2.5421368396282196
Loss: 2.491640272140503
------------ EPOCH 5 ------------
Loss: 1.7451489907558833
Loss: 1.7905638825893402
Loss: 1.7745794916152955


Great! I got around 1.77 loss after 5 epochs.

## 6. Generating New Lyrics

Let's test our model to generate some new lyrics

In [53]:
import torch.nn.functional as F

def generate_lyrics(trained_rnn, start_word, int_to_vocab, punct_token, padding, lyric_length = 100):
    """
    Generate lyrics from the trained RNN model
    
    # Arguments:
        trained_rnn: the trained RNN model
        start_word: String, input word to start generating lyrics
        int_to_vocab: A dictionary containing word to index
        punct_token: punctuation token
        padding: padding token (<PAD>)
        lyric_length: Integer, output length of the generated lyrics
        
    # Output:
        generated_lyrics: The final output of generated lyrics
    """
    
    # set model to evaluation mode
    trained_rnn.eval()
    
    # we have only 1 word only in the beginning (start word)
    lyric_sequence = np.full((1, sequence_length), padding)
    lyric_sequence[-1][-1] = start_word
    pred = [int_to_vocab[start_word]]
    
    for _ in range(lyric_length):
        # move tensors to GPU if available
        if gpu_availability:
            lyric_sequence = torch.LongTensor(lyric_sequence).cuda()
        else:
            lyric_sequence = torch.LongTensor(lyric_sequence)
        
        # forward propagation
        hidden = trained_rnn.init_hidden(lyric_sequence.size(0))
        output, _ = trained_rnn(lyric_sequence, hidden)
        
        # extract the softmax output
        candidate_lyrics = F.softmax(output, dim=1).data
        # move it to cpu
        if gpu_availability:
            candidate_lyrics = candidate_lyrics.cpu()
        
        # choose top 5 words with the highest probabilities from the softmax output
        top_candidate = 5
        candidate_lyrics, chosen_lyrics = candidate_lyrics.topk(top_candidate)
        # convert tensor to numpy
        chosen_lyrics = chosen_lyrics.numpy().squeeze()
        candidate_lyrics = candidate_lyrics.numpy().squeeze()
        # random factor
        idx = np.random.choice(chosen_lyrics, p=candidate_lyrics/candidate_lyrics.sum())
        
        lyric = int_to_vocab[idx] # convert indexes back to words
        pred.append(lyric)
        
        # move the start word 'pointer' to the next word (generated word from the model) and repeat
        lyric_sequence = np.roll(lyric_sequence.cpu(), -1, 1)
        lyric_sequence[-1][-1] = idx
        
    generated_lyrics = ' '.join(pred)
    
    # convert back punctuation tokens into real punctuations
    for punct, token in token_lookup.items():
        ending = ' ' if punct in ['\n', '(', '"'] else ''
        generated_lyrics = generated_lyrics.replace(' ' + token.lower(), punct)
    generated_lyrics = generated_lyrics.replace('\n ', '\n')
    generated_lyrics = generated_lyrics.replace('( ', '(')
    
    return generated_lyrics

In [58]:
lyrics_len = 200
start_lyric = '시간'

pad_token = PADDING['PADDING']
generated_lyrics = generate_lyrics(trained_model, vocab_to_int[start_lyric], int_to_vocab, token_lookup, vocab_to_int[pad_token], lyrics_len)
print(generated_lyrics)

시간 나의 멈춰 매일 지나가겠죠
나의 모든 순간이 아름답고 눈부셔
영원토록 뜨겁게 지지 않을게
이 모든 계절
나의 모든 계절 매일 화려한 이 무대

난 지금 이대로가 좋아, ooh ooh
창밖의 시선 따윈 필요 없잖아
이 순간 내가 원하는 걸 좀 더 꿈꿀래
내 안에 나를 더 알고 싶어
i' m always curious you(i' m)
i' m so curious, i' m so curious

wow! my rose

i remember
네 맘을 흔들어 담은
발길을 멈춘 그대
i' m your for you
(hold me hold me)
이 모든 순간이 아름답고 원하는 눈부셔
그 날부터 함께 걸어갈게요

혼자라면 할 수 없는 이 노래
(listen to me)
지금 너와 내가 만든 이 무대
이 순간 내가 원하는 걸 좀 더 꿈꿀래
그 안에 나를 더 알고 싶어
(we can feel it)
나와 네 손잡아 게 가
(hold me hold me feel)

now i' m crazy for you, and crazy fallin'

