In [1]:
import warnings
warnings.filterwarnings('ignore')

import random
import numpy as np
import pandas as pd
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.ERROR)

# Data preparation
* Read the downloaded input dataset.

In [2]:
df = pd.read_csv("./data/songdata.csv")

In [3]:
df.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


* Our dataset consists of about $57,650$ song lyrics.

In [4]:
df.shape[0]

57650

* We have song lyrics of about $643$ artists.

In [5]:
len(df['artist'].unique())

643

* The number of songs from each artist is shown as follows.

In [6]:
df['artist'].value_counts()[:10]

Donna Summer        191
Gordon Lightfoot    189
George Strait       188
Bob Dylan           188
Reba Mcentire       187
Alabama             187
Cher                187
Loretta Lynn        187
Chaka Khan          186
Dean Martin         186
Name: artist, dtype: int64

* On average, we have about $89$ songs of each artist.

In [7]:
df['artist'].value_counts().values.mean()

89.65785381026438

* We have song lyrics in the column text, so we combine all the rows of that column and save it as a text in a variable called `data`, as follows.

In [8]:
data = ', '.join(df['text'])

* Let's see few lines of song.

In [9]:
data[:369]

"Look at her face, it's a wonderful face  \nAnd it means something special to me  \nLook at the way that she smiles when she sees me  \nHow lucky can one fellow be?  \n  \nShe's just my kind of girl, she makes me feel fine  \nWho could ever believe that she could be mine?  \nShe's just my kind of girl, without her I'm blue  \nAnd if she ever leaves me what could I do, what co"

* Since we are building a char-level RNN, we will store all the unique characters in our dataset into a variable called `chars`. This is basically our vocabulary.

In [10]:
chars = sorted(list(set(data)))

* Store the vocabulary size in a variable called `vocab_size`.

In [11]:
vocab_size = len(chars)

* Since the neural network only accept the input in numbers, we need to convert all the characters in the vocabulary to a number.
* We map all the characters in the vocabulary to their corresponding index that forms a unique. We define a `char_to_ix` dictionary, which has a mapping of all the characters to the index. To get the index by a character, we also define the `ix_to_char` dictionary, which has mapping of all the indices to their respective characters.

In [12]:
char_to_ix = {ch:i for i, ch in enumerate(chars)}
ix_to_char = {i:ch for i, ch in enumerate(chars)}

* As you can see in the following code snippet, the character's is mapped to an index 68 in the `char_to_ix` dictionary.

In [13]:
char_to_ix['s']

68

In [14]:
ix_to_char[68]

's'

* Once we obtain the character to integer mapping, we use one-hot encoding to represent the input and output in vector form. A one-hot encoded vector is basically a vector full of $0$s, except for a $1$ at a position corresponding to a character index.
* For example, let's suppose that the `vocabSize` is $7$, and the character `z` is in the fourth position in the vocabulary. Then, the one-hot encoded representation for the charactrer `z` can be represented as follows.

In [15]:
vocabSize = 7
char_index = 4

np.eye(vocabSize)[char_index]

array([0., 0., 0., 0., 1., 0., 0.])

* As you can see, we have a $1$ at the corresponding index of the character, and the rest of the values are $0$s. This is how we convert each character into a one-hot encoded vector.
* In the following code, we define a function called `one_hot_encoder`, which will return the one-hot encoded vectors, given an index of the character.

In [16]:
def one_hot_encoder(index):
    return np.eye(vocab_size)[index]

# Defining the Network Parameters
* We need to define all the network parameters.

In [17]:
'''Define the number of units in the hidden layer'''
hidden_size = 100

'''Define the length of the input and output sequence'''
seq_length = 25

'''Define the learning rate for gradient descent is as follows'''
learning_rate = 1e-1

'''Set the seed value'''
seed_value = 42
tf.set_random_seed(seed_value)
random.seed(seed_value)

# Defining placeholders
* Now, we will define the TensorFlow placeholders. The placeholders for the input and outut are as the follows.

In [18]:
inputs = tf.placeholder(shape=[None, vocab_size], dtype=tf.float32, name='inputs')
targets = tf.placeholder(shape=[None, vocab_size], dtype=tf.float32, name='targets')

* Define the placeholder for the initial hidden state.

In [19]:
init_state = tf.placeholder(shape=[1, hidden_size], dtype=tf.float32, name='state')

* Define an initializer for initializing the weights of the RNN.

In [20]:
initializer = tf.random_normal_initializer(stddev=0.1)

# Defining forward propagation
* Let's define the forward propagation involved in the RNN, which is mathematically given as follows.
  $$\begin{aligned}
    &\mathbf{h}_t = \mathrm{tanh}(\mathbf{Ux}_t + \mathbf{Wh}_{t - 1} + \mathbf{bh}) \\
    &\widehat{\mathbf{y}} = \mathrm{softmax}(\mathbf{Vh}_t + \mathbf{bv})
  \end{aligned}$$

In [21]:
with tf.variable_scope('RNN') as scope:
    h_t = init_state
    y_hat = []
    
    for t, x_t in enumerate(tf.split(inputs, seq_length, axis=0)):
        if t > 0:
            scope.reuse_variables()
            
        '''Input to hidden layer weights'''
        U = tf.get_variable('U', [vocab_size, hidden_size], initializer=initializer)
        
        '''Hidden to hidden layer weights'''
        W = tf.get_variable('W', [hidden_size, hidden_size], initializer=initializer)
        
        '''Hidden to output layer weights'''
        V = tf.get_variable('V', [hidden_size, vocab_size], initializer=initializer)
        
        '''Bias for hidden layer'''
        bh = tf.get_variable('bh', [hidden_size], initializer=initializer)
        
        '''Bias for output layer'''
        by = tf.get_variable('by', [vocab_size], initializer=initializer)
        
        h_t = tf.tanh(tf.matmul(x_t, U) + tf.matmul(h_t, W) + bh)
        
        y_hat_t = tf.matmul(h_t, V) + by
        
        y_hat.append(y_hat_t)

* Apply softmax on the output and get the probabilities.

In [22]:
output_softmax = tf.nn.softmax(y_hat[-1])
outputs = tf.concat(y_hat, axis=0)

* Computer the cross-entropy loss.

In [23]:
loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(labels=targets, logits=outputs))

* Store the final hidden state of the RNN in `hprev`. We use this hidden state for making predictions.

In [24]:
hprev = h_t

# Defining Backpropagation Through time
* Now, we will perform the BPTT, with Adam as out optimizer. We will also perform gradient clipping to avoid the exploding gradients problem.
* Initialize the Adam optimizer.

In [25]:
minimizer = tf.train.AdamOptimizer()  

* Computer the gradients of the loss with the Adam optimizer.

In [26]:
gradients = minimizer.compute_gradients(loss)

* Set the threshold for the gradient clipping.

In [28]:
threshold = tf.constant(5., name='grad_clipping')

* Clip the gradients which exceeds the `threshold` and bring it to the range.

In [31]:
clipped_gradients = []

for grad, var in gradients:
    clipped_grad = tf.clip_by_value(grad, -threshold, threshold)
    clipped_gradients.append((clipped_grad, var))

* Update the gradients with the clipped gradients.

In [32]:
updated_gradients = minimizer.apply_gradients(clipped_gradients)

# Start generating songs
* Start the TensorFlow session and initialize all the variables.

In [34]:
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

# Step by Step explanation
* First let us understand how RNN is generating song lyrics step by step, the complete code is given at the end of the notebook.
  * Now, we will look at how to generate the song lyrics using an RNN. What should the input and output to the RNN be? How does it learn? What is the training data? Let's see the explanation, along with the code, step by step.
  * We know that in RNNs, the output predicted at a time step $t$ will be sent as the input to the next time step. We need to feed the predicted character from the previous time step as input. So, we prepare out dataset in the same way.
  * For instance, look at the following table. Let's suppose that each row is a different time step, one time step $t = 0$, the RNN predicted a new chatacter `g` as the output. This will be sent as the input to the next time step $t = 1$.
  * However, if you notice the input in the time step $t = 1$, we removed the first character from the input `o` and added the newly predicted character `g` at the end of our sequence. Why are we removing the first character from the input? Because we need to maintain the sequence length.
  * Let's suppose that our sequence length is $8$, adding a newly predicted character to our sequence increases the sequence length to $9$. To avoid this, we remove the first character from the input, while adding a newly predicted character from the previous time step.
  * Similarity, in the output data, we also remove the first character on each time step. Because once it predicts the new character, the sequence length increases. To void this, we remove the first character from the output on each time step, as shown in the following table.<br>
    ![](./images/04.00.png)
  * Now we will look at how we can prepare out input and output sequence similar to the preceding table.
    * Define a variable called `pointer`, which points to the character in our dataset. We will set out pointer to $0$, which means it points to the first character.
      ```python
      pointer = 0
      ```