### Required
Once you have selected a data set, you will produce the deliverables listed below and submit them to one of your peers for review. Treat this exercise as an opportunity to produce analysis that are ready to highlight your analytical skills for a senior audience, for example, the Chief Data Officer, or the Head of Analytics at your company.

Sections required in your report:

* Main objective of the analysis that also specifies whether your model will be focused on a specific type of Deep Learning or Reinforcement Learning algorithm and the benefits that your analysis brings to the business or stakeholders of this data.

* Brief description of the data set you chose, a summary of its attributes, and an outline of what you are trying to accomplish with this analysis.

* Brief summary of data exploration and actions taken for data cleaning or feature engineering.

* Summary of training at least three variations of the Deep Learning model you selected. For example, you can use different clustering techniques or different hyperparameters.

* A paragraph explaining which of your Deep Learning models you recommend as a final model that best fits your needs in terms of accuracy or explainability.

* Summary Key Findings and Insights, which walks your reader through the main findings of your modeling exercise.

* Suggestions for next steps in analyzing this data, which may include suggesting revisiting this model or adding specific data features to achieve a better model.

### Main Objective:
* This report will focus on the use of RNN and its variations to generate text from a Metallica tracklist dataset and deliver text of a similar style.

* The main benefit is to generate interest in the use of neural networks for different aspects that are not necessarily commercial.

In [1]:
# import libraries
import pandas as pd
import numpy as np
import tensorflow as tf
import time
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM, SimpleRNN, GRU
from tensorflow.keras.optimizers import RMSprop
import numpy as np
import random
import warnings
warnings.filterwarnings("ignore")

### About the data
* the data was collected directly from the official metallica website where there was an ordered list of all the songs, it was only necessary to remove the header from the lists
* I try to get the network to learn the songs sequentially and print possible next values

In [2]:
# load the csv with pandas
link = "https://raw.githubusercontent.com/Vicente-Figueroa/Metallica_song_names/main/metallica.csv"
dataframe= pd.read_csv(link, sep='\t', names=['songs'])
input_names = dataframe['songs']
# print the head of the list
input_names[-5:]

201                  Whiplash
202        Whiskey in the Jar
203    White Light/White Heat
204                    Would?
205         You Really Got Me
Name: songs, dtype: object

### Feature engineering
* Since recurrent networks read data sequentially, we pass the songs as a sequence X and the next character as Y

In [3]:
# Make it all to a long string
concat_names = '\n'.join(input_names).lower()

# Find all unique characters by using set()
chars = sorted(list(set(concat_names)))
num_chars = len(chars)

# Build translation dictionaries, 'a' -> 0, 0 -> 'a'
char2idx = dict((c, i) for i, c in enumerate(chars))
idx2char = dict((i, c) for i, c in enumerate(chars))


# Use longest name length as our sequence window
max_sequence_length = max([len(name) for name in input_names])

print('Total chars: {}'.format(num_chars))
print('Corpus length:', len(concat_names))
print('Number of names: ', len(input_names))
print('Longest name: ', max_sequence_length)

Total chars: 43
Corpus length: 3236
Number of names:  206
Longest name:  35


In [4]:
sequences = []
next_chars = []
step_length = 1


# Loop over our data and extract pairs of sequances and next chars
for i in range(0, len(concat_names) - max_sequence_length, step_length):
    sequences.append(concat_names[i: i + max_sequence_length])
    next_chars.append(concat_names[i + max_sequence_length])

num_sequences = len(sequences)

print('Number of sequences:', num_sequences)
print('First 10 sequences and next chars:')
for i in range(10):
    print('X=[{}] y=[{}]'.replace('\n', ' ').format(sequences[i], next_chars[i]).replace('\n', ' '))

Number of sequences: 3201
First 10 sequences and next chars:
X=[2 x 4 53rd & 3rd ain't my bitch all] y=[ ]
X=[ x 4 53rd & 3rd ain't my bitch all ] y=[d]
X=[x 4 53rd & 3rd ain't my bitch all d] y=[a]
X=[ 4 53rd & 3rd ain't my bitch all da] y=[y]
X=[4 53rd & 3rd ain't my bitch all day] y=[ ]
X=[ 53rd & 3rd ain't my bitch all day ] y=[a]
X=[53rd & 3rd ain't my bitch all day a] y=[n]
X=[3rd & 3rd ain't my bitch all day an] y=[d]
X=[rd & 3rd ain't my bitch all day and] y=[ ]
X=[d & 3rd ain't my bitch all day and ] y=[a]


In [5]:
X = np.zeros((num_sequences, max_sequence_length, num_chars), dtype=np.bool)
Y = np.zeros((num_sequences, num_chars), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for j, char in enumerate(sequence):
        X[i, j, char2idx[char]] = 1
        Y[i, char2idx[next_chars[i]]] = 1

print('X shape: {}'.format(X.shape))
print('Y shape: {}'.format(Y.shape))

X shape: (3201, 35, 43)
Y shape: (3201, 43)


### Summary of training

In [6]:
# First we define the total of epochs and the amount of names generateds
epochs = 150
gen_amount = 10

In [7]:
# function to print generated names
def print_names(model, gen_amount):
    # Start sequence generation from end of the input sequence
    sequence = concat_names[-(max_sequence_length - 1):] + '\n'

    new_names = []
    print('{} new names are being generated'.format(gen_amount))

    while len(new_names) < gen_amount:
        # Vectorize sequence for prediction
        x = np.zeros((1, max_sequence_length, num_chars))
        for i, char in enumerate(sequence):
            x[0, i, char2idx[char]] = 1

        # Sample next char from predicted probabilities
        probs = model.predict(x, verbose=0)[0]
        probs /= probs.sum()
        next_idx = np.random.choice(len(probs), p=probs)
        next_char = idx2char[next_idx]
        sequence = sequence[1:] + next_char

        # New line means we have a new name
        if next_char == '\n':
            gen_name = [name for name in sequence.split('\n')][1]

            # Never start name with two identical chars, could probably also
            if len(gen_name) > 2 and gen_name[0] == gen_name[1]:
                gen_name = gen_name[1:]

            # Discard all names that are too short
            if len(gen_name) > 2:
                # Only allow new and unique names
                concat_list = np.concatenate((input_names, new_names), axis=0)
                if gen_name not in concat_list:
                    new_names.append(gen_name.capitalize())

            if 0 == (len(new_names) % (gen_amount/ 10)):
                print('Generated {}'.format(len(new_names)))
            
    return new_names

### First model

In [8]:
latent_dim = 64 
dropout_rate = 0.2

# making the NN
model1 = Sequential()

# Add the LSTM layer 
model1.add(SimpleRNN(latent_dim,
               input_shape=(max_sequence_length, num_chars),
               recurrent_dropout=dropout_rate))

# Output layer dense and softmax activation
model1.add(Dense(units=num_chars, activation='softmax'))

# Optmizer in RMS
optimizer = RMSprop(lr=0.01)
model1.compile(loss='categorical_crossentropy',
              optimizer=optimizer)

model1.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn (SimpleRNN)      (None, 64)                6912      
                                                                 
 dense (Dense)               (None, 43)                2795      
                                                                 
Total params: 9,707
Trainable params: 9,707
Non-trainable params: 0
_________________________________________________________________


In [9]:
batch_size = 32 

start = time.time()
print('Start training for {} epochs'.format(epochs))
history = model1.fit(X, Y, epochs=epochs, batch_size=batch_size, verbose=1)
end = time.time()
print('Finished training - time elapsed:', (end - start)/60, 'min')

Start training for 150 epochs
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoc

Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150
Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 143/150
Epoch 144/150
Epoch 145/150
Epoch 146/150
Epoch 147/150
Epoch 148/150
Epoch 149/150
Epoch 150/150
Finished training - time elapsed: 1.8656500299771628 min


In [10]:
names= print_names(model1, gen_amount)
print('First {} generated names:'.format(gen_amount))
for name in names[:gen_amount]:
    print(name)

10 new names are being generated
Generated 1
Generated 2
Generated 3
Generated 3
Generated 4
Generated 5
Generated 5
Generated 6
Generated 7
Generated 8
Generated 8
Generated 8
Generated 8
Generated 8
Generated 8
Generated 9
Generated 10
First 10 generated names:
Would?
You really got me
Bssh wubflohi lytatosafb any
F lribltjhlovrd'sysetl
F lribltjhlovrd'sysetl
Rssyidvtsryaly'bg buyin
Pystrolry.elysy
Pystrolry.elysy
Wsrosryscictyeyiy thety toy tf
Thatsrv


### Second model

In [11]:
latent_dim = 64 
dropout_rate = 0.2

# making the NN
model2 = Sequential()

# Add the LSTM layer 
model2.add(LSTM(latent_dim,
               input_shape=(max_sequence_length, num_chars),
               recurrent_dropout=dropout_rate))

# Output layer dense and softmax activation
model2.add(Dense(units=num_chars, activation='softmax'))

# Optmizer in RMS
optimizer = RMSprop(lr=0.01)
model2.compile(loss='categorical_crossentropy',
              optimizer=optimizer)

model2.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 64)                27648     
                                                                 
 dense_1 (Dense)             (None, 43)                2795      
                                                                 
Total params: 30,443
Trainable params: 30,443
Non-trainable params: 0
_________________________________________________________________


In [12]:
start = time.time()
print('Start training for {} epochs'.format(epochs))
history = model2.fit(X, Y, epochs=epochs, batch_size=batch_size, verbose=1)
end = time.time()
print('Finished training - time elapsed:', (end - start)/60, 'min')

Start training for 150 epochs
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoc

Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150
Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 143/150
Epoch 144/150
Epoch 145/150
Epoch 146/150
Epoch 147/150
Epoch 148/150
Epoch 149/150
Epoch 150/150
Finished training - time elapsed: 3.8861140966415406 min


In [13]:
names= print_names(model2, gen_amount)
print('First {} generated names:'.format(gen_amount))
for name in names[:gen_amount]:
    print(name)

10 new names are being generated
Generated 1
Generated 2
Generated 3
Generated 4
Generated 5
Generated 6
Generated 7
Generated 8
Generated 9
Generated 10
First 10 generated names:
You really got me
The mon's drman
Mr. soul
Mtv:icon medley
Poor twistead jack built
House the train hop
Shoot pera wit the dumb
Frustration
Frustration
Fuel


### Third model

In [14]:
latent_dim = 64 
dropout_rate = 0.2

# making the NN
model3 = Sequential()

# Add the LSTM layer 
model3.add(GRU(latent_dim,
               input_shape=(max_sequence_length, num_chars),
               recurrent_dropout=dropout_rate))

# Output layer dense and softmax activation
model3.add(Dense(units=num_chars, activation='softmax'))

# Optmizer in RMS
optimizer = RMSprop(lr=0.01)
model3.compile(loss='categorical_crossentropy',
              optimizer=optimizer)

model3.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 gru (GRU)                   (None, 64)                20928     
                                                                 
 dense_2 (Dense)             (None, 43)                2795      
                                                                 
Total params: 23,723
Trainable params: 23,723
Non-trainable params: 0
_________________________________________________________________


In [15]:
start = time.time()
print('Start training for {} epochs'.format(epochs))
history = model3.fit(X, Y, epochs=epochs, batch_size=batch_size, verbose=1)
end = time.time()
print('Finished training - time elapsed:', (end - start)/60, 'min')

Start training for 150 epochs
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoc

Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150
Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 143/150
Epoch 144/150
Epoch 145/150
Epoch 146/150
Epoch 147/150
Epoch 148/150
Epoch 149/150
Epoch 150/150
Finished training - time elapsed: 3.4371578097343445 min


In [16]:
names= print_names(model3, gen_amount)
print('First {} generated names:'.format(gen_amount))
for name in names[:gen_amount]:
    print(name)

10 new names are being generated
Generated 1
Generated 2
Generated 2
Generated 3
Generated 4
Generated 5
Generated 6
Generated 7
Generated 8
Generated 9
Generated 10
First 10 generated names:
You really got me
The farn her
Heat far
Heat far
Heat far
Let the real
Was nowithsnight in the lightning
No lat me
No lat me
Did of the rising


### Recommended model

* The model that I recommend is the one with the LSTM layer, which adds more parameters to the training and achieves less loss, being reflected in more consistent song names compared to the other configured networks.
* However, the model still lacks more robustness to be able to give coherent names.

#### Keyfindings

* An important key is the loss of the model that in 150 epochs gave a total of 0.4565, being significantly lower than the other layers with the same hyperparameters.

* The number of trainable parameters in the LSTM layer is 30,443, being greater than the other configurations, making a significant change.

### Next Steps

* In the following steps I would like to add an embedding layer and see how the model performs
* I would also like to connect more recurring layers or save the weights of a training with another data set and then train a layer to connect the letters and form the song
* We can add more epochs to see if the loss decreases or train another larger data set to increase the number of sequences with which the model works
