<a href="https://colab.research.google.com/github/ayushs0911/Projects/blob/main/NLP/Drake_Lyrics_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Problem Statement 
**Character based RNN Model** <br>
Text Generation model, which outputs Drake Style lyrics from any English Language inputs. 

Given a sequence of characters from the data, training a model to predict the next character in the sequence. Longer sequences of text can be generated by calling the model repeatedly.


##Downloading Datset

In [None]:
!pip install -q kaggle
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download -d juicobowley/drake-lyrics
!unzip "/content/drake-lyrics.zip" -d "/content/dataset/"

mkdir: cannot create directory ‘/root/.kaggle’: File exists
Downloading drake-lyrics.zip to /content
  0% 0.00/764k [00:00<?, ?B/s]
100% 764k/764k [00:00<00:00, 112MB/s]
Archive:  /content/drake-lyrics.zip
  inflating: /content/dataset/drake_data.csv  
  inflating: /content/dataset/drake_data.json  
  inflating: /content/dataset/drake_lyrics.txt  


In [None]:
text = open('/content/dataset/drake_lyrics.txt', 'rb').read().decode(encoding = 'utf-8')

In [None]:
print(f"Length of text : {len(text)} characters")

Length of text : 791643 characters


In [None]:
text[:100]

'"[Verse]\r\nPut my feelings on ice\r\nAlways been a gem\r\nCertified lover boy, somehow still heartless\r\nH'

Checking unique characters in file

In [None]:
vocab = sorted(set(text))
len(vocab)

105

##Processing the text <br>
`tf.keras.layers.StringLookup` layer can convert each character into a numberic ID. It just needs the text to be split into tokens first. 

In [None]:
import tensorflow as tf 
import numpy as np
import os 
import time 
     

In [None]:
ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary = list(vocab),
    mask_token = None
)

In [None]:
example_texts = [text[:100]]
chars = tf.strings.unicode_split(example_texts, input_encoding = 'UTF-8')

In [None]:
chars

<tf.RaggedTensor [[b'"', b'[', b'V', b'e', b'r', b's', b'e', b']', b'\r', b'\n', b'P',
  b'u', b't', b' ', b'm', b'y', b' ', b'f', b'e', b'e', b'l', b'i', b'n',
  b'g', b's', b' ', b'o', b'n', b' ', b'i', b'c', b'e', b'\r', b'\n',
  b'A', b'l', b'w', b'a', b'y', b's', b' ', b'b', b'e', b'e', b'n', b' ',
  b'a', b' ', b'g', b'e', b'm', b'\r', b'\n', b'C', b'e', b'r', b't',
  b'i', b'f', b'i', b'e', b'd', b' ', b'l', b'o', b'v', b'e', b'r', b' ',
  b'b', b'o', b'y', b',', b' ', b's', b'o', b'm', b'e', b'h', b'o', b'w',
  b' ', b's', b't', b'i', b'l', b'l', b' ', b'h', b'e', b'a', b'r', b't',
  b'l', b'e', b's', b's', b'\r', b'\n', b'H']]>

In [None]:
ids = ids_from_chars(chars)

In [None]:
ids

<tf.RaggedTensor [[5, 57, 52, 64, 77, 78, 64, 58, 2, 1, 46, 80, 79, 3, 72, 84, 3, 65, 64,
  64, 71, 68, 73, 66, 78, 3, 74, 73, 3, 68, 62, 64, 2, 1, 31, 71, 82, 60,
  84, 78, 3, 61, 64, 64, 73, 3, 60, 3, 66, 64, 72, 2, 1, 33, 64, 77, 79,
  68, 65, 68, 64, 63, 3, 71, 74, 81, 64, 77, 3, 61, 74, 84, 14, 3, 78, 74,
  72, 64, 67, 74, 82, 3, 78, 79, 68, 71, 71, 3, 67, 64, 60, 77, 79, 71,
  64, 78, 78, 2, 1, 38]]>

Inverting this representation and recover human readable strings from it. 

In [None]:
chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary = ids_from_chars.get_vocabulary(),
    invert = True,
    mask_token = None
)

In [None]:
chars = chars_from_ids(ids)
chars
tf.strings.reduce_join(chars, axis = -1).numpy()

array([b'"[Verse]\r\nPut my feelings on ice\r\nAlways been a gem\r\nCertified lover boy, somehow still heartless\r\nH'],
      dtype=object)

In [None]:
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis =-1)

##The prediction task

The input to the model will be a sequence of characters, and you train the model to predict the output—the following character at each time step.

Since RNNs maintain an internal state that depends on the previously seen elements, given all the characters computed until this moment, what is the next character?

**Create training examples and targets**

- Next divide the text into example sequences. Each input sequence will contain `seq_length` characters from the text.
- For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right.
- So break the text into chunks of `seq_length+1`. For example, say seq_length is 4 and our text is "Hello". The input sequence would be "Hell", and the target sequence "ello".
To do this first use the `tf.data.Dataset.from_tensor_slices` function to convert the text vector into a stream of character indices.

In [None]:
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
all_ids

<tf.Tensor: shape=(791643,), dtype=int64, numpy=array([ 5, 57, 52, ...,  5,  2,  1])>

In [None]:
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)


In [None]:
for ids in ids_dataset.take(20):
  print(chars_from_ids(ids).numpy().decode('utf-8'))

"
[
V
e
r
s
e
]



P
u
t
 
m
y
 
f
e
e


In [None]:
seq_length = 100

In [None]:
sequences = ids_dataset.batch(seq_length+1, drop_remainder = True)

for seq in sequences.take(1):
  print(seq)
  print(chars_from_ids(seq))

tf.Tensor(
[ 5 57 52 64 77 78 64 58  2  1 46 80 79  3 72 84  3 65 64 64 71 68 73 66
 78  3 74 73  3 68 62 64  2  1 31 71 82 60 84 78  3 61 64 64 73  3 60  3
 66 64 72  2  1 33 64 77 79 68 65 68 64 63  3 71 74 81 64 77  3 61 74 84
 14  3 78 74 72 64 67 74 82  3 78 79 68 71 71  3 67 64 60 77 79 71 64 78
 78  2  1 38 64], shape=(101,), dtype=int64)
tf.Tensor(
[b'"' b'[' b'V' b'e' b'r' b's' b'e' b']' b'\r' b'\n' b'P' b'u' b't' b' '
 b'm' b'y' b' ' b'f' b'e' b'e' b'l' b'i' b'n' b'g' b's' b' ' b'o' b'n'
 b' ' b'i' b'c' b'e' b'\r' b'\n' b'A' b'l' b'w' b'a' b'y' b's' b' ' b'b'
 b'e' b'e' b'n' b' ' b'a' b' ' b'g' b'e' b'm' b'\r' b'\n' b'C' b'e' b'r'
 b't' b'i' b'f' b'i' b'e' b'd' b' ' b'l' b'o' b'v' b'e' b'r' b' ' b'b'
 b'o' b'y' b',' b' ' b's' b'o' b'm' b'e' b'h' b'o' b'w' b' ' b's' b't'
 b'i' b'l' b'l' b' ' b'h' b'e' b'a' b'r' b't' b'l' b'e' b's' b's' b'\r'
 b'\n' b'H' b'e'], shape=(101,), dtype=string)


In [None]:
for seq in sequences.take(5):
  print(text_from_ids(seq).numpy())

b'"[Verse]\r\nPut my feelings on ice\r\nAlways been a gem\r\nCertified lover boy, somehow still heartless\r\nHe'
b'art is only gettin\' colder"\r\n"[Verse]\r\nHands are tied\r\nSomeone\'s in my ear from the other side\r\nTelli'
b"n' me that I should pay you no mind\r\nWanted you to not be with me all night\r\nWanted you to not stay w"
b"ith me all night\r\nI know, you know, who that person is to me\r\nDoesn't really change things\r\n\r\n[Chorus"
b"]\r\nI know you're scared of dating, falling for me\r\nShorty, surely you know me\r\nRight here for you alw"


In [None]:
def split_input_target(sequence):
  input_text = sequence[:-1]
  target_text = sequence[1:]
  return input_text, target_text

In [None]:
input, target = split_input_target(list('Ayush Singh'))
input, target

(['A', 'y', 'u', 's', 'h', ' ', 'S', 'i', 'n', 'g'],
 ['y', 'u', 's', 'h', ' ', 'S', 'i', 'n', 'g', 'h'])

In [None]:
dataset = sequences.map(split_input_target)

In [None]:
for input_example, target_example in dataset.take(1):
  print("Input:", text_from_ids(input_example).numpy())
  print("Target", text_from_ids(target_example).numpy())

Input: b'"[Verse]\r\nPut my feelings on ice\r\nAlways been a gem\r\nCertified lover boy, somehow still heartless\r\nH'
Target b'[Verse]\r\nPut my feelings on ice\r\nAlways been a gem\r\nCertified lover boy, somehow still heartless\r\nHe'


##Creating training batches 
Shuffle the data and pack into batches 

In [None]:
BATCH_SIZE = 64
BUFFER_SIZE = 10000
dataset = (dataset
           .shuffle(BUFFER_SIZE)
           .batch(BATCH_SIZE, drop_remainder = True)
           .prefetch(tf.data.experimental.AUTOTUNE))
dataset

<_PrefetchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int64, name=None), TensorSpec(shape=(64, 100), dtype=tf.int64, name=None))>

##Build the model 
Defines model as `keras.Model` subclass. 
3 layers 
- `tf.keras.layers.Embedding`
- `tf.keras.layers.GRU`
- `tf.keras.layers.Dense` : output layer with `vocab_size` outputs. It outputs one logit for each character in vocabulary. 

In [None]:
#length of vocab in StringLookup layer 

vocab_size = len(ids_from_chars.get_vocabulary())
embedding_dim = 256
rnn_units = 1024

In [None]:
class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units, 
                                   return_sequences = True, 
                                   return_state = True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states = None, return_state = False, training = False):
    x = inputs
    x = self.embedding(x, training = training )
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state = states, 
                         training = training)
    x = self.dense(x, training = training)

    if return_state:
      return x, states
    else:
      return x 

In [None]:
model = MyModel(vocab_size = vocab_size,
                embedding_dim = embedding_dim,
                rnn_units = rnn_units)

##Train the model 
At this point the problem can be treated as standard classification problem. Given the previous RNN Statet and input this time step, predict the class of next character. 

In [None]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits = True)

In [None]:
model.compile(optimizer = 'Adam', 
              loss = loss)


In [None]:
#checkpoint callback 
checkpoint_dir = './training_checkpoints'

checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath = checkpoint_prefix,
    save_weights_only = True
)

##Execute the training 

In [None]:
history = model.fit(dataset, epochs = 30,
                    callbacks = [checkpoint_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


##Generate Text
Simplest way to generate text with this model is to run it in a loop, and keep track of the model's internal state during execution. 

Each time you call the model you pass in some text and an internal state. The model returns a prediction for the next character and its new state. Pass the prediction and state back in to continue generating text. 
The following makes a single step prediction :

In [None]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    print(sparse_mask.shape)
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)
    

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    print(f"1: {predicted_logits.shape}")
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask
    print(predicted_logits.shape)

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states
     

In [None]:
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)


(106,)


In [None]:
start = time.time()
states = None
next_char = tf.constant(["Yeah, Usher"])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

Yeah, Usher. Senthereal the paperwation is that 8 2n law as the same spirt
'Cause I got real hitters over new it, reluations
I was 2 ma, with J somethin', state up to me
Everything that I write is either for y'all to getting close
I tried to tell me ""Get it""
Then is this women that I know we here
I'm not acting up, you can find it like that
Only you can do it real girl, wait till the time
Timeling is comin' back around and get ya
It's the ones that you're explaining for me
Like I'm Louing bag, is crossed, that what I might
First turned up, I let that shit sind it, I'm in love with easones
I don't wanna seem like four words
This mother spoken like Kala Pashos the past then I heard it
Niggas know what I'm sayin'?

[Outro: Majid Al Massai]
'amony""
And welcome to my power
I remember when my life every nigga in the tankin'
Cun a Grammy like a motherfucking crass
Look, what you call, you not the belate
Man, she's not the honest, they know it's becked up the very face