<a href="https://colab.research.google.com/github/WuraolaOyewusi/Predict-Yoruba-Hymn-Lyrics-with-Tensorflow/blob/master/Yoruba_hymn_generator_using_TF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Download DataSet
###Dataset contains 10 popular hymns written in yoruba language with their proper tone marks

In [0]:
!wget https://raw.githubusercontent.com/WuraolaOyewusi/Predict-Yoruba-Hymn-Lyrics-with-Tensorflow/master/Ten_Yoruba_Hymns.txt

--2020-03-10 13:45:09--  https://raw.githubusercontent.com/WuraolaOyewusi/Predict-Yoruba-Hymn-Lyrics-with-Tensorflow/master/Ten_Yoruba_Hymns.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7994 (7.8K) [text/plain]
Saving to: ‘Ten_Yoruba_Hymns.txt’


2020-03-10 13:45:10 (127 MB/s) - ‘Ten_Yoruba_Hymns.txt’ saved [7994/7994]



## Load the text file

In [0]:
with open('Ten_Yoruba_Hymns.txt') as f:
    data = f.readlines()                             #Data loads as list
data = ' '.join(data).lower().split('\n')            #Join Data into string,lower words and split along new lines

In [0]:
print(len(data))                                     #Check Length of Data
data[0:6]                                            #View data sample

261


['ìsun kan wa tó kún fẹ́jẹ̀',
 ' mo ti ní jésù lọ́rẹ̀',
 ' enìkan nbẹ tó fẹ́ràn wa',
 ' gba ayé mi, olúwa',
 ' olùgbàlà gbóhùn mi',
 ' árẹ̀ mú ọ, ọkàn re pòrurù']

##Import Libraries

In [0]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

## Data Preprocessing(Tokenization, Case folding, Sequencing and Sequence Padding

In [0]:
tokenizer = Tokenizer()
corpus = data
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
print(total_words)

459


In [0]:
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1,len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)


##View after 'pre' padding the sequences with '0' to bring them to equal array lenght

In [0]:
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len,padding='pre'))
input_sequences

array([[  0,   0,   0, ...,   0, 109, 110],
       [  0,   0,   0, ..., 109, 110,   3],
       [  0,   0,   0, ..., 110,   3,  36],
       ...,
       [  0,   0,   0, ..., 458,  12,  47],
       [  0,   0,   0, ...,  12,  47, 170],
       [  0,   0,   0, ...,  47, 170, 107]], dtype=int32)

##Word prediction is a treated like a classification task where the next word in a sequence is treated as a label of the previous words. So each word is a label class.

In [0]:
train_data = input_sequences[:,:-1]
labels = input_sequences[:,-1]
labels = tf.keras.utils.to_categorical(labels, num_classes = total_words)

## Model architecture and hyperparameters are from one of the lessons in the Natural Language Processing with Tensorflow Course on Coursera. The architecture works well for my use case

In [0]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(total_words, 64, input_length=max_sequence_len - 1))
model.add(tf.keras.layers.Bidirectional((tf.keras.layers.LSTM(200))))
model.add(tf.keras.layers.Dense(total_words, activation='softmax'))
model.summary()

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 7, 64)             29376     
_________________________________________________________________
bidirectional (Bidirectional (None, 400)               424000    
_________________________________________________________________
dense (Dense)                (N

##Model Training. By the 46th epoch the accuracy stayed in the same range but the loss value kept reducing till the 100th epoch

In [0]:
adam = tf.keras.optimizers.Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(train_data,labels,epochs=100,verbose=1)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 1053 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/

##Text Prediction. A seed text to start the predicted lyrics is preprocessed exactly as the training data is 

In [0]:
def generate_hymn(seed_text,next_words):
    """ A function that takes a 
    seed_text: to prompt next word prediction
    next_word: The number of next words to predict
    and returns the predicted yoruba hymn lyrics"""
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len - 1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        output_word = " "
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text


##Generate yoruba hymns lyrics

In [0]:
generate_hymn('olúwa olúwa gbà',4)

'olúwa olúwa gbà mí ègbè ègbè tán'

In [0]:
seed_text_list = ['olúwa gbà','olùgbàlà' ,'Ọlọ́run' , 'ìṣẹ́gun ni' , 'ìyanu mi', 'gbórí', 'ayọ̀ ńbọ̀','ìfẹ́','ìfẹ́ ọkàn', 'olúwa mi','ọ̀rẹ́','ọ̀rẹ́ òtítọ́']
for word in seed_text_list:
    print(generate_hymn(word,5))

olúwa gbà gbà mí ègbè nù kúrọ̀
olùgbàlà gbóhùn mi ko ṣì gbọ́ràn
Ọlọ́run ọ̀rọ̀ rẹ̀ mo figbàgbọ́ rísun
ìṣẹ́gun ni jà re wò re pòrurù
ìyanu mi ba ti jẹ ní gbèsè
gbórí ọ̀rọ̀ rẹ̀ mo figbàgbọ́ rísun
ayọ̀ ńbọ̀ fún mi titi náà ló
ìfẹ́ rẹ̀ ju t'ìyekan lọ sógo
ìfẹ́ ọkàn kò sì ní tán wa
olúwa mi sí ńké pé o ró
ọ̀rẹ́ ayé nkọ̀ wá sílẹ̀ ní
ọ̀rẹ́ òtítọ́ ayé nkọ̀ wá sílẹ̀ ní


##How to save and load the model with tf.keras

In [0]:
import tensorflow as tf
model.save('./yoruba_hymn_lyrics_predictor_model.h5')
model = tf.keras.models.load_model('./yoruba_hymn_lyrics_predictor_model.h5')