# Tongue Twister Dataset
One of my personal favourite aspects of linguistics are Tongue twisters. Thus, when I was looking on the net and found no datasets for this, I created one.        
While I currently just have ~600 tongue twisters, I believe an increase in data is easily possible for this topic due to the excellent community we're a part of.         
Therefore this is sort of an Open Source dataset that you can contribute to! [Here is the GitHub](https://github.com/djthegr8/tongue-twisters) if you want to Contribute


### Special Thanks 
Instead of starting a notebook with a word level LSTM from scratch, I have reused this Beginner's Guide to LSTMs from Shivam Bansal here, so thanks to them.



## Import the libraries

As the first step, we need to import the required libraries:

In [None]:
# keras module for building LSTM 
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import Sequential
import tensorflow.keras.utils as ku 
# String for getting punctuation
from string import punctuation
# set seeds for reproducability
from tensorflow import set_random_seed
from numpy.random import seed
set_random_seed(2)
seed(1)
# I like to generally import tensorflow, just in case.
import tensorflow as tf
import pandas as pd
import numpy as np
import string, os 

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

## Load the dataset

Load the dataset of tongue twisters. 

In [None]:
all_headlines = open('../dataset/Text Formats/database.txt', 'rb').read().decode(encoding='utf-8').split('|')

## Dataset preparation

### Dataset cleaning 

In dataset preparation step, we will first perform text cleaning of the data which includes removal of punctuations and lower casing all the words. 

In [None]:
def clean_text(txt):
    txt = "".join(v for v in txt if v not in punctuation).lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 

corpus = [clean_text(x) for x in all_headlines]
corpus[:10]

Don't worry, even I can't pronounce all of them without rolling my tongue off

### Generating Sequence of N-gram Tokens

Language modelling requires a sequence input data, as given a sequence (of words/tokens) the aim is the predict next word/token.  

The next step is Tokenization. Tokenization is a process of extracting tokens (terms / words) from a corpus. Python’s library Keras has inbuilt model for tokenization which can be used to obtain the tokens and their index in the corpus. After this step, every text document in the dataset is converted into sequence of tokens. 


In [None]:
tokenizer = Tokenizer()

def get_sequence_of_tokens(corpus):
    ## tokenization
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    ## convert data to sequence of tokens 
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

inp_sequences, total_words = get_sequence_of_tokens(corpus)
inp_sequences[:10]

In [None]:
def generate_padded_sequences(input_sequences):
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len

predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

Perfect, now we can obtain the input vector X and the label vector Y which can be used for the training purposes. Recent experiments have shown that recurrent neural networks have shown a good performance in sequence to sequence learning and text data applications. Lets look at them in brief.

## LSTMs for Text Generation

![](http://www.shivambansal.com/blog/text-lstm/2.png)

LSTM's major layers are below..

1. Input Layer : Takes the sequence of words as input
2. LSTM Layer : Computes the output using LSTM units. I have added 100 units in the layer, but this number can be fine tuned later.
3. Dropout Layer : A regularisation layer which randomly turns-off the activations of some neurons in the LSTM layer. It helps in preventing over fitting. (Optional Layer)
4. Output Layer : Computes the probability of the best possible next word as output

We will run this model for total 50 epochs but it can be experimented further.

In [None]:
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    
    # Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    # Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(50))
    model.add(Dropout(0.1))
    
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))
    opt = tf.keras.optimizers.Adam(lr=0.005)
    model.compile(loss='categorical_crossentropy', optimizer=opt)
    
    return model

model = create_model(max_sequence_len, total_words)
model.summary()

Lets train our model now. 50 epochs is actually really overfittable, as you will see.

In [None]:
history = model.fit(predictors, label, epochs=50, verbose=True, batch_size=32)

In [None]:
# I don't know if i'll be here when it completes, so here i go
model.save('tongue_twister.h5')


## 5. Generating the text 

Great, our model architecture is now ready and we can train it using our data. Next lets write the function to predict the next word based on the input words (or seed text). We will first tokenize the seed text, pad the sequences and pass into the trained model to get predicted word. The multiple predicted words can be appended together to get predicted sequence.


In [None]:
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

## 6. Some Results
Well, as you can see, the model thinks I'm a dog green 🤣       
Jokes apart, as you can see the outputs aren't too (or in any way) sensible, but ***they're not supposed to!*** 
We've trained them to make weirdly sounding sentences, and can't disagree with `Donald Duck Fish Rutter Watch Fried`

In [None]:
print (generate_text("Dweep Joshipura is", 3, model, max_sequence_len))
print (generate_text("donald duck", 4, model, max_sequence_len))
print (generate_text("india and china", 4, model, max_sequence_len))
print (generate_text("new york", 4, model, max_sequence_len))

## Improvement Ideas 

As we can see, the model has produced the output which looks Really bad. However, this is due to lack of data (only ~600 examples).                           
The results can be improved with following points:
- Adding more data
- Fine Tuning the network architecture
- Fine Tuning the network parameters

Thanks for going through the notebook!                           
I hope this dataset will be helpful to you!