# Text Generation

This notebook shows how to implement text generating model in tensorflow. It's based on sarcastic posts on Reddit. These posts cover wide variety of subjects. Yet, they all share the same feature: sarcastic intent. Can you train a model able to generate similar comments?

As for the machine learning, the following concepts are covered:

- Data prerocessing
    - stopword removal
    - tokenization
    - creating n-grams
- Recurrent Neural Networks
    - embedding layers
    - LSTM layers
    - fully-connected layers
    - regularization
    - dropout
- Text generation

## Data

The notebook is based on Sarcasm on Reddit dataset. The dataset was generated by scraping comments from Reddit containing the "\s" tag. This tag is often used by Redditors to indicate that their comment is in jest and not meant to be taken seriously, and is generally a reliable indicator of sarcastic comment content.

The dataset contains 1.3 million sarcastic statements, which we will use to build a sarcasm generator. 

To start, [download the data](https://www.kaggle.com/danofer/sarcasm/download) and place it in your working directory. 

In [None]:
import random

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer

from utils import (create_n_grams, generate_text, load_text_data,
                   plot_training_progress, unpack_file)

In [None]:
# Configure GPU
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
for device in gpu_devices: 
    tf.config.experimental.set_memory_growth(device, True)

In [None]:
DATA_FILE = "sarcasm.zip"
DATA_DIR = "sarcasm"

unpack_file(DATA_FILE, DATA_DIR)

TRAIN_FILE = DATA_DIR + "/train-balanced-sarcasm.csv"

In [None]:
# Data loading

corpus, labels = load_text_data(TRAIN_FILE, 1, 0)
corpus = [sentence for i, sentence in enumerate(corpus) if int(labels[i]) == 1]

In [None]:
# Let's have a look at sample posts. 

posts_to_show = 15

for i in range(posts_to_show):
    idx = random.randrange(0,len(corpus))
    print("-" + " ".join(corpus[idx]))
    print("")

In [None]:
# Data preprocessing

MAX_WORDS = 1000     # max words in a dictionnary
SEQUENCE_LEN = 50    # maximum sequence length
MAX_DOCS = 5000      # number of posts used for training

tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)

if MAX_WORDS:
    tokenizer.word_index = {e:i for e,i in tokenizer.word_index.items() if i <= MAX_WORDS}

total_words = len(tokenizer.word_index) + 1

predictors, predictands = create_n_grams(corpus, tokenizer, SEQUENCE_LEN, MAX_DOCS)
predictands = tf.keras.utils.to_categorical(predictands, num_classes=total_words, dtype=int)

## Recurrent Neural Network

Let's build a recurrent neural network.  We will formulate this probelm as multiclass classification. As the input, we will use a sequence of words. As the output, we will use the last word in this sequence. Nural network will then learn words combination that often appear together. As a result, we will be able to "predict" the next word, given a sequence of previous words. 

In [None]:
# Model architecture

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(total_words, 100, input_length=SEQUENCE_LEN-1))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences = True)))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.LSTM(50)),
model.add(tf.keras.layers.Dense(total_words/2, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01)))
model.add(tf.keras.layers.Dense(total_words, activation='softmax'))

model.summary()

In [None]:
# Now it's time to train our model. 

tf.keras.backend.clear_session()

model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['acc'])

history = model.fit(predictors, 
                    predictands, 
                    epochs=100, 
                    verbose=1)

In [None]:
# Let's evaluate its training progress

plot_training_progress(history, "acc", plot_validation=False)

Now, let's see how it performs with text generation. Start a sequence below, and let the neural network finish it for you!

In [None]:
seed_text = "I have never seen such a good movie!"
next_words = 15

generate_text(model, tokenizer, seed_text, next_words)

How sarcastic does this sentence sound? Or does it sound like a sentence at all?

Language models are known for taking really long time to train. At the end of the day, you are trying to represent relelationship between all existing words! So take a step backward. Increase the number of epochs, retrain your model, and try again!