#  Lab: Generating Text with RNNs

## Overview

**Recurrent Neural Network**(RNN) is a form of machine learning algorithm that is ideal for sequential data such as text, time series, financial data, speech, audio, video among others.  

To run this lab we will be using a python library called **textgenrnn**
This library will easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code." textgenrnn is authored by Max Woolf, an Associate Data Scientist at BuzzFeed, and former Apple Software QA Engineer.

**textgenrnn** is a Python 3 module on top of Keras/TensorFlow for creating [char-rnns](https://github.com/karpathy/char-rnn), with many cool features:

 - A modern neural network architecture which utilizes new techniques as attention-weighting and skip-embedding to accelerate training and improve model quality.
 - Train on and generate text at either the character-level or word-level.
 - Configure RNN size, the number of RNN layers, and whether to use bidirectional RNNs.
 - Train on any generic input text file, including large files.
 - Train models on a GPU and then use them to generate text with a CPU.
 - Utilize a powerful CuDNN implementation of RNNs when trained on the GPU, which massively speeds up training time as opposed to typical LSTM implementations.
 - Train the model using contextual labels, allowing it to learn faster and produce better results in some cases.

You can find the repository [here](https://github.com/minimaxir/textgenrnn)

## Step 1 import the library

In [1]:
# Code to run on Google Colab
try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

print ("Running in Google COLAB : ", IN_COLAB)

Running in Google COLAB :  False


In [2]:
## Turning this off for now to prevent error in Colab
# try:
#   # %tensorflow_version only exists in Colab.
#   %tensorflow_version 2.x
# except Exception:
#   pass

import tensorflow as tf
from tensorflow import keras
print (keras.__version__)

2.2.4-tf


In [3]:
## Install textgenRNN

!pip install -q textgenrnn

## Step 2 Build Model

**textgenrnn** will use a default model unless you specify the size and complexity of the neural network with a wide variety of parameter.

In [4]:
from textgenrnn import textgenrnn

textgen = textgenrnn()

Using TensorFlow backend.


## Step 3 Train Model

Select text to train on.   Uncomment out one of the url to train train data on they are of the following:
 - 2018 Trump State of the Union Address
 - 2009 Obama State of the Union Address
 -  Collection of Trump tweets(Largest data set will take a while to train on)

Select the number of epochs you wish to train on.  For this example we will be using 1 epoch but to get better results increase the epochs.  


In [5]:
## obama state of the union
data_location = 'https://elephantscale-public.s3.amazonaws.com/data/text/state-of-the-unions/2018-Trump.txt'

## trump stou-2020
# data_location = 'https://elephantscale-public.s3.amazonaws.com/data/text/state-of-the-unions/2020-Trump.txt'

## trump tweets 
# ~20k tweets, Note : Training will take a while (on CPU = 1 hr)
# data_location = 'https://elephantscale-public.s3.amazonaws.com/data/text/tweets/Trump-tweets.txt'

## tiny shakespeare (1.1 M)
## training time on colab GPU = ~ 2 mins,   CPU ( 8 core, 4GHz) = 30 mins
# data_location = 'https://elephantscale-public.s3.amazonaws.com/data/text/books/tiny-shakespeare.txt'

In [6]:
import os 

data_location_local  = keras.utils.get_file(fname=os.path.basename(data_location),
                                           origin=data_location)
print (data_location_local)

Downloading data from https://elephantscale-public.s3.amazonaws.com/data/text/state-of-the-unions/2018-Trump.txt
/home/ubuntu/.keras/datasets/2018-Trump.txt


In [7]:
%%time

textgen.train_from_file(data_location_local, num_epochs=1)  

648 texts collected.
Training on 30,236 character sequences.
  ...
    to  
  ['...']
Train for 236 steps
Temperature: 0.2
####################






####################
Temperature: 0.5
####################




fix to help me our candidal and the third car court to commune the same country.

####################
Temperature: 1.0
####################
is if Youth to reps

We must salcenced them, we will receiving Underviewed Peterble.

dofbend on the Cyrezeer Sorget, when he loostined yours

CPU times: user 7min 3s, sys: 8min 14s, total: 15min 17s
Wall time: 1min 10s


## Step 4 Generate Text

In [8]:
%%time 

print (textgen.generate())

receined to tax claimed For The right and a new hundred and Methlay etc.

None
CPU times: user 23.1 s, sys: 33 s, total: 56.1 s
Wall time: 4.1 s


Change some simple paramters such as **temperature** (the textgenrnn default is 0.5) to get some more creative text.
**Temperature** represents the “creativity” of the text, it allows the model to make increasingly suboptimal prediction

In [9]:
%%time 

print (textgen.generate(5, temperature=0.9))

 20%|██        | 1/5 [00:03<00:12,  3.21s/it]

Manday was are standards of tonight, Steve Pizzaer, Some of



 40%|████      | 2/5 [00:08<00:11,  3.79s/it]

inteined expanding with the Unitue of the speciest. They pushiluus to people endansed to keep.



 60%|██████    | 3/5 [00:10<00:06,  3.23s/it]

more of the grave for the tree live out



 80%|████████  | 4/5 [00:14<00:03,  3.41s/it]

immigger, dusting versions called the countrial. They will return to expect



100%|██████████| 5/5 [00:15<00:00,  3.19s/it]

alternates, including the booby of

None
CPU times: user 1min 28s, sys: 2min 7s, total: 3min 36s
Wall time: 16 s





## Step 5: Experiment with Different Texts

- You will not get quality generated text 100% of the time, even with a heavily-trained neural network. 
- Results will vary greatly between datasets. Because the pretrained neural network is relatively small, it cannot store as much data.  For best results, use a dataset with at least 2,000-5,000 documents. If a dataset is smaller, you'll need to train it for longer by setting num_epochs higher when calling a training method and/or training a new model from scratch. 
- A GPU is not required to retrain textgenrnn, but it will take much longer to train on a CPU.