# Tensor Flow Text Prediction
#### Authors: Alexandria Davis, Donald Dong
## Introduction

A popular problem in data science is predictive text. Whether it is used for type suggestion, making swipe keyboards more accurate, or simply out of fun, text prediction is a challenge most data scientists attack with a nerual net.

This neural net is based off of [Karpathy's nerual net](https://gist.github.com/karpathy/d4dee566867f8291f086), however utilizes tensorflow to improve the net's speed, object oriented programming to make our code more readable. The neural nets are also able to save their progress mid training, which allows for easier testing and crash recovery.

In [None]:
%matplotlib inline
alice_txt = '../data/alice.txt'


from datetime import datetime, timedelta
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt


tf.logging.set_verbosity(tf.logging.WARN)
pd.set_option('display.max_colwidth', -1)

In [None]:
"""
Temp import the source. Will be removed in the final report
"""
import sys
sys.path.append('../')
# Import all the dependency here so we don't have to run everything
from src.dataset import Batch
from src.dataset import Dataset
from src.text_generator import RNNTextGenerator
from src.model_selector import ModelSelector
from src.time_limit import time_limit

## Reading the data
To prepare our data for use by our neural net, we first needed to split it into groups of data that follow specific rules. To streamline the process, we used the `Dataset` class to store and manage our input data. This class was responsible for splitting the data into strings of the correct length and for turning them into one hot encoded arrays that the neural net could better understand. We stored this pre-prepared data in a `Batch` object, which has `inputs` and `targets` attributes for our model to use in training.

In [None]:
"""
The real code will be inserted here in the final report
"""
from src.dataset import Batch

In [None]:
"""
The real code will be inserted here in the final report
"""
from src.dataset import Dataset

## Batching the data

In [None]:
"""
The real code will be inserted here in the final report
"""
from test.dataset_test import test_batch

In [None]:
test_batch(alice_txt, 5, 100) # The test passes without any errors

## Build the RNN Text Generator

The text generator itself is stored in the `RNNTextGenerator` class. Among other things, storing the generator in the class allows the session helps prevent accidental data loss.

The class also internalizes the methods needed to save and restore the model as a file. This allows for long term storage and quick retreaval of a file, as well as increasing the ease of using the weights for a model with a different sized input.

The text generator does not take batches when training, however, and needs to be fed the inputs and targets seperately. 

In [None]:
"""
The real code will be inserted here in the final report
"""
from src.text_generator import RNNTextGenerator

## Save and restore the model

In [None]:
"""
The real code will be inserted here in the final report
"""
from test.text_generator_test import test_save_restore

In [None]:
test_save_restore(4, 5, 10) # The test passes without any errors

## Collect tensorflow logs

In [None]:
"""
The real code will be inserted here in the final report
"""
from test.text_generator_test import test_log

In [None]:
test_log(4, 10, '../tf_logs') # The test passes without any errors

### *Here will be a screenshot from the tensorboard*

## Training the RNN Text Generator
A short amount of training provides us with a model that is capable of forming multiple words and a few phrases, but not much more. 

In [None]:
"""
The real code will be inserted here in the final report
"""
from test.alice_test import test_alice

Let's generate some text! Start by:

In [None]:
scores = test_alice(alice_txt, 'my favorite ')

In [None]:
fig, axes = plt.subplots(figsize=(15, 6), ncols=2)
scores['accuracy'].plot(ax=axes[0], title='Accuracy')
scores['loss'].plot(ax=axes[1], title='Loss')
for ax in axes:
    ax.set(xlabel='Steps')

## Build a Model Selector 

In [None]:
"""
The real code will be inserted here in the final report
"""
from src.model_selector import ModelSelector

In [None]:
"""
The real code will be inserted here in the final report
"""
from test.model_selector_test import test_model_selector

In [None]:
seq_length = 25
dataset = Dataset([alice_txt], seq_length)
params = {
    'rnn_cell': [
        tf.nn.rnn_cell.BasicRNNCell,
        tf.nn.rnn_cell.LSTMCell,
        tf.nn.rnn_cell.GRUCell,
    ],
    'n_neurons': np.arange(1, 1000),
    'activation': [
        tf.nn.relu,
        tf.nn.leaky_relu,
        tf.nn.relu6,
        tf.nn.crelu,
        tf.nn.elu,
        tf.nn.selu,
        tf.nn.softplus,
        tf.nn.softsign,
        tf.nn.dropout,
        tf.sigmoid,
        tf.tanh,
    ],
    'output_keep_prob': np.linspace(0.5, 1, 100, endpoint=True),
    'optimizer': [
        tf.train.AdamOptimizer,
        tf.train.GradientDescentOptimizer,
    ],
    'learning_rate': np.linspace(0, 1, 10000, endpoint=False),
    'epoch': np.arange(5, 100),
    'batch_size': np.arange(25, 100),
}

In [None]:
test_model_selector(dataset, params, 1)

We allow models to train for a restricted amount of time. This is useful for training multiple models over night, as they would not need to be manually stopped.

In [None]:
"""
The real code will be inserted here in the final report
"""
from src.time_limit import time_limit

## Select the best model

In [None]:
selector = ModelSelector(dataset, params)

In [None]:
for _ in time_limit(hours=3):
    selector.search()

In [None]:
selector.as_df().head(10)

In [None]:
model = selector.best_model()
start_seq = 'Alice is'
print(start_seq, model.generate(
    dataset,
    start_seq,
    100
))

We then continued to train the same model on our dataset to see how well our model learned when it continued to be fed data from its dataset. 

Every so many epochs, we paused training to test our model by generating our models scores and generating a sample text. This information is stored for comparison purpouses. 

In [None]:
for _ in time_limit(hours=1):
    model.fit(dataset)
    start_seq = 'Alice is '
    print(start_seq, model.generate(
        dataset,
        start_seq,
        100
    ))
    print('-----------------------')