# Text Generation

LSTM or any RNN network are generally treated as predictive models and used for purpose of classification. RNN model can also be used as the generative models, to generate a related new sequence by learning the text. Generative models help in studying how well the model has learned our data. Generative models can help in augmenting the data. Generative Adversarial Network-based methods are used widely for the data augmentation. In this recipe, we will learn about how to develop a generative model using LSTM on different textual data

## Data 
In the present implementation, I am using, A book from Gutenberg repository - [Gypsy Sorcery and Fortune Telling by Charles Godfrey Leland.](https://www.gutenberg.org/ebooks/58465)  you can take any text in replacement to this book. The book in text format is kept at `data/58465.txt`

# Importing Requirements

In [8]:
import numpy as np
import torch
import torch.nn.functional as F
from torch import nn
from torch.autograd import Variable

SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Reading the data

In [6]:
with open('data/58465.txt', 'r') as f:
    text = f.read()

In [7]:
chars = tuple(set(text))
int2char = dict(enumerate(chars))
char2int = {ch: ii for ii, ch in int2char.items()}
encoded = np.array([char2int[ch] for ch in text])

In [0]:
def one_hot_encode(arr, n_labels):
    """
    one-hot encoding the data
    """
    
    # Initialize the the encoded array
    one_hot = np.zeros((np.multiply(*arr.shape), n_labels), dtype=np.float32)
    
    # Fill the appropriate elements with ones
    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.
    
    # Finally reshape it to get back to the original array
    one_hot = one_hot.reshape((*arr.shape, n_labels))
    
    return one_hot

In [0]:
def get_batches(arr, n_seqs, n_steps):
    '''Create a generator that returns mini-batches of size
       n_seqs x n_steps from arr.
    '''
    
    batch_size = n_seqs * n_steps
    n_batches = len(arr)//batch_size
    
    # Keep only enough characters to make full batches
    arr = arr[:n_batches * batch_size]
    # Reshape into n_seqs rows
    arr = arr.reshape((n_seqs, -1))
    
    for n in range(0, arr.shape[1], n_steps):
        # The features
        x = arr[:, n:n+n_steps]
        # The targets, shifted by one
        y = np.zeros_like(x)
        try:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, n+n_steps]
        except IndexError:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]
        yield x, y


## Defining the model
- Model defination
- Function for prediction
- Initilaizing weights
- Initilaizing hidden states

The network architecture is very simple, it takes the character encoder to numerical indices as the input. this input is passed on to the LSTM cell which then gives our output shape, hidden and the cell state.  A Dropout is applied to the LSTM output. Then all the output for each time-step is stacked over one another. Let's take one example assuming the following variables:

1. batch size 32
2. hidden size  =256
3. Sequence length = 100

Considering above input from the above defined variable the output from the LSTM will be of size [1, batch size, hidden size] for each time step. This output will be generated for time steps equal to sequence length (100). These all 100 output are stacked to form the resultant shape [100*batch size, hidden shape]. This shape is then passed on to the fully connected layer. The fully connected layer will transform [100*batch size, hidden shape] in to [100*batch size, number character]. To simplify the understanding lets take the final shape as [32, 100, number of unique characters]. After application of softmax to this shape, it represents the probability of a character out of number of unique characters for 100 characters taken as input in a batch of 32. Having understood this we will now move on to the prediction part. The implementation of the prediction function looks like as given below.



In [0]:
class CharRNN(nn.Module):
    def __init__(self, tokens, n_steps=100, n_hidden=256, n_layers=2,
                               drop_prob=0.5, lr=0.001):
        super().__init__()
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        
        self.chars = tokens
        self.int2char = dict(enumerate(self.chars))
        self.char2int = {ch: ii for ii, ch in self.int2char.items()}
        
        self.dropout = nn.Dropout(drop_prob)
        self.lstm = nn.LSTM(len(self.chars), n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True)
        self.fc = nn.Linear(n_hidden, len(self.chars))
        
        self.init_weights()
        
    def forward(self, x, hc):
        ''' Forward pass through the network '''
        
        x, (h, c) = self.lstm(x, hc)
        x = self.dropout(x)
        
        # Stack up LSTM outputs
        x = x.view(x.size()[0]*x.size()[1], self.n_hidden)
        
        x = self.fc(x)
        
        return x, (h, c)
    
    def init_weights(self):
        ''' Initialize weights for fully connected layer '''
        initrange = 0.1
        
        # Set bias tensor to all zeros
        self.fc.bias.data.fill_(0)
        # FC weights as random uniform
        self.fc.weight.data.uniform_(-1, 1)
        
    def init_hidden(self, n_seqs):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x n_seqs x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        return (Variable(weight.new(self.n_layers, n_seqs, self.n_hidden).zero_()),
                Variable(weight.new(self.n_layers, n_seqs, self.n_hidden).zero_()))

# Training

In [0]:
def train(net, data, epochs=10, n_seqs=10, n_steps=50, lr=0.001, clip=5, val_frac=0.1, print_every=10):
    ''' Traing a network 
    
        Arguments
        ---------
        
        net: CharRNN network
        data: text data to train the network
        epochs: Number of epochs to train
        n_seqs: Number of mini-sequences per mini-batch, aka batch size
        n_steps: Number of character steps per mini-batch
        lr: learning rate
        clip: gradient clipping
        val_frac: Fraction of data to hold out for validation
        print_every: Number of steps for printing training and validation loss
    
    '''
    
    net.train()
    opt = torch.optim.Adam(net.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    
    # create training and validation data
    val_idx = int(len(data)*(1-val_frac))
    data, val_data = data[:val_idx], data[val_idx:]
    
    net.to(device)
    
    counter = 0
    train_losses = []
    validation_losses = []
    n_chars = len(net.chars)
    for e in range(epochs):
        h = net.init_hidden(n_seqs)
        for x, y in get_batches(data, n_seqs, n_steps):
            counter += 1
            
            # One-hot encode our data and make them Torch tensors
            x = one_hot_encode(x, n_chars)
            x, y = torch.from_numpy(x), torch.from_numpy(y)
            
            inputs, targets = Variable(x), Variable(y)
            inputs, targets = inputs.to(device), targets.to(device)

            # Creating new variables for the hidden state, otherwise
            # we'd backprop through the entire training history
            h = tuple([Variable(each.data) for each in h])

            net.zero_grad()
            
            output, h = net.forward(inputs, h)
            loss = criterion(output, targets.view(n_seqs*n_steps))

            loss.backward()
            
            # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
            nn.utils.clip_grad_norm(net.parameters(), clip)

            opt.step()
            
            if counter % print_every == 0:
                
                # Get validation loss
                val_h = net.init_hidden(n_seqs)
                val_losses = []
                for x, y in get_batches(val_data, n_seqs, n_steps):
                    # One-hot encode our data and make them Torch tensors
                    x = one_hot_encode(x, n_chars)
                    x, y = torch.from_numpy(x), torch.from_numpy(y)
                    
                    # Creating new variables for the hidden state, otherwise
                    # we'd backprop through the entire training history
                    val_h = tuple([Variable(each.data, volatile=True) for each in val_h])
                    
                    inputs, targets = Variable(x, volatile=True), Variable(y, volatile=True)
                    inputs, targets = inputs.to(device), targets.to(device)

                    output, val_h = net.forward(inputs, val_h)
                    val_loss = criterion(output, targets.view(n_seqs*n_steps))
                
                    val_losses.append(val_loss.item())
                    train_losses.append(loss.item())
                    validation_losses.append(val_loss.item())
                
                print("Epoch: {}/{}...".format(e+1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.4f}...".format(loss.item()),
                      "Val Loss: {:.4f}".format(np.mean(val_losses)))
                
    return train_losses, validation_losses

In [0]:
net = CharRNN(chars, n_hidden=512, n_layers=2)

In [0]:
n_seqs, n_steps = 128, 100
train_losses, validation_losses = train(net, encoded, epochs=25, n_seqs=n_seqs, n_steps=n_steps, lr=0.001, print_every=10)

# Plotting progress

In [0]:
plt.plot(train_losses, label="Training Loss")
plt.plot(validation_losses, label="Validation Loss")
plt.legend(loc='upper Right')
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.show()


# Predicting


This function takes previously trained network as the input along with other parameters such as char and hidden state. The argument char is actually a seed. The seed is a short sequence require to initiate the generation, it can be something like 'the' or 'we' or any other short sequence.  After receiving these arguments the predict function performs the following steps :

1. convert chatters to numerical index by dictionary lookup
2. Creating new variables for the hidden state, otherwise, it would backdrop through the entire training history
3. parsing numerical indexes to the network and getting back probability distribution for the new character as discussed previously in the network architecture.
4. Choosing the top 5 largest characters, out of one will be chosen as the final candidate.
5. Applying random choice on the basis of weighted distribution. using Numpy random with the non-uniform distribution. Our predictions come from a categorical probability distribution over all the possible characters. We can make the sampled text more reasonable but less variable by only considering some most probable characters. This will prevent the network from giving us completely absurd characters while allowing it to introduce some noise and randomness into the sampled text.
6. Repeat 1-5 until all max specified length. return all the generated characters.
When I allow training the network for 25 epoch, the final generated text looks as given below:

> there when a seal of the places a stall as to the Strunk of all seence, into a comman trace the whole the proticle that or taken to trear to the work and reperiously trancelly would hus believed to the cu

It seems the network has learned to place articles like the and a but it is making spelling mistakes. Allowing network to train for sufficiently longer could yield better results. With a small amount of text, the chance of detecting novelty in the generated text is highly rare. 

In [0]:
def predict(net, char, h=None, top_k=None):
        ''' 
        Given a character, predict the next character.
        Returns the predicted character and the hidden state.
        '''
        
        if h is None:
            h = net.init_hidden(1)
        
        x = np.array([[net.char2int[char]]])
        x = one_hot_encode(x, len(net.chars))
        inputs = Variable(torch.from_numpy(x), volatile=True)
            
        inputs = inputs.to(device)
        
        h = tuple([Variable(each.data, volatile=True) for each in h])
        out, h = net.forward(inputs, h)
        
        # p = predicted

        p = F.softmax(out).data
        p = p.to(device)
        
        if top_k is None:
            top_ch = np.arange(len(net.chars))
        else:
            p, top_ch = p.topk(top_k)
            top_ch = top_ch.cpu().numpy().squeeze()
        
        p = p.cpu().numpy().squeeze()
        char = np.random.choice(top_ch, p=p/p.sum()) #random choice on the basis of weighted distribution
            
        return net.int2char[char], h

In [0]:
def sample(net, size, prime='The', top_k=None):
        
    net.to(device)

    net.eval()
    
    # First off, run through the prime characters
    chars = [ch for ch in prime]
    h = net.init_hidden(1)
    for ch in prime:
        char, h = net.predict(ch, h, top_k=top_k)

    chars.append(char)
    
    # Now pass in the previous character and get a new one
    for ii in range(size):
        char, h = predict(net, chars[-1], h, top_k=top_k)
        chars.append(char)

    return ''.join(chars)

In [0]:
print(sample(net, 200, prime='the', top_k=5))