#### Importing Necessary Libraries:

In [4]:
import torch
import numpy as np
from torch import nn
import torch.nn.functional as F

## Loading the Data:

In [5]:
with open('C:/Users/Geekquad/rnn_data/anna.txt', 'r') as f:
    text = f.read()

#### Checking out the first 500 characters:

In [6]:
text[:500]

"Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverything was in confusion in the Oblonskys' house. The wife had\ndiscovered that the husband was carrying on an intrigue with a French\ngirl, who had been a governess in their family, and she had announced to\nher husband that she could not go on living in the same house with him.\nThis position of affairs had now lasted three days, and not only the\nhusband and wife themselves, but all the members of their f"

## Tokenization:

In the cells below I am creating a couple of dictionaries to convert the characters to and from integers. 
Encoding the characters as integers makes it easier to use as input in the network.

In [7]:
"""Creating two dictonaries
   1. int2char : which maps integers to characters
   2. char2int : which maps charaters to integers"""

chars = tuple(set(text))
int2char = dict(enumerate((chars)))
char2int = {ch: ii for ii, ch in int2char.items()}

#ENCODING THE TEXT:
encoded = np.array([char2int[ch] for ch in text])

And we can see those same characters from above, encoded as integers.

In [8]:
encoded[:100]

array([77, 42, 11, 58, 19, 22, 30, 20, 38, 62, 62, 62, 56, 11, 58, 58, 35,
       20, 41, 11, 24, 17, 15, 17, 22, 49, 20, 11, 30, 22, 20, 11, 15, 15,
       20, 11, 15, 17, 45, 22, 55, 20, 22, 21, 22, 30, 35, 20, 18, 46, 42,
       11, 58, 58, 35, 20, 41, 11, 24, 17, 15, 35, 20, 17, 49, 20, 18, 46,
       42, 11, 58, 58, 35, 20, 17, 46, 20, 17, 19, 49, 20, 76,  7, 46, 62,
        7, 11, 35, 50, 62, 62, 68, 21, 22, 30, 35, 19, 42, 17, 46])

## Pre-processing the data:

As in out char-RNN, our LSTM expects an input that is one-hot encoded meaning, that each character is converted into an integer (by our created dictionary), and then converted into a column vector where only it's corresponding integer index will have the value of 1 and the rest of the vector will be filled with 0's. 
Making a one_hot_encoding function to do this:

In [9]:
def one_hot_encode(arr, n_labels):
    one_hot = np.zeros((arr.size, n_labels), dtype = np.float32)
    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1
    one_hot = one_hot.reshape((*arr.shape, n_labels))
    return one_hot

In [10]:
test_seq = np.array([[3, 5, 1]])
one_hot = one_hot_encode(test_seq, 8)

print(one_hot)

[[[0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0. 0.]
  [0. 1. 0. 0. 0. 0. 0. 0.]]]


## Making training mini-batches

To train on this data, we will create mini-batches for training of some desired number of sequence steps.

In [13]:
def get_batches(arr, batch_size, seq_length):
    batch_size_total = batch_size*seq_length
    n_batches = len(arr)//batch_size_total
    
    arr = arr[:n_batches*batch_size_total]
    arr = arr.reshape((batch_size, -1))
    
    for n in range(0, arr.shape[1], seq_length):
        x = arr[:, n:n+seq_length]
        y = np.zeros_like(x)
        try:
            y[:,:-1], y[:,-1] = x[:,1:], arr[:, n+seq_length]
        except IndexError:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]
        yield x, y 

Now I'll make some data sets and we can check out what's going on as we batch data. Here I am going to use a batch size of 8 and 50 sequence steps.

In [14]:
batches = get_batches(encoded, 8, 50)
x, y = next(batches)

In [17]:
print('x/n', x[:10, :10])
print('\ny\n', y[:10, :10])

x/n [[77 42 11 58 19 22 30 20 38 62]
 [49 76 46 20 19 42 11 19 20 11]
 [22 46 75 20 76 30 20 11 20 41]
 [49 20 19 42 22 20 53 42 17 22]
 [20 49 11  7 20 42 22 30 20 19]
 [53 18 49 49 17 76 46 20 11 46]
 [20 12 46 46 11 20 42 11 75 20]
 [31  8 15 76 46 49 45 35 50 20]]

y
 [[42 11 58 19 22 30 20 38 62 62]
 [76 46 20 19 42 11 19 20 11 19]
 [46 75 20 76 30 20 11 20 41 76]
 [20 19 42 22 20 53 42 17 22 41]
 [49 11  7 20 42 22 30 20 19 22]
 [18 49 49 17 76 46 20 11 46 75]
 [12 46 46 11 20 42 11 75 20 49]
 [ 8 15 76 46 49 45 35 50 20 74]]


## Building the Network:

In [None]:
class CharRNN(nn.Module):
    def __init__(self, tokens, n_hidden=256, n_layers=2, drop_prob=0.5, lr=0.001):
        super().__init__()
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        
        self.chars = tokens
        self.int2char = dict(enumerate(Self.chars))
        self.char2int = {ch: ii for ii, ch in self.int2char.items()}
        
        self.lstm = nn.LSTM(len(self.chars), n_hidden, n_layers, dropout=drop_prob, batch_first=True)
        self.dropout = nn.Dropout(drop_prob)
        self.fc = nn.Linear(n_hidden, len(Self.chars))
        
    def forward(self, x, hidden):
        r_output, hidden = self.lstm(x, hidden)
        out = self.dropout(r_dropout)
        out = out.contiguous().view(-1, self.n_hidden)
        out = self.fc(out)
        return out, hidden
    
    def init_hidden(self, batch_size):
        weight = next(Self.parameters()).data
        
        hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(), weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
        
        return hidden       