#### Importing Necessary Libraries:

In [20]:
import torch
import numpy as np
from torch import nn
import torch.nn.functional as F

## Loading the Data:

In [21]:
with open('C:/Users/Geekquad/rnn_data/anna.txt', 'r') as f:
    text = f.read()

#### Checking out the first 500 characters:

In [22]:
text[:500]

"Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverything was in confusion in the Oblonskys' house. The wife had\ndiscovered that the husband was carrying on an intrigue with a French\ngirl, who had been a governess in their family, and she had announced to\nher husband that she could not go on living in the same house with him.\nThis position of affairs had now lasted three days, and not only the\nhusband and wife themselves, but all the members of their f"

## Tokenization:

In the cells below I am creating a couple of dictionaries to convert the characters to and from integers. 
Encoding the characters as integers makes it easier to use as input in the network.

In [23]:
"""Creating two dictonaries
   1. int2char : which maps integers to characters
   2. char2int : which maps charaters to integers"""

chars = tuple(set(text))
int2char = dict(enumerate((chars)))
char2int = {ch: ii for ii, ch in int2char.items()}

#ENCODING THE TEXT:
encoded = np.array([char2int[ch] for ch in text])

And we can see those same characters from above, encoded as integers.

In [24]:
encoded[:100]

array([60,  5,  3, 80, 48, 25, 13, 78, 23, 50, 50, 50, 24,  3, 80, 80, 79,
       78, 26,  3, 58, 57, 77, 57, 25,  9, 78,  3, 13, 25, 78,  3, 77, 77,
       78,  3, 77, 57, 27, 25, 31, 78, 25, 81, 25, 13, 79, 78, 82, 18,  5,
        3, 80, 80, 79, 78, 26,  3, 58, 57, 77, 79, 78, 57,  9, 78, 82, 18,
        5,  3, 80, 80, 79, 78, 57, 18, 78, 57, 48,  9, 78, 52, 76, 18, 50,
       76,  3, 79,  6, 50, 50, 64, 81, 25, 13, 79, 48,  5, 57, 18])

## Pre-processing the data:

As in out char-RNN, our LSTM expects an input that is one-hot encoded meaning, that each character is converted into an integer (by our created dictionary), and then converted into a column vector where only it's corresponding integer index will have the value of 1 and the rest of the vector will be filled with 0's. 
Making a one_hot_encoding function to do this:

In [30]:
def one_hot_encode(arr, n_labels):
    one_hot = np.zeros((arr.size, n_labels), dtype = np.float32)
    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1
    one_hot = one_hot.reshape((*arr.shape, n_labels))
    return one_hot

In [31]:
test_seq = np.array([[3, 5, 1]])
one_hot = one_hot_encode(test_seq, 8)

print(one_hot)

[[[0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0. 0.]
  [0. 1. 0. 0. 0. 0. 0. 0.]]]
