Dataset: 

you can take any .txt file format, create it yourself or download it from any source, e.g. from project gutenberg https://www.gutenberg.org/

I created Rustavely.txt file, which you can download here and use: 
https://1drv.ms/t/s!AhnVhbVlzYkKgQmnRS7FrC4QrQNM?e=800gvv 

In [1]:
import numpy as np

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
#from torch.distributions.categorical import Categorical

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [2]:
with open('Rustaveli.txt', 'r', encoding="utf8") as fl:
    text=fl.read()
    
start_indx = text.find('Shota Rustaveli')
end_indx = text.find('The End')

text = text[start_indx:end_indx]       # where starts the text and where it ends
char_set = set(text)                   # convert into set to get only unique characters
print('Total Length:', len(text))
print('Unique Characters:', len(char_set))

Total Length: 403097
Unique Characters: 68


In [3]:
chars_sorted = sorted(char_set)
char2int = {ch:i for i,ch in enumerate(chars_sorted)}   #we give the set the form of character string encoding/ often refered as character dictionary
char_array = np.array(chars_sorted)                   

text_encoded = np.array(
    [char2int[ch] for ch in text],
    dtype=np.int32)                                     # string encodings for each charackter in text order

print('Text encoded shape: ', text_encoded.shape)

print(text[:15], '     == Encoding ==> ', text_encoded[:15])
# print(text_encoded[15:21], ' == Reverse  ==> ', ''.join(char_array[text_encoded[15:30]]))

Text encoded shape:  (403097,)
Shota Rustaveli      == Encoding ==>  [32 48 55 60 41  2 31 61 59 60 41 62 45 52 49]


In [4]:
for ex in text_encoded[:5]:
    print('{} -> {}'.format(ex, char_array[ex]))

32 -> S
48 -> h
55 -> o
60 -> t
41 -> a


In [5]:
seq_length = 40
chunk_size = seq_length + 1

text_chunks = [text_encoded[i:i+chunk_size] 
               for i in range(len(text_encoded)-chunk_size+1)]   # because 10 chunks is not len(10) but len(9)

for seq in text_chunks[:1]:                     #let's look at the first chunk
    input_seq = seq[:seq_length]                #sequence chunk that we use for prediction of next letter
    target = seq[seq_length]                    #next letter that has to be predicted 
    print(input_seq, ' -> ', target)
    print(repr(''.join(char_array[input_seq])), # repr() simple way of giving string representation of the value
          ' -> ', repr(''.join(char_array[target])))

[32 48 55 60 41  2 31 61 59 60 41 62 45 52 49  2  1  1 22 54 60 58 55 44
 61 43 60 55 58 65  2 30 61 41 60 58 41 49 54 59]  ->  1
'Shota Rustaveli \n\nIntroductory Quatrains'  ->  '\n'


In [6]:
class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks

    def __len__(self):
        return len(self.text_chunks)
    
    def __getitem__(self, idx):
        text_chunk = self.text_chunks[idx]
        return text_chunk[:-1].long(), text_chunk[1:].long()  #chunk with all elements except last one, except first one
        # these are starting sequence and target sequence, which is shifted right by one character     
        
seq_dataset = TextDataset(torch.tensor(text_chunks))

  del sys.path[0]


In [7]:
# just inspect if everything works as expeced
for i, (seq, target) in enumerate(seq_dataset):
    print(' Input (x):', repr(''.join(char_array[seq])))
    print('Target (y):', repr(''.join(char_array[target])))
    print()
    if i == 1:
        break
    

 Input (x): 'Shota Rustaveli \n\nIntroductory Quatrains'
Target (y): 'hota Rustaveli \n\nIntroductory Quatrains\n'

 Input (x): 'hota Rustaveli \n\nIntroductory Quatrains\n'
Target (y): 'ota Rustaveli \n\nIntroductory Quatrains\n\n'



In [8]:
batch_size = 64

torch.manual_seed(1)
seq_dl = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True, drop_last=True)


In [62]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)   #nn.Embedding just looks up nonzero positions to avoid computations on zero positions
        self.rnn_hidden_size = rnn_hidden_size
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size, 
                           batch_first=True)
        self.linear = nn.Linear(rnn_hidden_size, vocab_size)

    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.linear(out).reshape(out.size(0), -1)
        return out, hidden, cell                         # out, hidden, cell states are given after each iteration

    def init_hidden(self, batch_size):                   #to initialize with batch size
        hidden = torch.zeros(1, batch_size, self.rnn_hidden_size)
        cell = torch.zeros(1, batch_size, self.rnn_hidden_size)
        return hidden.to(device), cell.to(device)        #and get (re-)started hiiden and cell states            

###------------------ version with GRU and Variable ----------------------------

# from torch.autograd import Variable
# class RNN(nn.Module):
#     def __init__(self, input_size, hidden_size, output_size, n_layers=1):
#         super(RNN, self).__init__()
#         self.input_size = input_size
#         self.hidden_size = hidden_size
#         self.output_size = output_size
#         self.n_layers = n_layers
        
#         self.encoder = nn.Embedding(input_size, hidden_size)
#         self.gru = nn.GRU(hidden_size, hidden_size, n_layers)
#         self.decoder = nn.Linear(hidden_size, output_size)
    
#     def forward(self, input, hidden):
#         input = self.encoder(input.view(1, -1))
#         output, hidden = self.gru(input.view(1, 1, -1), hidden)
#         output = self.decoder(output.view(1, -1))
#         return output, hidden

#     def init_hidden(self):
#         return Variable(torch.zeros(self.n_layers, 1, self.hidden_size))

### ----------------------------------------------------------------------------

vocab_size = len(char_array)
embed_dim = 256
rnn_hidden_size = 512

torch.manual_seed(1)
model = RNN(vocab_size, embed_dim, rnn_hidden_size) 
model = model.to(device)
model

RNN(
  (embedding): Embedding(68, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (linear): Linear(in_features=512, out_features=68, bias=True)
)

In [None]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

num_epochs = 10000 

best = float('inf')

torch.manual_seed(1)

for epoch in range(num_epochs):
    hidden, cell = model.init_hidden(batch_size) # we clean hidden and cell states each epoch but parameters stay
    seq_batch, target_batch = next(iter(seq_dl)) # for each epoch we just take next batch/chunk from dataloader
    seq_batch = seq_batch.to(device)
    target_batch = target_batch.to(device)
    optimizer.zero_grad()
    loss = 0
    for c in range(seq_length):
        pred, hidden, cell = model(seq_batch[:, c], hidden, cell) 
        loss += loss_fn(pred, target_batch[:, c])
    loss.backward()
    optimizer.step()
    loss = loss.item()/seq_length
    if epoch % 500 == 0:
        print(f'Epoch {epoch} loss: {loss:.4f}')
        if loss<best:
          best = loss
          torch.save(model.state_dict(), './model.pth')
 

In [78]:
model = RNN(vocab_size, embed_dim, rnn_hidden_size)
model.load_state_dict(torch.load('./model.pth'))
model.to(device)
model.eval()

RNN(
  (embedding): Embedding(68, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (linear): Linear(in_features=512, out_features=68, bias=True)
)

In [87]:
def evaluate(model, starting_str, 
           len_generated_text=500, 
           temperature=1.0):
    #starting string is the string you give as starting sequence for further generation
    encoded_input = torch.tensor([char2int[s] for s in starting_str])
    encoded_input = torch.reshape(encoded_input, (1, -1)) #just put everithing in second dimention

    generated_str = starting_str

    model.eval()
    hidden, cell = model.init_hidden(1)
    hidden = hidden.to('cpu')
    cell = cell.to('cpu')
    for c in range(len(starting_str)-1):
        _, hidden, cell = model(encoded_input[:, c].view(1), hidden, cell) 
    
    last_char = encoded_input[:, -1]
    for i in range(len_generated_text):
        logits, hidden, cell = model(last_char.view(1), hidden, cell) 
        logits = torch.squeeze(logits, 0)

        # scale factor is in the role of temperature patameter (1 over inverse temperature beta) in logsumexp
        
        ## sampling with raw code, I have to improve this version 
        dist = logits.view(-1).div(temperature).exp()
        last_char = torch.multinomial(dist, 1)[0]        #this is actually not a multinomial necessarily but it works   

        ## sampling with Categorical()
        # scaled_logits = logits * temperature
        # dist = Categorical(logits=scaled_logits)      #we create new distribution for each next character to sample
        # last_char = dist.sample()                     #notice, this is done for each next character!

        generated_str += str(char_array[last_char])
        
    return generated_str


torch.manual_seed(1)
model.to('cpu')
print(evaluate(model, starting_str='peace to world'))

peace to world (or as a brother, they will not take myself, advice it is necessary treachery as sun-like one was a hundred (her in his praise was a far a waiting tears!
 "The maiden sat shedder water, how have senseless the voiced with you. When night the rose-tintless Tariel, say: 'She forsook would meet it? Spoke now faint?
     P’hatman,' Thou went back steel is beloved about. Avt’handil's night, would grieve not Dame in a palanquishing built; you we shoulded here, and at no timble became to battle crepara


In [88]:
## be avare how increase of temperature increases the randomness

torch.manual_seed(1)
print(evaluate(model, starting_str='peace to world', 
             temperature=0.5))

peace to world (or a revered) pale--earnest them that look on the partridges; what is the man of the same as was a hundrest of the heart, and who are not well as the sun of the heart of adamant. The king sat down and sought the desire calmed by stone and crystal. P’hridon met me alives the seashore, I shall set as the same and panther or a heap; they play as if the sun in my story of my story?" said he. What I show my sadness at the healing of heaven of the seventh folk; they become a net of a mountain and re


In [91]:
torch.manual_seed(1)
print(evaluate(model, starting_str='peace to world', 
             temperature=0.25))

peace to world (or helpers) with the seashore. When they said: "What a tenth and said to me: 'What hath a man arose and song, and the moon should they desire the seashore of the seashore is this world and heard of his days. When they say: 'Selds the Kadjis are the sun (Tariel's) armies to see me. When they said to me: 'What a stand the maiden saw the plains; the maiden sat down and stood consumed to meet me. It is better than a moment he saw a madman, I said: 'Stand!' (gong caress (of the Seas) was a gift, th
