<a href="https://colab.research.google.com/github/eyyupoglu/Quran-Bible-Shakira/blob/master/word_generation_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup the Environment

In [0]:
# http://pytorch.org/
from os.path import exists
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

!pip install -q http://download.pytorch.org/whl/{accelerator}/torch-0.4.1-{platform}-linux_x86_64.whl torchvision
import torch

In [2]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


# Imports and Constants

In [0]:
import argparse
import time
import math
import os
import torch
import torch.nn as nn
import torch.onnx
import numpy as np
import pickle

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [0]:
# HYPER PARAMETERS GO HERE
 
emsize = 400
batch_size = 40
bptt = 40
eval_batch_size = 10
vocabulary_size = 8000 # is this still needed?
nhid = 1150
nlayers = 3
dropout = 0.5
wdrop = 0.5
tied = False
clip = 0.25
log_interval = 100
epochs = 10
lr = 20


model_type = 'LSTM'

# Tools-Classes

## Tools

### DropConnect

In [0]:
import torch
from torch.nn import Parameter
from functools import wraps

class WeightDrop(torch.nn.Module):
    def __init__(self, module, weights, dropout=0, variational=False):
        super(WeightDrop, self).__init__()
        self.module = module
        self.weights = weights
        self.dropout = dropout
        self.variational = variational
        self._setup()

    def widget_demagnetizer_y2k_edition(*args, **kwargs):
        # We need to replace flatten_parameters with a nothing function
        # It must be a function rather than a lambda as otherwise pickling explodes
        # We can't write boring code though, so ... WIDGET DEMAGNETIZER Y2K EDITION!
        # (╯°□°）╯︵ ┻━┻
        return

    def _setup(self):
        # Terrible temporary solution to an issue regarding compacting weights re: CUDNN RNN
        if issubclass(type(self.module), torch.nn.RNNBase):
            self.module.flatten_parameters = self.widget_demagnetizer_y2k_edition

        for name_w in self.weights:
            print('Applying weight drop of {} to {}'.format(self.dropout, name_w))
            w = getattr(self.module, name_w)
            del self.module._parameters[name_w]
            self.module.register_parameter(name_w + '_raw', Parameter(w.data))

    def _setweights(self):
        for name_w in self.weights:
            raw_w = getattr(self.module, name_w + '_raw')
            w = None
            if self.variational:
                mask = torch.autograd.Variable(torch.ones(raw_w.size(0), 1))
                if raw_w.is_cuda: mask = mask.cuda()
                mask = torch.nn.functional.dropout(mask, p=self.dropout, training=True)
                w = mask.expand_as(raw_w) * raw_w
            else:
                w = torch.nn.functional.dropout(raw_w, p=self.dropout, training=self.training)
            setattr(self.module, name_w, w)

    def forward(self, *args):
        self._setweights()
        return self.module.forward(*args)

### Our DropConnect

In [0]:
class Dropout(nn.Module): 
    def __init__(self, p=0.5, inplace=False):
        super(Dropout, self).__init__()
        self.inplace = inplace
        self.p = p
      
    def forward(self, input):  
        return F.dropout(input, self.p, self.training, self.inplace)

    def __call__(self, x):
        return self.forward(x)

class DropConnect(nn.Module):
    def __init__(self, layer, prob=0.5, **kwargs):
        super(DropConnect, self).__init__()
        self.prob = prob
        self.layer = layer

    def __call__(self, x):
        if 0. < self.prob < 1.:
            self.layer.weight = nn.Parameter(F.dropout(self.layer.weight, self.prob, self.training))
        return self.layer(x)
      



### Dictionary Corpus

In [0]:
import os
import torch

class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
        return self.word2idx[word]
      
    def lookup_word(self, idx):
        return self.idx2word[idx]

    def lookup_sequence(self, idx_list):
        result = []
        for i in idx_list:
          result.append(self.idx2word[i])
        return result
      
    @staticmethod
    def predict(some_string, test_model, corpus, hidden):
        """
        Example usage:
        --------------
        >>> corpus = Corpus('./gdrive/My Drive/nlp/data/raw/penn-treebank')
        >>> file_path = './gdrive/My Drive/nlp/model_wisse1'
        >>> test_model = torch.load(file_path).to(device)
        >>> corpus = Corpus('./gdrive/My Drive/nlp/data/raw/penn-treebank')
        >>> Dictionary.predict('new york stock', test_model = test_model, corpus=corpus)
        'exchange'
        >>> Dictionary.predict('las ', test_model = test_model, corpus=corpus)
        'vegas'
        """
        tokens = corpus.tokenize_string(some_string)
        tokens = tokens.unsqueeze(1)
        output, hidden = test_model.forward(tokens, hidden)
        #excluding <eos>, <unk>, N in our prediction(indices are  24, 26, 27)
        output[:,:,24] = 0
        output[:,:,26] = 0
        output[:,:,27] = 0
        _, indices = torch.max(output, 2)       
        indices = indices.data.cpu().numpy()
        word_list = corpus.dictionary.lookup_sequence(indices[:,0])
        output_words = " ".join(word_list)
        return word_list[-1], hidden
    
    @staticmethod
    def conditional_predict(some_word, test_model, corpus, prd_length):
      """
      Example usage:
        --------------
        >>> corpus = Corpus('./gdrive/My Drive/nlp/data/raw/penn-treebank')
        >>> file_path = './gdrive/My Drive/nlp/model_wisse1'
        >>> test_model = torch.load(file_path).to(device)
        >>> corpus = Corpus('./gdrive/My Drive/nlp/data/raw/penn-treebank')
        >>> Dictionary.conditional_predict('francisco', test_model, corpus, 50)
        earthquake and the first boston 's recent years ago the company in the company in the first boston 's recent months in the company in the first boston 's recent years ago the company in the company in the company in the company officials in the company officials in the"
      """
      hidden = test_model.init_hidden(1)
      generated_sq = [some_word]
      for i in range(prd_length):
        some_word, hidden = Dictionary.predict(" ".join(generated_sq), test_model, corpus, hidden)
        generated_sq.append(some_word)
      return " ".join(generated_sq)

      
    def __len__(self):
        return len(self.idx2word)


class Corpus(object):
    def __init__(self, path):
        self.dictionary = Dictionary()
        self.train = self.tokenize(os.path.join(path, 'quran_train.txt'))
        self.valid = self.tokenize(os.path.join(path, 'quran_val.txt'))
        self.test = self.tokenize(os.path.join(path, 'quran_test.txt'))

    def tokenize_string(self, s):
        """Tokenizes a text file."""
        
        tokens = s.split()
        # Tokenize file content
        ids = torch.LongTensor(len(tokens))
        for i, word in enumerate(tokens):
            ids[i] = self.dictionary.word2idx[word]

        return ids
    
    def tokenize(self, path):
        """Tokenizes a text file."""
        # Add words to the dictionary
        with open(path, 'r', encoding="utf8") as f:

            tokens = 0
            for line in f:
                words = line.split() + ['<eos>']
                tokens += len(words)
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, 'r', encoding="utf8") as f:
            ids = torch.LongTensor(tokens)
            token = 0
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    ids[token] = self.dictionary.word2idx[word]
                    token += 1

        return ids

In [0]:
###############################################################################
# Load data
###############################################################################
corpus = Corpus('./gdrive/My Drive/nlp/data/raw/penn-treebank')

# Starting from sequential data, batchify arranges the dataset into columns.
# For instance, with the alphabet as the sequence and batch size 4, we'd get
# ┌ a g m s ┐
# │ b h n t │
# │ c i o u │
# │ d j p v │
# │ e k q w │
# └ f l r x ┘.
# These columns are treated as independent by the model, which means that the
# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient
# batch processing.

def batchify(data, bsz):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)

train_data = batchify(corpus.train, batch_size)
val_data = batchify(corpus.valid, eval_batch_size)
test_data = batchify(corpus.test, eval_batch_size)

In [0]:
def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""
#     print('repackage hiddem', type(hidden))
    if isinstance(h, torch.Tensor):
      return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)


def get_batch(source, i):
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target


def evaluate(data_source):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    ntokens = len(corpus.dictionary)
    if model_type == 'LSTM':
      hidden = model.init_hidden(eval_batch_size)
    else:
      hidden, _ = model.init_hidden(eval_batch_size)
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, bptt):
            data, targets = get_batch(data_source, i)
            output, hidden = model(data, hidden)
            output_flat = output.view(-1, ntokens)
            total_loss += len(data) * criterion(output_flat, targets).item()
            hidden = repackage_hidden(hidden)
    return total_loss / (len(data_source) - 1)


In [0]:
# Utility module to deal with lists of layers, see https://discuss.pytorch.org/t/list-of-nn-module-in-a-nn-module/219/2
class ListModule(nn.Module):
    def __init__(self, *args):
        super(ListModule, self).__init__()
        idx = 0
        for module in args:
            self.add_module(str(idx), module)
            idx += 1

    def __getitem__(self, idx):
        if idx < 0 or idx >= len(self._modules):
            raise IndexError('index {} is out of range'.format(idx))
        it = iter(self._modules.values())
        for i in range(idx):
            next(it)
        return next(it)

    def __iter__(self):
        return iter(self._modules.values())

    def __len__(self):
        return len(self._modules)

## Models

In [11]:

# coding: utf-8

# In[1]:

import itertools
import operator
from datetime import datetime
import sys
from torch import FloatTensor
from torch.autograd import Variable
from torch import nn

import torch.nn.functional as F

class RNN_mehmet(nn.Module):
    def __init__(self, word_dim, hidden_dim = 100, activation = 'sigmoid'):
        super(RNN_mehmet, self).__init__()

        self.word_dim = word_dim
        self.hidden_dim = hidden_dim
        self.weights_hh = self.init_weights((hidden_dim, hidden_dim))
        self.weights_xh = self.init_weights((hidden_dim, word_dim))
        self.weights_o = self.init_weights((word_dim, hidden_dim))
        self.activation = getattr(torch, activation)


  
    def init_weights(self, dim):
        return nn.Parameter(torch.FloatTensor(dim[0], dim[1]).uniform_(-np.sqrt(1./dim[0]), np.sqrt(1./dim[1])), requires_grad=True)
      
    
    def init_hidden(self, batch_size, dim):
      
        layer = torch.zeros((1, batch_size, self.hidden_dim),  requires_grad = True)
        if dim > 1:
           layer = (layer.clone(), layer.clone())
        return layer
      
    def step(self, lr):
        for p in self.parameters():
            p.data.add_(-lr, p.grad.data)
      
    def forward_step(self, xt, hidden_t_1):
        
        # calculate left and right terms
        left_term = F.linear(xt, self.weights_xh)
        right_term = F.linear(hidden_t_1, self.weights_hh)
        
        # sum terms
        sum_ = left_term + right_term
        
        # activation for hidden state
        hidden_t = self.activation(sum_)
        
        # calculate output
        output = F.linear(hidden_t, self.weights_o)
        return output, hidden_t

    def forward_propagation(self, x, hidden_t_1):
        # Get sequence length (bptt), batch_size from the input
        bptt, batch_size, _ = x.size()
        output = torch.zeros((bptt, batch_size, self.word_dim)).to(device)
        
        # loop over sequence
        for t in torch.arange(bptt): 
            xt = x[t,:,:]
            output[t], hidden_t_1 = self.forward_step(xt, hidden_t_1)
        return [output, hidden_t_1]
      
    def __call__(self, x, hidden_t_1):
        return self.forward_propagation(x, hidden_t_1)
      

fake_net = RNN_mehmet(emsize, hidden_dim = 130)

#forward_prop
# test forward pass
x = np.random.normal(0, 1, (bptt, batch_size, emsize)).astype('float32')
x = torch.Tensor(torch.from_numpy(x))

hidden = fake_net.init_hidden(batch_size, 1)

y = np.random.normal(0, 1, (bptt * batch_size, emsize)).astype('float32')
y = torch.Tensor(torch.from_numpy(y)).long().to(device)


#example backward and 10 steps

for i in range(10):
    fake_net.zero_grad()
    result = fake_net.forward_propagation(x, hidden)
    output =  result[0].view(-1, emsize)

    loss_fn = nn.MSELoss()
    loss = loss_fn(output, y.float())
    print(loss)
    loss.backward()
    fake_net.step(0.01)



tensor(2.0390, device='cuda:0', grad_fn=<MseLossBackward>)
tensor(2.0299, device='cuda:0', grad_fn=<MseLossBackward>)
tensor(2.0208, device='cuda:0', grad_fn=<MseLossBackward>)
tensor(2.0119, device='cuda:0', grad_fn=<MseLossBackward>)
tensor(2.0030, device='cuda:0', grad_fn=<MseLossBackward>)
tensor(1.9943, device='cuda:0', grad_fn=<MseLossBackward>)
tensor(1.9856, device='cuda:0', grad_fn=<MseLossBackward>)
tensor(1.9770, device='cuda:0', grad_fn=<MseLossBackward>)
tensor(1.9686, device='cuda:0', grad_fn=<MseLossBackward>)
tensor(1.9602, device='cuda:0', grad_fn=<MseLossBackward>)


In [0]:

# coding: utf-8

# In[1]:

import itertools
import operator
from datetime import datetime
import sys
from torch import FloatTensor
from torch.autograd import Variable
from torch import nn

import torch.nn.functional as F

class LSTMCustom(nn.Module):
    def __init__(self, word_dim, hidden_dim, nlayers = 1, activation = 'sigmoid'):
        super(LSTMCustom, self).__init__()

        self.word_dim = word_dim
        self.hidden_dim = hidden_dim
   
        # weights for x
        self.weights_xi = self.init_weights((hidden_dim, word_dim))
        self.weights_xo = self.init_weights((hidden_dim, word_dim))
        self.weights_xf = self.init_weights((hidden_dim, word_dim))
        self.weights_xc = self.init_weights((hidden_dim, word_dim))

        # weights for hidden
        self.weights_hi = self.init_weights((hidden_dim, hidden_dim))
        self.weights_ho = self.init_weights((hidden_dim, hidden_dim))
        self.weights_hf = self.init_weights((hidden_dim, hidden_dim))
        self.weights_hc = self.init_weights((hidden_dim, hidden_dim))
        
        self.weights_h_out = self.init_weights((hidden_dim, hidden_dim))
        

        
  
    def init_weights(self, dim):
        return nn.Parameter(torch.FloatTensor(dim[0], dim[1]).uniform_(-np.sqrt(1./dim[0]), np.sqrt(1./dim[1])), requires_grad=True)
  
    def init_gates(self):
        return torch.zeros(self.hidden_dim,  requires_grad = True)
    
    def init_hidden(self, batch_size):
        layer = torch.zeros((1, batch_size, self.hidden_dim),  requires_grad = True)
        return (layer.clone(), layer.clone())
      
    def step(self, lr):
        for p in self.parameters():
            p.data.add_(-lr, p.grad.data)
      
    def forward_step(self, xt, hidden_t_1):
        
        ht_1, ct_1 = hidden_t_1
        gate_i = torch.sigmoid(
            F.linear(xt, self.weights_xi) + 
            F.linear(ht_1, self.weights_hi))
        
        gate_f = torch.sigmoid(
            F.linear(xt, self.weights_xf) + 
            F.linear(ht_1, self.weights_hf))
        
        gate_o = torch.sigmoid(
            F.linear(xt, self.weights_xo) + 
            F.linear(ht_1, self.weights_ho))
        
        new_c = torch.tanh(
            F.linear(xt, self.weights_xc) +
            F.linear(ht_1, self.weights_hc))
        
        ct = gate_f * ct_1 + gate_i * new_c
        
        ht = gate_o * torch.tanh(ct)
        
        output = F.linear(ht, self.weights_h_out)
        
        return output, (ht, ct)

    def forward(self, x, hidden_t_1):
        # Get sequence length (bptt), batch_size from the input
        bptt, batch_size, _ = x.size()
        output = torch.zeros((bptt, batch_size, self.hidden_dim)).to(device)
        
        # loop over sequence
        for t in torch.arange(bptt):
            xt = x[t,:,:]
            output[t], hidden_t_1 = self.forward_step(xt, hidden_t_1)
            
        return [output, hidden_t_1]
      
    def __call__(self, x, hidden_t_1):
        return self.forward_propagation(x, hidden_t_1)


In [13]:
# New RNN Layer

import torch.nn.functional as F

class RNNCustom(nn.Module):
  
  def __init__(self, ninput, nhid, activation = 'sigmoid'):
      self.ninput = ninput
      self.nhid = nhid
      self.weights_hh = self.init_weights((nhid, nhid))
      self.weights_xh = self.init_weights((nhid, ninput))
      self.activation = getattr(F, activation)
      self._modules = {}
    
  def init_weights(self, dimensions):
      return torch.zeros(dimensions, requires_grad = True).to(device)
      
  def init_hidden(self):
      layer = torch.zeros((1, batch_size, self.nhid),  requires_grad = True).to(device)
      return (layer, layer)
    
  def step(self, xt, ht_1):
      # calculate product of weights and inputs
      xt = F.linear(xt, self.weights_xh)
      ht_1 = F.linear(ht_1, self.weights_hh)
      
      # return activation of concatenated products
      return self.activation(xt + ht_1), ht_1
      
    
  def forward(self, x, hidden):
      # Get sequence length (bptt), batch_size from the input
      bptt, batch_size, _ = x.size()
      
      # intialize output
      output = torch.zeros((bptt, batch_size, self.ninput), requires_grad = True).to(device)
      
      # hidden layers
      ht_1, ht = hidden
      
      # loop over sequence
      for i in range(bptt):
        
        # slice input 
        xt = x[i,:,:]
       
        # store step output
        output[i,:,:], ht = self.step(xt, ht_1)
        
        # update hidden states
        ht_1 = ht
      
      # return output, (hidden, hidden)
      return output, (ht_1, ht)
    
  def __call__(self, x, hidden):
      return self.forward(x, hidden)
    

# test forward pass
x = np.random.normal(0, 1, (bptt, batch_size, emsize)).astype('float32')

# double hidden only necesary for LSTM
hidden_layer = np.zeros((1, batch_size, nhid)).astype('float32')
hidden_layer_tensor = torch.Tensor(torch.from_numpy(hidden_layer)).to(device)
hidden = (hidden_layer_tensor, hidden_layer_tensor)


# output is still very different
rnn_pt = nn.RNN(emsize, nhid)
output, h = rnn_pt(torch.Tensor(torch.from_numpy(x)))
# print('RNN TORCH', output)

rnn_cust = RNNCustom(ninput = emsize, nhid = nhid)
output, h = rnn_cust(torch.Tensor(torch.from_numpy(x)).to(device), hidden)
# print('RNN cust', output)



RuntimeError: ignored

## RNNModel

In [0]:

class RNNModel(nn.Module):
    """Container module with an encoder, a recurrent module, and a decoder."""

    def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5,wdrop = 0.6, tie_weights=False):
        super(RNNModel, self).__init__()
        self.drop = Dropout(dropout)#our dropout
#         self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(ntoken, ninp)
        
        # quick fix to match up dimensions of internal LSTM layers
        # setting all but last LSTM layer to transform [ninp -> nhid], and last one [nhid -> nhid]
        LSTM_dimensions = [(ninp, nhid) if i == 0 else (nhid, nhid) for i in range(nlayers)]
        self.rnns = [LSTMCustom(*LSTM_dimensions[i]).to(device) for i in range(nlayers)]
                
        
        print('wdrop: ', wdrop)
        self.rnns = [WeightDrop(rnn, ['weights_hc'], dropout=wdrop) for rnn in self.rnns]

#         if rnn_type in ['LSTM', 'GRU']:
#             self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)
#         else:
#             try:
#                 nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type]
#             except KeyError:
#                 raise ValueError( """An invalid option for `--model` was supplied,
#                                  options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']""")
#             self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout)
        self.decoder = nn.Linear( nhid, ntoken )
#         self.dropconnect = DropConnect(self.decoder, dropout)
        # Optionally tie weights as in:
        # "Using the Output Embedding to Improve Language Models" (Press & Wolf 2016)
        # https://arxiv.org/abs/1608.05859
        # and
        # "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling" (Inan et al. 2016)
        # https://arxiv.org/abs/1611.01462
        if tie_weights:
            if nhid != ninp:
                raise ValueError('When using the tied flag, nhid must be equal to emsize')
            self.decoder.weight = self.encoder.weight

        self.init_weights()

        self.rnn_type = rnn_type
        self.nhid = nhid
        self.nlayers = nlayers
        self.rnns = ListModule(*self.rnns)

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, x, hidden):
        # Send to device explicitly, because the loop over LSTM layers 
        # in the Net breaks the normal .to(decvice) functionality
        x = x.to(device)
        hidden = (hidden[0].to(device), hidden[1].to(device))
        x = self.drop(self.encoder(x))
        for i, rnn in enumerate(self.rnns):
          x, hidden = rnn(x, hidden)
        x = self.drop(x)
        decoded = self.decoder(x.view(x.size(0)*x.size(1), x.size(2)))
#         decoded = self.dropconnect( x.view(x.size(0)*x.size(1), x.size(2)) )
        ext_output = decoded.view(x.size(0), x.size(1), decoded.size(1))
        return ext_output, hidden            


    def init_hidden(self, bsz):
        init_range = 1/np.sqrt(self.nhid)
        weight = next(self.parameters())
        if self.rnn_type == 'LSTM':
            return (weight.new_zeros(1, bsz, self.nhid).uniform_(-init_range, init_range),
                    weight.new_zeros(1, bsz, self.nhid).uniform_(-init_range, init_range))
        else:
            return weight.new_zeros(self.nlayers, bsz, self.nhid).uniform_(-init_range, init_range)

          



# Training

In [15]:
###############################################################################
# Build the model
###############################################################################
ntokens = len(corpus.dictionary)
model = RNNModel('LSTM', ntokens, emsize, nhid, nlayers, dropout, wdrop, tied).to(device)
model.train()
criterion = nn.CrossEntropyLoss()
print(model)

wdrop:  0.5
Applying weight drop of 0.5 to weights_hc
Applying weight drop of 0.5 to weights_hc
Applying weight drop of 0.5 to weights_hc
RNNModel(
  (drop): Dropout()
  (encoder): Embedding(5595, 400)
  (decoder): Linear(in_features=1150, out_features=5595, bias=True)
  (rnns): ListModule(
    (0): WeightDrop(
      (module): LSTMCustom()
    )
    (1): WeightDrop(
      (module): LSTMCustom()
    )
    (2): WeightDrop(
      (module): LSTMCustom()
    )
  )
)


In [0]:
###############################################################################
# Training code
###############################################################################

def train():
    # Turn on training mode which enables dropout.
    model.train()
    total_loss = 0.
    start_time = time.time()
    ntokens = len(corpus.dictionary)
    if model_type == 'LSTM':
        hidden = model.init_hidden(batch_size)
    else:
        hidden, _ = model.init_hidden(batch_size)
    for batch, i in enumerate(range(0, train_data.size(0), bptt)):
        data, targets = get_batch(train_data, i)

        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        hidden = repackage_hidden(hidden)
        model.zero_grad()
        output, hidden = model(data, hidden)
        loss = criterion(output.view(-1, ntokens), targets)
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        for p in model.parameters():
            try:
              p.data.add_(-lr, p.grad.data)
            except Exception as e:
              print(e)
              pass
        total_loss += loss.item()

        if batch % log_interval == 0 and batch > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            try:
              print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
                      'loss {:5.2f} | ppl {:8.2f}'.format(
                  epoch, batch, len(train_data) // bptt, lr,
                  elapsed * 1000 / log_interval, cur_loss, math.exp(cur_loss)))
            except:
              print('math error the same')
            total_loss = 0
            start_time = time.time()

# Loop over epochs.

best_val_loss = None
save_file = './gdrive/My Drive/nlp/model_ep{}_nlayer{}_em{}_nhid{}_bptt{}_tied{}_bs{}_embdrop_{}_clip{}_lr{}'.format(epochs, nlayers, emsize, nhid, bptt, tied, batch_size, dropout, clip, lr)
meta_file = save_file + '.meta'
meta_data = []

# At any point you can hit Ctrl + C to break out of training early.
try:
    for epoch in range(1, epochs+1):
        epoch_start_time = time.time()
        train()
        val_loss = evaluate(val_data)
        print('-' * 89)
        print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
                'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                           val_loss, math.exp(val_loss)))
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if not best_val_loss or val_loss < best_val_loss:

            with open(save_file, 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
            meta = {
                'epoch': epoch,
                'time': (time.time() - epoch_start_time),
                'val_loss': val_loss,
                'val_ppl': math.exp(val_loss),
                'lr': lr,
            }
            meta_data.append(meta)
            with open(meta_file, 'wb') as f:
                pickle.dump(meta_data, f)
            
        else:
            # Anneal the learning rate if no improvement has been seen in the validation dataset.
            lr /= 4.0
except KeyboardInterrupt:
    valid_file = save_file + '_val_loss{:5.2f}_ppl_{:8.2f}'.format(best_val_loss, math.exp(best_val_loss))
    with open(valid_file, 'w') as f:
      f.write('-----')
    print('-' * 89)
    print('Exiting from training early')

valid_file = save_file + '_val_loss{:5.2f}_ppl_{:8.2f}'.format(best_val_loss, math.exp(best_val_loss))
with open(valid_file, 'w') as f:
  f.write('-----')
   


| epoch   1 |   100/  102 batches | lr 5.00 | ms/batch 782.67 | loss  4.42 | ppl    82.75
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 80.31s | valid loss  4.37 | valid ppl    78.76
-----------------------------------------------------------------------------------------


  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "


| epoch   2 |   100/  102 batches | lr 5.00 | ms/batch 780.79 | loss  4.34 | ppl    76.78
-----------------------------------------------------------------------------------------
| end of epoch   2 | time: 80.15s | valid loss  4.32 | valid ppl    74.87
-----------------------------------------------------------------------------------------
| epoch   3 |   100/  102 batches | lr 5.00 | ms/batch 786.49 | loss  4.27 | ppl    71.17
-----------------------------------------------------------------------------------------
| end of epoch   3 | time: 80.73s | valid loss  4.26 | valid ppl    71.08
-----------------------------------------------------------------------------------------
| epoch   4 |   100/  102 batches | lr 5.00 | ms/batch 788.71 | loss  4.19 | ppl    66.17
-----------------------------------------------------------------------------------------
| end of epoch   4 | time: 80.93s | valid loss  4.22 | valid ppl    67.80
----------------------------------------------------------

In [41]:

Dictionary.conditional_predict('there', model, corpus, 15)

'there polytheists enough the falsehood Solomon the smell Solomon the smell Solomon the smell Solomon the'

In [0]:
# Loading model

test_model = torch.load(save_file)
print(test_model)

# load meta
with open(save_file + '.meta', 'rb') as f:
  meta = pickle.load(f)
  print(meta)
  
# number of parameters (untested)
def get_n_params(model):
    pp=0
    for p in list(model.parameters()):
        nn=1
        for s in list(p.size()):
            nn = nn*s
        pp += nn
    return pp

RNNModel(
  (drop): Dropout(p=0.5)
  (encoder): Embedding(10000, 400)
  (rnns): ListModule(
    (0): LSTMCustom()
    (1): LSTMCustom()
    (2): LSTMCustom()
  )
  (decoder): Linear(in_features=1150, out_features=10000, bias=True)
)
[{'epoch': 1, 'time': 500.2183609008789, 'val_loss': 6.094107644355903, 'val_ppl': 443.2383424130543, 'lr': 20}, {'epoch': 2, 'time': 500.0337471961975, 'val_loss': 5.587101821576135, 'val_ppl': 266.960797446221, 'lr': 20}, {'epoch': 3, 'time': 500.3192808628082, 'val_loss': 5.343247472148831, 'val_ppl': 209.1909501949757, 'lr': 20}, {'epoch': 4, 'time': 500.0058431625366, 'val_loss': 5.2122298108117056, 'val_ppl': 183.5027788197245, 'lr': 20}, {'epoch': 5, 'time': 500.48462748527527, 'val_loss': 5.133896831577107, 'val_ppl': 169.67703424361176, 'lr': 20}, {'epoch': 6, 'time': 507.9676570892334, 'val_loss': 4.998721956802627, 'val_ppl': 148.22360183117735, 'lr': 20}, {'epoch': 7, 'time': 500.00479006767273, 'val_loss': 4.900065622168072, 'val_ppl': 134.2985

In [0]:
Dictionary.predict('take major steps N years ago', test_model = test_model, corpus=corpus)

'major steps N years ago <eos>'

# Others

## Dimensionality Analysis

I used the implementation above to go through all the different dimensions. Maybe this helps you understanding how to implement the RNN / LSTM layer.

I also tried looking at: https://www.quora.com/In-LSTM-how-do-you-figure-out-what-size-the-weights-are-supposed-to-be

Also, look here: https://github.com/pytorch/pytorch/blob/v0.3.0/torch/nn/_functions/rnn.py

### Some variables

* bptt = "backpropagation through time", but here used as the sequence length that we feed at once. It's given in dim=0 aling the 'seq_len' dimension in the LSTM given in the original code. Given above to be 35.
* bsz = "batch size", number of sequences looked at at once, in our case set to 20 in dim=1
* ntokens = len(vocab), total number of different tokens in the data
* len(text) = total number of tokens of the whole text
* nhid = number of values in hidden layer
* emsize = embedding size

### Step by step

1. in **corpus** only the index of every word is kept, so every word goes from dimension *ntokens* to a scalar value:   
      dim token: [ntokens] -> [1]

2. after **batchify(training_data)** , dividing the total text by the batch size and having *batch_size* many sequences:  
      dim: [len(text)] -> [len(text) / bsz, bsz]

3. after **get_batch(data, i)** get on sequence of size bptt for every batch:   
    dim data: [len(text) / bsz,  bsz] -> [bptt, bsz]  
    dim target: [len(text)] -> [bptt * bsz]
    
4. In the Net  
    1. Input dim: [bptt, bsz]
    2. hidden layer dim: (ht-1, ht): ([1, bsz, nhid], [1, bsz, nhid])
    3. embedding layer dim: [bptt, bsz, emsize]
    4. lstm layer dim: [bptt, bsz, emsize], hidden layer

## TODO

1. Implement LSTM cell (done)
2. Bias term
3. Think about initialization (done, used AWD paper settings)
4. Multilayer (done)
5. Bi directional 
6. Prediction module (done)
7. Optimization from AWD paper (both orignal and Custom LSTM)
  - DropConnect (for recurring connections)
  - Variational dropout (all other) 
  - NT-ASGD
  - Varying BPTT length (done)
  - Embedding dropout (done)
  - Weight tying (done)
  - L2 regularization 
8. Convolutional Net

  
  



 