Once we have built the log-linear model, we are not far away from the feed-forward NN.

NN has a more complicated structure than log-linear model. Coming from n-gram, it's a
nice path to build our understanding and building blocks.
And we will use PyTorch (or TensorFlow 2.0 if you wish) to build this.

## Prepare data
We will deal with unknown word with an uniform distribution this time.

In [2]:
import tarfile
import re

from math import log
from collections import Counter, defaultdict
from itertools import chain

import numpy as np
import pandas
import torch

from processdata import get_vocabs, load_mapping_from_vocabs, generate_mapping_from_vocabs, tokenize_num, transform_data, writing_seq_idx, START_SYMBOL, STOP_SYMBOL

In [3]:
VOCABS = get_vocabs("data/ngram/train.txt")
NKNOWN_SEQUENCE_ID = len(VOCABS)

In [4]:
generate_mapping_from_vocabs('data/nn/vocab_mapping.txt', VOCABS)
mapping = load_mapping_from_vocabs('data/nn/vocab_mapping.txt', convert_to_int=False)
LEFT_PAD_SYMBOL = int(mapping['<s>'])
RIGHT_PAD_SYMBOL = int(mapping['</s>'])

for i in ["train", "valid", "test"]:
  writing_seq_idx(f"data/nn/{i}.txt", transform_data(mapping, f"data/ngram/{i}.txt"))

## Build a pipeline
We won't apply any optimization over computation. But we will try to vectorize
our computation. To see more, checkout SIMD. Deep learning framework can also utilize
many new devices.

Our previous computation happens every sequence (two words for tri-gram).
And optimization steps cost a lot. What if we build a mini-batches?

Here, we don't use PyTorch's data loader. We build things that you can understand.

In [5]:
def translate_sentence_to_indexes(
    sequence,
    n
):
  return chain((LEFT_PAD_SYMBOL,) * (n - 1), iter(sequence), (RIGHT_PAD_SYMBOL,) * (n - 1))

def dataloader(filename, batch_size, n):
  def gen_token(l):
    for token in tokenize_num(l):
      try:
        token = int(token)
      except:
        token = UNKNOWN_SEQUENCE_ID
      yield token
  
  def gen(n):
    with open(filename, 'r') as f:
      for l in f:
        for s in translate_sentence_to_indexes(gen_token(l), n):
          yield s

  def ngram(sequence, n):
    history = []
    while n > 1:
      try:
        next_item = next(sequence)
      except StopIteration:
        # no more data, terminate the generator
        return
      history.append(next_item)
      n -= 1

    for item in sequence:
      history.append(item)
      yield np.array(history, dtype=np.int64)
      del history[0]

  return ngram(ngram(gen(n), n), batch_size)

In [6]:
next(dataloader('data/nn/train.txt', 10, 3)) # for trigram

array([[15567, 15567, 20957],
       [15567, 20957, 39763],
       [20957, 39763, 28909],
       [39763, 28909,  7625],
       [28909,  7625, 41481],
       [ 7625, 41481,    96],
       [41481,    96, 36480],
       [   96, 36480,  5623],
       [36480,  5623, 32051],
       [ 5623, 32051, 30453]])

In [7]:
next(dataloader('data/nn/train.txt', 10, 2)) # to test trigram

array([[15567, 20957],
       [20957, 39763],
       [39763, 28909],
       [28909,  7625],
       [ 7625, 41481],
       [41481,    96],
       [   96, 36480],
       [36480,  5623],
       [ 5623, 32051],
       [32051, 30453]])

## Build NN

Now we are in a good sport. Let's build a NN.

A tri-gram can be expressed like this:

$$
\begin{aligned} \boldsymbol{m} &=\operatorname{concat}\left(M_{, e_{t-2}}, M_{, e_{t-1}}\right) \\ \boldsymbol{h} &=\tanh \left(W_{m h} \boldsymbol{m}+\boldsymbol{b}_{h}\right) \\ \boldsymbol{s} &=W_{h s} \boldsymbol{h}+\boldsymbol{b}_{s} \\ \boldsymbol{p} &=\operatorname{softmax}(\boldsymbol{s}) \end{aligned}
$$

In [8]:
class Model(torch.nn.Module):
  def __init__(self):
    super(Model, self).__init__()
    self.embedding = torch.nn.Embedding(len(VOCABS), len(VOCABS))
    hidden_size = int(0.3*len(VOCABS)) # random
    self.first = torch.nn.Linear(2*len(VOCABS), hidden_size) # tri-gram here
    self.tanh = torch.nn.Tanh()
    self.hidden = torch.nn.Linear(hidden_size, len(VOCABS))
    self.softmax = torch.nn.LogSoftmax()
    self.loss_fn = torch.nn.NLLLoss()

  def forward(self, x):
    x = self.embedding(x).reshape(x.shape[0], -1)
    x = self.tanh(self.first(x))
    x = self.hidden(x)
    p = self.softmax(x)
    return p
  
  def loss(self, p, target):
    return self.loss_fn(p, target)

def train(dataloader, epoch):
  data = iter(dataloader)
  model = Model()
  optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

  for i in range(epoch):
    loss_val = 0.0
    for batch in data:
      batch = torch.from_numpy(batch)
      target = batch[:,2]
      x = batch[:,:2]
      
      optimizer.zero_grad()
      result = model.forward(x)
      output = model.loss(result, target)
      loss_val += output.item()
      output.backward()
      optimizer.step()
    print(f"loss: {loss_val}")


In [None]:
train(dataloader('data/nn/train.txt', 100, 3), 1)