In this assignment we will design a neural net language model. The model will learn to predict the next word given the previous three words. The network looks like this:

![Network](misc/proga2.png)

The starter code implements a basic framework for training neural nets with mini-batch gradient descent. Your job is to **write code to complete the implementation of forward and back propagation**. See the README file for a description of the dataset, starter code and how to run it.

Softmax output
$$
    y_i = \frac{e^{z_i}}{\sum_{j}e^{z_j}}
$$

Implies,
$$
    \frac{\partial y_i}{\partial z_i} = y_i (1 - y_i)
$$

We denote the **cross-entropy** cost function $C$ as:
$$
    C = -\sum_j t_j \log y_j
$$
With partial w.r.t. output $i$ ($z_i$) as
$$
    \frac{\partial C}{\partial z_i} = \frac{\partial C}{\partial y_i} \frac{\partial y_i}{\partial z_i} = y_i - t_i
$$

In [1]:
load programming_assignment_2/data.mat
fieldnames(data)

ans = 
{
  [1,1] = testData
  [2,1] = trainData
  [3,1] = validData
  [4,1] = vocab
}


`data.vocab` contains the vocabulary of 250 words. Training, validation and
test sets are in `data.trainData`, `data.validData` and `data.testData`  respectively.

In [2]:
data.vocab(1:10)

ans = 
{
  [1,1] = all
  [1,2] = set
  [1,3] = just
  [1,4] = show
  [1,5] = being
  [1,6] = money
  [1,7] = over
  [1,8] = both
  [1,9] = years
  [1,10] = four
}


In [3]:
data.testData = data.testData';
data.validData = data.validData';
data.trainData = data.trainData';
size(data.trainData)

ans =

   372550        4



In [4]:
data.trainData(1:10, 1:4)

ans =

   28   26   90  144
  184   44  249  117
  183   32   76  122
  117  247  201  186
  223  190  249    6
   42   74   26   32
  242   32  223   32
  223   32  158  144
   74   32  221   32
   42  192   91   68



In [3]:
% --2 column, all rows--
data.vocab([184, 44, 249, 117])

ans = 
{
  [1,1] = were
  [1,2] = not
  [1,3] = the
  [1,4] = first
}


'data.trainData' is a matrix of 372550 X 4. This means there are 372550
training cases and 4 words per training case. Each entry is an integer that is
the index of a word in the vocabulary. So each row represents a sequence of 4
words. 'data.validData' and 'data.testData' are also similar. They contain
46,568 4-grams each. **All three need to be separated into inputs and targets
and the training set needs to be split into mini-batches**. The file load_data.m provides code for doing that.

In [2]:
addpath("programming_assignment_2/")

In [4]:
[train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100);

    load_data at line 16 column 1


In [5]:
%Output with the initial code
model = train(1);

    load_data at line 16 column 1
    train at line 32 column 30
Epoch 1
Batch 100 Train CE 5.521
Batch 200 Train CE 5.521
Batch 300 Train CE 5.521
Batch 400 Train CE 5.521
Batch 500 Train CE 5.521
Batch 600 Train CE 5.521
Batch 700 Train CE 5.521
Batch 800 Train CE 5.521
Batch 900 Train CE 5.521
Batch 1000 Train CE 5.521
Running validation ... Validation CE 5.521
Batch 1100 Train CE 5.521
Batch 1200 Train CE 5.521
Batch 1300 Train CE 5.521
Batch 1400 Train CE 5.521
Batch 1500 Train CE 5.521
Batch 1600 Train CE 5.521
Batch 1700 Train CE 5.521
Batch 1800 Train CE 5.521
Batch 1900 Train CE 5.521
Batch 2000 Train CE 5.521
Running validation ... Validation CE 5.521
Batch 2100 Train CE 5.521
Batch 2200 Train CE 5.521
Batch 2300 Train CE 5.521
Batch 2400 Train CE 5.521
Batch 2500 Train CE 5.521
Batch 2600 Train CE 5.521
Batch 2700 Train CE 5.521
Batch 2800 Train CE 5.521
Batch 2900 Train CE 5.521
Batch 3000 Train CE 5.521
Running validation ... Validation CE 5.521
Batch 3100 Train CE 5.521
B