# Assignment 1:  Sentiment with Deep Neural Networks

Welcome to the first assignment of course 3. In this assignment, you will explore sentiment analysis using deep neural networks. 

In course 1, you implemented Logistic regression and Naive Bayes for sentiment analysis. However if you were to give your old models an example like:

<center> <span style='color:blue'> <b>This movie was almost good.</b> </span> </center>

Your model would have predicted a positive sentiment for that review. However, that sentence has a negative sentiment and indicates that the movie was not good. To solve those kinds of misclassifications, you will write a program that uses deep neural networks to identify sentiment in text. By completing this assignment, you will: 

- Understand how you can build/design a model using layers
- Train a model using a training loop
- Use a binary cross-entropy loss function
- Compute the accuracy of your model
- Predict using your own input

As you can tell, this model follows a similar structure to the one you previously implemented in the second course of this specialization. 
- Indeed most of the deep nets you will be implementing will have a similar structure. The only thing that changes is the model architecture, the inputs, and the outputs. Before starting the assignment, we will introduce you to the Google library `trax` that we use for building and training models.


Now we will show you how to compute the gradient of a certain function `f` by just using `  .grad(f)`. 

- Trax source code can be found on Github: [Trax](https://github.com/google/trax)
- The Trax code also uses the JAX library: [JAX](https://jax.readthedocs.io/en/latest/index.html)


# Part 1:  Import libraries and try out Trax

- Let's import libraries and look at an example of using the Trax library.

In [1]:
import os
import random as rnd

import numpy as np
import trax
from   trax import layers as tl
from   trax import fastmath
from   trax.fastmath import grad
import trax.fastmath.numpy as fnp
from   trax.supervised import training

from utils import get_all_tweets, Layer, load_tweets, process_tweet

#trax.supervised.trainer_lib.init_random_number_generators(31)

In [2]:
a = fnp.array(5.)
display(a)
type(a)



DeviceArray(5., dtype=float32)

jax.interpreters.xla.DeviceArray

In [3]:
def f(x):
    return x**2

In [4]:
print(f'f(a) for a={a}: {f(a)}')

f(a) for a=5.0: 25.0


The gradient (derivative) of function `f` with respect to its input `x` is the derivative of $x^2$.
- The derivative of $x^2$ is $2x$.  
- When x is 5, then $2x=10$.

You can calculate the gradient of a function by using `trax.fastmath.grad(fun=)` and passing in the name of the function.
- In this case the function you want to take the gradient of is `f`.
- The object returned (saved in `grad_f` in this example) is a function that can calculate the gradient of f for a given trax.fastmath.numpy array.

In [5]:
grad_f = grad(fun=f)
type(grad_f)

function

In [6]:
grad_calculation = grad_f(a) # f = x^2 -> d/dx = 2x
display(grad_calculation)

DeviceArray(10., dtype=float32)

# Part 2:  Importing the data

## 2.1  Loading in the data

Import the data set.  
- You may recognize this from earlier assignments in the specialization.
- Details of process_tweet function are available in utils.py file

In [7]:
DATA = '../../../../data/twitter_samples'

In [8]:
all_positive_tweets = get_all_tweets(DATA, 'positive')
all_negative_tweets = get_all_tweets(DATA, 'negative')

In [9]:
len(all_positive_tweets), len(all_negative_tweets)

(5000, 5000)

In [10]:
N_TRAIN = 4000
train_pos = all_positive_tweets[:N_TRAIN]
val_pos = all_positive_tweets[N_TRAIN:]
train_neg = all_negative_tweets[:N_TRAIN]
val_neg = all_negative_tweets[N_TRAIN:]

train_x = train_pos + train_neg
val_x = val_pos + val_neg
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
val_y = np.append(np.ones(len(val_pos)), np.zeros(len(val_neg)))

len(train_x), len(val_x)

(8000, 2000)

In [11]:
print('original:', train_pos[0])
print('processed:', process_tweet(train_pos[0]))

original: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
processed: ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


## 2.2  Building the vocabulary

Now build the vocabulary.
- Map each word in each tweet to an integer (an "index"). 
- The following code does this for you, but please read it and understand what it's doing.
- Note that you will build the vocabulary based on the training data. 
- To do so, you will assign an index to everyword by iterating over your training set.

The vocabulary will also include some special tokens
- `__PAD__`: padding
- `</e>`: end of line
- `__UNK__`: a token representing any word that is not in the vocabulary.

In [12]:
vocab = {'__PAD__': 0, '__</e>__': 1, '__UNK__': 2}

i = 3
for tweet in train_x:
    processed = process_tweet(tweet)
    for word in processed:
        if word not in vocab:
            vocab[word] = i
            i += 1
            
len(vocab)            

9111

In [13]:
#display(vocab)

## 2.3  Converting a tweet to a tensor

Write a function that will convert each tweet to a tensor (a list of unique integer IDs representing the processed tweet).
- Note, the returned data type will be a **regular Python `list()`**
    - You won't use TensorFlow in this function
    - You also won't use a numpy array
    - You also won't use trax.fastmath.numpy array
- For words in the tweet that are not in the vocabulary, set them to the unique ID for the token `__UNK__`.

##### Example
Input a tweet:
```CPP
'@happypuppy, is Maria happy?'
```

The tweet_to_tensor will first conver the tweet into a list of tokens (including only relevant words)
```CPP
['maria', 'happi']
```

Then it will convert each word into its unique integer

```CPP
[2, 56]
```
- Notice that the word "maria" is not in the vocabulary, so it is assigned the unique integer associated with the `__UNK__` token, because it is considered "unknown."

### Exercise 01
**Instructions:** Write a program `tweet_to_tensor` that takes in a tweet and converts it to an array of numbers. You can use the `Vocab` dictionary you just found to help create the tensor. 

- Use the vocab_dict parameter and not a global variable.
- Do not hard code the integer value for the `__UNK__` token.

In [14]:
def tweet_to_tensor(tweet, vocab_dict, unk_token='__UNK__', verbose=False):
    '''
    Input: 
        tweet - A string containing a tweet
        vocab_dict - The words dictionary
        unk_token - The special string for unknown tokens
        verbose - Print info durign runtime
    Output:
        tensor_l - A python list with
        
    '''  
    word_l = process_tweet(tweet)
    if verbose:
        print('List of words from the processed tweet:')
        print(word_l)
    tensor_l = []
    unk_ID = vocab_dict[unk_token]
    if verbose:
        print(f'The unique integer ID for the unk_token is {unk_ID}')
    tensor_l = [vocab_dict.get(word, unk_ID) for word in word_l]    
    return tensor_l

In [15]:
print('Actual tweet is\n', val_pos[0])
print(
    '\nTensor of tweet:\n', tweet_to_tensor(val_pos[0], vocab_dict=vocab))

Actual tweet is
 Bro:U wan cut hair anot,ur hair long Liao bo
Me:since ord liao,take it easy lor treat as save $ leave it longer :)
Bro:LOL Sibei xialan

Tensor of tweet:
 [1076, 138, 486, 2365, 755, 8170, 1134, 755, 54, 2, 2688, 801, 2, 2, 355, 609, 2, 3507, 1028, 605, 4579, 9, 1076, 159, 2, 2]


## 2.4  Creating a batch generator

Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets. 
- If instead of training with batches of examples, you were to train a model with one example at a time, it would take a very long time to train the model. 
- You will now build a data generator that takes in the positive/negative tweets and returns a batch of training examples. It returns the model inputs, the targets (positive or negative labels) and the weight for each target (ex: this allows us to can treat some examples as more important to get right than others, but commonly this will all be 1.0). 

Once you create the generator, you could include it in a for loop

```CPP
for batch_inputs, batch_targets, batch_example_weights in data_generator:
    ...
```

You can also get a single batch like this:

```CPP
batch_inputs, batch_targets, batch_example_weights = next(data_generator)
```
The generator returns the next batch each time it's called. 
- This generator returns the data in a format (tensors) that you could directly use in your model.
- It returns a triple: the inputs, targets, and loss weights:
-- Inputs is a tensor that contains the batch of tweets we put into the model.
-- Targets is the corresponding batch of labels that we train to generate.
-- Loss weights here are just 1s with same shape as targets. Next week, you will use it to mask input padding.

### Exercise 02
Implement `data_generator`.

In [16]:
def data_generator(
        data_pos, data_neg, batch_size, loop, vocab_dict, shuffle=False):
    '''
    Input: 
      data_pos - Set of posstive examples
      data_neg - Set of negative examples
      batch_size - number of samples per batch. Must be even
      loop - True or False
      vocab_dict - The words dictionary
      shuffle - Shuffle the data order
    Yield:
      inputs - Subset of positive and negative examples
      targets - The corresponding labels for the subset
      example_weights - An array specifying the importance of each example  
    '''     
    assert batch_size % 2 == 0
    n_to_take = batch_size // 2
    pos_index = 0
    neg_index = 0
    len_data_pos = len(data_pos)
    len_data_neg = len(data_neg)
    pos_index_lines = list(range(len_data_pos))
    neg_index_lines = list(range(len_data_neg))
    if shuffle:
        rnd.shuffle(pos_index_lines)
        rnd.shuffle(neg_index_lines)        
    stop = False
    while not stop:  
        batch = []
        for i in range(n_to_take):
            if pos_index >= len_data_pos: 
                if not loop:
                    stop = True
                    break              
                pos_index = 0
                if shuffle:
                    rnd.shuffle(pos_index_lines)
            tweet = data_pos[pos_index_lines[pos_index]]
            tensor = tweet_to_tensor(tweet, vocab_dict)
            batch.append(tensor)
            pos_index += 1
        for i in range(n_to_take):            
            if neg_index >= len_data_neg:
                if not loop:
                    stop = True
                    break
                neg_index = 0
                if shuffle:
                    rnd.shuffle(neg_index_lines)
            tweet = data_neg[neg_index_lines[neg_index]]
            tensor = tweet_to_tensor(tweet, vocab_dict)
            batch.append(tensor)
            neg_index += 1
        if stop:
            break
        pos_index += n_to_take
        neg_index += n_to_take
        max_len = max([len(t) for t in batch]) 
        tensor_pad_l = []
        for tensor in batch:
            n_pad = max_len - len(tensor)
            pad_l = [0] * n_pad
            tensor_pad = tensor + pad_l
            tensor_pad_l.append(tensor_pad)
        inputs = np.array(tensor_pad_l)
        targets = np.array([1] * n_to_take + [0] * n_to_take)
        example_weights = np.ones_like(targets)
        yield inputs, targets, example_weights

In [17]:
# Set the random number generator for the shuffle procedure
rnd.seed(30) 

# Create the training data generator
def train_generator(batch_size, shuffle=False):
    return data_generator(
        train_pos, train_neg, batch_size, True, vocab, shuffle)

# Create the validation data generator
def val_generator(batch_size, shuffle = False):
    return data_generator(
        val_pos, val_neg, batch_size, True, vocab, shuffle)

# Create the validation data generator
def test_generator(batch_size, shuffle = False):
    return data_generator(
        val_pos, val_neg, batch_size, False, vocab, shuffle)

# Get a batch from the train_generator and inspect.
inputs, targets, example_weights = next(train_generator(4, shuffle=True))

# this will print a list of 4 tensors padded with zeros
print(f'Inputs: {inputs}')
print(f'Targets: {targets}')
print(f'Example Weights: {example_weights}')

Inputs: [[ 349 2019 4471 3218    9    0    0    0    0    0    0]
 [4974  575 2014 1467 5194 3517  143 3517  132  466    9]
 [3780  111  138  591 2946 3989    0    0    0    0    0]
 [ 253 3780    0    0    0    0    0    0    0    0    0]]
Targets: [1 1 0 0]
Example Weights: [1 1 1 1]


In [18]:
# Test the train_generator
# Create a data generator for training data,
# which produces batches of size 4 (for tensors and their respective 
# targets)
tmp_data_gen = train_generator(batch_size = 4)

# Call the data generator to get one batch and its targets
tmp_inputs, tmp_targets, tmp_example_weights = next(tmp_data_gen)

print(f'The inputs shape is {tmp_inputs.shape}')
print(f'The targets shape is {tmp_targets.shape}')
print(f'The example weights shape is {tmp_example_weights.shape}')

for i,t in enumerate(tmp_inputs):
    print(f'input tensor: {t}; target {tmp_targets[i]}; '
          f'example weights {tmp_example_weights[i]}')

The inputs shape is (4, 14)
The targets shape is (4,)
The example weights shape is (4,)
input tensor: [3 4 5 6 7 8 9 0 0 0 0 0 0 0]; target 1; example weights 1
input tensor: [10 11 12 13 14 15 16 17 18 19 20  9 21 22]; target 1; example weights 1
input tensor: [5758 2917 3780    0    0    0    0    0    0    0    0    0    0    0]; target 0; example weights 1
input tensor: [ 868  259 3670 5759  311 4478  575 1241 2783  333 1213 3780    0    0]; target 0; example weights 1


Now that you have your train/val generators, you can just call them and they will return tensors which correspond to your tweets in the first column and their corresponding labels in the second column. Now you can go ahead and start building your neural network. 

# Part 3:  Defining classes

In this part, you will write your own library of layers. It will be very similar
to the one used in Trax and also in Keras and PyTorch. Writing your own small
framework will help you understand how they all work and use them effectively
in the future.

Your framework will be based on the following `Layer` class from utils.py.

```CPP
class Layer(object):
    """ Base class for layers.
    """
      
    # Constructor
    def __init__(self):
        # set weights to None
        self.weights = None

    # The forward propagation should be implemented
    # by subclasses of this Layer class
    def forward(self, x):
        raise NotImplementedError

    # This function initializes the weights
    # based on the input signature and random key,
    # should be implemented by subclasses of this Layer class
    def init_weights_and_state(self, input_signature, random_key):
        pass

    # This initializes and returns the weights, do not override.
    def init(self, input_signature, random_key):
        self.init_weights_and_state(input_signature, random_key)
        return self.weights
 
    # __call__ allows an object of this class
    # to be called like it's a function.
    def __call__(self, x):
        # When this layer object is called, 
        # it calls its forward propagation function
        return self.forward(x)
```

## 3.1  ReLU class
You will now implement the ReLU activation function in a class below.
$$ \mathrm{ReLU}(x) = \mathrm{max}(0,x) $$

### Exercise 03
**Instructions:** Implement the ReLU activation function below. Your function should take in a matrix or vector and it should transform all the negative numbers into 0 while keeping all the positive numbers intact. 

In [19]:
class Relu(Layer):
    """Relu activation function implementation"""
    def forward(self, x):
        '''
        Input: 
            - x (a numpy array): the input
        Output:
            - activation (numpy array): all positive or 0 version of x
        '''
        activation = x.copy()
        activation[activation < 0] = 0
        return activation

In [20]:
# Test your relu function
x = np.array([[-2.0, -1.0, 0.0], [0.0, 1.0, 2.0]], dtype=float)
relu_layer = Relu()
print("Test data is:")
print(x)
print("Output of Relu is:")
print(relu_layer(x))

Test data is:
[[-2. -1.  0.]
 [ 0.  1.  2.]]
Output of Relu is:
[[0. 0. 0.]
 [0. 1. 2.]]


## 3.2  Dense class 

### Exercise

Implement the forward function of the Dense class. 
- The forward function multiplies the input to the layer (`x`) by the weight matrix (`W`)

$$\mathrm{forward}(\mathbf{x},\mathbf{W}) = \mathbf{xW} $$

- You can use `numpy.dot` to perform the matrix multiplication.

Note that for more efficient code execution, you will use the trax version of `math`, which includes a trax version of `numpy` and also `random`.

Implement the weight initializer `new_weights` function
- Weights are initialized with a random key.
- The second parameter is a tuple for the desired shape of the weights (num_rows, num_cols)
- The num of rows for weights should equal the number of columns in x, because for forward propagation, you will multiply x times weights.

Please use `trax.fastmath.random.normal(key, shape, dtype=tf.float32)` to generate random values for the weight matrix. The key difference between this function
and the standard `numpy` randomness is the explicit use of random keys, which
need to be passed. While it can look tedious at the first sight to pass the random key everywhere, you will learn in Course 4 why this is very helpful when
implementing some advanced models.
- `key` can be generated by calling `random.get_prng(seed=)` and passing in a number for the `seed`.
- `shape` is a tuple with the desired shape of the weight matrix.
    - The number of rows in the weight matrix should equal the number of columns in the variable `x`.  Since `x` may have 2 dimensions if it reprsents a single training example (row, col), or three dimensions (batch_size, row, col), get the last dimension from the tuple that holds the dimensions of x.
    - The number of columns in the weight matrix is the number of units chosen for that dense layer.  Look at the `__init__` function to see which variable stores the number of units.
- `dtype` is the data type of the values in the generated matrix; keep the default of `tf.float32`. In this case, don't explicitly set the dtype (just let it use the default value).

Set the standard deviation of the random values to 0.1
- The values generated have a mean of 0 and standard deviation of 1.
- Set the default standard deviation `stdev` to be 0.1 by multiplying the standard deviation to each of the values in the weight matrix.

In [21]:
random = fastmath.random

In [22]:
# See how the fastmath.trax.random.normal function works
tmp_key = random.get_prng(seed=1)
print('The random seed generated by random.get_prng')
display(tmp_key)

print('choose a matrix with 2 rows and 3 columns')
tmp_shape=(2, 3)
display(tmp_shape)

# Generate a weight matrix
# Note that you'll get an error if you try to set dtype to tf.float32, 
# where tf is tensorflow
# Just avoid setting the dtype and allow it to use the default data type
tmp_weight = trax.fastmath.random.normal(key=tmp_key, shape=tmp_shape)

print('Weight matrix generated with a normal distribution with mean 0 and'
      ' stdev of 1')
display(tmp_weight)

The random seed generated by random.get_prng


DeviceArray([0, 1], dtype=uint32)

choose a matrix with 2 rows and 3 columns


(2, 3)

Weight matrix generated with a normal distribution with mean 0 and stdev of 1


DeviceArray([[ 0.957307  , -0.9699291 ,  1.0070664 ],
             [ 0.36619022,  0.17294823,  0.29092228]], dtype=float32)

In [23]:
class Dense(Layer):
    """
    A dense (fully-connected) layer.
    """
    def __init__(self, n_units, init_stdev=0.1):
        self._n_units = n_units
        self._init_stdev = init_stdev

    def forward(self, x):
        dense = fnp.dot(x, self.weights)
        return dense

    def init_weights_and_state(self, input_signature, random_key):
        input_shape = input_signature.shape
        shape = (input_shape[1], self._n_units)
        w = trax.fastmath.random.normal(key=random_key, shape=shape)
        w *= self._init_stdev
        self.weights = w
        return self.weights

In [24]:
# Testing your Dense layer 
dense_layer = Dense(n_units=10)      # sets number of units in dense layer
random_key = random.get_prng(seed=0) # sets random seed
z = np.array([[2.0, 7.0, 25.0]])     # input array 

dense_layer.init(z, random_key)
# Returns randomly generated weights
print(f'Weights are\n {dense_layer.weights}')
# Returns multiplied values of units and weights
print('Foward function output is', dense_layer(z)) 

Weights are
 [[-0.02837108  0.09368162 -0.10050076  0.14165013  0.10543301  0.09108126
  -0.04265672  0.0986188  -0.05575325  0.00153249]
 [-0.20785688  0.0554837   0.09142365  0.05744595  0.07227863  0.01210617
  -0.03237354  0.16234995  0.02450038 -0.13809784]
 [-0.06111237  0.01403724  0.08410042 -0.1094358  -0.10775021 -0.11396459
  -0.05933381 -0.01557652 -0.03832145 -0.11144515]]
Foward function output is [[-3.0395496   0.9266802   2.5414743  -2.050473   -1.9769388  -2.582209
  -1.7952735   0.94427425 -0.8980402  -3.7497487 ]]


## 3.3  Model

Now you will implement a classifier using neural networks. Here is the model architecture you will be implementing. 

Embedding -> Dense(n) -> Dense(2) -> softmax -> preds

For the model implementation, you will use the Trax layers library `tl`.
Note that the second character of `tl` is the lowercase of letter `L`, not the number 1. Trax layers are very similar to the ones you implemented above,
but in addition to trainable weights also have a non-trainable state.
State is used in layers like batch normalization and for inference, you will learn more about it in course 4.

First, look at the code of the Trax Dense layer and compare to your implementation above.
- [tl.Dense](https://github.com/google/trax/blob/master/trax/layers/core.py#L29): Trax Dense layer implementation

One other important layer that you will use a lot is one that allows to execute one layer after another in sequence.
- [tl.Serial](https://github.com/google/trax/blob/master/trax/layers/combinators.py#L26): Combinator that applies layers serially.  
    - You can pass in the layers as arguments to `Serial`, separated by commas. 
    - For example: `tl.Serial(tl.Embeddings(...), tl.Mean(...), tl.Dense(...), tl.LogSoftmax(...))`

Please use the `help` function to view documentation for each layer.

- [tl.Embedding](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/core.py#L113): Layer constructor function for an embedding layer.  
    - `tl.Embedding(vocab_size, d_feature)`.
    - `vocab_size` is the number of unique words in the given vocabulary.
    - `d_feature` is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example).

In [25]:
#help(tl.Dense)
#help(tl.Serial)
#help(tl.Embedding)

In [26]:
tmp_embed = tl.Embedding(vocab_size=3, d_feature=2)
display(tmp_embed)

Embedding_3_2

- [tl.Mean](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/core.py#L276): Calculates means across an axis.  In this case, please choose axis = 1 to get an average embedding vector (an embedding vector that is an average of all words in the vocabulary).  
- For example, if the embedding matrix is 300 elements and vocab size is 10,000 words, taking the mean of the embedding matrix along axis=1 will yield a vector of 300 elements.

In [27]:
help(tl.Mean)

Help on function Mean in module trax.layers.core:

Mean(axis=-1, keepdims=False)
    Returns a layer that computes mean values using one tensor axis.
    
    `Mean` uses one tensor axis to form groups of values and replaces each group
    with the mean value of that group. The resulting values can either remain
    in their own size 1 axis (`keepdims=True`), or that axis can be removed from
    the overall tensor (default `keepdims=False`), lowering the rank of the
    tensor by one.
    
    Args:
      axis: Axis along which values are grouped for computing a mean.
      keepdims: If `True`, keep the resulting size 1 axis as a separate tensor
          axis; else, remove that axis.



In [28]:
# Given the embedding matrix uses 2 elements for embedding meaning, and 
# has a vocab size of 3: shape = (2, 3)
tmp_embed = np.array([[1, 2, 3], 
                      [4, 5, 6]])
display(np.mean(tmp_embed, axis=0)) # means for each word
display(np.mean(tmp_embed, axis=1)) # means for each semantic value

array([2.5, 3.5, 4.5])

array([2., 5.])

- [tl.LogSoftmax](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/core.py#L242): Implements log softmax function
- Here, you don't need to set any parameters for `LogSoftMax()`.

In [29]:
#help(tl.LogSoftmax)

**Online documentation**

- [tl.Dense](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Dense)

- [tl.Serial](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#module-trax.layers.combinators)

- [tl.Embedding](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Embedding)

- [tl.Mean](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Mean)

- [tl.LogSoftmax](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.LogSoftmax)

### Exercise 05
Implement the classifier function. 

In [30]:
def classifier(
        vocab_size=len(vocab), embedding_dim=256, output_dim=2, 
        mode='train'): 
    model = tl.Serial(tl.Embedding(vocab_size=vocab_size, 
                                   d_feature=embedding_dim),
                      tl.Mean(axis=0),
                      tl.Dense(n_units=output_dim),
                      tl.LogSoftmax())
    return model

In [31]:
tmp_model = classifier()
print(type(tmp_model))
display(tmp_model)

<class 'trax.layers.combinators.Serial'>


Serial[
  Embedding_9111_256
  Mean
  Dense_2
  LogSoftmax
]

# Part 4:  Training

To train a model on a task, Trax defines an abstraction [`trax.supervised.training.TrainTask`](https://trax-ml.readthedocs.io/en/latest/trax.supervised.html#trax.supervised.training.TrainTask) which packages the train data, loss and optimizer (among other things) together into an object.

Similarly to evaluate a model, Trax defines an abstraction [`trax.supervised.training.EvalTask`](https://trax-ml.readthedocs.io/en/latest/trax.supervised.html#trax.supervised.training.EvalTask) which packages the eval data and metrics (among other things) into another object.

The final piece tying things together is the [`trax.supervised.training.Loop`](https://trax-ml.readthedocs.io/en/latest/trax.supervised.html#trax.supervised.training.Loop) abstraction that is a very simple and flexible way to put everything together and train the model, all the while evaluating it and saving checkpoints.
Using `Loop` will save you a lot of code compared to always writing the training loop by hand, like you did in courses 1 and 2. More importantly, you are less likely to have a bug in that code that would ruin your training.

In [32]:
#help(trax.supervised.training.TrainTask)
#help(trax.supervised.training.EvalTask)
#help(trax.supervised.training.Loop)
#help(trax.optimizers)

Notice some available optimizers include:
```CPP
    adafactor
    adam
    momentum
    rms_prop
    sm3
```

## 4.1  Training the model

Now you are going to train your model. 

Let's define the `TrainTask`, `EvalTask` and `Loop` in preparation to train the model.

In [34]:
BATCH_SIZE = 16
ETA = 0.01
rnd.seed(271)

In [40]:
train_task = training.TrainTask(
    labeled_data=train_generator(batch_size=BATCH_SIZE, shuffle=True),
    loss_layer=tl.CrossEntropyLoss(),
    optimizer=trax.optimizers.Adam(ETA),
    n_steps_per_checkpoint=10)
eval_task = training.EvalTask(
    labeled_data=val_generator(batch_size=BATCH_SIZE, shuffle=True),
    metrics=[tl.CrossEntropyLoss(), tl.Accuracy()])
model = classifier()

This defines a model trained using [`tl.CrossEntropyLoss`](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.metrics.CrossEntropyLoss) optimized with the [`trax.optimizers.Adam`](https://trax-ml.readthedocs.io/en/latest/trax.optimizers.html#trax.optimizers.adam.Adam) optimizer, all the while tracking the accuracy using [`tl.Accuracy`](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.metrics.Accuracy) metric. We also track `tl.CrossEntropyLoss` on the validation set.

Now let's make an output directory and train the model.

In [41]:
output_dir = './model'

### Exercise 06
**Instructions:** Implement `train_model` to train the model (`classifier` that you wrote earlier) for the given number of training steps (`n_steps`) using `TrainTask`, `EvalTask` and `Loop`.

In [47]:
def train_model(classifier, train_task, eval_task, n_steps, output_dir):
    '''
    Input: 
        classifier - the model you are building
        train_task - Training task
        eval_task - Evaluation task
        n_steps - the evaluation steps
        output_dir - folder to save your files
    Output:
        trainer -  trax trainer
    '''
### START CODE HERE (Replace instances of 'None' with your code) ###
    training_loop = training.Loop(
        classifier, # The learning model
        train_task, # The training task
        eval_task=eval_task, # The evaluation task
        output_dir=output_dir) # The output directory
    training_loop.run(n_steps=n_steps)
### END CODE HERE ###

    # Return the training_loop, since it has the model.
    return training_loop

In [48]:
training_loop = train_model(
    model, train_task, eval_task, 100, output_dir)

TypeError: __init__() got an unexpected keyword argument 'eval_task'