# Assignment 1:  Sentiment with Deep Neural Networks

Welcome to the first assignment of course 3. In this assignment, you will explore sentiment analysis using deep neural networks. 

In course 1, you implemented Logistic regression and Naive Bayes for sentiment analysis. However if you were to give your old models an example like:

<center> <span style='color:blue'> <b>This movie was almost good.</b> </span> </center>

Your model would have predicted a positive sentiment for that review. However, that sentence has a negative sentiment and indicates that the movie was not good. To solve those kinds of misclassifications, you will write a program that uses deep neural networks to identify sentiment in text. By completing this assignment, you will: 

- Understand how you can build/design a model using layers
- Train a model using a training loop
- Use a binary cross-entropy loss function
- Compute the accuracy of your model
- Predict using your own input

As you can tell, this model follows a similar structure to the one you previously implemented in the second course of this specialization. 
- Indeed most of the deep nets you will be implementing will have a similar structure. The only thing that changes is the model architecture, the inputs, and the outputs. Before starting the assignment, we will introduce you to the Google library `trax` that we use for building and training models.


Now we will show you how to compute the gradient of a certain function `f` by just using `  .grad(f)`. 

- Trax source code can be found on Github: [Trax](https://github.com/google/trax)
- The Trax code also uses the JAX library: [JAX](https://jax.readthedocs.io/en/latest/index.html)


# Part 1:  Import libraries and try out Trax

- Let's import libraries and look at an example of using the Trax library.

In [1]:
import os
import random as rnd

import numpy as np
import trax
from   trax import layers as t
from   trax.fastmath import grad
import trax.fastmath.numpy as fnp

from utils import get_all_tweets, Layer, load_tweets, process_tweet

#trax.supervised.trainer_lib.init_random_number_generators(31)

In [2]:
a = fnp.array(5.)
display(a)
type(a)



DeviceArray(5., dtype=float32)

jax.interpreters.xla.DeviceArray

In [3]:
def f(x):
    return x**2

In [4]:
print(f'f(a) for a={a}: {f(a)}')

f(a) for a=5.0: 25.0


The gradient (derivative) of function `f` with respect to its input `x` is the derivative of $x^2$.
- The derivative of $x^2$ is $2x$.  
- When x is 5, then $2x=10$.

You can calculate the gradient of a function by using `trax.fastmath.grad(fun=)` and passing in the name of the function.
- In this case the function you want to take the gradient of is `f`.
- The object returned (saved in `grad_f` in this example) is a function that can calculate the gradient of f for a given trax.fastmath.numpy array.

In [5]:
grad_f = grad(fun=f)
type(grad_f)

function

In [6]:
grad_calculation = grad_f(a) # f = x^2 -> d/dx = 2x
display(grad_calculation)

DeviceArray(10., dtype=float32)

# Part 2:  Importing the data

## 2.1  Loading in the data

Import the data set.  
- You may recognize this from earlier assignments in the specialization.
- Details of process_tweet function are available in utils.py file

In [7]:
DATA = '../../../../data/twitter_samples'

In [8]:
all_positive_tweets = get_all_tweets(DATA, 'positive')
all_negative_tweets = get_all_tweets(DATA, 'negative')

In [9]:
len(all_positive_tweets), len(all_negative_tweets)

(5000, 5000)

In [10]:
N_TRAIN = 4000
train_pos = all_positive_tweets[:N_TRAIN]
val_pos = all_positive_tweets[N_TRAIN:]
train_neg = all_negative_tweets[:N_TRAIN]
val_neg = all_negative_tweets[N_TRAIN:]

train_x = train_pos + train_neg
val_x = val_pos + val_neg
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
val_y = np.append(np.ones(len(val_pos)), np.zeros(len(val_neg)))

len(train_x), len(val_x)

(8000, 2000)

In [11]:
print('original:', train_pos[0])
print('processed:', process_tweet(train_pos[0]))

original: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
processed: ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


## 2.2  Building the vocabulary

Now build the vocabulary.
- Map each word in each tweet to an integer (an "index"). 
- The following code does this for you, but please read it and understand what it's doing.
- Note that you will build the vocabulary based on the training data. 
- To do so, you will assign an index to everyword by iterating over your training set.

The vocabulary will also include some special tokens
- `__PAD__`: padding
- `</e>`: end of line
- `__UNK__`: a token representing any word that is not in the vocabulary.

In [12]:
vocab = {'__PAD__': 0, '__</e>__': 1, '__UNK__': 2}

i = 3
for tweet in train_x:
    processed = process_tweet(tweet)
    for word in processed:
        if word not in vocab:
            vocab[word] = i
            i += 1
            
len(vocab)            

9111

In [13]:
#display(vocab)

## 2.3  Converting a tweet to a tensor

Write a function that will convert each tweet to a tensor (a list of unique integer IDs representing the processed tweet).
- Note, the returned data type will be a **regular Python `list()`**
    - You won't use TensorFlow in this function
    - You also won't use a numpy array
    - You also won't use trax.fastmath.numpy array
- For words in the tweet that are not in the vocabulary, set them to the unique ID for the token `__UNK__`.

##### Example
Input a tweet:
```CPP
'@happypuppy, is Maria happy?'
```

The tweet_to_tensor will first conver the tweet into a list of tokens (including only relevant words)
```CPP
['maria', 'happi']
```

Then it will convert each word into its unique integer

```CPP
[2, 56]
```
- Notice that the word "maria" is not in the vocabulary, so it is assigned the unique integer associated with the `__UNK__` token, because it is considered "unknown."

### Exercise 01
**Instructions:** Write a program `tweet_to_tensor` that takes in a tweet and converts it to an array of numbers. You can use the `Vocab` dictionary you just found to help create the tensor. 

- Use the vocab_dict parameter and not a global variable.
- Do not hard code the integer value for the `__UNK__` token.

In [14]:
def tweet_to_tensor(tweet, vocab_dict, unk_token='__UNK__', verbose=False):
    '''
    Input: 
        tweet - A string containing a tweet
        vocab_dict - The words dictionary
        unk_token - The special string for unknown tokens
        verbose - Print info durign runtime
    Output:
        tensor_l - A python list with
        
    '''  
    word_l = process_tweet(tweet)
    if verbose:
        print('List of words from the processed tweet:')
        print(word_l)
    tensor_l = []
    unk_ID = vocab_dict[unk_token]
    if verbose:
        print(f'The unique integer ID for the unk_token is {unk_ID}')
    tensor_l = [vocab_dict.get(word, unk_ID) for word in word_l]    
    return tensor_l

In [15]:
print('Actual tweet is\n', val_pos[0])
print(
    '\nTensor of tweet:\n', tweet_to_tensor(val_pos[0], vocab_dict=vocab))

Actual tweet is
 Bro:U wan cut hair anot,ur hair long Liao bo
Me:since ord liao,take it easy lor treat as save $ leave it longer :)
Bro:LOL Sibei xialan

Tensor of tweet:
 [1076, 138, 486, 2365, 755, 8170, 1134, 755, 54, 2, 2688, 801, 2, 2, 355, 609, 2, 3507, 1028, 605, 4579, 9, 1076, 159, 2, 2]


## 2.4  Creating a batch generator

Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets. 
- If instead of training with batches of examples, you were to train a model with one example at a time, it would take a very long time to train the model. 
- You will now build a data generator that takes in the positive/negative tweets and returns a batch of training examples. It returns the model inputs, the targets (positive or negative labels) and the weight for each target (ex: this allows us to can treat some examples as more important to get right than others, but commonly this will all be 1.0). 

Once you create the generator, you could include it in a for loop

```CPP
for batch_inputs, batch_targets, batch_example_weights in data_generator:
    ...
```

You can also get a single batch like this:

```CPP
batch_inputs, batch_targets, batch_example_weights = next(data_generator)
```
The generator returns the next batch each time it's called. 
- This generator returns the data in a format (tensors) that you could directly use in your model.
- It returns a triple: the inputs, targets, and loss weights:
-- Inputs is a tensor that contains the batch of tweets we put into the model.
-- Targets is the corresponding batch of labels that we train to generate.
-- Loss weights here are just 1s with same shape as targets. Next week, you will use it to mask input padding.

### Exercise 02
Implement `data_generator`.

In [None]:
def data_generator(
        data_pos, data_neg, batch_size, loop, vocab_dict, shuffle=False):
    '''
    Input: 
      data_pos - Set of posstive examples
      data_neg - Set of negative examples
      batch_size - number of samples per batch. Must be even
      loop - True or False
      vocab_dict - The words dictionary
      shuffle - Shuffle the data order
    Yield:
      inputs - Subset of positive and negative examples
      targets - The corresponding labels for the subset
      example_weights - An array specifying the importance of each example  
    '''     
    ### START GIVEN CODE ###
    # make sure the batch size is an even number
    # to allow an equal number of positive and negative samples
    assert batch_size % 2 == 0
    
    # Number of positive examples in each batch is half of the batch size
    # same with number of negative examples in each batch
    n_to_take = batch_size // 2
    
    # Use pos_index to walk through the data_pos array
    # same with neg_index and data_neg
    pos_index = 0
    neg_index = 0
    len_data_pos = len(data_pos)
    len_data_neg = len(data_neg)
    
    # Get and array with the data indexes
    pos_index_lines = list(range(len_data_pos))
    neg_index_lines = list(range(len_data_neg))
    
    # shuffle lines if shuffle is set to True
    if shuffle:
        rnd.shuffle(pos_index_lines)
        rnd.shuffle(neg_index_lines)
        
    stop = False
    
    # Loop indefinitely
    while not stop:  
        # create a batch with positive and negative examples
        batch = []
        # First part: Pack n_to_take positive examples
        # Start from pos_index and increment i up to n_to_take
        for i in range(n_to_take):
            # If the positive index goes past the positive dataset length
            if pos_index >= len_data_pos: 
                # If loop is set to False, break once we reach the end of
                # the dataset
                if not loop:
                    stop = True
                    break
                
                # If user wants to keep re-using the data, reset the index
                pos_index = 0
                if shuffle:
                    # Shuffle the index of the positive sample
                    rnd.shuffle(pos_index_lines)
                    
            # get the tweet as pos_index
            tweet = data_pos[pos_index_lines[pos_index]]
            # convert the tweet into tensors of integers representing the
            # processed words
            tensor = tweet_to_tensor(tweet, vocab_dict)
            # append the tensor to the batch list
            batch.append(tensor)
            # Increment pos_index by one
            pos_index += 1
        ### END GIVEN CODE ###
            
        ### START CODE HERE (Replace instances of 'None' with your code)
        # Second part: Pack n_to_take negative examples
        # Using the same batch list, start from neg_index and increment i 
        # up to n_to_take
        for i in range(n_to_take):            
            # If the negative index goes past the negative dataset length
            if neg_index >= len_data_neg
                # If loop is set to False, break once we reach the end of
                #the dataset
                if not loop:
                    stop = True
                    break
                    
                # If user wants to keep re-using the data, reset the index
                neg_index = 0
                if shuffle:
                    # Shuffle the index of the negative sample
                    rnd.shuffle(neg_index_lines)
                    
            # get the tweet as neg_index
            tweet = data_neg[neg_index_lines[neg_index]]
            # convert the tweet into tensors of integers representing the 
            # processed words
            tensor = tweet_to_tensor(tweet, vocab_dict)
            # append the tensor to the batch list
            batch.append(tensor)
            # Increment neg_index by one
            neg_index += 1
        ### END CODE HERE ###        

        # -------------CONTINUE HERE------------------
        ### START GIVEN CODE ###
        if stop:
            break;

        # Update the start index for positive data 
        # so that it's n_to_take positions after the current pos_index
        pos_index += n_to_take
        
        # Update the start index for negative data 
        # so that it's n_to_take positions after the current neg_index
        neg_index += n_to_take
        
        # Get the max tweet length (the length of the longest tweet) 
        # (you will pad all shorter tweets to have this length)
        max_len = max([len(t) for t in batch]) 
        
        
        # Initialize the input_l, which will 
        # store the padded versions of the tensors
        tensor_pad_l = []
        # Pad shorter tweets with zeros
        for tensor in batch:
### END GIVEN CODE ###

### START CODE HERE (Replace instances of 'None' with your code) ###
            # Get the number of positions to pad for this tensor so that it will be max_len long
            n_pad = None
            
            # Generate a list of zeros, with length n_pad
            pad_l = None
            
            # concatenate the tensor and the list of padded zeros
            tensor_pad = None
            
            # append the padded tensor to the list of padded tensors
            None

        # convert the list of padded tensors to a numpy array
        # and store this as the model inputs
        inputs = None
  
        # Generate the list of targets for the positive examples (a list of ones)
        # The length is the number of positive examples in the batch
        target_pos = None
        
        # Generate the list of targets for the negative examples (a list of zeros)
        # The length is the number of negative examples in the batch
        target_neg = None
        
        # Concatenate the positve and negative targets
        target_l = None
        
        # Convert the target list into a numpy array
        targets = None

        # Example weights: Treat all examples equally importantly.It should return an np.array. Hint: Use np.ones_like()
        example_weights = None
        

### END CODE HERE ###

### GIVEN CODE ###
        # note we use yield and not return
        yield inputs, targets, example_weights