# Assignment #4
**CE4719: Deep Learing**


*   Spring 2020
*   http://ce.sharif.edu/courses/98-99/2/ce719-1/index.php

**Please pay attention to these notes:**
- the coding parts you have to complete are specified by:
```
    ################################################################################
    # TODO:                                                                        #
    ################################################################################
    pass
    ################################################################################
    #                                 END OF YOUR CODE                             #
    ################################################################################ 
```
- We always recommend discussion in groups for assignments. However, each student has to finish all of the questions by him/herself. 
- All submitted code will be compared against all student's codes using Stanford MOSS.
- If you have any questions about this assignment, feel free to drop us a line. You may also post your questions on the course's forum page.
- We HIGHLY encourage you to run this notebook on Google Colab.
- **Before starting to work on the assignment, please fill your name in the next section AND Remember to RUN the cell.**


In [0]:
#@title Enter your information & "RUN the cell!!"
student_id = "" #@param {type:"string"}
student_name = "" #@param {type:"string"}

print("your student id:", student_id)
print("your name:", student_name)

## 2. Tokenization, Vocabulary, Preprocessing (15 pts)

---
In this problem, you will practice tokenization, creating vocabulary for a corpus, preprocessing data, and processing data using RNNs.

### 2.1

In [0]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.nn.init as init
from torch.nn.utils.rnn import pad_sequence
from torch.nn import RNN, RNNCell, Embedding
from typing import List, Dict
from itertools import chain
from collections import Counter
import json
from pprint import pprint

# nltk will be used to tokenize texts
import nltk  
nltk.download('punkt')

In [0]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(DEVICE)

# Function for setting the random seed for reproducibility
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if DEVICE == torch.device("cuda"):
        torch.cuda.manual_seed(seed)
        torch.backends.cudnn.deterministic = True
        torch.cuda.empty_cache()

### 2.2 Tokenize (1 pts)

Tokenization is the process of tokenizing or splitting a string or text into a list of tokens. Tokenization is one of the common pre-processings in natural language processing. The resulting tokens are then passed on to some other form of processing, which in our case will be deep neural networks.

In [0]:
from nltk import word_tokenize


def tokenize(corpus: List[str]) -> List[List[str]]:
    """ tokenizes the corpus and returns it as a list of list of tokens
    corpus: Input corpus as a list of sentences (each sentence is a string)
    """
    ################################################################################
    # TODO: use nltk word_tokenize to tokenize corpus, as a list of sentences,     #
    # into its constituent words.                                                  #
    # You should first lowercase the characters of sentences (use .lower()         #
    # methods of strings.)                                                         #
    ################################################################################
    pass
    ################################################################################
    #                                 END OF YOUR CODE                             #
    ################################################################################ 

Let's test your implementation with the following function:

In [0]:
def test_tokenize():
    sample_corpus = ["From fairest creatures we desire increase.",
                     "Within thine own bud buriest thy content.",
                     "Thy youth's proud livery so gazed on now.", 
                     "Shall sum my count, and make my old excuse."]

    correct_answer = [['from', 'fairest', 'creatures', 'we', 'desire', 'increase', '.'],
                      ['within', 'thine', 'own', 'bud', 'buriest', 'thy', 'content', '.'],
                      ['thy', 'youth', "'s", 'proud', 'livery', 'so', 'gazed', 'on', 'now', '.'],
                      ['shall', 'sum', 'my', 'count', ',', 'and', 'make', 'my', 'old', 'excuse', '.']]

    assert tokenize(sample_corpus) == correct_answer

    print('passed!')

test_tokenize()

### 2.3 Vocabulary (6 pts)

After tokenizing the corpus, we will find the unique tokens and name them the vocabulary of that corpus. There are a couple of important points to note here:

1. When dealing with a corpus in sentence-wise manner, we usually mark the beginning and end of sentences with some special tokens (e.g., `'<START>'` and `'<END>'`).

2. To make sequences have the same length, we pad shorter ones with a special token (e.g., `'<PAD>'`).

3. Tokens which will be encountered later and do not exist in our vocab will be replaced by a special token (e.g., `'<UNK>'`).

In [0]:
class Vocab:
    def __init__(self, word2id=None):
        """Constructor of Vocab

        word2id: dictionary that maps tokens to their ids.
        """
        self.pad_token = '<PAD>'
        self.end_token = '<END>'
        self.start_token = '<START>'
        self.unk_token = '<UNK>'

        if word2id is None:
            self.word2id = {self.pad_token: 0,
                            self.start_token: 1,
                            self.end_token: 2,
                            self.unk_token: 3}
            self.size = 4
        else:
            self.word2id = word2id
            self.size = len(self.word2id)

        self.id2word = {v: k for (k, v) in self.word2id.items()}

    def build(self, tokenized_corpus: List[List[str]], size=None, min_freq=None):
        """Builds the vocab from a tokenized corpus.

        tokenized_corpus: corpus as a list of list of tokens (strings)
        size: Final size of (number of unique tokens in) our vocab
        min_freq: minimum frequency
        """
        tokens2freq = Counter(chain(*tokenized_corpus))  # dict that maps unique tokens to their freqs in the corpus
        frequent_tokens = []
        ################################################################################
        # TODO: use tokens2freq and find the first size frequent tokens and save       #
        #       them in frequent_tokens. Remove tokens with a frequency lower than     #
        #       min_freq (i.e. if token's occurence in the corpus is less than         # 
        #       min_freq times, don't put the token in frequent_tokens).               #
        #       If size is None, then use all of the tokens. This also applies to      #
        #       min_freq.                                                              #
        ################################################################################
        pass
        ################################################################################
        #                                 END OF YOUR CODE                             #
        ################################################################################
        # adding tokens to the vocab
        for token in frequent_tokens:
            self.add_token(token)

    def get_token_by_id(self, t_id: int) -> str:
        """Returns the token with the corresponding id in the vocab.
        If the id is not valid, returns None.

        t_id: token id
        """
        return self.id2word.get(t_id, None)

    def get_id_by_token(self, token: str) -> int:
        """Returns the id of the token in the vocab. If the token does not exist,
        returns the id of <UNK> token.

        token: token (as a string) for which the id should be returned.
        """
        return self.word2id.get(token, self.word2id[self.unk_token])

    def add_token(self, token: str):
        """Adds the token to the vocab's data structures
        token: token as a string
        """
        ################################################################################
        # TODO: If the token is not already in the vocab add it to word2id and id2word #
        # Don't forget to update the vocab size afterwards!                            #
        ################################################################################
        pass
        ################################################################################
        #                                 END OF YOUR CODE                             #
        ################################################################################

    def tokens2ids(self, sents):
        """Convert list of words or list of sentences of tokens 
        into list or list of list of indices.

        sents: input sentences as List[List[str]] (multiple sentences) or List[str]
        (single sentence)
        """
        ################################################################################
        # TODO: return a new list where each token is repalced by its id.              #
        # HINT: try to implement each part in one line of code using list comprehension#
        ################################################################################
        if type(sents[0]) == list:
            pass
        else:
            pass
        ################################################################################
        #                                 END OF YOUR CODE                             #
        ################################################################################
  
    def to_tensor(self, sent: List[str]):
        """Converts a sentence as a list of tokens into a tensor of indices.

        sent: a sentence as a list of strings (tokens)
        """
        ################################################################################
        # TODO: Use self.tokens2ids to get the sentence as a list of indices and wrap a#
        #       tensor around it with dtpye of torch.long and device of DEVICE.        #
        ################################################################################
        pass
        ################################################################################
        #                                 END OF YOUR CODE                             #
        ################################################################################

    def pad_sents(self, sents: List[List[str]]) -> List[List[str]]:
        """Pads list of sentences according to the longest sentence.

        sents: sentences as a list of list of tokens (strings).
        """
        sents_padded = []
        ################################################################################
        # TODO: pad shorter sentences by appending pad token to them.                  #
        ################################################################################
        pass
        ################################################################################
        #                                 END OF YOUR CODE                             #
        ################################################################################
        return sents_padded

    def save(self, path: str):
        """Saves the vocab in a json file.

        path: path to save the vocab in
        """
        with open(path, 'w') as f:
            json.dump(self.word2id, f)

    @staticmethod
    def load(path: str):
        """Loads vocab from a json file.

        path: path to load the vocab from
        """
        with open(path, 'r') as f:
            word2id = json.load(f)

        return Vocab(word2id)

Implement `add_token` and `build` methods. Then, test your implementation with the following function:

In [0]:
def test_build_vocab():
    vocab =  sample_corpus = ["From fairest creatures we desire increase.",
                              "Within thine own bud buriest thy content.",
                              "Thy youth's proud livery so gazed on now.", 
                              "Shall sum my count, and make my old excuse."]

    vocab = Vocab()
    vocab.build(tokenize(sample_corpus))

    assert set(vocab.word2id.keys()) == {'old', '<UNK>', 'on', '<END>', 'gazed', 'now',
                                         'own', 'my', 'proud', 'content', ',', 'desire',
                                         'and', '<PAD>', 'shall', 'from', 'creatures',
                                         'sum', 'fairest', 'youth', "'s", 'make', 'increase',
                                         'excuse', 'we', 'bud', 'thine', '.', '<START>',
                                         'livery', 'thy', 'count', 'buriest', 'within', 'so'}

    return vocab

vocab = test_build_vocab()
print('passed!')

Implement `pad_sents` method and test it with the following function:

In [0]:
def test_pad():
    vocab = test_build_vocab()

    unpadded_sents = [['to', 'say', 'within', 'thine', 'own', 'deep', 'sunken', 'eyes'], 
                      ['of', 'his', 'self-love', 'to', 'stop', 'posterity',], 
                      ['die', 'single'],
                      ['then', 'beauteous', 'niggard']]

    padded_sents = vocab.pad_sents(unpadded_sents)

    assert padded_sents == [['to', 'say', 'within', 'thine', 'own', 'deep', 'sunken', 'eyes'],
                            ['of', 'his', 'self-love', 'to', 'stop', 'posterity', '<PAD>', '<PAD>'],
                            ['die', 'single', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>'],
                            ['then', 'beauteous', 'niggard', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>']]
    print('passed!')

test_pad()

Implement `tokens2ids` and check your output with the following function (check whether each token is replaced by a correct id):

In [0]:
def test_tokens2id():
    vocab = test_build_vocab()
    sents = [['to', 'say', 'within', 'thine', 'own', 'deep', 'sunken', 'eyes'],
             ['increase', 'his', 'we', 'to', 'stop', 'desire', '<PAD>', '<PAD>'],
             ['die', 'single', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>'],
             ['then', 'proud', 'livery', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>']]

    sent = ['increase', 'proud', 'stop', 'sunken', '<PAD>']

    pprint(vocab.word2id)
    print()
    pprint(vocab.tokens2ids(sents))
    print()
    pprint(vocab.tokens2ids(sent))


test_tokens2id()

Implement `to_tensor()` method and test it with the following function:

In [0]:
def test_to_tensor():
    sent = ['increase', 'proud', 'stop', 'sunken', '<PAD>']
    vocab = test_build_vocab()

    out = vocab.to_tensor(sent)

    assert torch.is_tensor(out)
    print(out)

test_to_tensor()

### 2.4 Pad & Pack and using them with RNNs in PyTorch (8 pts)

You might want to process a batch of data before and after feeding to a RNN (instead of directly feeding it in and using its output). In this section, we will look at some of the ways of doing this in PyTorch.

#### 2.4.1 `pad_sequence()` (2 pts)

`pad_sequence` is used to convert variable length sequences to same size. You can do padding either manually (like what we you did in the previous part) or by using `torch.nn.utils.rnn.pad_sequence()`. <br/>
Now let's use this function to pad some data and then feed it to a RNN. With your knowledge of PyTorch at this point, you should be able to read [torch.nn documentation](https://pytorch.org/docs/stable/nn.html) and figure out how to do this task:

In [0]:
from torch.nn.utils.rnn import pad_sequence

def pad_and_feed(data: List[torch.Tensor]) -> torch.Tensor: 
    """Pads shorter sentences in a list of tensors and feeds it to a RNN.

    data: data as a list of tensors with torch.int dtype.
    """ 
    rnn, output = None, None
    ################################################################################
    # TODO: 1) pads the data using pad_sequence of PyTorch                         #
    #       2) Instantiate rnn using torch.nn.RNN with appropriate input size and  #
    #       hidden size of 10.                                                     #
    #       save the outputs (hidden states) of applying the rnn to the data in    #
    #       the output variable.                                                   #         
    ################################################################################
    pass
    ################################################################################
    #                                 END OF YOUR CODE                             #
    ################################################################################ 
    return output

Now let's test your implementation with the following function (the test is not exhaustive and is just checking shapes):

In [0]:
def test_pad_and_feed():
    np.random.seed(42)
    data = [torch.empty(np.random.randint(0, 10), 100) for _ in range(5)]
    assert pad_and_feed(data).shape == torch.Size([7, 5, 10])
    print('passed!')

test_pad_and_feed()

#### 2.4.2 `pack_sequence()` (2 pts)

After padding the data, there will be a lot of zeros (we often set the id of pad token to zero) representing pads in it. These zeros do not really need to be processed by RNN because they do not represent any meaningful data! In fact, we just use them to make a single tensor from our variable length sentences. Therefore, feeding padded data directly to RNN is inefficient.

In order to make the process more efficient, there is a function called `pack_sequence()` that reforms the data so that the model can just process useful tokens and not the paddings.

Let's see what exactly pack does:

In [0]:
from torch.nn.utils.rnn import pack_sequence

def pack():
    a = torch.zeros((5, 8))
    b = torch.ones((3, 8))
    c = 2 * torch.ones((1, 8))
    ################################################################################
    # TODO:  pack the above three tensors and print the packed output.             #
    ################################################################################
    pass
    ################################################################################
    #                                 END OF YOUR CODE                             #
    ################################################################################ 

pack()

##### **Questions**: 

1) what pattern do you see in the packed data? 

**YOUR ANSWER**: <<< Write your answer here >>>

2) what does `batch_sizes` mean in the packed data? <br/>

**YOUR ANSWER**: <<< Write your answer here >>>

3) How do you think RNN processes packed data?

**YOUR ANSWER**: <<< Write your answer here >>>

Feel free to write your answers in Persian.

#### 2.4.3 `pack_padded_sequence()` and `pad_packed_sequence()` (2 pts)

You will usually see that two functions are frequenlty used to process data before and after feeding it to the RNNs: `pack_padded_sequence()` and `pad_packed_sequence()`.

In [0]:
# We will import these functions as pack and unpack
from torch.nn.utils.rnn import pack_padded_sequence as pack
from torch.nn.utils.rnn import pad_packed_sequence as unpack

In [0]:
def pad_and_pack(data: List[torch.Tensor]) -> torch.Tensor:
    data = sorted(data, key=lambda element: element.shape[0], reverse=True)
    lengths = [d.shape[0] for d in sorted_data]
    ################################################################################
    # TODO: 1) pad the data using pad_sequence()                                   #
    #       2) pack the data and feed it to a RNN with hidden size of 10           #
    #       3) unpack the RNN's outputs (hidden states) and return it              #
    ################################################################################   
    pass
    ################################################################################
    #                                 END OF YOUR CODE                             #
    ################################################################################ 

Let's test your implementation with the following function (the test is not exhaustive and is just checking shapes):

In [0]:
def test_pad_and_pack():
    np.random.seed(42)
    data = [torch.empty(np.random.randint(0, 10), 100) for _ in range(5)]
    outputs = pad_and_pack(data)
    assert outputs.size() == torch.Size([7, 5, 10])
    print('passed!')

test_pad_and_pack()

### 2.5 Embedding

As you have seen in previous parts, the shape of a batch of our sentences would be $(N, L)$ where $N$ is the number of sentences and $L$ is the length of the longest sequence (Note that after padding all of the sequences have the same length $L$). Each entry of this tensor is the id of a token in our vocab. <br/>

Feeding these integer ids directly to RNNs is obviously a bad idea. We need to represent each token of our vocabulary with a dense vector. That's where embeddings comes in. <br/>

You can instantiate embedding layers in PyTorch by using `torch.nn.Embedding`. If you feed a Tensor of vocab indices with shape $(d_1, d_2, ..., d_k)$ to an embedding layer, it will return a $(d_1, d_2, ..., d_k, D)$ tensor where $D$ is the embedding dimension of the layer.

In [0]:
def embed():
    set_seed(40719)
    data = torch.tensor([[1, 2, 4, 4], [1, 3, 0, 0], [2, 0, 0, 0]])
    layer = nn.Embedding(num_embeddings=5, embedding_dim=3, padding_idx=0)  # what is padding_idx?
    with torch.no_grad():
        print(layer(data))

embed() 

### 2.6 Putting it all together (2 pts)

Now let's put all the stuff we learned above together:

In [0]:
def process_data(data: List[str], vocab: Vocab) -> torch.Tensor:
    outputs = None
    ################################################################################
    # TODO: 1) tokenize the data using your tokenize function.                     #                                     
    #       2) use vocab's pad_sents method to pad the data.                       #
    #       3) replace tokens with their ids in the vocab.                         #
    #       4) instantiate an embedding layer with embedding size of 10 and        #
    #          vocabulary size equal to vocab.size.                                #
    #       5) wrap a tensor around the data and apply the embedding layer to it.  # 
    #       6) pack the data and feed it to a RNN with hidden size of 5.           #
    #       7) unpack the RNN's outputs (hidden states) and save it in outputs.    #
    ################################################################################ 
    pass
    ################################################################################
    #                                 END OF YOUR CODE                             #
    ################################################################################
    return outputs

Let's test your implementation (the test is not exhaustive and is just checking shapes):

In [0]:
def test_process_data():
    vocab = test_build_vocab()
    sents = ['to say within thine own deep sunken eyes', 
             'of his self-love to stop posterity', 
             'then beauteous niggard'
             'die single']

    outputs = process_data(sents, vocab)
    assert outputs.shape == torch.Size([8, 3, 5])
    print('passed!')

test_process_data()

## Submission

- Check and review your answers. Make sure all cells' output are what you have planned.
- Select File > Save.
- To download the notebook, select File > Download .ipynb.
- Create an archive of all notebooks (P1.ipynb, P2.ipynb, and P3.ipynb)