## Homework 4 - Supervised Learning II - MDS Computational Linguistics

### Assignment Topics
- Train a deep learning system for real-world task
- Recurrent Neural Networks
- Long Short-Term 
- Regularization
- Gradient-based optimization
- Very-short answer questions

### Software Requirements
- Python (>=3.6)
- PyTorch (>=1.2.0) 
- Jupyter (latest)

### Submission Info.
- Due Date: February 8, 2020, 18:00:00 (Vancouver time)

## Getting Started

In [1]:
# all the necessary imports
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from torch import optim
import torch
import torchtext
from torchtext.data import Field, LabelField
from torchtext.data import TabularDataset
from torchtext.data import Iterator, BucketIterator

## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)

# Exercise 1 Application Questions


### 1.1 Yelp review rating prediction 
rubric={accuracy:40}

This exercise is a competition of real-world application. 
Millions of people share a great number of reviews about business on [Yelp.com](https://www.yelp.com/) and Yelp mobile app everyday. These reviews and ratings help other users to make a choice. We used [Yelp APIs (application programming interface)](https://www.yelp.ca/developers) to collect over 35,000 reviews of 1,000 restaurants in New York City. We split this dataset into 90\% TRAIN set (28,000 reviews), 10\% DEV set (3,500 reviews), and 10\% TEST set (3,500 reviews). Each review has text review content and a corresponding label (i.e., 5-level rating star). This table shows the class ditribution of TRAIN and DEV sets.

|    Rating      |     |    # of reviews  |
|--------|-------------------|-----|
|| Train             | Dev |
| 1star      | 5,619              | 683 |
| 2star      | 5,616              | 677 |
| 3star      | 5,583              | 713 |
| 4star      | 5,532              | 733 |
| 5star      | 5,650              | 694 |


In directory `./data/yelp_review/`, we provide the `TRAIN` and `DEV` sets with the corresponding labels for your system development. 
Please use the TRAIN and DEV sets to develop a classification system for this task. You can use any model of this course (e.g., linear regression, feed-forward neural network, RNN, GRUs, LSTM). We also provide `TEST` which only contain text content of review for final evaluation. You will use your best trained model to predict the labels of **`TEST` reviews** and submit your predictions. 

**The performance of your submitted systems will be evaluated on predictions of rating labels for reviews in TEST set. Macro Averaged F-score will be the evaluation metric.** 

Your mark of this Exercise = 40 * ( $1 - (F1_{max} - F1_i) ) $ where $F1_{max}$ is the highest test F1 score achieved across the class and $F1_i$ is the test F1 score you achieved.


**Development Instruction**

**1. Data and preprocessing: Use `torchtext` load and pre-process dataset. Prepare the batch Iterators**

Hints: You should select a tokenizer for your system (e.g., SpaCy English model, whitespace tokenizer, NLTK word tokenizer). 

**2. Model selection and hyper-parameter tuning. You need to select the architecture you want to use. You may need to search the optimal hyper-parameter set to improve your model performance.**

Hints: You can use the DEV set to estimate your model performance. 

There are many possible strategies you could take to improve performance:   

a. Changing vocabulary size, batch size. Using TF-IDF features or pre-training word embedding model as your embedding weights (e.g., [google news word2vec](https://code.google.com/archive/p/word2vec/), [GloVe](https://nlp.stanford.edu/projects/glove/), [ELMo](https://allennlp.org/elmo)).

Hint: In our tutorial, we use the embedding layer with randomly initialized weights. If you initialize your embedding layer with word embedding weights from a pre-trained word embedding model from the ones listed above, you may get improvements.

b. Changing model, such as linear regression, feed-forward neural network, RNN, GRUs, LSTM.

c. Changing neural network structure, such as changing hidden dimension size, number of layers, dropout rate, activation function.

d. Changing training procedure, such as number of epochs, learning rate, adding regulization and momentum (or RMSProp, or Adam).

e. You may find some novel ideas in the state-of-the-art NLP systems [here](http://nlpprogress.com/english/sentiment_analysis.html).

Hint: Due to the high requirement of computational resource, we suggest you to run your experiments on [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb#recent=true). Please read `Colab instructions` for more information of Colab

**3. When you get your best model on DEV set, you will use this model to predict the labels of `TEST set` and submit your prediction.**

**4. For predication submission, please read `submission instruction`.**

**Colab Instruction**

[Google Colab](https://colab.research.google.com/notebooks/intro.ipynb#recent=true) will allow you to train your model on a GPU. 

You can follow the steps to use Colab:

1. We provide a new notebook `(lab4_colab.ipynb)` for your experiments on Colab. You should develop your system on `lab4_colab.ipynb` instead of current jupyter notebook. 
2. Go to [Google colab](https://colab.research.google.com).
3. Create an account or login your account.
4. Select "UPLOAD" and upload `lab4_colab.ipynb`, again please don't upload current notebook (Lab4.ipynb).
5. Set the hardware: 
**Go to the navigation bar, click Runtime --> Change runtime type --> Hardware accelerator --> Select GPU.**

6. You don't need to install any packages. Google prepared everything for you.
7. You can find all your generations in `Files`. You can download your notebook and files.

Suggestion: 
1. You can download the notebook from Colab and overwrite your local version of **`lab4_colab.ipynb`**. 
2. If you train your model on GPU, please make sure your model, input and loss is sent to GPU using XXX.to(device) where device is `cuda`. 
3. If you want to send the GPU varibles to CPU, please use XXX.cpu() to detach from GPU. You can find more related information [here](https://pytorch.org/docs/stable/notes/cuda.html). 

``Warning``: Running on CPU will be slow. 

**Submission Instrcution**

1. In directory `./data/yelp_review/`,  `EXAMPLE_GOLD.txt` and `EXAMPLE_PRED.txt` are examples of gold and prediction files which can be used with the ``Scorer.py`` provided (description below). Your sumbission should have excactly the same structure as **`EXAMPLE_PRED.txt`** (i.e., each line contains one predication label without header of the column.) This is important.

2. `./data/yelp_review/Scorer.py`

The scoring script (Scorer.py) is provided at the root directory of the released data. `Scorer.py` is a python script that will take in two text files containing true labels and predicted labels and will output accuracy, F1 score, precision and recall. (Note that the evaluation metric is F1 score).  The scoring script is used for evaluating your TEST prediction.

Please make sure to have `sklearn library` installed.
```
Usage of the scorer:

    python3 Scorer.py  <gold-file> <pred-file>

In the dataset directory, there are example
gold and prediction files. If they are used with the scorer,
they should produce the following results:

python3 Scorer.py EXAMPLE_GOLD.txt EXAMPLE_PRED.txt

OVERALL SCORES:
MACRO AVERAGE PRECISION SCORE: 20.97 %
MACRO AVERAGE RECALL SCORE: 20.97 %
MACRO AVERAGE F1 SCORE: 20.97 %
OVERALL ACCURACY: 20.97 %
```

**Requirements:**
1. Your submission must has **same** structure as `EXAMPLE_PRED.txt`. 


2. The predication label must be the **original label format** (`i.e., '1star', '2star', '3star', '4star', or '5star'`).

Hint: You may try to geneate a synthetic file to test your code. 


3. Put your prediction txt file in this lab directory. The prediction txt file should be named with `<yourfirstname>_<yourlastname>_PRED.txt`. We will use ``Scorer.py`` to evaluate your submission.

Hint: We provide a funtion `out_prediction` to help you generate the submission file.


In [2]:
def out_prediction(first_name, last_name, prediction_list):
    """
    out_prediction takes three input varibles: first_name, last_name, prediction_list
    <first_name>, string, your first name, e.g., Tom
    <last_name>, string, your last name, e.g., Smith
    <prediction_list>, list of string which includes all your predications of TEST samples
                        e.g., ['1star','5star','3star']
                        
    Generate a file is named with <yourfirstname>_<yourlastname>_PRED.txt in current directory
    """
    output_file = open("{}_{}_PRED.txt".format(first_name,last_name),'w')
    for item in prediction_list:
        output_file.write(item+"\n")
    output_file.close()   

A example of using `out_prediction` funtion. You can find a file `Tom_Smith_PRED.txt` in your diretory.

In [8]:
out_prediction("Tom", "Smith", ['1star','5star','3star'])

### 1.2 Please clearly describe the system you submitted in 1.1 (i.e., your best model) within 300 words.
rubric={mechanics:5,resoning:5}

Hints: 
1. Describe all the hyper-parameters of your submitted system. You may follow the structure of Development Instruction. 
2. List the strategies you attempted. What strategies did work? What did not work?

**Write your answer here.**

### 1.3 Please orgnize your code in `lab4_colab.ipynb` and only keep the code that you used to train your submitted system in 1.1. Submit `lab4_colab.ipynb`.
rubric={mechanics:5}

# Exercise 2 Short Answer Questions

### 2.1 What are the differences between parameter and hypyer-parameter?
rubric={reasoning:1}

Hint: Your answer should be a maximum of 2-3 sentences. Short answers are just fine.

**Write your answer here.**

### 2.2 Compute the gradient $\nabla f(x)$ of each of the following functions. You don't need to show your work. (OPTIONAL QUESTION)

rubric={accuracy:4}

The notation for the gradient of a function $f$ is $\nabla f(x)$. 

The input $x$ of function $f$ may be a vector, so the gradient of this function is also a vector.

$x\in \mathbb{R}^n$ denotes the dimension of input $x$ (i.e., $x$ include $n$ elements).

$\exp(x)$ is exponential funtion, i.e., $e^x$.

1. $f_1(x) = \tanh(x_1x_2)$ where $x\in \mathbb{R}^2$;
2. $f_2(x) = Sigmoid(x_1+x_2)$ where $x\in \mathbb{R}^2$;
3. $f_3(x) = \exp(x_1+x_2x_3)$ where $x\in \mathbb{R}^3$;
4. $f_4(x) = \exp(x_1 + x_2^3)$ where $x \in \mathbb{R}^4$

Hint: 
You can find some examples of multivarible derivative in file `gradient_examples.pdf`.


**Write your answer here.**

### 2.3 What is the “gradient descent” ?
rubric={reasoning:2}

**Write your answer here.**

### 2.4 What is the difference between `gradient descent` and `SGD`?
rubric={reasoning:2}

**Write your answer here.**

### 2.4 What is the “vanishing gradient” problem with neural networks based on `Sigmoid` non-linearities?
rubric={reasoning:2}

**Write your answer here.**

### 2.5 Why do we need regularization? What are the differences between `L1-regularization` and `L2-regularization`?
rubric={reasoning:3}

Hint: A short answer is just fine.

**Write your answer here.**

### 2.6 How many output does `LSTM layer of torch.nn` return? What does each of them represent? 
rubric={reasoning:3}

**Write your answer here.**

### 2.7 Please describe the size of each of the following tensors and the meaning of each dimension. 
rubric={accuracy:6,reasoning:6}

For example, 

**Question 0.**

If we pass a tensor with size of [32, 22] (batch size, sequence length) to a Embedding layer which is defined as (Embedding layer): Embedding(5002, 300), what size is the output? 

**Answer:**

The size of the output is [32, 22, 300] (batch size, sequence length, embedding dimension).

1. If we pass a tensor with size of `[32, 64] (batch size, sequence length)` to a Embedding layer which is defined as `(Embedding layer): Embedding(10004, 500)`, what size is the output?


2. If we pass a tensor with size of `[32, 50] (batch size, input dimension)` to a Linear layer which is defined as  `(Linear layer): Linear(in_features=50, out_features=128`, bias=True), what size is the output?


3. If we pass a tensor with size of `[32, 64] (batch size, hidden dimension)` to a Softmax layer which is defined as  `(softmax layer): LogSoftmax()`, what size is the output?


4. If we pass a tensor with size of `[32, 64, 300] (sequence length, batch size`, embedding dimension) to a LSTM layer which is defined as `(LSTM layer): LSTM(300, 500, num_layers=2)`, what size is the first output returned?


5. If we pass a tensor with size of `[32, 64, 300] (sequence length, batch size, embedding dimension)` to a LSTM layer which is defined as `(LSTM layer): LSTM(300, 500, num_layers=2, bidirectional=True)`, what size is the second output returned?


6. If we pass a tensor with size of `[32, 64] (batch size, hidden dimension)` to a dropout layer which is defined as `(Dropout layer): Dropout(p=0.5, inplace=False)`, what size is the output?

**Write your answer here...**