<a href="https://colab.research.google.com/github/aya-se/advanced-machine-learning-2022/blob/main/aml2022_report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Machine Learning (2022) Final Report Assignment

Answer Questions 1 to 4 (either in Japanese or English). Submit a report in either PDF (.pdf) or JupyterNotebook (.ipynb) format.

## Question 1 (50 points)

Consider a convolutional neural network (CNN) that predicts a label $\hat{y} \in \{0, 1\}$ for a given sentence $\boldsymbol{X} \in \mathbb{R}^{d \times T}$. Here, a sentence is represented by a matrix $\boldsymbol{X} = (\boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_T)$ consisting of a concatenation of $T$ word embeddings, $\boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_T \in \mathbb{R}^d$, where $d$ is the size of word embeddings, and $T$ is the number of words in the sentence.

These equations define the whole architecture of the CNN.

\begin{align}
\hat{y} &= \begin{cases}
1 & (0.5 < p) \\
0 & (p \leq 0.5)
\end{cases} \\
p &= \sigma(\boldsymbol{v}^\top \boldsymbol{s}) \\
\boldsymbol{s} &= \max(\boldsymbol{c}_1, \dots, \boldsymbol{c}_{T-\delta+1}) \\
\boldsymbol{c}_t &= {\rm ReLU}(\boldsymbol{W} \boldsymbol{x}_{t:t+\delta-1} + \boldsymbol{b}) & (\forall t \in \{1, \dots, T-\delta+1\}) \\
\boldsymbol{x}_{t:t+\delta-1} &= \boldsymbol{x}_{t} \oplus \boldsymbol{x}_{t+1} \oplus \dots \oplus \boldsymbol{x}_{t+\delta-1}
\end{align}

Here:

+ $\boldsymbol{W} \in \mathbb{R}^{m \times \delta d}$, $\boldsymbol{b} \in \mathbb{R}^m, \boldsymbol{v} \in \mathbb{R}^m$ are the model parameters;
+ $m$ denotes the number of output channels of the CNN;
+ $\delta$ denotes the width (kernel size) of the convolution;
+ $\sigma(\cdot)$ denotes the standard sigmoid function;
+ $\max(\cdot)$ presents the max pooling operation;
+ ${\rm ReLU}(\cdot)$ denotes the ReLU activation function;
+ $\oplus$ presents a concatenation of vectors.

Setting the hyperparameters $d=3, m=2, \delta=2$, we initialize the model parameters as follows.

\begin{align}
\boldsymbol{W} &= \begin{pmatrix}
-3 & -2 & -1 & -1 & -2 & -3 \\
3 & 2 & 3 & 2 & 3 & 2
\end{pmatrix} \\
\boldsymbol{b} &= \begin{pmatrix}
-0.2 \\ 0.1
\end{pmatrix} \\
\boldsymbol{v} &= \begin{pmatrix}
-1 \\ 2
\end{pmatrix}
\end{align}

Suppose that we give a negative ($y=0$) training instance with the sentence ($T = 5$),

\begin{align}
\boldsymbol{X} &= \begin{pmatrix}
-0.3 & 0 & 0.1 & 0 & 0 \\
-0.2 & -0.1 & 0 & 0.1 & 0 \\
-0.1 & -0.2 & 0.1 & 0 & 0.1
\end{pmatrix} ,
\end{align}
to the CNN model, and answer the following questions.

**(1)** Find the value of the vector $\boldsymbol{x}_{3:4}$.

In [65]:
# 必要ライブラリのインポート
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [66]:
# 各種変数の準備
W = np.array([[-3.0, -2.0, -1.0, -1.0, -2.0, -3.0], [3.0, 2.0, 3.0, 2.0, 3.0, 2.0]])
b = np.array([[-0.2], [0.1]])
v = np.array([[-1.0], [2.0]])
X = np.array([[-0.3, 0, 0.1, 0, 0], [-0.2, -0.1, 0, 0.1, 0], [-0.1, -0.2, 0.1, 0, 0.1]])
W, b, v, X, d, m, delta

(array([[-3., -2., -1., -1., -2., -3.],
        [ 3.,  2.,  3.,  2.,  3.,  2.]]),
 array([[-0.2],
        [ 0.1]]),
 array([[-1.],
        [ 2.]]),
 array([[-0.3,  0. ,  0.1,  0. ,  0. ],
        [-0.2, -0.1,  0. ,  0.1,  0. ],
        [-0.1, -0.2,  0.1,  0. ,  0.1]]),
 3,
 2,
 2)

In [67]:
# x_{3:4}
np.concatenate([X[:, 2], X[:, 3]])

array([0.1, 0. , 0.1, 0. , 0.1, 0. ])

**(2)** Find the values of the hidden vectors $\boldsymbol{c}_1, \boldsymbol{c}_2, \boldsymbol{c}_3, \boldsymbol{c}_4$.

In [68]:
# コンテキストベクトルc_tの計算
def c(t) :
    val = (W @ np.concatenate([X[:, t], X[:, t+1]])).reshape(2, 1) + b
    return np.maximum(0, val)

c(0), c(1), c(2), c(3)

(array([[2.],
        [0.]]),
 array([[0.],
        [0.]]),
 array([[0.],
        [1.]]),
 array([[0. ],
        [0.5]]))

**(3)** Find the value of the vector $\boldsymbol{s}$.


In [111]:
s = np.max(np.hstack([c(0), c(1), c(2), c(3)]), axis=1)
s.reshape(2, 1)

array([[2.],
       [1.]])

**(4)** Find the value of $p$.

In [112]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

p = sigmoid(v.T @ s)[0]
p

0.5000000000000002

**(5)** Write the formula of the binary cross-entropy loss between the correct label $y$ and the probability estimate $p$.

In [113]:
# 注：p=0,1の時の処理は省略する
def BCE(y, p):
    return -(y * np.log(p) + (1-y) * np.log(1-p)) 

**(6)** Compute the loss value by using the formula of (5) for the training instance.

In [114]:
loss = BCE(0, p)
loss

0.6931471805599457

**(7)** Compute the gradient of the loss function with respect to $\boldsymbol{v}$ for the training instance.

負例の場合の勾配は、$\displaystyle\frac{\partial BCE(p)}{\partial v}=\frac{1}{1-p}\cdot p(1-p)\cdot s$と求まることから、以下のように計算できる。

In [115]:
grad_v = 1/(1-p) * p * (1-p) * s
grad_v.reshape(2, 1)

array([[1. ],
       [0.5]])

**(8)** Compute the gradients of the loss function with respect to $\boldsymbol{W}$ for the training instance.

(7)と同様にして誤差逆伝播法で、勾配を計算

In [129]:
grad_W = 1/(1-p) * p * (1-p) * v * np.array([np.concatenate([X[:, 0], X[:, 0+1]]), np.concatenate([X[:, 2], X[:, 2+1]])])
grad_W

[[-0.3 -0.2 -0.1  0.  -0.1 -0.2]
 [ 0.1  0.   0.1  0.   0.1  0. ]]


array([[ 0.15,  0.1 ,  0.05, -0.  ,  0.05,  0.1 ],
       [ 0.1 ,  0.  ,  0.1 ,  0.  ,  0.1 ,  0.  ]])

## Question 2 (20 points)

Give names of two datasets that can be used to evaluate the quality of word embeddings, and explain the datasets with the following perspectives.

+ Brief explanation of the task for the evaluation.
+ Statistics of the dataset (e.g., the number of instances)
+ Measure(s) for evaluating the quality

## Question 3 (20 points)

Explain two reasons why Transformers are superior to Recurrent Neural Network
(RNN) in sequence-to-sequence tasks such as Machine Translation.

## Question 4 (10 points)

Implement the code for using a pre-trained **language** model. Show the code and its output as well as the following information:

+ The detail of the pre-trained language model, for example,
    + https://huggingface.co/EleutherAI/gpt-j-6B
    + https://huggingface.co/rinna/japanese-gpt-1b
    + https://huggingface.co/facebook/blenderbot-400M-distill
+ The task addressed by the model (e.g., "text generation", "summarization", "chatbot")
