# Basic Attention Operation

As you've learned, attention allows a seq2seq decoder to use information from each encoder step instead of just the final encoder hidden state. In the attention operation, the encoder outputs are weighted based on the decoder hidden state, then combined into one context vector. This vector is then used as input to the decoder to predict the next output step.

In this ungraded lab, you'll implement a basic attention operation as described in [Bhadanau, et al (2014)](https://arxiv.org/abs/1409.0473) using Numpy. I'll describe each of the steps which you will be coding.

### Function 1: `softmax`

In [4]:
# Run this first, a bit of setup for the rest of the lab
import numpy as np

def softmax(x, axis=0):
    """ Calculate softmax function for an array x along specified axis
    
        axis=0 calculates softmax across rows which means each column sums to 1 
        axis=1 calculates softmax across columns which means each row sums to 1
    """
    # subtract the max for numerical stability
    x = x - np.expand_dims(np.max(x, axis=axis), axis)
    # calculate the softmax
    return np.exp(x) / np.expand_dims(np.sum(np.exp(x), axis=axis), axis)
    
    

## Explanation

The code snippet provided is a Python function implementing the **Softmax function**, a fundamental concept frequently utilized in machine learning, particularly in classification problems.

### Overview of Softmax Function

The **Softmax function** is utilized to convert a vector of raw scores (or "logits") into probabilities. Given an input vector `[z_1, z_2, ..., z_n]`, the Softmax function, %\sigma%, outputs a vector `[p_1, p_2, ..., p_n]` where each `p_i` is defined as:

$$p_i = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}$$

### Key Characteristics of the Function

#### 1. **Numerical Stability**
   
In the implementation, the maximum value along the specified axis is subtracted from each element before applying the exponential function. This is a common practice to enhance numerical stability and prevent potential overflow/underflow issues.

```python
x = x - np.expand_dims(np.max(x, axis=axis), axis)
```

#### 2. **Flexibility with Axis**

The function is designed to calculate the Softmax across different axes (`axis=0` for columns, and `axis=1` for rows), providing flexibility depending on the desired application.

- When `axis=0`, it calculates the Softmax across columns, meaning each column's values are transformed into probabilities and all values in a column sum to 1.
- When `axis=1`, it calculates the Softmax across rows, meaning each row's values are converted into probabilities, and all values in a row sum to 1.

#### 3. **Calculation of Probabilities**

The Softmax is calculated by:
   
- Taking the exponential of each element in the input array, $e^{x_i}$.
- Dividing each element by the sum of the exponentials along the specified axis.

```python
return np.exp(x) / np.expand_dims(np.sum(np.exp(x), axis=axis), axis)
```

### Usage Example

Suppose we have a 2x3 matrix representing raw score outputs from a model, and we want to convert them into probabilities:

```python
z = np.array([[1.0, 2.0, 3.0], 
              [1.0, 2.0, 3.0]])
```

Calculating the Softmax across rows:

```python
softmax_probs = softmax(z, axis=1)
```

`softmax_probs` will be a matrix of the same shape, with each row containing probabilities derived from the respective original raw scores.

## 1: Calculating alignment scores

The first step is to calculate the alignment scores. This is a measure of similarity between the decoder hidden state and each encoder hidden state. From the paper, this operation looks like

$$
\large e_{ij} = v_a^\top \tanh{\left(W_a s_{i-1} + U_a h_j\right)}
$$

where $W_a \in \mathbb{R}^{n\times m}$, $U_a \in \mathbb{R}^{n \times m}$, and $v_a \in \mathbb{R}^m$
are the weight matrices and **$n$ is the hidden state size**. In practice, this is implemented as a feedforward neural network with two layers, where **$m$ is the size of the layers in the alignment network**. It looks something like:

![alignment model](./images/alignment_model.png)

Here $h_j$ are the encoder hidden states for each input step $j$ and $s_{i - 1}$ is the decoder hidden state of the previous step. The first layer corresponds to $W_a$ and $U_a$, while the second layer corresponds to $v_a$.

To implement this, first concatenate the encoder and decoder hidden states to produce an array with size $K \times 2n$ where $K$ is the number of encoder states/steps. For this, use `np.concatenate` ([docs](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html)). Note that there is only one decoder state so you'll need to reshape it to successfully concatenate the arrays. The easiest way is to use `decoder_state.repeat` ([docs](https://numpy.org/doc/stable/reference/generated/numpy.repeat.html#numpy.repeat)) to match the hidden state array size.

Then, apply the first layer as a matrix multiplication between the weights and the concatenated input. Use the tanh function to get the activations. Finally, compute the matrix multiplication of the second layer weights and the activations. This returns the alignment scores.

### Function 2: `alignment`

In [5]:
hidden_size = 16 # size of hidden state of encoder and decoder
''' 
attention_size(m) is set to 10. This means that the attention vectors (intermediate representations 
used to compute the alignment scores between encoder and decoder states) will be vectors of size 10. 
This is unrelated to the input length (5) or hidden size (16) and is a hyperparameter that could be tuned.
'''
attention_size = 10 # size of the attention "vectors" used to compute the alignment scores
input_length = 5   # length of the input sequence

np.random.seed(42) # set the random seed for reproducibility

# Synthetic vectors used to test
# h_j = encoder hidden states
encoder_states = np.random.randn(input_length, hidden_size) # (5, 16)  5 words, each word has 16 features
# s_i = decoder hidden state
decoder_state = np.random.randn(1, hidden_size) # (1, 16) 1 word, 16 features

# Synthetic weights used to test the alignment function
# layer_1 2 hidden_size x attention size
# This layer corresponds to W_a and U_a weights
layer_1 = np.random.randn(2*hidden_size, attention_size) # (32, 10) 32 features, 10 attention vectors
# layer_2 attention_size x 1  
# This layer corresponds to v_a weights
layer_2 = np.random.randn(attention_size, 1) # (10, 1) 10 attention vectors, 1 score

# Implement this function. Replace None with your code. Solution at the bottom of the notebook
def alignment(encoder_states, decoder_state):
    ''' 
    - First, concatenate the encoder states and the decoder state along the appropriate axis
    to produce an array with size Kx2n where K is the number of encoder states/steps aka number 
    of words and n is the hidden size. Then, multiply this concatenated array by layer_1 and apply 
    the tanh activation function. Finally, multiply the output of the previous layer by layer_2
    and return the result.
    
    - np.tile(decoder_state, (input_length, 1))] repeats decoder_state 5 times to match encoder_states 
    shape. Note that decoder_state is (1, 16) and encoder_states is (5, 16). We can also use 
    np.repeat(decoder_state, input_length, axis=0) which gives the same result
    '''
    inputs = np.concatenate([encoder_states, np.tile(decoder_state, (input_length, 1))], axis=-1) # (5, 32) 5 words, 32 features
    assert inputs.shape == (input_length, 2*hidden_size)  # assert inputs.shape == (5, 32)
    
    ''' 
    - Matrix multiplication of the concatenated inputs and layer_1, with tanh activation function
    tanh function squashes the values between -1 and 1. Note that we can also use 
    np.matmul(inputs, layer_1) instead of np.dot(inputs, layer_1) which gives same results
    '''
    activations = np.tanh(np.dot(inputs, layer_1)) # (5, 10) 5 words, 10 attention vectors
    assert activations.shape == (input_length, attention_size) # assert activations.shape == (5, 10)
    
    '''
    Matrix multiplication of the activations with layer_2. 
    Remember that you don't need tanh here
    '''
    scores = np.dot(activations, layer_2) # (5, 1) 5 words, 1 score
    assert scores.shape == (input_length, 1) # assert scores.shape == (5, 1)
    
    return scores # (5, 1)

### Explanation: 

Implementing Attention Mechanism for Sequence Alignment

In the provided code snippet, we define and implement a simple attention mechanism for aligning encoder and decoder states in a sequence-to-sequence model. Below, let's dissect the various components and their responsibilities.

```python
hidden_size = 16 
```
This refers to the dimensionality of the hidden states in both the encoder and decoder of a sequence-to-sequence model, indicating that each state in the sequences is represented as a 16-dimensional vector.

```python
attention_size = 10 
```
Here, `attention_size` determines the size of attention vectors, which are intermediate representations used to calculate alignment scores between encoder and decoder states. This value is independent of the input length (5) or hidden size (16) and serves as a tunable hyperparameter.

```python
input_length = 5 
```
`input_length` designates the length of the input sequence which in this context refers to the number of words in the input sequence. 

Subsequently, synthetic data for encoder states and decoder states, and synthetic weights for attention alignment calculations are generated as follows:

```python
encoder_states = np.random.randn(input_length, hidden_size) 
decoder_state = np.random.randn(1, hidden_size) 
layer_1 = np.random.randn(2*hidden_size, attention_size) 
layer_2 = np.random.randn(attention_size, 1) 
```

**Key Concepts:**
- **encoder_states**: A synthetic set of encoder hidden states, each a 16-dimensional vector and 5 such vectors for each word in the input.
- **decoder_state**: A single 16-dimensional vector representing the hidden state of the decoder.
- **layer_1**: The weight matrix used to transform the concatenated encoder-decoder states into attention vectors. This correlates to the $W_a$ and $U_a$ weights in attention mechanism formulations.
- **layer_2**: A second weight matrix that helps convert attention vectors into attention scores. This relates to the $v_a$ weight in attention mechanisms.

**Defining the `alignment` function:**

```python
def alignment(encoder_states, decoder_state):
```
This function performs the following sequence of operations to calculate the attention scores:
1. **Concatenation**: Merges each encoder state with the decoder state. For maintaining the shape consistency, the single decoder state is tiled to match the number of encoder states.
2. **Attention Vector Calculation**: Performs a linear transformation on the concatenated states using `layer_1` weights and applies the tanh activation function.
3. **Attention Score Calculation**: Derives the attention scores by multiplying the attention vectors with `layer_2` weights.

**Attention Mechanism Logic:**
- The concatenated encoder-decoder states are first transformed into attention vectors by multiplying them with a weight matrix (`layer_1`) and passing them through a non-linear tanh activation.
- These attention vectors are then used to derive the attention scores by multiplying them with another weight matrix (`layer_2`). These scores represent how much attention the decoder state should pay to each encoder state during the decoding process.

**General Notes:**
- The attention mechanism allows the model to focus on different parts of the input sequence when producing each element of the output sequence, essentially enabling the model to have a "memory" of the input.
- The choice of `attention_size`, `hidden_size`, and other hyperparameters should be motivated by both the empirical performance on validation data and computational efficiency.

In [6]:
# Run this to test your alignment function
scores = alignment(encoder_states, decoder_state)
print(scores)

[[4.35790943]
 [5.92373433]
 [4.18673175]
 [2.11437202]
 [0.95767155]]


If you implemented the function correctly, you should get these scores:

```python
[[4.35790943]
 [5.92373433]
 [4.18673175]
 [2.11437202]
 [0.95767155]]
```

## 2: Turning alignment into weights

The next step is to calculate the weights from the alignment scores. These weights determine the encoder outputs that are the most important for the decoder output. These weights should be between 0 and 1, and add up to 1. You can use the softmax function (which I've already implemented above) to get these weights from the attention scores. Pass the attention scores vector to the softmax function to get the weights. Mathematically,

$$
\large \alpha_{ij} = \frac{\exp{\left(e_{ij}\right)}}{\sum_{k=1}^K \exp{\left(e_{ik}\right)}}
$$



## 3: Weight the encoder output vectors and sum

The weights tell you the importance of each input word with respect to the decoder state. In this step, you use the weights to modulate the magnitude of the encoder vectors. Words with little importance will be scaled down relative to important words. Multiply each encoder vector by its respective weight to get the alignment vectors, then sum up the weighted alignment vectors to get the context vector. Mathematically,

$$
\large c_i = \sum_{j=1}^K\alpha_{ij} h_{j}
$$

Implement these steps in the `attention` function below.

### Funtion 3: `attention`

In [7]:
# Attention function that takes in the encoder states and decoder state and returns the context vector 
def attention(encoder_states, decoder_state):
    """ Example function that calculates attention, returns the context vector 
    
        Arguments:
        encoder_vectors: nxm numpy array, where n is the number of vectors and m is the vector length
        decoder_vector: 1xm numpy array, m is the vector length, much be the same m as encoder_vectors
    """ 
    
    # First, calculate the alignment scores
    scores = alignment(encoder_states, decoder_state) # (5, 1) 5 words, 1 score
    
    # Then take the softmax of the alignment scores to get a weight distribution
    weights = softmax(scores, axis=0) # (5, 1) 5 words, 1 score values sum to 1
    
    # Multiply each encoder state by its respective weight
    weighted_scores = np.multiply(encoder_states, weights) # (5, 16) 5 words, 16 features
    
    # Sum up weighted alignment vectors to get the context vector and return it
    context = np.sum(weighted_scores, axis=0) # (16,) 16 features
    return context

context_vector = attention(encoder_states, decoder_state)
print(context_vector)

[-0.63514569  0.04917298 -0.43930867 -0.9268003   1.01903919 -0.43181409
  0.13365099 -0.84746874 -0.37572203  0.18279832 -0.90452701  0.17872958
 -0.58015282 -0.58294027 -0.75457577  1.32985756]


If you implemented the `attention` function correctly, the context vector should be

```python
[-0.63514569  0.04917298 -0.43930867 -0.9268003   1.01903919 -0.43181409
  0.13365099 -0.84746874 -0.37572203  0.18279832 -0.90452701  0.17872958
 -0.58015282 -0.58294027 -0.75457577  1.32985756]
```



### Explanation

The provided code snippet illustrates how an attention mechanism operates in the context of a sequence-to-sequence model in NLP. Specifically, the function `attention` computes a context vector based on the encoder states and a given decoder state. Let's break down the code and its methodology.

```python
def attention(encoder_states, decoder_state):
    ...
```
This function takes in:
- `encoder_states`: An $n \times m$ array where $n$ denotes the number of encoder state vectors and $m$ is the dimensionality of each vector.
- `decoder_state`: A $1 \times m$ array representing the current state of the decoder.

The function then proceeds to compute the context vector, which is crucial for determining the decoder's focus on the input sequence during the translation (or sequence generation) process.

**Step 1: Calculate Alignment Scores**
```python
    scores = alignment(encoder_states, decoder_state) # (5, 1) 5 words, 1 score
```
The `alignment` function computes the *alignment scores* by determining how well each encoder state aligns with the current decoder state. A higher score implies a higher degree of alignment or attention.

**Step 2: Compute Attention Weights**
```python
    weights = softmax(scores, axis=0) # (5, 1) 5 words, 1 score values sum to 1
```
Using the softmax function on the alignment scores across the first axis ensures that the scores are normalized and collectively sum to 1, thus forming a probability distribution. These normalized scores, now referred to as *attention weights*, designate how much focus the decoder should assign to each corresponding encoder state.

**Step 3: Calculate Weighted Sum of Encoder States**
```python
    weighted_scores = np.multiply(encoder_states, weights) # (5, 16) 5 words, 16 features
```
Next, each encoder state is multiplied by its respective attention weight, creating a set of weighted encoder states. This step essentially scales each encoder state by the amount of attention it should receive from the decoder.

**Step 4: Derive the Context Vector**
```python
    context = np.sum(weighted_scores, axis=0)
    return context
```
Finally, the function computes the *context vector* by summing the weighted encoder states along the 0-axis (i.e., summing across all the weighted encoder states). This context vector, now representing a weighted sum of all encoder states, is utilized by the decoder to generate the next element in the output sequence, bearing in mind the relevant parts of the input sequence.

This breakdown illustrates a typical implementation of attention mechanisms, ensuring that the model can allocate its focus adaptively across different parts of an input sequence when generating each word/token in the output. This is pivotal for handling long sequences and for tasks where different parts of the input are relevant at different stages of the output generation, like in machine translation, text summarization, and more.

### Operational Flow of the Seq2Seq Attention Model

![operational flow of Seq2Seq Attention Model](./images/operational_flow_attention.png)

This visualization details the step-by-step operations performed within this notebook for the sequence-to-sequence model with attention. From initial encoding of the input sequence to the final context-aware decoding, each stage of the process is meticulously illustrated.

## See below for solutions

```python
# Solution
def alignment(encoder_states, decoder_state):
    # First, concatenate the encoder states and the decoder state.
    inputs = np.concatenate((encoder_states, decoder_state.repeat(input_length, axis=0)), axis=1)
    assert inputs.shape == (input_length, 2*hidden_size)
    
    # Matrix multiplication of the concatenated inputs and the first layer, with tanh activation
    activations = np.tanh(np.matmul(inputs, layer_1))
    assert activations.shape == (input_length, attention_size)
    
    # Matrix multiplication of the activations with the second layer. Remember that you don't need tanh here
    scores = np.matmul(activations, layer_2)
    assert scores.shape == (input_length, 1)
    
    return scores

# Run this to test your alignment function
scores = alignment(encoder_states, decoder_state)
print(scores)
```

```python
# Solution
def attention(encoder_states, decoder_state):
    """ Example function that calculates attention, returns the context vector 
    
        Arguments:
        encoder_vectors: NxM numpy array, where N is the number of vectors and M is the vector length
        decoder_vector: 1xM numpy array, M is the vector length, much be the same M as encoder_vectors
    """ 
    
    # First, calculate the dot product of each encoder vector with the decoder vector
    scores = alignment(encoder_states, decoder_state)
    
    # Then take the softmax of those scores to get a weight distribution
    weights = softmax(scores)
    
    # Multiply each encoder state by its respective weight
    weighted_scores = encoder_states * weights
    
    # Sum up the weights encoder states
    context = np.sum(weighted_scores, axis=0)
    
    return context

context_vector = attention(encoder_states, decoder_state)
print(context_vector)
```