The two most commonly used attention functions are additive attention [(cite)](https://arxiv.org/abs/1409.0473), and dot-product (multiplicative) attention.  Dot-product attention is identical to our algorithm, except for the scaling factor of $\frac{1}{\sqrt{d_k}}$. Additive attention computes the compatibility function using a feed-forward network with a single hidden layer.  While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.                                                                                             

                                                                        
While for small values of $d_k$ the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of $d_k$ [(cite)](https://arxiv.org/abs/1703.03906). We suspect that for large values of $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients  (To illustrate why the dot products get large, assume that the components of $q$ and $k$ are independent random variables with mean $0$ and variance $1$.  Then their dot product, $q \cdot k = \sum_{i=1}^{d_k} q_ik_i$, has mean $0$ and variance $d_k$.). To counteract this effect, we scale the dot products by $\frac{1}{\sqrt{d_k}}$.

#  What is Attention

Informally, a neural attention mechanism equips a neural network with the ability to focus on a subset of its inputs (or features): it selects specific inputs. Let $x \in R^{d}$ be an input vector, $z \in R^{k}$ a feature vector, $a \in [0, 1]^{k}$ an attention vector, $g \in R^{k}$ an attention glimpse and $f_{ϕ}(x)$ an attention network with parameters. Typically, attention is implemented as

$$a = f_{ϕ}(x)$$
$$g = a*z$$

# Reference
1. [Attention in Neural Networks and How to Use It](http://akosiorek.github.io/ml/2017/10/14/visual-attention.html)

# Epoch vs Batch Size vs Iterations
We need terminologies like **epochs**, **batch size**, **iterations** only when the data is too big which happens all the time in machine learning and we can’t pass all the data to the computer at once. So, to overcome this problem we need to divide the data into smaller sizes and give it to our computer one by one and update the weights of the neural networks at the end of every step to fit it to the data given.
# What is Epoches : One Epoch is when an **ENTIRE** dataset is passed forward and backward through the neural network only ONCE.
Since one epoch is too big to feed to the computer at once we divide it in several smaller **batches**.

## Why we use more than one Epoch?

I know it doesn’t make sense in the starting that — passing the entire dataset through a neural network is not enough. And we need to pass the full dataset multiple times to the same neural network. But keep in mind that we are using a limited dataset and to optimise the learning and the graph we are using Gradient Descent which is an iterative process. So, updating the weights with single pass or one epoch is not enough.


<img src="images/nums_epoches.png" />

However in above graph, as the number of epochs increases, more number of times the weight are changed in the neural network and the curve goes from underfitting to optimal to overfitting curve.


## So, what is the right numbers of epochs?

Unfortunately, there is no right answer to this question. The answer is **different for different datasets but you can say that the numbers of epochs is related to how diverse your data is… ** just an example - Do you have only black cats in your dataset or is it much more diverse dataset?

# Batch Size: Total number of training examples present in a single batch.

As I said, you can’t pass the entire dataset into the neural net at once. So, you divide dataset into Number of Batches or sets or parts.

# Iterations: Iterations is the number of batches needed to complete one epoch.

**Note**: The number of batches is equal to number of iterations for one epoch.



# Example
Let’s say we have 2000 training examples that we are going to use . **We can divide the dataset of 2000 examples into batches of 500 then it will take 4 iterations to complete 1 epoch.**, Where **Batch Size** is 500 and **Iterations** is 4, for 1 complete epoch.


In [2]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, copy, time
from torch.autograd import Variable
import matplotlib.pyplot as plt
import seaborn
seaborn.set_context(context="talk")
%matplotlib inline
class PositionalEncoding(nn.Module):
    "Implement the PE function."
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) *
                             -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        x = Variable(self.pe[:, :x.size(1)], 
                         requires_grad=False)
        return self.dropout(x)

In [8]:
pe = PositionalEncoding(16, 0.1)

In [14]:
pe.pe.size()

torch.Size([1, 5000, 16])

torch.Size([1, 5000, 16])

In [18]:
table = pe.pe.squeeze(0)

In [19]:
table.size()

torch.Size([5000, 16])

In [27]:
p1 = table[1,:]

In [34]:
p2 = table[2,:]

In [35]:
p3 = table[3,:]

In [36]:
p2

tensor([ 9.0930e-01, -4.1615e-01,  5.9113e-01,  8.0658e-01,  1.9867e-01,
         9.8007e-01,  6.3203e-02,  9.9800e-01,  1.9999e-02,  9.9980e-01,
         6.3245e-03,  9.9998e-01,  2.0000e-03,  1.0000e+00,  6.3246e-04,
         1.0000e+00])

In [37]:
p1

tensor([8.4147e-01, 5.4030e-01, 3.1098e-01, 9.5042e-01, 9.9833e-02, 9.9500e-01,
        3.1618e-02, 9.9950e-01, 9.9998e-03, 9.9995e-01, 3.1623e-03, 9.9999e-01,
        1.0000e-03, 1.0000e+00, 3.1623e-04, 1.0000e+00])

In [38]:
p2 - p1

tensor([ 6.7826e-02, -9.5645e-01,  2.8014e-01, -1.4384e-01,  9.8836e-02,
        -1.4938e-02,  3.1586e-02, -1.4994e-03,  9.9988e-03, -1.4997e-04,
         3.1622e-03, -1.4961e-05,  1.0000e-03, -1.4901e-06,  3.1623e-04,
        -1.1921e-07])

In [39]:
p3 - p2

tensor([-7.6818e-01, -5.7385e-01,  2.2152e-01, -2.2382e-01,  9.6851e-02,
        -2.4730e-02,  3.1523e-02, -2.4973e-03,  9.9968e-03, -2.4998e-04,
         3.1622e-03, -2.5034e-05,  1.0000e-03, -2.5034e-06,  3.1623e-04,
        -2.3842e-07])

In [1]:
import torchtext
from torchtext.data.utils import get_tokenizer
TEXT = torchtext.data.Field(tokenize=get_tokenizer("basic_english"),
                            init_token='<sos>',
                            eos_token='<eos>',
                            lower=True)

In [2]:
train_txt, val_txt, test_txt = torchtext.datasets.WikiText2.splits(TEXT)

downloading wikitext-2-v1.zip


.data\wikitext-2\wikitext-2-v1.zip: 100%|██████████████████████| 4.48M/4.48M [00:00<00:00, 13.2MB/s]


extracting


In [12]:
type(train_txt.examples[0].text)

list

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, copy, time
from torch.autograd import Variable
import matplotlib.pyplot as plt
import seaborn
seaborn.set_context(context="talk")
%matplotlib inline
