# ACSE Module 8 - Practical - Morning Session 5:

# Time-Series Data

## What is Time-Series Data?
Any data that varies along the time dimension is called Time-Series data. e.g.:
- Sales of ice cream per day in the Senior Common Room at Imperial College - univariate
- Average price of the Apple stock by day - univariate
- The closing value of FTSE100 by day - univariate
- Average daily temperatures in Australia - univariate
- The complete text of 1984 by George Orwell
- EEG data - multivariate
- Room occupancy sensor data - multivariate

## What can we do with time-series?
- Forecasting - predict the future!
- Generate text - e.g. autocomplete
- Anomaly detection - e.g. detect a siezure happening/about to happen from EEG data
- Classification - e.g. room occupied/not occupied, sentiment analysis of text

## What makes time-series data special?
- There is a concept of "past" and "future" in time-series data
- In the case of Forecasting, the model must be retrained after essentially every prediction
- Non-stationary probability distribution of data

Although there is a large number of classical time-series methods that work really well, we are going to focus on Neural Networks in this session. 

The neural networks we have seen so far have no concept of _order_ in inputs. Each time a fresh input is presented, it is treated in isolation by the NN architectures we have seen so far. In time-series analysis, the _context_ is extremely important, i.e. what came before this?


## Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) solve this problem by incorporating loops within them. The loop signifies that information can persist from one set of inputs to another. 

<img src="images/RNN-rolled.png" width=200 />

This network accepts input _x<sub>t</sub>_ and provides output _h<sub>t</sub>_ at every step _t_. 

The network can also be seen in the following _unrolled_ visualisation:

<img src="images/RNN-unrolled.png" width=600 />


Simple RNNs, however, suffer from problems of _short-term memory_. 

![](images/LSTM.png)


For example, try to predict the last word in the sentence: **There are so many clouds in the _sky_.**

This is easy for a simple RNN to predict as the necessary context word _clouds_ appeared just two words ago. 

However, look at the following example: **I grew up in France... I speak fluent _French_** 

The distance between the contextual clue word _France_ and the predicted word _French_ could have been arbitrarily long in this text. Simple RNNs have trouble learning to connect such contextual references that appear far away from each other. 

[Let's see if you can differentiate between machine generated text and human written text](http://goopt2.xyz)

## Text data

Textual data can also be seen as a time-series since it varies along a single dimension. There are two distinct ways of looking at this data - every character can be seen as an individual data point, or every word can be seen as a new data point. 

### Character-level

When working with text data at the character level, the advantage is that the set of possible inputs (all possible characters) is relatively small:

In [1]:
# # Download data
# !mkdir data
# !cd data && wget https://raw.githubusercontent.com/acse-2019/ACSE-8/master/Implementation/practical_5/morning/data/names.zip && unzip names.zip

In [2]:
import string
all_letters = string.ascii_letters
n_letters = len(all_letters)
all_letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

However, an obvious disadvantage is the risk of _missing the forest for the trees_ , i.e. higher-level patterns might be less obvious when looking at fine-grained data. 

### Objective: Name-origin classifier
Given a dataset of common last names from different languages, classify a new (previously unseen) last name into one of the (seen) languages. 

In [4]:
# Explore dataset
import glob

all_filenames = glob.glob('data/names/*.txt')
print(all_filenames)

['data/names/Czech.txt', 'data/names/German.txt', 'data/names/Arabic.txt', 'data/names/Japanese.txt', 'data/names/Chinese.txt', 'data/names/Vietnamese.txt', 'data/names/Russian.txt', 'data/names/French.txt', 'data/names/Irish.txt', 'data/names/English.txt', 'data/names/Spanish.txt', 'data/names/Greek.txt', 'data/names/Italian.txt', 'data/names/Portuguese.txt', 'data/names/Scottish.txt', 'data/names/Dutch.txt', 'data/names/Korean.txt', 'data/names/Polish.txt']


In [5]:
import unicodedata
import string


# Turn a Unicode string to plain ASCII, thanks to http://stackoverflow.com/a/518232/2809427
def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicode_to_ascii('Ślusàrski'))

Slusarski


In [6]:
# Build the category_lines dictionary, a list of names per language
category_lines = {}
all_categories = []

# Read a file and split into lines
def readLines(filename):
    lines = open(filename).read().strip().split('\n')
    return [unicode_to_ascii(line) for line in lines]

for filename in all_filenames:
    category = filename.split('/')[-1].split('.')[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines

n_categories = len(all_categories)
print('n_categories =', n_categories)
total_names = sum([len(x) for x in category_lines.values()])
print(total_names)

n_categories = 18
20050


In [7]:
print(category_lines['Italian'][:5])

['Abandonato', 'Abatangelo', 'Abatantuono', 'Abate', 'Abategiovanni']


## Task 1: Turning names into Tensors

We have loaded the names from the dataset into memory but our neural networks can only deal with Tensors so we need a tensor representation for a name. 

To represent each letter numerically, we could simply use its position in the alphabet, e.g. 1 for a, but this would imply an ordering between the letters - that in some way b > a. We don't want our model to learn such an ordering so we want a _non-ordinal_ representation. A common one is _one-hot encoding_. 

##### One-hot encoding

One-hot encoding generates vectors that are the size of number of all possible outcomes, and contain zeros everywhere except one location, which has a 1. 
e.g. To represent _Gender_, we could use 1 to represent _Male_ and 2 to represent _Female_ , but this would imply that _Female_ > _Male_ . So instead we use the vector [0 1] to represent _Male_ and the vector [1 0] to represent _Female_. 


##### Write a function to convert a letter to a one-hot tensor

In [8]:
import torch

all_letters = string.ascii_letters
n_letters = len(all_letters)
all_letters

def letter_to_tensor(letter):
    tensor = torch.zeros(1, n_letters) # Initialize a torch tensor of zeros of appropriate length
    letter_index = all_letters.find(letter) # Find the location of the letter in *all_letters* above
    tensor[0][letter_index] = 1 # Set the appropriate location to 1 here
    return tensor

In [9]:
# Turn a line into a <line_length x 1 x n_letters>,
# or an array of one-hot letter vectors
def line_to_tensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        letter_index = all_letters.find(letter)
        tensor[li][0][letter_index] = 1
    return tensor

In [10]:
print(letter_to_tensor('k'))

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])


In [11]:
print(line_to_tensor('Kukreja').size())

torch.Size([7, 1, 52])


## Task 2: The Network
![](images/RNN-network.png)

This RNN module (from [here](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)) is just 2 linear layers which operate on an input and hidden state, with a LogSoftmax layer after the output. Before continuing, think about whether we want to do the classification at every step here or are we only interested in the output at the end Complete the code in the following cell to implement the above network.

In [12]:
import torch.nn as nn
from torch.autograd import Variable

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)
    
    def forward(self, i, hidden):
        combined = torch.cat((i, hidden), 1) # Combine the input and the hidden into a single tensor
        hidden = self.i2h(combined) # Do the Input->Hidden Transform
        output = self.i2o(combined) # Do the Input->Output Transform
        output = self.softmax(output) # Softmax on Output
        return output, hidden

    def init_hidden(self):
        return torch.zeros(1, self.hidden_size, requires_grad=True)

Now let's test this module we created.

In [13]:
n_hidden = 128
rnn = RNN(n_letters, n_hidden, n_categories)

In [14]:
input = letter_to_tensor('A')
hidden = rnn.init_hidden()

output, next_hidden = rnn(input, hidden)
print('output.size =', output.size())

output.size = torch.Size([1, 18])


In [16]:
input = line_to_tensor('Dubrule')
hidden = torch.zeros(1, n_hidden)

output, next_hidden = rnn(input[0], hidden)
print(output)

tensor([[-2.9042, -2.9045, -2.9411, -2.7595, -3.0002, -2.8847, -2.8267, -3.0088,
         -2.9026, -2.8830, -2.9186, -2.9036, -2.7786, -3.0009, -2.8817, -2.9094,
         -2.8469, -2.8137]], grad_fn=<LogSoftmaxBackward>)


In [17]:
def category_from_output(output):
    top_value, top_index = output.data.topk(1)
    category_index = top_index[0][0]
    return all_categories[category_index], category_index

print(category_from_output(output))

('Japanese', tensor(3))


In [18]:
import random

def randomChoice(l):
    return l[random.randint(0, len(l) - 1)]

def randomTrainingExample():
    category = randomChoice(all_categories)
    line = randomChoice(category_lines[category])
    category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)
    line_tensor = line_to_tensor(line)
    return category, line, category_tensor, line_tensor

for i in range(10):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    print('category =', category, '/ line =', line)

category = Greek / line = Adamou
category = Spanish / line = Gallo
category = Scottish / line = Scott
category = Russian / line = Viharev
category = Scottish / line = Mcgregor
category = Chinese / line = Xuan
category = Portuguese / line = Mateus
category = French / line = Lambert
category = Dutch / line = Kouman
category = Portuguese / line = Araullo


In [19]:
criterion = nn.NLLLoss()
lr = 0.005 
optimizer = torch.optim.SGD(rnn.parameters(), lr=lr)

[Why](https://github.com/rasbt/stat479-deep-learning-ss19/blob/master/other/pytorch-lossfunc-cheatsheet.md) NLLLoss? 

## Task 3: The train function

For the training function here, we will implement _true_ Stochastic Gradient Descent, i.e. one example at a time. 

The train function must:
- Reset gradients
- Create a zeroed initial hidden state
- Read each letter in and
- Keep hidden state for next letter
- Compare final output to target
- Back-propagate
- Return the output and loss

In [20]:
def train(category_tensor, line_tensor):
    optimizer.zero_grad()
    hidden = rnn.init_hidden()
    
    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)
    loss = criterion(output, category_tensor)
    loss.backward()

    optimizer.step()

    return output, loss.item()

In [21]:
import time
import math

n_iters = 100000
print_every = 5000
plot_every = 1000



# Keep track of losses for plotting
current_loss = 0
all_losses = []

def timeSince(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

start = time.time()

for iter in range(1, n_iters + 1):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output, loss = train(category_tensor, line_tensor)
    current_loss += loss

    # Print iter number, loss, name and guess
    if iter % print_every == 0:
        guess, guess_i = category_from_output(output)
        correct = '✓' if guess == category else '✗ (%s)' % category
        print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))

    # Add current loss avg to list of losses
    if iter % plot_every == 0:
        all_losses.append(current_loss / plot_every)
        current_loss = 0

5000 5% (0m 7s) 2.4133 Walentowicz / Russian ✗ (Polish)
10000 10% (0m 14s) 1.8494 Severins / Dutch ✓
15000 15% (0m 23s) 2.3727 Srour / German ✗ (Arabic)
20000 20% (0m 32s) 1.0771 Viroslavsky / Russian ✓
25000 25% (0m 41s) 2.4634 Simon / English ✗ (Irish)
30000 30% (0m 50s) 0.8576 Zelinka / Czech ✓
35000 35% (1m 0s) 0.9887 Sook / Korean ✓
40000 40% (1m 9s) 0.6829 Donovan / Irish ✓
45000 45% (1m 18s) 0.6593 Letsos / Greek ✓
50000 50% (1m 27s) 1.7587 StrakaO / Arabic ✗ (Czech)
55000 55% (1m 36s) 0.8935 Diep / Vietnamese ✓
60000 60% (1m 44s) 3.8970 Hadash / Japanese ✗ (Czech)
65000 65% (1m 53s) 0.8514 Banos / Greek ✓
70000 70% (2m 2s) 0.0587 Tieu / Vietnamese ✓
75000 75% (2m 11s) 0.6497 Polites / Greek ✓
80000 80% (2m 20s) 1.8170 Lucassen / French ✗ (Dutch)
85000 85% (2m 29s) 0.2374 Scarsi / Italian ✓
90000 90% (2m 38s) 0.0230 Patsiorkovsky / Russian ✓
95000 95% (2m 47s) 0.9595 Login / Irish ✓


KeyboardInterrupt: 

In [None]:
# Plot the evolution of the loss function through the training process

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline

plt.figure()
plt.plot(all_losses)

We see the loss function going down through the training iterations - the network is learning.

Next, let's plot the confusion matrix.

In [None]:
# Just return an output given a line
def evaluate(line_tensor):
    hidden = rnn.init_hidden()
    
    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)
    
    return output

In [None]:
# Keep track of correct guesses in a confusion matrix
confusion = torch.zeros(n_categories, n_categories)
n_confusion = 10000


# Go through a bunch of examples and record which are correctly guessed
for i in range(n_confusion):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output = evaluate(line_tensor)
    guess, guess_i = category_from_output(output)
    category_i = all_categories.index(category)
    confusion[category_i][guess_i] += 1

# Normalize by dividing every row by its sum
for i in range(n_categories):
    confusion[i] = confusion[i] / confusion[i].sum()

# Set up plot
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(confusion.numpy())
fig.colorbar(cax)

# Set up axes
ax.set_xticklabels([''] + all_categories, rotation=90)
ax.set_yticklabels([''] + all_categories)

# Force label at every tick
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

# sphinx_gallery_thumbnail_number = 2
plt.show()

We can see that our network is doing very well with Greek, but very bad with English!

Running on user input

In [None]:
def predict(input_line, n_predictions=3):
    print('\n> %s' % input_line)
    output = evaluate(Variable(line_to_tensor(input_line)))

    # Get top N categories
    topv, topi = output.data.topk(n_predictions, 1, True)
    predictions = []

    for i in range(n_predictions):
        value = topv[0][i]
        category_index = topi[0][i]
        print('(%.2f) %s' % (value, all_categories[category_index]))
        predictions.append([value, all_categories[category_index]])
        
    

In [None]:
predict("Zhang")

## Objective: Name generator

Now let's modify the previous example to build a name generator instead. First, some utility functions.

## Task 4: The "Generator" Network
This time we use a slightly modified version of the previous network:
![](images/RNN-generator.png)

This version has an added input for the category tensor, where we specify which origin name we would like the network to generate. This category will be another one-hot tensor like the letter input. 

Another change is the addition of a dropout layer before the softmax. This allows for some randomization to increase the sampling variety.

In [None]:
import torch
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size

        self.i2h = nn.Linear(n_categories + input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(n_categories + input_size + hidden_size, output_size)
        self.o2o = nn.Linear(hidden_size + output_size, output_size)
        self.dropout = nn.Dropout(0.1)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, category, input, hidden):
        input_combined = torch.cat((category, input, hidden), 1) # Combine category, input and hidden into a single tensor
        hidden = self.i2h(input_combined) # Input to hidden
        output = self.i2o(input_combined) # Input to Output
        output_combined = torch.cat((hidden, output), 1) # output combined
        output = self.o2o(output_combined) # Output to output
        output = self.dropout(output) # Dropout
        output = self.softmax(output) # SoftMax
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

For each timestep (that is, for each letter in a training word) the inputs of the network will be (category, current letter, hidden state) and the outputs will be (next letter, next hidden state). So for each training set, we’ll need the category, a set of input letters, and a set of output/target letters.

Since we are predicting the next letter from the current letter for each timestep, the letter pairs are groups of consecutive letters from the line - e.g. for "ABCD<EOS>" we would create (“A”, “B”), (“B”, “C”), (“C”, “D”), (“D”, “EOS”).

In [None]:
# One-hot vector for category
def categoryTensor(category):
    tensor = torch.zeros(1, n_categories)
    tensor[0][all_categories.index(category)] = 1
    return tensor

categoryTensor("Greek")

In [None]:
# LongTensor of second letter to end (EOS) for target
def targetTensor(line):
    letter_indexes = [all_letters.find(line[li]) for li in range(1, len(line))]
    letter_indexes.append(n_letters - 1) # EOS
    return torch.LongTensor(letter_indexes)

targetTensor("Lombardo")

In [None]:
# Get a random category and random line from that category
def randomTrainingPair():
    category = randomChoice(all_categories)
    line = randomChoice(category_lines[category])
    return category, line

# Make category, input, and target tensors from a random category, line pair
def randomTrainingExample():
    category, line = randomTrainingPair()
    category_tensor = categoryTensor(category)
    input_line_tensor = line_to_tensor(line)
    target_line_tensor = targetTensor(line)
    return category_tensor, input_line_tensor, target_line_tensor
randomTrainingExample()

Now we define the training function. Note that in contrast to the classification train function, we now use the output of the network at every step instead of only at the end. 

## Task 5: The train function for the generator

In [None]:
def train(rnn, optimizer, criterion, category_tensor, input_line_tensor, target_line_tensor):
    target_line_tensor.unsqueeze_(-1)    
    
    hidden = rnn.initHidden() # Initialise hidden
    optimizer.zero_grad() # Reset gradients

    loss = 0. # Initialise loss

    for i in range(input_line_tensor.size(0)): # loop over the characters in put
        output, hidden = rnn(category_tensor, input_line_tensor[i], hidden) #Forward prop one letter at a time
        l = criterion(output, target_line_tensor[i]) #loss function w.r.t expected output (each letter)
        loss += l # Sum losses

    loss.backward() #Backprop
    
    optimizer.step() # Gradient step
    
    return output, loss.item() / input_line_tensor.size(0)

The training loop

In [None]:
criterion = nn.NLLLoss()
lr = 0.0005
rnn = RNN(n_letters, 128, n_letters)
optimizer = torch.optim.SGD(rnn.parameters(), lr=lr)
n_iters = 100000
print_every = 5000
plot_every = 500
all_losses = []
total_loss = 0. # Reset every plot_every iters

start = time.time()


for iter in range(1, n_iters + 1):
    output, loss = train(rnn, optimizer, criterion, *randomTrainingExample())
    total_loss += loss

    if iter % print_every == 0:
        print('%s (%d %d%%) %.4f' % (timeSince(start), iter, iter / n_iters * 100, loss))

    if iter % plot_every == 0:
        all_losses.append(total_loss / plot_every)
        total_loss = 0

Plot the losses to see the progress of training:

In [None]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.figure()
plt.plot(all_losses)

#### Sampling from the network

Now we will write the function that generates names using this network. To sample, we feed in a category (name origin), as well as a first letter as a seed - and ask the network what the next letter is. Next, keeping the category the same, we feed in the second letter (generated by the network in the previous step) as the seed and ask for the next letter, and so on until we get the EOS (end-of-string) token.

In [None]:
max_length = 20

# Sample from a category and starting letter
def sample(category, start_letter='A'):
    with torch.no_grad():  # no need to track history in sampling
        category_tensor = categoryTensor(category)
        input = inputTensor(start_letter)
        hidden = rnn.initHidden()

        output_name = start_letter

        for i in range(max_length):
            output, hidden = rnn(category_tensor, input[0], hidden)
            topv, topi = output.topk(1)
            topi = topi[0][0]
            if topi == n_letters - 1:
                break
            else:
                letter = all_letters[topi]
                output_name += letter
            input = inputTensor(letter)

        return output_name

# Get multiple samples from one category and multiple starting letters
def samples(category, start_letters='ABC'):
    for start_letter in start_letters:
        print(sample(category, start_letter))

samples('Russian', 'RUS')
print("***")
samples('German', 'GER')
print("***")
samples('Spanish', 'SPA')


# Recent Advances

The long-term memory in LSTM is [a specific instance](https://arxiv.org/pdf/1601.06733.pdf) of a more generic concept called _Attention_. The concept of Attention was introduced to solve one problem - when doing _Neural Machine Translation_ , the next word in the output sentence (in the output language) is not necessary related to the last (or second-to-last) word in the input sentence (in the input language). Since simple RNNs can only capture adjacency relationships, various styles of attention were tried to teach the model to look at a specific part of the input sentence in order to predict the next output word. [Many of these attention approaches](https://arxiv.org/abs/1409.0473) were successful and today far outperform LSTMs on the above tasks. 

![](images/self-attention.png)

One extremely successful kind of attention is _self attention_. Here, instead of mapping relationships between an output sequence and an input sequence, we map relationships between the different words of the same sentence. Going down this path, it was realised that the self-attention mechanism is more than just an add-on to RNNs and it might be possible to build entire networks out of self-attention alone. In ["Attention is all you need" (Vasuvani 2017)](https://arxiv.org/pdf/1706.03762.pdf) a neural network architecture called _Transformer_ was introduced that was composed entirely of self attention layers, and had some other innovations regarding memory. 

![](images/transformer.png)

In Feb 2019, a company called OpenAI introduced a variation of the transformer called GPT2 and [refused to release](https://slate.com/technology/2019/02/openai-gpt2-text-generating-algorithm-ai-dangerous.html) it _claiming it might destroy human society_ . This was a text generation model that could generate entire (_fake_) news articles from a one/few word prompt - think of it as autocomplete on steroids. They did eventually release it and is now available to try online: https://talktotransformer.com

In the name generation example above, it was easy to sample names from the network because the output had a similar shape to the input. But how can we create a network from which we sample higher dimensional data like images?