# Modelling Sequential Data using Recurrent Neural Networks

## Overview

* [Introducing sequential data](#Introducing-sequential-data)
    * [Representing sequences](#Representing-sequences)
    * [The different categories of sequence modelling](#The-different-categories-of-sequence-modelling)
* [RNN for modelling sequences](#RNN-for-modelling-sequences)
    * [Structure and flow of RNNs](#Structure-and-flow-of-RNNs)
    * [Computing activations in RNNs](#Computing-activations-in-RNNs)
    
* [Working with text data in Keras](#Working-with-text-data-in-Keras)
    * [More on one-hot encoding](#More-on-one-hot-encoding) - Contains link to <font color=green>**COLAB NOTEBOOK 01**</font>
    * [Word embeddings](#Word-embeddings) - Contains link to <font color=green>**COLAB NOTEBOOK 02**</font>
* [More on RNNs](#More-on-RNNs)
    * [Vanishing gradient problem](#Vanishing-gradient-problem)
    * [LSTM units](#LSTM-units)
    * [Using simple RNN and LSTM layers in Keras](#Using-simple-RNN-and-LSTM-layers-in-Keras) - Contains link to <font color=green>**COLAB NOTEBOOK 03**</font>
    * [Advanced use of RNNs](#Advanced-use-of-RNNs) - Contains **multiple** links to <font color=green>**COLAB NOTEBOOKS**</font>
    * [Sequence processing with CNN and RNN](#Sequence-processing-with-CNN) - Contains link to <font color=green>**COLAB NOTEBOOK**</font>
* [Summary of RNN](#Summary-of-RNN)

**Artificial Neural Networks in DAT300**

* So far we have used ANN for various tasks
    * Binary classification
    * Multiclass classification
    * Regression
    * CNN for image data

* ANN for analysis of sequential data
    * RNN for text data
    * RNN for time series data

**Classic approach to text classification: bag-of-words model**

<img src="./images/fig_07_00.PNG" />

**Classic approach to text classification: bag-of-words model**

* is not an order-preserving tokenisation method
* the tokens generated are understood as a set, not a sequence
* general structure of sentence is lost

**n-grams for bag-of-words model**

* n-grams are a powerful feature engineering tool
* work well for shallow text processing
* models using n-grams are lightweight compared to RNN

## Introducing sequential data

* So far we have worked with input data is **Independent and Identically Distributed (IID)**
    * we have $n$ data samples: $x^{(1)}, x^{(2)}, \dots, x^{(n)}$
    * the order in which we use the data for training our machine learning algorithm does not matter
* This assumption is **not valid anymore** when we deal with sequences — by definition, order matters

### Applications of RNN

* **Document classification** and **timeseries classification**, such as identifying the topic of an article or the author of a book
* **Timeseries comparisons**, such as estimating how closely related two documents or two stock tickers are
* **Sequence-to-sequence** learning, such as decoding an English sentence into French
* **Sentiment analysis**, such as classifying the sentiment of tweets or movie reviews as positive or negative
* **Timeseries forecasting**, such as predicting the future weather at a certain location, given recent weather data

### Representing sequences

* We've established that sequences are a **non-independent order** in our input data
* We next need to find ways to leverage this valuable information in our machine learning model
* Sequences will be represented as 
    * $(x^{(1)}, x^{(2)}, \dots, x^{(T)})$
    * superscript indices indicate the **order** of the instances
    * **length** of sequence is $T$
    * **In time series**: each sample point $x^{(t)}$ belongs to a particular time $t$ 

#### Example

* A time-series data where both $x$'s and $y$'s naturally follow the order according to their time axis
* Therefore, both $x$'s and $y$'s are sequences


<img src="./images/fig_07_01.PNG" />

Figure from book **Python Machine Larning**, S. Raschka, V. Mirjalili, *Packt Publishing* (2017)

#### RNN vs. other standard neural network models

* Standard neural network models that we have covered so far, such as MLPs (dense NN) and CNNs, **are not capable** of handling the *order* of input samples
* Intuitively, one can say that such models do not have a *memory* of the past seen samples
* With previous standard NN, samples are passed through the feedforward and backpropagation steps, and the weights are updated *independent* of the order in which the sample is processed

* RNNs, by contrast, are designed for modeling sequences
* RNNs are capable of remembering past information and processing new events accordingly

### The different categories of sequence modelling

* Different types of sequence modeling tasks require **appropriate** models
* If **either** the input or output is a sequence, the data will form **one** of the following three different categories


<img src="./images/fig_07_02.PNG" width=600/>

Figure from book **Python Machine Larning**, S. Raschka, V. Mirjalili, *Packt Publishing* (2017)

* **Many-to-one**
    * *input data*: a sequence 
    * *output data*: a fixed-size vector, not a sequence
    * *example*: **sentiment analysis**, where the input is textbased and the output is a class label

* **One-to-many**: 
    * *input data*: standard format, not a sequence 
    * *output data*: a sequence 
    * *example*: **image captioning**, where the input is an image and the output is an English phrase

<img src="./images/fig_07_02_a_image_captioning.png" width=1300/>

* **Many-to-many**: 
    * *input data*: a sequence
    * *output data*: a sequence 
    * *examples*: 
        * *synchronised*: **video classification**, where each frame in a video is labeled. 
        * *delayed*: **translating** a language into another. For instance, an entire English sentence must be read and processed by a machine before producing its translation into German.

## RNN for modelling sequences

* Introduction of typcal RNN structure
* Data flow through one or more hidden layers
* Computation of neuron activations in a typical RNN

### Structure and flow of RNNs

* A standard feedforward neural network and an RNN, in a side by side for comparison

<img src="./images/fig_07_03.PNG" width=600/>

Figure from book **Python Machine Larning**, S. Raschka, V. Mirjalili, *Packt Publishing* (2017)

* Both networks have only one hidden layer
* Generic RNN architecture corresponds to **two** modelling categories where input data is **sequence**
    * **many-to-many**, where output $y^{(t)}$ is a sequence
    * **many-to-one**, where only the last element of output sequence $y^{(t)}$ is used. $y^{(t)}$ can be converted into a standard, non-sequential unit

* **Standard feedforward network**: information flows from the input to the hidden layer, and then from the hidden layer to the output layer
* **RNN**: 
    * hidden layer gets its input from both the input layer and the hidden layer from the previous time step
    * flow of information in adjacent time steps in the hidden layer allows the network to have a memory of past events
    * flow of information is usually displayed as a loop, also known as a **recurrent edge** in graph notation, which is how this general architecture got its name
* **NOTE**: In Keras sequential models there is no such thing as input layer, only hidden layers and output layers
    

### Single layer RNN

* **Standard neural networks**: unit in hidden layer receives only one input, that is the net preactivation associated with the input layer
* **RNN**: units in hidden layer receives two distinct inputs:
    * **preactivation** from **input** layer
    * **activation** from **same** hidden layer from previous time step $t-1$

<img src="./images/fig_07_04.PNG" width=600/>

Figure from book **Python Machine Larning**, S. Raschka, V. Mirjalili, *Packt Publishing* (2017)

### Multilayer RNN



* **layer=1**: the first hidden layer is represented as $h_{1}^{(t)}$ and gets its input from the data point $x^{(t)}$ and activations from the same hidden layer, but the previous time step $h_{1}^{(t-1)}$
* **layer=2**: the second hidden layer $h_{2}^{(t)}$ receives its inputs (activations) from the hidden units of the layer below at the current time step $(h_{1}^{(t)})$ and its own activations form the same layer from the previous time step $h_{2}^{(t-1)}$

<img src="./images/fig_07_05.PNG" width=700/>

Figure from book **Python Machine Larning**, S. Raschka, V. Mirjalili, *Packt Publishing* (2017)

### Computing activations in RNNs

* Learn how to compute the activations of
    * hidden layer
    * output layer
* Discuss computations for RNN with only one hidden layer
* Same concept applies to multilayer RNN

**Weight matrices in RNNs**

* $W_{xh}$: weight matrix between input $x^{(t)}$ and hidden layer $h$
* $W_{hh}$: weight matrix associated with recurrent edge
* $W_{hy}$: weight matrix between hidden layer $h$ and output layer $y$ 

<img src="./images/fig_07_06.PNG" width=900/>

Figure from book **Python Machine Larning**, S. Raschka, V. Mirjalili, *Packt Publishing* (2017)

**Activation computations of hidden layer units**

* net input: $z_{h}^{(t)}$
* bias vector for hidden units: $\bf{\it{b_{h}}}$
* activation function of hidden layer: $\phi (\cdot)$

<img src="./images/eq_07_00.PNG" width=500/>

<img src="./images/eq_07_01.PNG" width=750/>

<img src="./images/fig_07_06.PNG" width=600/>

**Alternative activation computations of hidden layer units**

* concatenated weight matrix: $W_{h}$

<img src="./images/eq_07_02.PNG" width=300/>

<img src="./images/eq_07_03.PNG" width=600/>

<img src="./images/fig_07_06.PNG" width=600/>

**Activation computations of output units**

* output: $y^{(t)}$

<img src="./images/eq_07_04.PNG" width=500/>

<img src="./images/fig_07_06.PNG" width=600/>

**Graphical illustration of activation computations**

<img src="./images/eq_07_05.PNG" width=1000/>

Figure from book **Python Machine Larning**, S. Raschka, V. Mirjalili, *Packt Publishing* (2017)

##  Working with text data in Keras

* You have learned in previous lectures how to carry out **sentiment analysis** with **scikit-learn** and **Keras** using ANN with dense layers
* Now you will learn how to work with and **preprocess** text in **Keras** before analysing it with RNN models built in Keras

### More on one-hot encoding

<font color=green>**COLAB NOTEBOOK 01**</font>: code on how to do [one-hot encoding of words and characters](https://colab.research.google.com/drive/1UTb1k0zyGe5_j4bpPBrJ3bS9yORDORBY?usp=sharing)

### Word embeddings

* One-hot encoding generates large sparse matrices (many zeros)
* A vocabulary of 20 000 words results in a sparse matrix containing 20 000 features 
    * each word is represented by a vector of dimension 20 000
    * each vector contains 19 999 zeros except one dimension

* A more elegant way of representing words is **embedding**
    * use finite-sized vectors to represent words
    * these finite-sized vectors contain real numbers
* Idea behind embedding
    * embedding is a feature learning technique 
    * embedding automatically learns salient features to represent word in a dataset
    * `embedding_size << unique_words` to represent the entire vocabulary
as input features

**Advantages of embedding over one-hot encoding**

* A reduction in the dimensionality of the feature space to decrease the effect of the curse of dimensionality
* The extraction of salient features since the embedding layer in a neural network is trainable

<img src="./images/fig_07_09.PNG" width=700/>

Figure from book **Python Machine Larning**, S. Raschka, V. Mirjalili, *Packt Publishing* (2017)

**A toy example of a word-embedding space**

* Four words embedded in a 2D plane: *cat, dog, wolf, tiger*
* There is a **semantic relationship** between the words
* Semantic relationship can be encoded as **geometric transformation**

**A toy example of a word-embedding space**

<img src="./images/fig_07_10.PNG" width=350/>

* **From pet to wild animal** vector
    * from dog to wolf
    * from cat to tiger
* **From canine to feline** vector
    * from dog/wolf to cat/tiger

**A toy example of a word-embedding space**

* Assume cat, dog, wolf and tiger are part of a 50 000-word animal vocuabulary we want to work with
* One-hot encoding: each of the four words would result in a 50 000-dimensional vector
* In this embedding example with two vectors cat, dog, wolf and tiger may be represented meaningful with a 2-dimensional vector
* Word vectors from embedding
    * cat --> [0.7, 0.3]
    * dog --> [0.4, 0.4]
    * wolf  --> [0.4, 0.9]
    * tiger --> [0.8, 0.8]

<img src="./images/fig_07_10.PNG" width=250/>

**Real world embeddings**

Common examples of meaningful geometric transformations are

* "gender" vectors
* "plural" vectors

Examples

* By adding a “female” vector to the vector “king” we obtain the vector “queen”
* By adding a “plural” vector, we obtain “kings”

**Ways to obtain word embedding**

* **Learn** word embeddings **jointly** with the main task you care about (such as document classification or sentiment prediction). In this setup, you start with random word vectors and then learn word vectors in the same way you learn the weights of a neural network.
* **Load into your model** word embeddings that were **precomputed** using a different machine-learning task than the one you’re trying to solve. These are called *pretrained word embeddings*.

Pretrained word embeddings
 * [GloVe](https://nlp.stanford.edu/projects/glove/)  (Global Vectors for Word Representation)

**Example of how to work with word embeddings in Keras**

<font color=green>**COLAB NOTEBOOK 02**</font>: code on how to use [word embeddings](https://colab.research.google.com/drive/16Mrqyj7NUFMALPf1-BxpYkpzOzCtSYgT?usp=sharing)

## More on RNNs

The following section discusses

* Vanishing gradient problem
* Layers that can handle the vanishing gradient problem
* Advanced use of recurrent networks

### Vanishing gradient problem

<img src="./images/fig_07_11b.png" width=600/>

* Use of backpropagation to update weights **across many** layers 
    * this leads to vanishing gradient problem
    * root of the problem: the gradient of a given layer is the **product** of gradients at **previous** layers

Computation of error in hidden layer

<img src="./images/fig_07_11c.png" width=600/>

Compuation of gradients for updating weights of hidden layer

<img src="./images/fig_07_11cc.png" width=500/>

#### The sigmoid activation function and its derivative

<img src="./images/fig_07_11d.jpg" width=600/>

#### The hyperbolic tangent activation function and its derivative

<img src="./images/fig_07_11e.jpeg" width=600/>

### Vanishing gradient problem with RNN

* RNNs should **theoretically** be able to retain at time $t$ information about inputs seen many timesteps before
* In practice, such long-term dependencies are **impossible to learn** because of **vanishing gradient problem** (more on this on next slides)
* Therefore RNNs are **too simplistic** for real use and other more advanced model types are required (like LSTM or GRU units)
* See illustration article [Visualizing memorization in RNNs](https://distill.pub/2019/memorization-in-rnns/) at Distill.pub

<img src="./images/fig_07_11f.png"/>

[Image source @GitHub by Raschka](https://github.com/rasbt/stat479-deep-learning-ss19/blob/master/L14_intro-rnn/L14_intro-rnn-part2_slides.pdf)

<img src="./images/fig_07_11g.png"/>

<img src="./images/fig_07_11h.png"/>

<img src="./images/fig_07_11i.png"/>

<img src="./images/fig_07_11j.png"/>

<img src="./images/fig_07_11k.png"/>

<img src="./images/fig_07_11l.png"/>

**Solution**


Various solutions exist such as

* **Gradient clipping (solves exploding gradients)**
    - set a max value for gradients if they grow too large
    - setting a max value solves only exploding gradient problem
* **Truncated Backpropagation through time (TBPTT)**
    - limit the number of time steps the signal can backpropagate each forward pass
    - Example: even if sequence has 100 elements / steps, only backpropagate through fewer steps
* **Long-Short-Term-Memory (LSTM) units (most common)**
    - Architecture that avoids vanishing / exploding gradient problems
    - Sepp Hochreiter and Jürgen Schmidhuber, Long short-term memory, *Neural computation*, 9, no. 8, (1997): 1735 - 1780

### LSTM units

* LSTMs were introduced and **designed to overcome** the vanishing gradient problem
* LSTM adds a way to carry information **across many timesteps**
* In this way LSTM **saves information for later**, thus **preventing older signals** from gradually vanishing during processing by re-injecting past information from previous layers 
* The building block of an LSTM is a **memory** cell, which essentially represents the hidden layer
* Unfolded structure of a **modern** LSTM cell is shown on next slide

<img src="./images/fig_07_12a.png" width=1000/>

<img src="./images/fig_07_12.PNG" width=850/>

* $\bf{C}^{(t-1)}$: cell state at previous time step $t-1$
* $\oplus$: elementwise addition
* $\odot$: elementwise multiplication
* $\bf{h}^{(t-1)}$: hidden units activation at previous time step $t-1$
* $\bf{x}^{(t)}$: input data at current time step $t$

Alternative presentation of LSTM. Note that $g$ here is represented by $\widetilde{C}$.

<img src="./images/fig_07_12_alt.PNG" width=850/>

<img src="./images/fig_07_12b.png" width=1000/>

<img src="./images/fig_07_12c.png" width=1000/>

<img src="./images/fig_07_12d.png" width=1000/>

**Forget gate** $f_{t}$

* Allows the memory cell to reset the cell state $\bf{C}$ to avoid $\bf{C}$ growing indefinitively
* Decides which information is allowed to go through and which information to suppress (controls how much of the old cell value is used in the new cell value)

<img src="./images/fig_07_13.PNG" width=600/>

<img src="./images/fig_07_12e.png" width=1000/>

**Input gate** $i_{t}$ and **input node** $g_{t}$

* Input gate $i_{t}$ and input node $g_{t}$ are responsible for updating the cell state $\bf{C}$

<img src="./images/fig_07_14.PNG" width=600/>

<img src="./images/fig_07_12f.png" width=1000/>

**Computation of cell state** $\bf{C}^{(t)}$ **at time step** $t$

<img src="./images/fig_07_15.PNG" width=500/>

**Output gate** $o_{t}$

* Output gate $o_{t}$ decides how to update the values of hidden units (controls how much of the new cell value is outputted)

<img src="./images/fig_07_16.PNG" width=600/>

<img src="./images/fig_07_12g.png" width=1000/>

**Computation of hidden units activations** $\bf{h}^{(t)}$ **at time step** $t$

<img src="./images/fig_07_17.PNG" width=500/>

<img src="./images/fig_07_12h.png" width=1400/>

**Further reading on LSTM**

* [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/?source=post_page-----37e2f46f1714----------------------)
* [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)

### GRU units

* Gated Recurrent Units (GRU)
* GRUs have a simpler architecture than LSTMs 
    * fewer parameters
    * ==> computationally more efficient
    * for some tasks performance comparable to those of LSTM
* There are reports exhibiting that GRUs have better performance on smaller datasets 

### Using simple RNN and LSTM layers in Keras

<font color=green>**COLAB NOTEBOOK 03**</font>: code on how to do [use simple RNN and LSTM layers in Keras](https://colab.research.google.com/drive/1d1d4KdN-3_IizrO98L6e3VAwOE11f6mT?usp=sharing)

**Weights in RNN using LSTM units**

<font color=green>**COLAB NOTEBOOK 04**</font>: code on [how to determine LSTM weights](https://colab.research.google.com/drive/14GzWAUpCevvjAVKoJomUS56oxMPnOzo7?usp=sharing)

<font color=green>**COLAB NOTEBOOK 05.6**</font>: code on [bidirectional RNN](https://colab.research.google.com/drive/1QlbhgsXpaVQtE0sgWS_yP5OKc1K0yY-Z?usp=sharing)

### Advanced use of RNNs

There are several ways to improve the performance and generalisation power of RNNs.

* **Recurrent dropout** - This is a specific, built-in way to use dropout to fight overfitting in recurrent layers
* **Stacking recurrent layers** - This increases the representational power of the network (at the cost of higher computational loads)
* **Bidirectional recurrent layers** - These present the same information to a recurrent network in different ways, increasing accuracy and mitigating forgetting issues


All three concepts will be discussed by applying them to a **temperature-forecasting** problem.

**A temperature-forecasting problem**

* So far, the only sequence data covered were text data
* In the following examples timeseries data will be used (weather timeseries)
* Weather data recorded at the [Max Planck Institute for Biogeochemistry](https://www.bgc-jena.mpg.de/wetter/) in Jena, Germany
    * 14 different quantities measured (such as air temperature, atmospheric pressure, wind direction, etc.)
    * Recorded every 10 minutes across many years
    * Original data goes back to 2003
    * Data used in examples: 2009 - 2016
* Use data to build model that:
    * takes as input data from recent past
    * predicts air temperature 24 hours in future

**Some Colab notebooks**

<font color=green>**COLAB NOTEBOOK 05.1**</font>: code on [time series data generator](https://colab.research.google.com/drive/1j1-nT4fdVCUnLUZsoAO5GHhGUH8-1m1F?usp=sharing)

<font color=green>**COLAB NOTEBOOK 05.2**</font>: code on [ANN with dense layers for time series data](https://colab.research.google.com/drive/1zNplubUnorV1othZXe_LmvXkxgqsUcRP?usp=sharing)

<font color=green>**COLAB NOTEBOOK 05.3**</font>: code on [baseline RNN for time series data](https://colab.research.google.com/drive/1nyfR5WXyICwKhFccJ2CmQgjJs4F33DsS?usp=sharing)

<font color=green>**COLAB NOTEBOOK 05.4**</font>: code on [RNN with dropout for time series data](https://colab.research.google.com/drive/1Uw4rCu7WmufOmHTzgu9Fla-zfocS_bug?usp=sharing)

<font color=green>**COLAB NOTEBOOK 05.5**</font>: code on [stacked RNN for time series data](https://colab.research.google.com/drive/1EV1MET240mJlS6_xhzr1pMggNxp391pT?usp=sharing)

<font color=green>**COLAB NOTEBOOK 05.6**</font>: code on [bidirectional RNN](https://colab.research.google.com/drive/1QlbhgsXpaVQtE0sgWS_yP5OKc1K0yY-Z?usp=sharing)

### Sequence processing with CNN and RNN

**Previously learned properties of CNN**

* Perform particularly well on computer vision problems
    * ability to operate **convolutionally**
    * **extracting features** from **local input patches**
    * allowing for **representation modularity** and **data efficiency**

* **Same properties** that make CNNs excel at computer vision also make them **highly relevant to sequence processing**
* Time can be treated as a **spatial dimension**, like the height or width of a 2D image

* 1D convnets can be **competitive** with RNN's on **certain** sequence-processing problems usually at a **considerably cheaper computational cost**
* **Small** 1D convnets can offer a **fast alternative** to RNN s for **simple tasks** such as **text classification** and **timeseries forecasting**

#### Understanding 1D convolution for sequence data

* 1D convolutions extract local 1D patches (subsequences) from sequences
* Such 1D convolution layers can recognize local patterns in a sequence
* Same input transformation is performed on every patch
* ==> A pattern learned at a certain position in a sentence can later be recognized at a different position
* ==> Makes 1D convnets translation invariant (for temporal translations)

Example: 

* a 1D convnet processing **sequences of characters** using **convolution windows** of size 5 should be able to learn **words or word fragments of length 5 or less**
* should be able to **recognise** these words **in any context** in an **input sequence**
* ==> A **character-level** 1D convnet is thus able to learn about **word morphology**

<img src="./images/fig_07_21.png" width=600/>

Figure from book **Deep Learning with Python**, Francois Chollet, *Manning Publications* (2018)

**1D pooling for sequence data**

* Previously learned about 2D pooling operations, such as 2D average pooling and max pooling, used in convnets to spatially downsample image tensors
* 2D pooling operation has a 1D equivalent
    * extracting 1D patches (subsequences) from an input
    * outputting the maximum value (max pooling) or average value (average pooling)
* Just as with 2D convnets, this is used for reducing the length of 1D inputs (subsampling)

Figure from book **Deep Learning with Python**, Francois Chollet, *Manning Publications* (2018)

<font color=green>**COLAB NOTEBOOK**</font>: code on [how to do sequence processing with CNN and RNN](https://colab.research.google.com/drive/1k705LQR6DN8_MEZxndsMXWBa5CHF6idr)

<img src="./images/fig_07_22.png" width=600/>

Figure from book **Deep Learning with Python**, Francois Chollet, *Manning Publications* (2018)

## Summary of RNN

Learned the following techniques, which are widely applicable to any dataset of sequence data, from text to timeseries

* How to tokenize text
* What word embeddings are, and how to use them
* What recurrent networks are, and how to use them
* How to stack RNN layers and use bidirectional RNNs to build more-powerful sequence-processing models
* How to use 1D convnets for sequence processing
* How to combine 1D convnets and RNNs to process long sequences

**Summary of RNN**

RNNs can be used for

* timeseries regression (“predicting the future”)
* timeseries classification
* anomaly detection in timeseries
* sequence labeling (such as identifying names or dates in sentences)

**Summary of RNN**

1D CNN can be used for

* machine translation (sequence-to-sequence convolutional models, like SliceNet)
* document classification
* spelling correction

**Importance of global order**

* If **global order matters** in your sequence data, then it’s preferable to use a **recurrent network** to process it
* This is typically the case for timeseries, where the **recent past** is likely to be **more informative** than the **distant past**

* If **global ordering isn’t fundamentally meaningful**, then 1D convnets will turn out to work at least as well and are cheaper
* This is often the case for text data, where a keyword found at the **beginning** of a sentence is **just as meaningful** as a keyword found **at the end**