# Recurrent Neural Network Guide: a Deep Dive in RNN

Sequence modeling is a task of modeling sequential data. Modeling sequence data is when you create a mathematical notion to understand and study sequential data, and use those understandings to generate, predict or classify the same for a specific application. 

Sequential data has three properties:

1. Elements in the sequence can repeat
2. It follows order (contextual arrangement)
3. Length of data varies (potentially infinitely)

Example of sequential data are: 

1. Text and sentences
2. Audio (eg. speech and music)
3. Motion pictures or videos
4. Time-series data (eg. Stock market data)
5. DNA sequence, protein structures
6. Material composition 
7. Decision-making

Sequence data is difficult to model because of its properties, and it requires a different method. For instance, if sequential data is fed through a feed-forward network, it might not be able to model it well, because **sequential data has variable length.** The feed-forward network works well with fixed-size input, and doesn’t take structure into account well. 

Convolutional neural networks, on the other hand, were created to process structures, or grids of data, such as an image. They can deal with long sequences of data, but are limited by the fact that **they can’t order the sequence correctly.** 

So, **how do we build deep learning models that can model sequential data?** 

When we process sequential data, we try to model the input sequence. Unlike a supervised learning task, where we map the input with the output, in sequence modeling we try to model how **probable** the sequence is.  

- Data: $\{x_i\}_i$
- Model: $p(x) = f_\theta(x)$
- Loss: $L(\theta) = \sum_{i=1}^N \log p(f_\theta(x_i))$
- Optimization: $\theta^* = \arg \max_\theta L(\theta)$


This gives machine learning or deep learning models the ability to generate likeable sequences, or an ability to estimate the likeliness of the sequence. The rest of the process of calculating the loss function and optimisation remains the same. 

## How to model sequences: Modeling p(x)

Assuming that words in a sentence are independent to each other, we can use a corpus which tells us how probable each of the words in the English language is. 

Once we know the probability of each word (from the corpus), we can then find the probability of the entire sentence by multiplying individual words with each other. 

For instance, if we were to model the sentence “Cryptocurrency is the next big thing”, then it would look something like this:

p(“Cryptocurrency”)p(“is”)p(“the”)p(“next”)p(“big”)p(“thing”)

The above model can be described in a formula:

$$
p(x) = \Pi_{t=1}^T p(x_t)
$$

Each word is given a timestamp: t,  t-1,  t-2,  t-n, which describes the position of an individual word. 

But, it turns out that the model described above does not really capture the structure of the sequence. Why?

Because the probability of any particular word can be higher than the rest of the word. In our example, the probability of the word “the” is higher than any other word, so the resultant sequence will be “The the the the the the”.  

## Modeling p(x|context)

Although we could still modify the same model by introducing conditional probability, assuming that each word is now dependent on every other word rather than independent, we could now model the sequence in the following way: $p(x_T) = p(x_T | x_1…., x_{T-1})$. 

The same sentence “Cryptocurrency is the next big ______” can now have a range of options to choose from. For example:

| Target| $p(x \| context)$|
|--------|------------|
| Stuff     | 0.0002    |
| Thing     | 0.01     |
| Coin      | 0.00003    |


**Essentially, conditional probability describes what the next word will be.** 

But the above example can predict one word at a time; in order to predict a sequence of words we need to calculate the joint probability from the conditionals. 

$$
p(x) = \Pi_{t=1}^T p(x_t|x_1,...,x_{t-1})
$$

For instance
