# Sentiment Analysis of Movie Reviews Using LSTM

We will turn our attention to recurrent neural networks (RNNs), in particular, to long short-term memory (LSTM) networks and how they can be used in sequencial problems, such as Natural Lenguage Processing (NPL). We will develop and train a LSTM network to predict the sentiment of movie reviews on IMDb


-----
## Sequential problems in machine learning

Sequential problems are a class of problems in machine learning in which the order of the features presented to the model is important for making predictions. These are commonly encountered in the following scenarios:

- NPL: sentiment analysis, lenguage translation, text prediction
- Time series predictions

Many NLP problems are sequential problems, because the languages that we speak are sequential in nature, and the sequence conveys context and other subtle nuances.

Sequential problems also occur naturally in time series proble,s. Time series problems are common in stock markets.

-----
## NLP and sentiment analsys

NLP is a subfiled in artificial intelligence that is concerned with the interaction of computers and human lenguages. 

With the proliferation of deep learning and neural networks in the image classification domian, scientists began to wonder whether the powers of neural networks could be applied to NPL. The ability of AI assistants, such as Siri and Alexa, to understand multiple languages spoken in different accents was the result of deep learning and LSTM networks.

Sentiment analysis is also an area of NLP that benefited from the resurgence of deep learning. It is defined as the prediction of the positivity of a text. Most sentiment analysis problems are classification problems (psotive, neutral or negative) and not regression problems. 

-----
## Why sentiment analysis is difficult

Due to the presence of subtle nuances in human lenguage. The same word can often covey a differnet meaning depending on the context.

Another reason sentiment analysis is difficult is because of sarcasm

-----
## RNN (Recurrent nerual networks)

An RNN has high-level architecture, as shown in the following diagram

![diagram](https://i.imgur.com/mIUxURI.png)

We can see that an RNN is a multi-layered neural networl. We can break up the raw input, splitting it into time steps. 

-----
## What's inside an RNN?

The following diagram depicts the mathematical function inside each layer of an RNN:

![diagram2](https://i.imgur.com/SjCsxjw.png)

The mathematical function of an RNN is simple. Each layer $t$ within an RNN has two inputs

- The input from the time step $t$
- The gidden state passed from the previus layer $t-1$

Each layer in RNN simply sums up the two inputs and applies a tanh($x$) function to the sum. It then outputs the result, to be passed as a hidden state to the next layer. More formally, the output hidden state of layer $t$ is this 

$$ s_t = \tanh (s_{t-1}+x_t) $$

The tanh function is a good choice as a non-linear transformation of the combinations of the current input and the previus hidden state, becaus it ensure that the weights don't diverge too rapidly and is easily differentiable.

Finally, to get the final output from the last layer in the RNN we apply a sigmoid function.

We can see that if we stack these layers together, the final output from an RNN depends on the non-linear combination of the inputs at different time steps.

-----

## Long- and short-term dependecies in RNN's

The architecture of an RNN makes it ideal for handling sequential data. 

Let's first take a look at a short piece of text as our sequential data:

> **THE WEATHER IS HOT TODAY**

We can treat this short sentence as sequential data by breaking it down into five different inputs. 

![diagram](https://i.imgur.com/5uzfiXo.png)

Suposse that we are building a simple RNN to predict whether is it snowing based on this sequential data. The RNN would look something like this:

![example](https://i.imgur.com/I1w5m9o.png)

The critical piece of information in the sequence is the word hot, at time step 4. Notice that the critical piece of information came just shortly before the final input. We would say that there is a short-term dependency in this sequence. 

Let's take a look at longer sequence of text.

> "I really liked the movie but i was disappointed in the service and cleanliness of the cinema. The cinema should be better maintained in order to provide a better experience for customers."

Our goal is to predict whether the customer liked the movie. The customer liked the movie but not the cinema, which was the main complaint in the paragraph. The RNN would look this:

![sequence](https://i.imgur.com/5lBgq42.png)

The critical words **liked the movie** appared between time steps 3 and 5. Notice that there is a significant gap between the critical time steps and the output time step, as the rest of the text was largely irrelevant to the prediction problem. We say that there is a long-term dependency in this sequence. Unfortunately, RNNs do not work well with long-term dependency sequences. RNNs have a good a good short-term memory but a bad long-term memory. 

-----

## The vanishing gradient problem

It is a problem when training deep neural networks using gradient-based methodss such as backpropagation.

When the loss is propagated backward, the loss tends to decrase with each successive layer:

![loss](https://i.imgur.com/1NaXFtc.png)

As a result, by the time the loss is propagated back toward the first few layers, the loss has already diminished so much that the weights do not change much at all. With such a small loss being propagated backward, it is impossible to adjust and train the weights of the first few layers. This phenomenon is known as the *vanishing grading problem in machine larning* 

To address this problem, Hochreiter and other proposed a clever varation of the RNN, known as the long short-term memory (LSTM) network.





-----

# The LSTM network

LSTM are a variation of RNNs, and they solve the long-term dependency problem faced by conventional RNNs. 

-----

## What's inside an LSTM network?

LSTMs have the same repeating structure of RNNs that we have seen previously. However, LSTM's diffier in their internal structure.

The diagram shows a high-level overview of the repeating unit of an LSTM

![unit](https://i.imgur.com/gSfie8Q.png)

An LSTM differs from a conventional RNN in that it has a cell state, in addition to the hidden state. You can think of the cell state as the current memory of the LSTM. In contrast, the hidden state is the overall memory of the entire LSTM. It contains everything that we have seen so far, both important and unimportant information. 

How does the LSTM release information between the hidden state and the cell state? It does so via three important gates:

- Forget gate
- Input gate
- Output gate

Just like physical gates, the three gates restrict the flow of information from the hidden state to the cell state. 

-----

## Forget gate 

![forget gate](https://i.imgur.com/Bd11o5b.png)

The **Forget gate (f)** forms the first part of the LSTM repeating unit, and its role is to decide how much data we should forget or remember from the previous cell state. It does so by first concatenating the Previous Hidden State $(h_{t−1})$ and the current Input $(x_t)$, then passing the concatenated vector through a sigmoid function.

-----

# Input gate

The next gate is the **Input gate (i)**, it controls how much information to pass to the current cell state.

![input gate](https://i.imgur.com/95MmUo5.png)

Just like the forget gate, the **Input gate(i)** takes as input the concatenation of the **Previus Hidden State** and the current **Input**. It then passes two copies of the concatenated vector through a sigmoid function and a tanh function, before multiplying them together.

The output of the input gate, $i$, is a follows:

$$ i = \sigma( \textrm{concatenate}(h_{t_1},x_t)) \ast \tanh ( \textrm{concatenate}(h_{t_1},x_t)) $$ 

At this point, we have what is required to compute the current cell state ($C_t$) to be output.

![current state](https://i.imgur.com/jJk48Cz.png)

The current cell state $C_t$ is as follows

$$ C_t = ( f \ast C_{t-1}) + i $$ 

-----

## Output gate

Output gate controls how much information is to be retained in the hidden state. 

![output gate](https://i.imgur.com/q8tbuDt.png)

First, we concatenate the **Previus Hidden State (h_{t-1})** and the current **Input(x_t)**, and pass it through a sigmoid function. Then. we take the current cell state ($C_t$) and pass it through a tanh function. Finally, we take the multiplication of the two, which is passed to the next repeating unit as the hidden state ($h_t$). This process is summarized by the following equation:

$$h_t = \sigma( \textrm{concatenate}(h_{t-1},x_t)) \ast \tanh(C_t) $$