# PreClass
## Converting words to numbers
1. Unique Numbers
    - Numbers are random. They don't capture relationships between words.
2. One hot encoding
    - Doesn't capture relationship between words.
    - Computationally in-efficient
3. Word embeddings ( You can capture the relationship between words.)
    - TF-IDF
    - Word2Vec
    
### What is Word Embeddings?
Word embeddings are numeric vectors that show semantic relationships between words. To see how close the two words are to each other semantically, the cosinus similarity of these word vectors is shocked. The closer the value is to 1, the closer the words are semantically.

## Recurrent Neural Networks (RNN)
- The RNNs are a kind of deep learning construct that is generally used to predict the next step.
- The biggest difference between them and other deep learning structures is that they remember.
- Another difference is that, while in other neural networks, each input is independent of the other, in RNNs the inputs are related to each other.
- **Vanishing** and **exploding** gradients are two frequent issues that arise during RNN backpropagation.
    - In the **vanishing** gradient problem, the gradient values decrease rapidly, and the training stops. The model can't capture the relationships between the beginning and the end of long sentences.
    - In the **exploding**  gradients, the gradient values grow to infinity exponentially rapidly and model weight values become NaN due to the unstable process. The model can't learn anything from taining data.

### Short Term Memory Problem:
- Short-term memory is a problem for recurrent neural networks due to vanishing gradient issues. They'll have difficulty transferring information from earlier time steps to later ones if the sequence is lengthy enough. If you're attempting to predict anything from a paragraph of text, RNN's may leave out essential information at the beginning of sequences.

- The vanishing gradient problem affects recurrent neural networks during backpropagation. Gradients are values that are used to update the weights of a neural network. When a gradient diminishes as it backpropagates through time, this is known as the vanishing gradient issue. When a gradient value falls below a certain threshold, it no longer contributes much to learning.

- Layers that get a tiny gradient update in recurrent neural networks stop learning. Those are generally the first layers to appear. RNNs can forget what they've seen in longer sequences since these layers don't learn, resulting in **short-term memory**.

- RNNs work very well in short sequences(like short sentences) and have less computational cost. However, they can't show this performance in long strings.

- **Long Short-Term Memory(LSTM)** and **Gated Recurrent Units(GRU)**, which are improved variants of RNN, are used as a solution to this problem.

## How does RNN work?
In an RNN, the words are converted into numeric vectors. Then the RNN processes each vector in the sequence sequentially. It transmits the previous concealed state from one step to the next during processing. The hidden state behaves as the neural network's memory. It stores prior network data.

### Long Short-Term Memory (LSTM) and Gated Recurrent Units(GRU)
Internall processes called gates in the LSTM and GRU, regulate the flow of information. These gates figure out which data in a sequence should be kept and which should be discarded.They can then send important information along long sequences to create predictions.

- Tanh Activations:
    - When vectors pass through a neural network, they go through a number of changes as a result of various math operations. As a result of these processes, the tanh activation is used so that some values don't become very large and other values don't become insignificant. Thus, all vector values stay between -1 and +1 values.
 
- Sigmoid Activations:
    - Sigmoid activations are at the Gates. Sigmoid activation is comparable to Tanh activation. Instead of squishing numbers ranging from -1 to 1, it squishes values rangeing from 0 to 1. Because every integer multiplied by 0 equals 0, values vanish or are forgotten. This is useful for updating or forgetting data.

## 1. Long Short-Term Memory (LSTM)
In an LSTM cell, there are three seperate gates (forget gate, input gate, and output gate) that control information flow.

- **Forget Gate:** This gate determines whether or not to retain the information. The **sigmoid** function is used for this process. The closer the value is to 0, the more likely it will be forgetten, and the closer it is to 1, the more likely it will be retained.


- **Input Gate:** In the first stage, the current input with the hidden state is passed through into the **sigmoid** function and then into the **tanh** function. Since the results from the sigmoid function will be between 0 and 1, it decides whether the information from the tanh function will be added to the cell state or not.


- **Cell State:** The cell state carries relevant information throughout the sequence. Thus, previous information isn't forgotten and the model can make more accurate predictions. Therefore, you can think of it as the memory of the network. As the cell state travels, information is added to or removed from the cell state via gates. The gates are distinct neural networks that determine whether information about the cell state is permitted. During training, the gates can choose which knowledge is necessary to retain and which to discard.

- **Output Gate:** It is used to determine the hiddent state to be transferred to the next step. Since the hidden state contains information about previous inputs(words, letters, etc.), it is used for predictions that the model will make. The current input and the previous hidden state are passed through into the sigmoid function. Then the current cell state is passed through into the tanh function and the results are multiplied. The output is now the new hidden state.

To summarize briefly, the **forget gate** decides whether to continue transferring the information from the precious steps to the cell state. The **input gate** decides whether or not to add to the cell state a piece of new information. The **output gate** determines the next hidden state.

## 2. Gated Recurrent Units (GRU)
GRU emerged in 2014 and it is the latest version of RNN's. Its working logic is very similar to LSTM. Unlike LSTM, in GRU, cell state and hidden state are combined. Also, it only has 2 gates. These are the update gate and the reset gate.

- **Update Gate:** It decides what information to discard and what information to include, such as the forget and input gates in LSTM.

- **Reset Gate:** It is the gate that decides how much of the information from other steps should be forgotten.

Although GRU is faster than LSTM, both LSTM and GRU give very good results. It's up to you to try both and decide.

In NLP, LSTM and GRU determine which class the text belongs to by determining the keywords and patterns in the text. The ignore all other words, phrases, etc.
