#You Will Learn
This course is all about making sense out of text data using deep learning techniques.
You will learn how to efficiently represent the text data in a numeric format such that they retain both semantic and textual relationship.
You will learn the mechanism behind difference word embeddings such as Skip grams, CBOW, GLove models.
Finally, you will know how to build a model for machine translation using the concept of encoders and decoders.



#Word Embeddings
The primary objective of Natural Language Processing especially in deep learning is to attain human-like performance in a linguistic task.
For a machine, a word is a just a string, and it does not make any difference if the words are used in different context.
For example, humans can understand that words one, two, three.. falls under one category when compared to words I, we, you ..
Hence we need an algorithm that can assign a numeric representation for words such that words with similar context will have similar numeric representation.
These numeric representations are known as word embeddings and in the rest of the course, you will explore different types of word embedding and the algorithm behind them.


#One Hot Encoding
One Hot Encoding
One naive way of representing a word in numeric format is by one-hot encoding.
As shown in the image initially, all the words in the vocabulary are stored as a set and assigned a unique index to each word.
Later each word is represented as a vector, where all the elements are zeros except for the index of the word which is equal to one.


![alt text](https://docs-cdn.fresco.me/system/attachments/files/004/719/726/large/8752c5b9f85b3455c09f04a9d3538fd71fee16c5/One_hot_encoding.jpeg)



#One Hot Encoding - Drawback
One major drawback of one hot encoding is that these numeric representation does not covey any relationship between the words. They are vectors by itself.
On the other hand, if the corpus is huge, the dimensions of the one hot encoded vector tend to be too large and computationally expensive.
However, nevertheless these vector form the basis for rest of the algorithms learn other kinds of word embeddings.

#Word to Vec
Learning good word embeddings are of paramount importance in NLP.
A good word embedding is one which can represent a word in minimum vector space and at the same time preserve the semantics as well as their context in the language.
Each word embeddings are points in a vector space, and the transformation of words to their vector representation is called wordtovec.


![alt text](https://docs-cdn.fresco.me/system/attachments/files/004/719/701/large/eec1e735d556e8e995fbe5a89446a914640edf1a/Word_to_vec_intution.jpeg)

#Wordtovec - Intuition
Wordtovec - Intuition 
Let's say that our neural network has learned to represent four words forest, jungle, sea, and dolphin into two-dimensional vector space (1,2), (1,2.5), (5,6),(5, 9).
As you can see that the words jungle and forest are closely spaced since they mean the same and the words sea and dolphin are placed quite closely but far away from prior words.
Though sea and dolphin do not mean the same, they are still closed since the network might have learned from the corpus that dolphin mostly habitats in the sea as they have appeared together in the corpus.
In the real world, the dimension of each word embedding will be around 300 to 1000.


#Learning Word Embeddings
There are several ways of arriving at word vectors using deep learning and some of the popular methods are


1. Continuous Bag of words
2. Skipgram model
3. Glovec model


In the rest of the cards, in this topic, let's try to understand the theoretical implementation of these models and understand practical implementation in the next topic.



#CBOW
Consider we have a sentence Judith feigned a forgotten wallet to evade paying for dinner, proving she had surpassed frugality and become parsimonious.
If you are not sure about the meaning of the final word parsimonious you tend to look at the words surrounding it to guess its meaning (or context).
By looking at the words evade, paying, frugality (avoiding waste) parsimonious might mean spending less.
CBOW learns the meaning of a given word (or numeric vector) by looking at a fixed number of words front and behind the word of interest, or in other words, it learns the context.
The main idea of CBOW model is to predict a word given its context.



#Skip Gram
Skip Gram
Skip gram model is quit different from CBOW model.
The idea of skip gram model is to predict the context given the word
For a given word skip gram model tries to predict most probable words that usually surrounds it.

#Global Vectors
GLoVe model is slightly different from word to vec model.
GLoVe model learns to build word embeddings by looking at the number of times the two words have appeared together which we call it as co-occurrence.
It tries to minimize the difference between the similarity of two-word vectors and their co-occurrence.

#Initializing Word Embeddings
Initializing Word Embeddings
Before training any neural network, one always starts with initializing the weights and bias parameters to random values.
Likewise, before learning the word embeddings, the embedding vectors for each of the words in the vocabulary are initialized with random values.
The word embeddings look like the table shown above. This table is commonly known as look up table.
The length of each word embedding is of N dimension, in the above table N = 3

![alt text](https://docs-cdn.fresco.me/system/attachments/files/004/719/731/large/4b04f527c1e63c1559dc6d4da27453a87a6fb29b/Initializ_word_embedding.jpeg)

#The Context and the Target
Before we employ a neural network to learn word embeddings, it is necessary to fix the context and target words.

Consider the sentence

When you play a game of thrones you win, or you die.

1. Let's say that we are trying to learn the word embedding for the word thrones in the above sentence.

2. Here the target word is thrones i.e the word for which we are trying to find the embedding.

3. The context words for the word thrones are **game **, of , you, win i.e. the two words before and after the target word provided the window size is 2.

4. In other words, the words which usually surround the target words becomes the context word

5. The words sharing similar context share similar meaning (or similar word vector representations)..



#Sampling
Sampling
Once we have initialized the lookup table, it's time to sample the target and context words as shown in the above image.
They are similar input feature and target labels what you see in supervised learning.

![alt text](https://docs-cdn.fresco.me/system/attachments/files/004/719/722/large/1ea7f58f82d822dd57271de5bbebefe5559b51db/sampling.jpeg)


#The Skip Gram Model
The Skip Gram Model
The figure shows the architecture of skip gram model.
The embedding matrix is the lookup table whose weights are taken as word vectors.



![alt text](https://docs-cdn.fresco.me/system/attachments/files/004/719/702/large/2262b0487001a8ba2fc857690af433fece14b41a/Skip_gram_model.jpeg)



#The Softmax Layer
The forward propagation explained in the previous card can be briefed into one equation as:
$$
\hat y_j = \frac{e^{w'_j.w_i}}{\sum_{j=1}^V e^{w'_j. w_i}}
​y
​^
​​ 
​j
​​ =
​∑
​j=1
​V
​​ e
​w
​j
​′
​​ .w
​i
​​ 
​​ 
​
​e
​w
​j
​′
​​ .w
​i
​​ 
​​ 
​​ $$

y_j - the output from node j in the final softmax layer.

Basically \hat y_j
​y
​^
​​ 
​j
​​  is the probability of word y_jy
​j
​​  being the context of the word x_ix
​i
​​ .



#Extracting Word Embeddings
After several iterations of forward-pass -> computing loss -> update weights the updated weights of the embedding matrix of initial hidden layer are the final word embeddings.
The weights corresponding to softmax layer can be discarded.



#Keep in Mind
1. The whole point of training a skip gram model is not to predict the context word given the target word but is to learn the weights of the embedding matrix.
2. This is true for all other models that are used to compute the word embeddings.


#Skip-Gram Drawback
One main disadvantage of skip-gram model is that for each of the training sample, it sees the model updates all the weights in every iteration.
If the vocabulary size is large, this could be computationally expensive.
To address this issue, there is another technique called negative sampling, where only a small fraction of weights are updated for each training sample.
Swipe next to know more on negative sampling.


#Negative Sampling
Negative Sampling
In negative sampling, we generate a set of positive and negative samples from the available text as shown in the figure.
Each sample will have a binary target value that says whether the two words appear in the context or not.
The number of negative samples K can be arbitrarily chosen. Larger the corpus, smaller the value of K (usually around 5).

![alt text](https://docs-cdn.fresco.me/system/attachments/files/004/719/725/large/5b01dc60d2d8b98a6fabdebe971165e3b304c9c8/Negative_sampling.jpeg)



#Negative Sampling Architecture
Negative Sampling  Architecture
The figure shows the negative sampling model architecture.
The softmax layer of Skip gram model is replaced by sigmoid activation.

![alt text](https://docs-cdn.fresco.me/system/attachments/files/004/719/955/large/8e6144728a258b0dd9103e82b9a2b719e2cbebae/Negative_sampling_model.jpeg)


#Negative Sampling - Steps Involved
In contrast to the skip gram model, negative sampling replaces the final softmax layer of V nodes with one single node to compute sigmoid activation.
We first compute the dot product of word vectors of input samples and then perform the sigmoid activation.$$ p(y=1|(w_t, w_c)) = sigmoid(w_t . w_c)p(y=1∣(w
​t
​​ ,w
​c
​​ ))=sigmoid(w
​t
​​ .w
​c
​​ )$$


Finally, the binary cross entropy loss function is computed and the weights of the target vector w_tw
​t
​​  is updated.



#CBOW Model
Continuous Bag of Words Model works slightly opposite to that of Skip gram model.
Here, we feed the context words as input and try to predict the target word.


![alt text](https://docs-cdn.fresco.me/system/attachments/files/004/721/977/large/7925a1a5251047f8f8d2cffd7854ccb5ca6ce61d/CBOW_model.jpeg)


The word vectors of each context words are averaged before passing to softmax layer.


#CBOW - Steps Involved
Most of the steps in the CBOW model is similar to that of Skip-gram model except for how we compute the hidden layer outputs.
Initially, we pull out all the word vectors corresponding to the context words from the lookup table.
Once we obtained the word vectors the element-wise average of word vector is computed.
The averaged vector is then fed to the softmax layer to predict the target word.

#Global Vectors
The skip gram and CBOW models are predictive in nature, i.e. they learn the word embeddings by trying to predict the local context words.
Hence word to vec approach does not take into account the global context of the word in the whole corpus.
The new word representation called Global vector takes into account the cooccurrence matrix of words and then use this matrix to arrive at more expressive and dense word vectors.

#Generate Co-Occurrence Matrix
The Global Vector model, also known as Glove model, initially computes the co-occurrence of each word with every other word in the corpus.

$$X_ijX
​i$$
​​ j = number of times the word i has appeared in the context of word j.

We build the matrix X having co-occurrence values of all the words in the corpus.


#The Objective Function
Once we have the co-occurrence matrix X, we need to decide the vector values for each word in the corpus.
The objective of the word vectors is to have useful information about how every pair of word i and j co-occur.
Ideally, the objective of word vectors has to be: $${w}_i^T .{w}_j + b_i + b_j = \log X_{ij}w
​i
​T
​​ .w
​j
​​ +b
​i
​​ +b
​j
​​ =logX
​ij$$
​​ 
$w_iw
​i
​​  and w_jw
​j$
​​  are word vectors for i and j obtained from the lookup table
$b_ib
​i
​​  and b_jb
​j$
​​  are scalar bias terms associated with words i and j.

#The Cost Function
The cost function J for Glove model is defined by $$J = \sum_{i=1}^V\sum_{j=1}^Vf(X_ij)({w}_i^T .{w}_j + b_i + b_j - \log X_{ij})J=∑
​i=1
​V
​​ ∑
​j=1
​V
​​ f(X
​i
​​ j)(w
​i
​T
​​ .w
​j
​​ +b
​i
​​ +b
​j
​​ −logX
​ij
​​ )$$
The objective is to minimize the cost function J and learn the word embeddings through back propagation.
The term f(X_ij)f(X
​i
​​ j) is a weight term defined as: $$f(X_ij) = 0f(X
​i
​​ j)=0 if X_ij =0X
​i
​​ j=0 else 11$$


#Gensim
Gensim is an open source library for natural language processing.
Using minimal lines of code you will be able to generate word vectors for your own corpus.
Refer this link for detailed documentation on gesim library.

```
from gensim.models import Word2Vec

# define training data
sentences = [['gensim', 'is', 'billed','as', 'a', 'natural', 'language', 'processing',                  'package'],
            ['but', 'it', 'is', 'practically', 'much', 'more', 'than' ,'that'],
            ['It', 'is', 'a', 'leading', 'and', 'a', 'state', 'of', 'the', 'art', 'package', 
             'for', 'processing', 'texts', 'working' 'with' 'word' 'vector' 'models']]

# train model
model = Word2Vec(sentences, min_count=1, size = 10)

# summarize the loaded model
print(model)

# summarize vocabulary
words = list(model.wv.vocab)
print(words)

# access vector for one word
print(model['gensim'])

```


Gensim - Word2Vec
Gensim library has a built-in class called Word2Vec to work with a word to vec models.
```
from gensim.models import Word2Vec

model = Word2Vec(sentences)
```

Parameters

size: (default 100) The number of dimensions of the embedding, e.g., the length of the dense vector to represent each token (word).
window: (default 5) The maximum distance between a target word and words around the target word.
min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.
workers: (default 3) The number of threads to use while training.
sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1)



Similarity Metric
Representing words as vectors is useful in measuring similarities between two words.
Cosine similarity is the best metric to measure similarity when it comes to vectors.
Cosine similarity is computed as follows:
$$similarity = \cos\theta = \frac{V_1 . V_2}{||V_1||.||V_2||}similarity=cosθ=
​∣∣V
​1
​​ ∣∣.∣∣V
​2
​​ ∣∣
​
​V
​1
​​ .V
​2$$
​​ 
​​ 

$V_1V
​1
​​  and V_2V
​2$
​​  are two word vectors of interest.
$||V||∣∣V∣∣$ represents the square root sum of squared elements in vector V.
$V_1.V_2V
​1
​​ .V
​2$
​​  is the dot product of two vectors.
Smaller the similarity score closer the two vectors.

#Summary
In this topic, you have learned:
The theory behind learning word embeddings using GLoVe model.
Using gensim library to generate word vectors.
Computing similarity between word vectors using a cosine similarity metric.

#Machine Translation
Machine translation is one of the fastest growing field in a deep learning application, be it in research or industrial applications.
The main idea of machine translation is to provide an interface to transform one form of sequence data to another form from which the user can make sense.
This has been a core technology behind Google Translate in both voices as well as text translation.

#Sequence to Sequence Model
For any translation, the input to the model will have a sequence data either a text data or speech data.
The output of the model is again a sequence data except that they are in different language.
Since you are translating one sequence to another, the underlying translators are known as a sequence to sequence models.
Moreover, you already know when it comes to sequence data Recurrent Neural Networks are good at handling such kinds of data and they form the basic entity of Sequence to Sequence models.

#Encoders and Decoders
The sequence to Sequence model can be broken down into two parts an Encoder and a Decoder network.
Let's take an example of French to English translation model trying to translate the below sentence:
Les petits oiseaux chantent joyeusement -> The little birds are singing happily

In the next cards, lets see how the Encoders and Decoder network manages to translate the French sentence to its counter English sentence.

#Encoder
Encoder 
The Encoder network takes in the input Sequence and encodes the whole input sequence into some form of numeric representation.
The final encoding has all the information required to translate the sentence.
Each word in the input sequence can be a vector representation of words like word vectors you learn in the previous topics and any numerical representation that can uniquely identify the words.


#Decoder
Decoder
The Decoder network is another RNN that accepts the encodings generated by Encoder networks and predicts word by word the corresponding sentence in English just like a sequence generator.
Unlike Encoder network the input sequence to the decoder network in the current time step is the word predicted by the same network in previous time step.


#Greedy Search
Greedy Search
The straight forward approach to select sample the words from the decoder output is to pick the word with maximum probability from the final softmax layer of the decoder network in every time step.
This approach is known as greedy search, which you will see in the next card.


#Greedy Search - Drawback
The drawback of greedy search approach is that we cannot guarantee that the model ends up with the best translation.
For example for the same input, French sentence x = Les petits oiseaux chantent joyeusement the output translation could be
sentence1 - The small birds are singing happily or

sentence2 - The little birds are singing happily

The greedy search might choose the first sentence because of the reason P(sentence1 | x)P(sentence1∣x) > P(sentence2 | x)P(sentence2∣x) inspite of second sentence is more meaningful.


#Beam Search
Beam search is an approximate search strategy that solves the issue of greedy search.
Unlike greedy search, the beam search selects top B occurring words at each time step before proceeding to the next time step.


#Summary
In this topic, you have learned the concept of encoders and decoder for machine translation
The logic behind the beam search algorithm to select among the best possible translation from the decoder network.

#Sentiment Classification
This topic is the road map for the next hands-on, where you train a model to classify movie reviews to predict if a review is a positive review or a negative review.
You will perform following steps to build and train the classifier.
1. Perform necessary preprocessing to transform movie reviews into one hot encoded data.
2. Initialize the lookup table.
3. Build an RNN model to learn from the input data.
4. Use the output of RNN to classify the reviews using sigmoid activation.


You will be using Keras framework to perform all the above operations.


#Data Preprocessing
When it comes to text data, we first remove all kinds of stop words, if necessary, and then transform each character or words into one hot encoding.
Keras framework has a built-in class called Tokenizer which performs implicit tokenization and indexing of each words in the document.
It also eliminates special characters in the document.


```
### collection of text (or corpus)
docs = ["not good",  "climax was awesome !",  "really liked the movie",  "too lengthy"]

from keras.preprocessing.text import Tokenizer

t = Tokenizer()

### Perform transformation
t.fit_on_texts(docs)

###Output the number of documents in the corpus
t.document_count

###Output the number of occurrence of each word across the document
t.word_counts

###Output the dictionary having word as key and their unique index as values
t.word_index

###Output the dictionary having word as key and number of documents it has appeared as values
t.word_docs
```

#Lookup Table
Now each word in text data is replaced by their respective index.
To train the LSTM model, you will not directly input the word index to the LSTM network.
We first initialize the lookup table of shape (vocab_size, vector_length).
We do this in KerasEmbedding class as follows.

```
from keras.layers import Embedding  

embedding_layer = Embedding(vocab_size,  vector_length)
```

Note that the embeddings initialized this way are random when later gets an update as you train the network.

#Transform Data
Once you have the unique index for each word in the corpus, the corpus has to be represented as an array of an index in place of words as shown below.
```
word_to_id = { the: 0, awesome: 1, movie: 2, good:3, was:4 }
 
 data = [["the movie was awesome"]]
 transformed_data = [[0, 2, 4, 1]]
 ```

#Sequence Padding

The length of the movie review is not always determined, it can be too short or too long.
The model may take a very long time to train if the text data is too long.
For all the reviews we may consider only first few words say 500.
If the text is less than 500words we zeros, in the beginning, to make up the length to 500 words.
```
from keras.preprocessing import sequence 
max_review_length = 500
sequence.pad_sequences(transformed_data,  maxlen=max_review_length)
```

#Building the Network
Building the Network
Once you are done with the preprocessing, now it's time to build a sentiment classifier on movie reviews.

The image shows the model outline for a sentiment classifier that you will be designing for hands-on.

![alt text](https://docs-cdn.fresco.me/system/attachments/files/004/719/721/large/e3cf9865e35a65bdcf7b5443e00653bc5ca54b17/sentiment_analysis.jpeg)



#Story So Far
In this course, you have learned the following topics:
The concept of word embeddings
Different algorithms used to generate word embeddings
To build an encoder-decoder network using RNN for machine translation
Use word vectors for sentiment analysis using RNN