<a href="https://colab.research.google.com/github/gcosma/COP509/blob/main/Week3bLearnedEmbedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Lesson 05: Learned Embedding**

**Original Source:** Jason Brownlee, [How to Get Started with Deep Learning for Natural Language Processing](https://machinelearningmastery.com/crash-course-deep-learning-natural-language-processing/), Available from [here](https://machinelearningmastery.com), accessed December 13, 2021.

- Lesson 01: Deep Learning and Natural Language
- Lesson 02: Cleaning Text Data
- Lesson 03: Bag-of-Words Model
- Lesson 04: Word Embedding Representation
- **Lesson 05: Learned Embedding**
- Lesson 06: Classifying Text
- Lesson 07: Movie Review Sentiment Analysis Project

In this lesson, you will discover how to learn a word embedding distributed representation for words as part of fitting a deep learning model

**Embedding Layer**
Keras offers an Embedding layer that can be used for neural networks on text data.

It requires that the input data be integer encoded so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.

The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset. You must specify the input_dim which is the size of the vocabulary, the output_dim which is the size of the vector space of the embedding, and optionally the input_length which is the number of words in input sequences.

In [None]:
#Do not run
layer = Embedding(input_dim, output_dim, input_length=??)

Or, more concretely, a vocabulary of 200 words, a distributed representation of 32 dimensions and an input length of 50 words.



In [None]:
#Do not run
layer = Embedding(200, 32, input_length=50)

**Embedding with Model**

The Embedding layer can be used as the front-end of a deep learning model to provide a rich distributed representation of words, and importantly this representation can be learned as part of training the deep learning model.

For example, the snippet below will define and compile a neural network with an embedding input layer and a dense output layer for a document classification problem.

When the model is trained on examples of padded documents and their associated output label both the network weights and the distributed representation will be tuned to the specific data.

In [None]:
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Flatten, Embedding
import numpy as np

# define problem
vocab_size = 100
max_length = 32

# define the model
model = Sequential([
    Embedding(vocab_size, 8, input_length=max_length),
    Flatten(),
    Dense(1, activation='sigmoid')
])

# Build the model by running a sample input
sample_input = np.zeros((1, max_length))  # Create a sample batch with one sequence
_ = model(sample_input)  # This builds the model

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# summarize the model
model.summary()

It is also possible to initialize the Embedding layer with pre-trained weights, such as those prepared by Gensim and to configure the layer to not be trainable. This approach can be useful if a very large corpus of text is available to pre-train the word embedding.

**Your Task**

Your task in this lesson is to design a small document classification problem with 10 documents of one sentence each and associated labels of positive and negative outcomes and to train a network with word embedding on these data. Note that each sentence will need to be padded to the same maximum length prior to training the model using the Keras pad_sequences() function. Bonus points if you load a pre-trained word embedding prepared using Gensim.

Post your code in the comments below. I would love to see what sentences you contrive and the skill of your model.

**More Information**

- Data Preparation for Variable Length Input Sequences
- How to Use Word Embedding Layers for Deep Learning with Keras

In the next lesson, you will discover how to develop deep learning models for classifying text.