This tutorial introduces word embeddings

It contains complete code to train word embeddings from scratch on a small dataset

# 1) Representing text as numbers

Machine learning models take vectors (arrays of numbers) as input

When working with text, the first thing we must do come up with a strategy to convert strings to numbers (or to "vectorize" the text) before feeding it to the model

In this section, we will look at three strategies for doing so

1. One-hot encodings
2. Encode each word with a unique number
3. Word embeddings



---



**Frequency based Embedding**
1.   Count Vector
2.   TF-IDF Vector
3.   Co-Occurrence Vector

**Prediction based Embedding**
1.   CBOW (Bag of Words)
2.   Skip-Gran

## 1) One-hot encodings

We might "one-hot" encode each word in our vocabulary

Consider the sentence **"The cat sat on the mat"**

The vocabulary (unique words) in this sentence is (cat, sat, on, the, mat)

To represent each word, we will create a zero vector with length = vocabulary, then place a one in index that corresponds to the word

![alt text](https://drive.google.com/uc?id=1MbwyW_fPcqydSSRfqThyPi8xuqjJKHkf)

To create a vector that contains the encoding of sentence, then we can concatenate the one-hot vectors for each word

**This approach is inefficient**

A one-hot encoded vector is sparse (most indices are zero)

Imagine we have 10,000 words in the vocabulary, then to one-hot encode each word, we have to create a vector where 99.99% of the elements are zero

## 2) Encode each word with a unique number

**Second approach: Encode each word with a unique number**

We can encode each word using a unique number, In above example, "The cat sat on the mat"

We could assign 1 to "the", 2 to "sat" and so on, Then the encoded sentence would be, 
"The cat sat on the mat" be like [1, 2, 3, 4, 1, 6]

This appoach is efficient than the sparse vector appoach

The integer-encoding is arbitrary it does not capture any relationship between the words

An integer-encoding can be challenging for a model to interpret

for example,
A linear classifier learns a single weight for each feature because there is no relationship between the similarity of any two words and the similarity of their encodings, this feature-weight combination is not meaningful

## 3) Word embeddings

With the word embeddings is an efficient, dense representation in which similar words have a similar encoding

In word embeddings we do not have to specify this encoding by hand

**An embedding** is a dense vector of floating point values (the length of the vector is a parameter you specify)

Instead of specifying values for the embedding manually, they are trainable parameters

**trainable parameters:** weights learned by model during training, in the same way a model learns weights for a dense layer

Word embeddings can have 8-dimensional (for small datasets), up to 1024-dimensions (for the large datasets)

A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn

![alt text](https://drive.google.com/uc?id=15PinRU7Q9b6l7_YH2wKAf4yZFP8wLRzR)

In above word embedding diagram, each word is represented as a 4-dimensional vector of floating point values

Another way to think of an embedding is as "lookup table"

After these weights have been learned, we can encode each word by looking up the dense vector it corresponds to in the table