<a href="https://colab.research.google.com/github/dmitry-kabanov/datascience/blob/main/2022-06-15-embeddings-in-keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embeddings in Keras

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
import numpy as np

We create a simple model based on embedding layer:

In [None]:
model = Sequential()
model.add(
    Embedding(
        input_dim=10,   # Size of the input vocabulary
        output_dim=4,   # Dimensionality of output vector space
        input_length=2  # Maximum length of a sequence
    )
)
model.compile(optimizer="adam", loss="mse")

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 2, 4)              40        
                                                                 
Total params: 40
Trainable params: 40
Non-trainable params: 0
_________________________________________________________________


## Example of decomposition of words with randomly initialized model

Now we take some indexed input data and embed them with the model:

In [None]:
input_data = np.array([[1, 2]])
pred = model.predict(input_data)
print("Input data shape: ", input_data.shape)
print("Predictions")
print(pred)

Input data shape:  (1, 2)
Predictions
[[[-0.04194342 -0.03313633 -0.00573406  0.03623357]
  [-0.00629203 -0.02213107 -0.0070897   0.01656753]]]


In [None]:
model.layers[0].trainable_weights

[<tf.Variable 'embedding/embeddings:0' shape=(10, 4) dtype=float32, numpy=
 array([[ 0.04158003,  0.04614941,  0.01253562,  0.03269812],
        [-0.04194342, -0.03313633, -0.00573406,  0.03623357],
        [-0.00629203, -0.02213107, -0.0070897 ,  0.01656753],
        [-0.01223986, -0.02374547, -0.04237936, -0.04358562],
        [-0.02551185,  0.01600123,  0.03523505, -0.03985095],
        [-0.04595212,  0.0114981 ,  0.00273025, -0.01366209],
        [ 0.04968718,  0.00606449,  0.02919232,  0.01792708],
        [ 0.00782752,  0.04510366, -0.03712968,  0.00625715],
        [ 0.01315261, -0.01916968,  0.01093953,  0.02169173],
        [ 0.01646555, -0.01309564, -0.02450665, -0.04531604]],
       dtype=float32)>]

## Training embedding model

To get word embeddings, we need to do the following:

1. We split sentences into words (tokenization)
2. One-hot encode the words
3. Pad sequences if needed such that they all are of the same length
4. Pass the padded sequences as inputs for model training.
5. Flatten and apply a dense layer to predict the label.

In [None]:
import numpy as np
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Embedding, Dense

In [None]:
# Define 10 resturant reviews as data
reviews = [
           "Never coming back!",
           "horrible service",
           "rude waitress",
           "cold food",
           "horrible food!",
           "awesome",
           "awesome services!",
           "rocks",
           "poor work",
           "couldn\'t have done better",
]

# Lables: 1 is negative and 0 is positive
labels = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])

In [None]:
vocab_size = 50
encoded_reviews = [one_hot(d, vocab_size) for d in reviews]
print(f"Encoded review: {encoded_reviews}")

Encoded review: [[24, 17, 45], [5, 49], [46, 29], [43, 23], [5, 23], [36], [36, 44], [33], [44, 20], [32, 45, 1, 38]]


In [16]:
max_length = 4
padded_reviews = pad_sequences(
    encoded_reviews, maxlen=max_length, padding="post"
)
print(padded_reviews)

[[24 17 45  0]
 [ 5 49  0  0]
 [46 29  0  0]
 [43 23  0  0]
 [ 5 23  0  0]
 [36  0  0  0]
 [36 44  0  0]
 [33  0  0  0]
 [44 20  0  0]
 [32 45  1 38]]


Model for embedding will be such that the resultant vectors are in 8-dimensional space, with input vocabulary size `vocab_size` and input sequence
length `max_length`:

In [17]:
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=8, input_length=max_length),
    Flatten(),
    Dense(1, activation="sigmoid"),
])

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"])

print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 4, 8)              400       
                                                                 
 flatten (Flatten)           (None, 32)                0         
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
None


In [18]:
model.fit(padded_reviews, labels, epochs=100, verbose=0)

<keras.callbacks.History at 0x7f2c35f27f50>

In [20]:
print(model.layers[0].get_weights()[0].shape)

(50, 8)


In [22]:
print(model.layers[0].get_weights()[0][0])

[ 0.1151866   0.11104467  0.15172915 -0.1234385  -0.12647197  0.07070275
 -0.09507612 -0.15376773]
