# Word Embedding

1. Word embedding is a core concept in Natural Language Processing (NLP) where words are represented as numerical vectors.
2. This technique is crucial for NLP tasks, enabling machines to understand word meanings and relationships.
Significance of Word Embedding:

3. Word embedding captures semantic relationships, making it a key component in NLP applications like sentiment analysis and machine translation.
4. It reduces dimensionality, improving model performance and facilitating the processing of large text corpora.

### Import relevant libraries

In [109]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding

### Data

In [110]:
reviews = {
    'too good',
    'very nice food',
    'amazing restaurant',
    'very bad',
    
    'just loved it',
    'will go again',
    'horrible food',
    'never go there',
    'poor service',
    'poor quality',
    'very nice',
    'needs improvement'
}
sentiments = np.array([1,1,1,0,1,1,0,0,0,0,1,0])

### convert into one hot vector

1. 'encoded_review' will contain lists of integers, where each integer corresponds to a word in the input text. 
2. The vocabulary size is set to 30, and the function assigns unique integers to the words in the input text within that vocabulary size constraint. 
3. The specific integer assigned to each word is determined by the hashing function used by one_hot

In [111]:
from tensorflow.keras.preprocessing.text import one_hot

vocabulary_size = 50
encoded_review = [one_hot(sentence, vocabulary_size) for sentence in reviews]
encoded_review

[[34, 1],
 [36, 2],
 [5, 9, 10],
 [23, 9, 30],
 [26, 32, 43],
 [34, 1, 2],
 [8, 48],
 [34, 10],
 [10, 31],
 [10, 1],
 [5, 48],
 [14, 46]]

### Padding
1. Some sentences are 3 word long and some ar 4.
2. So we need padding to have uniform size

In [112]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_length = 3
padded_reviews = pad_sequences(encoded_review, maxlen = max_length, padding = 'post')
padded_reviews

array([[34,  1,  0],
       [36,  2,  0],
       [ 5,  9, 10],
       [23,  9, 30],
       [26, 32, 43],
       [34,  1,  2],
       [ 8, 48,  0],
       [34, 10,  0],
       [10, 31,  0],
       [10,  1,  0],
       [ 5, 48,  0],
       [14, 46,  0]])

### Model


In [113]:
embedded_vector_size = 4
train = padded_reviews
targets = sentiments

model = tf.keras.Sequential([
    Embedding(vocabulary_size, embedded_vector_size, input_length = max_length, name = 'embedding'),
    Flatten(),
    Dense(1, activation = 'sigmoid')
])

model.compile('adam', loss = 'binary_crossentropy', metrics =['accuracy'])
model.summary()

Model: "sequential_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 3, 4)              200       
                                                                 
 flatten_12 (Flatten)        (None, 12)                0         
                                                                 
 dense_12 (Dense)            (None, 1)                 13        
                                                                 
Total params: 213
Trainable params: 213
Non-trainable params: 0
_________________________________________________________________


In [114]:
model.fit(train, targets, epochs =30, verbose =2 )

Epoch 1/30


1/1 - 1s - loss: 0.6927 - accuracy: 0.5833 - 610ms/epoch - 610ms/step
Epoch 2/30
1/1 - 0s - loss: 0.6914 - accuracy: 0.5833 - 3ms/epoch - 3ms/step
Epoch 3/30
1/1 - 0s - loss: 0.6901 - accuracy: 0.5833 - 2ms/epoch - 2ms/step
Epoch 4/30
1/1 - 0s - loss: 0.6888 - accuracy: 0.5833 - 2ms/epoch - 2ms/step
Epoch 5/30
1/1 - 0s - loss: 0.6876 - accuracy: 0.5833 - 3ms/epoch - 3ms/step
Epoch 6/30
1/1 - 0s - loss: 0.6863 - accuracy: 0.5833 - 2ms/epoch - 2ms/step
Epoch 7/30
1/1 - 0s - loss: 0.6850 - accuracy: 0.6667 - 3ms/epoch - 3ms/step
Epoch 8/30
1/1 - 0s - loss: 0.6837 - accuracy: 0.6667 - 999us/epoch - 999us/step
Epoch 9/30
1/1 - 0s - loss: 0.6824 - accuracy: 0.6667 - 2ms/epoch - 2ms/step
Epoch 10/30
1/1 - 0s - loss: 0.6811 - accuracy: 0.6667 - 999us/epoch - 999us/step
Epoch 11/30
1/1 - 0s - loss: 0.6799 - accuracy: 0.6667 - 3ms/epoch - 3ms/step
Epoch 12/30
1/1 - 0s - loss: 0.6786 - accuracy: 0.7500 - 3ms/epoch - 3ms/step
Epoch 13/30
1/1 - 0s - loss: 0.6773 - accuracy: 0.7500 - 3ms/epoch - 3ms

<keras.callbacks.History at 0x267fb813b50>

In [115]:
model.evaluate(train,targets)



[0.6536964178085327, 0.9166666865348816]

## Note:
1. This is a fake problem, we are more interested in the word embedding which we will do now
2. we have that data in the Embedding layer- "embedding"

In [116]:
weights = model.get_layer('embedding').get_weights()[0]
len(weights)

50

In [117]:
weights[1]

array([0.05637718, 0.07565457, 0.06916998, 0.01669856], dtype=float32)

In [118]:
weights[48]

array([-0.02278753,  0.0499354 ,  0.0567701 ,  0.07229727], dtype=float32)

- 1-> good
- 48-> nice 
- from one_hot 
- even though good and nice have same meaning the vector are different as we had limited dataset