In [1]:
from tensorflow.keras.preprocessing.text import one_hot

In [2]:
### sentences
sentences=[  'the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good',]

In [3]:
voc_size = 10000
one_hot_repr = [one_hot(sentence,voc_size) for sentence in sentences]

In [4]:
one_hot_repr

[[3596, 5823, 1095, 5957],
 [3596, 5823, 1095, 9597],
 [3596, 2692, 1095, 8483],
 [8473, 3601, 7319, 6783, 8322],
 [8473, 3601, 7319, 6783, 7921],
 [6584, 3596, 9421, 1095, 3357],
 [6814, 1296, 7995, 6783]]

The one_hot() function from tensorflow.keras.preprocessing.text generates a unique integer representation for each word in the sentence.

It does not perform traditional one-hot encoding (where each word is represented as a sparse vector of size equal to vocabulary size). Instead, it assigns a random but deterministic integer (between 1 and voc_size) to each word.

The output will be a list of lists where each inner list contains integer values representing words.

In [5]:
from tensorflow.keras.utils import pad_sequences

max_sentence_length = 8
embedded_doc = pad_sequences(one_hot_repr,padding="pre",maxlen=max_sentence_length)

In [6]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.models import Sequential

In [None]:
dim = 10

model=Sequential()
model.add(Embedding(voc_size,dim,input_length=max_sentence_length))
model.compile('adam','mse')



In [8]:
model.summary()

In [11]:
model.predict(embedded_doc)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step


array([[[-0.04162744,  0.00332686,  0.04613862, -0.00371896,
          0.0237614 , -0.00667145, -0.00117655, -0.02491373,
          0.01515737,  0.00770762],
        [-0.04162744,  0.00332686,  0.04613862, -0.00371896,
          0.0237614 , -0.00667145, -0.00117655, -0.02491373,
          0.01515737,  0.00770762],
        [-0.04162744,  0.00332686,  0.04613862, -0.00371896,
          0.0237614 , -0.00667145, -0.00117655, -0.02491373,
          0.01515737,  0.00770762],
        [-0.04162744,  0.00332686,  0.04613862, -0.00371896,
          0.0237614 , -0.00667145, -0.00117655, -0.02491373,
          0.01515737,  0.00770762],
        [-0.04967687,  0.02662567,  0.01981434, -0.01477013,
         -0.01244587, -0.01182693,  0.02183693,  0.02754739,
         -0.02321283,  0.00949694],
        [ 0.00925181, -0.02408495, -0.04708297, -0.04156034,
         -0.04287237, -0.02433472,  0.01309354,  0.03424022,
          0.00653262,  0.02684298],
        [-0.0465182 , -0.04596427,  0.02495554, -0.0

array([   0,    0,    0,    0, 3596, 5823, 1095, 5957], dtype=int32)

In [15]:
model.predict(embedded_doc[0])

ValueError: Exception encountered when calling Sequential.call().

[1mCannot take the length of shape with unknown rank.[0m

Arguments received by Sequential.call():
  • inputs=tf.Tensor(shape=<unknown>, dtype=int32)
  • training=False
  • mask=None

## Are we using One-Hot Encoding or Word2Vec to generate embeddings?

We are NOT using Word2Vec.

We are NOT using traditional one-hot encoding (which would generate sparse vectors).

We are using integer encoding (via one_hot()) followed by an Embedding layer.

The Embedding layer learns word embeddings during training similar to Word2Vec but is randomly initialized at first.

### Summary
one_hot() does NOT create one-hot vectors; it assigns a unique integer index to words.

Embedding(voc_size, dim) learns dense vector representations (embeddings) for words.

This is not pre-trained Word2Vec; it's a trainable embedding layer initialized randomly.

If you wanted to use pre-trained Word2Vec embeddings, you would need to load a pre-trained model like GloVe or Word2Vec and set trainable=False in the Embedding layer.


## Final Answer
one_hot() gives integer encoding, which is NOT a true vector representation.

Word embeddings are necessary because integer encoding lacks meaning and cannot capture relationships between words.

The Embedding layer learns dense vector representations that make words with similar meanings close to each other in the embedding space.

## Key Takeaway
💡 One-hot encoding and integer encoding are different. Integer encoding is just an index, whereas word embeddings provide meaningful, dense representations.

