# Text Representation: One-Hot Encoding and Word Embeddings

### This Jupyter Notebook demonstrates two fundamental techniques for converting text data into numerical representations that can be used as input for machine learning models, particularly neural networks: One-Hot Encoding and Word Embeddings.

## Key Concepts Covered:

1.  **Text Data Preparation**: Defining a list of example sentences to illustrate the process.
2.  **Vocabulary Size Definition**: Establishing a `voc_size` (vocabulary size) which determines the range for unique word representations.
3.  **One-Hot Encoding (`one_hot`)**:
    * Converting each word in a sentence into a unique integer representation based on a predefined vocabulary size.
    * This process results in a list of integer sequences, where each integer corresponds to a specific word.
4.  **Padding Sequences (`pad_sequences`)**:
    * Ensuring all integer sequences (sentences) have a uniform length, which is a requirement for neural network inputs.
    * Padding (adding zeros) is applied to shorter sequences to match the `maxlen` (maximum sentence length).
5.  **Word Embedding Representation (`Embedding` layer)**:
    * Introducing TensorFlow/Keras's `Embedding` layer, which learns dense vector representations (embeddings) for words.
    * Unlike one-hot encoding, word embeddings capture semantic relationships between words, representing them in a lower-dimensional continuous vector space.
6.  **Building a Simple Embedding Model**:
    * Creating a `Sequential` Keras model containing only an `Embedding` layer.
    * Configuring the `Embedding` layer with `voc_size` (input dimension), `dim` (output dimension/embedding size), and `input_length` (padded sequence length).
    * Compiling the model with a dummy optimizer and loss for demonstration purposes, as the primary goal here is to observe the embedding output.
7.  **Generating Embeddings**: Using the model's `predict` method to obtain the learned word embeddings for the padded input sequences.

This notebook provides a practical introduction to converting raw text into numerical formats suitable for deep learning, highlighting the transition from sparse one-hot representations to dense, semantically rich word embeddings.

In [11]:
from tensorflow.keras.preprocessing.text import one_hot

In [10]:
### sentences
# Define a list of example sentences for demonstration
sent=[  'the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good',]

In [12]:
sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

In [13]:
## Define the vocabulary size
# Define the size of the vocabulary. This determines the range of integer IDs for one-hot encoding.
# A larger vocabulary size reduces collisions but may not always be necessary for small datasets.
voc_size=10000

In [14]:
sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

In [15]:
### One Hot Representation
# Apply one-hot encoding to each sentence
# Each word in a sentence is converted into a unique integer ID within the `voc_size` range.
one_hot_repr=[one_hot(words,voc_size)for words in sent]
one_hot_repr

[[9984, 1599, 2422, 1287],
 [9984, 1599, 2422, 8590],
 [9984, 6765, 2422, 6548],
 [5044, 2929, 5027, 6590, 2913],
 [5044, 2929, 5027, 6590, 3885],
 [232, 9984, 9164, 2422, 8697],
 [6871, 4875, 8670, 6590]]

In [5]:
## word Embedding Representation
# Import necessary Keras layers and utilities for word embeddings
from tensorflow.keras.layers import Embedding
from tensorflow.keras.utils import pad_sequences # Corrected import from .processing.sequence
from tensorflow.keras.models import Sequential
import numpy as np # Used for numerical operations, though not explicitly in this snippet's output


In [16]:
# Define the desired length for all sequences after padding
sent_length=8
# Pad the one-hot encoded sequences to ensure uniform length
# `padding='pre'` adds zeros to the beginning of shorter sequences.
# `maxlen` specifies the target length.
embedded_docs=pad_sequences(one_hot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[   0    0    0    0 9984 1599 2422 1287]
 [   0    0    0    0 9984 1599 2422 8590]
 [   0    0    0    0 9984 6765 2422 6548]
 [   0    0    0 5044 2929 5027 6590 2913]
 [   0    0    0 5044 2929 5027 6590 3885]
 [   0    0    0  232 9984 9164 2422 8697]
 [   0    0    0    0 6871 4875 8670 6590]]


In [18]:
## feature representation
# Define the dimensionality of the word embeddings (the size of the dense vector for each word)
dim=10

In [None]:
# Initialize a Sequential Keras model
model=Sequential()
# Add an Embedding layer to the model
# `voc_size`: The size of the vocabulary (maximum integer index + 1).
# `dim`: The output dimension of the embedding (size of each word vector).
# `input_length`: The length of input sequences (the `maxlen` used in padding).
model.add(Embedding(voc_size,dim,input_length=sent_length))
# Compile the model. For an Embedding layer demonstration, the optimizer and loss
# are not critical as we're primarily interested in the layer's output, not training.
model.compile('adam','mse')  # Using Adam optimizer and Mean Squared Error loss



In [20]:
model.summary()

In [32]:
# Use the model to predict (i.e., generate) the word embeddings for the padded documents
# This will output a 3D array: (number_of_sentences, sequence_length, embedding_dimension)
model.predict(embedded_docs)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step


array([[[ 0.04876981, -0.02584345,  0.01496186, -0.02252179,
         -0.01341717,  0.0327312 , -0.03630286, -0.03725926,
         -0.00585858, -0.01611942],
        [ 0.04876981, -0.02584345,  0.01496186, -0.02252179,
         -0.01341717,  0.0327312 , -0.03630286, -0.03725926,
         -0.00585858, -0.01611942],
        [ 0.04876981, -0.02584345,  0.01496186, -0.02252179,
         -0.01341717,  0.0327312 , -0.03630286, -0.03725926,
         -0.00585858, -0.01611942],
        [ 0.04876981, -0.02584345,  0.01496186, -0.02252179,
         -0.01341717,  0.0327312 , -0.03630286, -0.03725926,
         -0.00585858, -0.01611942],
        [-0.02918469, -0.00735456,  0.00391432, -0.01519922,
         -0.04398937, -0.01802974,  0.00740063,  0.00415504,
         -0.00951606, -0.01654055],
        [ 0.03514184, -0.03198707, -0.03448869, -0.03153343,
          0.03102025, -0.0177858 , -0.03443227, -0.0305138 ,
          0.0246017 , -0.032864  ],
        [ 0.02906818, -0.03571652, -0.03203621,  0.0

In [49]:
# Display the one-hot encoded representation of the first sentence (before embedding)
[embedded_docs[0]]

[array([   0,    0,    0,    0, 9984, 1599, 2422, 1287], dtype=int32)]

In [52]:
# Generate and display the word embeddings for the first sentence
# The output will be a 2D array: (sequence_length, embedding_dimension)
dims = np.expand_dims(embedded_docs[0], axis=0)
model.predict(dims)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step


array([[[ 0.04876981, -0.02584345,  0.01496186, -0.02252179,
         -0.01341717,  0.0327312 , -0.03630286, -0.03725926,
         -0.00585858, -0.01611942],
        [ 0.04876981, -0.02584345,  0.01496186, -0.02252179,
         -0.01341717,  0.0327312 , -0.03630286, -0.03725926,
         -0.00585858, -0.01611942],
        [ 0.04876981, -0.02584345,  0.01496186, -0.02252179,
         -0.01341717,  0.0327312 , -0.03630286, -0.03725926,
         -0.00585858, -0.01611942],
        [ 0.04876981, -0.02584345,  0.01496186, -0.02252179,
         -0.01341717,  0.0327312 , -0.03630286, -0.03725926,
         -0.00585858, -0.01611942],
        [-0.02918469, -0.00735456,  0.00391432, -0.01519922,
         -0.04398937, -0.01802974,  0.00740063,  0.00415504,
         -0.00951606, -0.01654055],
        [ 0.03514184, -0.03198707, -0.03448869, -0.03153343,
          0.03102025, -0.0177858 , -0.03443227, -0.0305138 ,
          0.0246017 , -0.032864  ],
        [ 0.02906818, -0.03571652, -0.03203621,  0.0