# **Word Embedding and Feature Representation**
---
## **Introduction to Word Embedding**
- **Word Embedding** is a technique used in Natural Language Processing (NLP) to convert words into vectors. 
- This technique is vital for feeding textual data into neural networks. Like other layers such as Dense layers, neural networks have an **Embedding Layer** dedicated to converting words into vectors.
- The embedding layer uses word embedding techniques to transform input words into vector representations, which are essential for processing textual data within neural networks.

## **Feature Representation**
- **Feature Representation** refers to the vectorized form of words created by embedding techniques.
- Each word in a dataset is converted into a vector of a certain dimension, representing various features or attributes of the word.
  
## **One-Hot Encoding**
- **One-Hot Encoding** is an early and simple method to represent words as vectors.
- In this technique, each word in the vocabulary is represented by a binary vector with a dimension equal to the vocabulary size. 
- For example, if the vocabulary size is 10,000, each word is represented as a vector of length 10,000 with a single `1` at the index corresponding to the word, and all other positions are `0`.
  
  **Example:**
  - For the word "man," if it appears at index 5000 in the vocabulary, the vector will be `[0, 0, 0, ..., 1, 0, 0, ...]`.
  - For the word "boy," if it appears at index 2000, the vector will be `[0, 0, 1, ..., 0, 0, 0, ...]`.

- **Sparse Matrix Problem**: One-Hot Encoding results in sparse matrices, as most values are zeros. This sparsity can lead to overfitting because the vectors lack meaningful relationships between words.

### **Limitations of One-Hot Encoding**
- **Inefficiency**: The resulting vectors are high-dimensional (equal to the vocabulary size), leading to large, sparse matrices.
- **No Semantic Information**: One-hot vectors do not capture any semantic relationships between words. For instance, the words "man" and "boy" might be semantically similar, but their one-hot representations do not reflect this.

## **Word Embedding: A Solution**
- **Word Embedding** addresses the limitations of one-hot encoding by creating dense vectors that capture semantic relationships between words.
- Unlike one-hot encoding, where the vector length equals the vocabulary size, word embeddings use a fixed, lower-dimensional space to represent words.

  **Example:**
  - The words "man" and "boy" might have vectors that are close to each other in the embedding space, reflecting their semantic similarity.

## **Word2Vec**
- **Word2Vec** is a popular word embedding technique developed by Google.
- **Word2Vec** works by training a neural network on a large corpus of text to predict words based on their context. This results in vectors where semantically similar words have similar representations.
- **Types of Word2Vec**:
  - **Skip-gram**: Trains a model to predict surrounding words given a central word.
  - **Continuous Bag of Words (CBOW)**: Predicts a central word based on its surrounding context.

## **Feature Representation in Word Embedding**
- In word embedding, each word is represented by a dense vector of fixed dimensions (e.g., 300 dimensions).
- **Feature Representation**: These vectors are determined by capturing relationships between words along different features like **Gender**, **Royalty**, **Age**, and **Food**.

  **Example:**
  - For the word "boy":
    - **Gender**: -1 (representing masculine)
    - **Royalty**: 0.01 (low association with royalty)
  - For the word "King":
    - **Gender**: 0.92 (high association with male gender)
    - **Royalty**: 0.95 (high association with royalty)

- The resulting vector captures the word's relationships across these features, creating a more meaningful representation.

## **Training Word Embeddings**
- The word embedding vectors are learned through training on large text corpora.
- Techniques like **Word2Vec** and **GloVe** (Global Vectors for Word Representation) are commonly used for training embeddings.
- **GloVe** is another word embedding technique similar to Word2Vec, focusing on capturing global statistical information from the corpus.

## **Parameters to Consider**
1. **Vocabulary Size**: The total number of unique words in the dataset.
   - Example: 10,000 words.
2. **Feature Dimension**: The dimensionality of the embedding space.
   - Commonly used dimensions: 100, 200, 300.
   - Word2Vec and GloVe often use 300 dimensions for effective representation.

## **Practical Implementation**
- Embedding layers using these word embedding techniques are integrated into neural networks.
- The next step involves applying these embeddings in practical tasks, such as training a **Simple RNN** or other deep learning models, to improve performance and handle textual data effectively.

In [1]:
!pip install tensorflow

Collecting keras<2.16,>=2.15.0 (from tensorflow)
  Downloading keras-2.15.0-py3-none-any.whl.metadata (2.4 kB)
Downloading keras-2.15.0-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: keras
  Attempting uninstall: keras
    Found existing installation: keras 3.4.1
    Uninstalling keras-3.4.1:
      Successfully uninstalled keras-3.4.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.[0m[31m
[0mSuccessfully installed keras-2.15.0


In [2]:
!pip install keras



In [3]:
from tensorflow.keras.preprocessing.text import one_hot

2024-08-22 08:23:18.756968: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-22 08:23:18.757114: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-22 08:23:18.928797: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [4]:
### sentences
sent=[  'the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good',]

sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

In [5]:
sent[0:3]

['the glass of milk', 'the glass of juice', 'the cup of tea']

In [6]:
## Define the vocabulary size
voc_size=10000

In [7]:
for words in sent:
    print(words)

the glass of milk
the glass of juice
the cup of tea
I am a good boy
I am a good developer
understand the meaning of words
your videos are good


In [8]:
# OHE Representation
one_hot_repr = [one_hot(words, voc_size) for words in sent]
one_hot_repr

[[8441, 4969, 9646, 3453],
 [8441, 4969, 9646, 774],
 [8441, 5972, 9646, 7629],
 [9360, 8974, 175, 6288, 3316],
 [9360, 8974, 175, 6288, 8527],
 [2944, 8441, 9895, 9646, 2079],
 [7838, 7387, 7107, 6288]]

In [9]:
# Word Embedding Representation
from tensorflow.keras.models import Sequential
# Sequential: This module provides a linear stack of layers, allowing you to easily create and manage a neural network model in a step-by-step manner.

from tensorflow.keras.layers import Embedding
# Embedding: This layer is used to convert categorical data (like words) into dense vectors of fixed size, which are more suitable for input to a neural network.

#from tensorflow.keras.processing.sequence import pad_sequences
from tensorflow.keras.utils import pad_sequences
# pad_sequences: This utility function pads sequences of varying lengths to ensure that they all have the same length, making them suitable for processing by neural networks.

import numpy as np
# numpy: A fundamental library for numerical computing in Python, used here for efficient array and matrix operations.

The `pad_sequences` function from tensorflow.keras.utils is commonly used to ensure that all sequences in a dataset have the same length. It pads shorter sequences with zeros (or another specified value) and truncates longer sequences to a specified maximum length. This is particularly useful when working with text data, such as for input to Recurrent Neural Networks (RNNs), where all input sequences need to be of the same length.

In [11]:
sent_len = 8
embedded_docs = pad_sequences(one_hot_repr, padding='pre', maxlen = sent_len)
print(embedded_docs)

[[   0    0    0    0 8441 4969 9646 3453]
 [   0    0    0    0 8441 4969 9646  774]
 [   0    0    0    0 8441 5972 9646 7629]
 [   0    0    0 9360 8974  175 6288 3316]
 [   0    0    0 9360 8974  175 6288 8527]
 [   0    0    0 2944 8441 9895 9646 2079]
 [   0    0    0    0 7838 7387 7107 6288]]


In [12]:
# Feature Representation

dim = 10

In [14]:
model = Sequential()
model.add(Embedding(voc_size, dim, input_length=sent_len))
model.compile("adam", "mse")

[Embedding Layer docs](https://keras.io/api/layers/core_layers/embedding/)

In [15]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 8, 10)             100000    
                                                                 
Total params: 100000 (390.62 KB)
Trainable params: 100000 (390.62 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [16]:
model.predict(embedded_docs)



array([[[-0.01924653, -0.03774593,  0.00947177,  0.01011114,
         -0.03010894, -0.02946844,  0.03806892,  0.00112319,
          0.02736178,  0.00875439],
        [-0.01924653, -0.03774593,  0.00947177,  0.01011114,
         -0.03010894, -0.02946844,  0.03806892,  0.00112319,
          0.02736178,  0.00875439],
        [-0.01924653, -0.03774593,  0.00947177,  0.01011114,
         -0.03010894, -0.02946844,  0.03806892,  0.00112319,
          0.02736178,  0.00875439],
        [-0.01924653, -0.03774593,  0.00947177,  0.01011114,
         -0.03010894, -0.02946844,  0.03806892,  0.00112319,
          0.02736178,  0.00875439],
        [-0.03157163, -0.00711571, -0.0286233 ,  0.03488598,
         -0.03657383, -0.01090366,  0.03258919, -0.00120535,
          0.00400615,  0.02875033],
        [-0.02574574, -0.00962896,  0.03425619, -0.00581864,
         -0.02202575,  0.03405695,  0.00866314,  0.02586682,
          0.00402568,  0.03387472],
        [ 0.02461665, -0.00123334, -0.00572918,  0.0

In [17]:
embedded_docs[0]

array([   0,    0,    0,    0, 8441, 4969, 9646, 3453], dtype=int32)

In [18]:
model.predict(embedded_docs[0])



array([[-0.01924653, -0.03774593,  0.00947177,  0.01011114, -0.03010894,
        -0.02946844,  0.03806892,  0.00112319,  0.02736178,  0.00875439],
       [-0.01924653, -0.03774593,  0.00947177,  0.01011114, -0.03010894,
        -0.02946844,  0.03806892,  0.00112319,  0.02736178,  0.00875439],
       [-0.01924653, -0.03774593,  0.00947177,  0.01011114, -0.03010894,
        -0.02946844,  0.03806892,  0.00112319,  0.02736178,  0.00875439],
       [-0.01924653, -0.03774593,  0.00947177,  0.01011114, -0.03010894,
        -0.02946844,  0.03806892,  0.00112319,  0.02736178,  0.00875439],
       [-0.03157163, -0.00711571, -0.0286233 ,  0.03488598, -0.03657383,
        -0.01090366,  0.03258919, -0.00120535,  0.00400615,  0.02875033],
       [-0.02574574, -0.00962896,  0.03425619, -0.00581864, -0.02202575,
         0.03405695,  0.00866314,  0.02586682,  0.00402568,  0.03387472],
       [ 0.02461665, -0.00123334, -0.00572918,  0.04080024, -0.04559186,
         0.02747586, -0.00947763, -0.04452793