# Word Embedding

Word Embedding converts words into meaningful dense vectors so similar words have similar numerical values.

# simple (layman) language:

gives each word a smart number code so that words with similar meaning get similar codes.

# Type of word Embedding

1Ô∏è‚É£ Static Word Embeddings

Each word has one fixed vector, no matter the sentence.

# Popular models:

Word2Vec

GloVe

FastText

2Ô∏è‚É£ Contextual Word Embeddings

Word vector changes depending on the sentence (context).

# Popular models:

BERT

Transformers



In [11]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding


In [12]:
# Sample text
texts = ["I love AI", "AI is powerful"]

In [13]:
# Step 1: Tokenization (words ‚Üí numbers)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)


In [14]:
# Step 2: Padding (same length)
padded = pad_sequences(sequences, maxlen=4, padding='post')

In [15]:
# Vocabulary size (+1 for padding)
vocab_size = len(tokenizer.word_index) + 1


In [16]:
# Step 3: Word Embedding Layer
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=8, input_length=4))




In [17]:
embedded_output = model.predict(padded)

[1m1/1[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m0s[0m 216ms/step


In [18]:
embedded_output.shape

(2, 4, 8)

# Word2vec

word embedding technique that converts words into meaningful numerical vectors based on context.

# Simple Idea (Layman)

It learns word meaning by looking at surrounding words.

# Example:

king and queen ‚Üí similar vectors

doctor and nurse ‚Üí similar vectors

üîÅ Two Types of Word2Vec

1Ô∏è‚É£ CBOW (Continuous Bag of Words)
Predicts the word using surrounding words.

2Ô∏è‚É£ Skip-gram
Predicts surrounding words using the main word.

# üéØ Why It Is Important

Captures meaning

Understands similarity

Small dense vectors (better than One-Hot)

# Core

Word2Vec learns word meaning from context and gives similar words similar numerical vectors

# ‚ùå Main Problems of Word2Vec

1Ô∏è‚É£ No context awareness
Same word = same vector always

2Ô∏è‚É£ Fails with polysemy (multiple meanings)
One word, many meanings ‚Üí cannot handle properly

3Ô∏è‚É£ Needs large data
Small dataset ‚Üí poor embeddings

4Ô∏è‚É£ Out-of-Vocabulary (OOV) problem
New/unknown words are not handled well

# Word2Vec fails because it cannot understand context and gives one fixed meaning to each word.



# Padding

Padding is adding extra zeros to make all sentences the same length.

# üß† Why We Need Padding?

Neural networks need input of same size.
But sentences have different lengths.

# Example:

‚ÄúI love AI‚Äù (3 words)

‚ÄúAI is very powerful‚Äù (4 words)

Lengths are different ‚ùå
Model cannot process directly.

# After Padding (make same length = 5)

"I love AI"            ‚Üí [I, love, AI, 0, 0]

"AI is very powerful"  ‚Üí [AI, is, very, powerful, 0]

In [7]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample sentences
texts = ["I love AI", "AI is very powerful"]

In [8]:
# Step 1: Convert words to numbers (tokenization)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

In [9]:
# Step 2: Apply Padding
padded = pad_sequences(sequences, maxlen=5, padding='post')

#maxlen=5 ‚Üí fixed length of sentence

#padding='post' ‚Üí add zeros at end

#padding='pre' ‚Üí add zeros at beginning


In [10]:
padded

array([[2, 3, 1, 0, 0],
       [1, 4, 5, 6, 0]], dtype=int32)

# Tokenization (Code)




In [19]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [20]:
# Sample sentences
texts = ["I love AI", "AI is powerful"]

In [21]:
# Create tokenizer
tokenizer = Tokenizer()

In [22]:
# Fit on text (learn vocabulary)
tokenizer.fit_on_texts(texts)


In [23]:
# Convert text to sequences (words ‚Üí numbers)
sequences = tokenizer.texts_to_sequences(texts)


In [26]:
tokenizer.word_index

{'ai': 1, 'i': 2, 'love': 3, 'is': 4, 'powerful': 5}

In [27]:
sequences

[[2, 3, 1], [1, 4, 5]]

 # Sequential Modeling (Very Easy)

 Sequential modeling means processing data step by step in order, especially for sequence data like text, time series, or speech.

 Some data comes in order (sequence), and the order matters.



In [28]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential()
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Vector Embedding

the numerical representation of words (or data) as dense vectors that capture meaning and similarity.

It is like giving each word a smart number list that represents its meaning.

# Example:

king   ‚Üí [0.25, 0.10, 0.90]  
queen  ‚Üí [0.27, 0.12, 0.88]  
apple  ‚Üí [0.90, 0.05, 0.10]

Here:

king and queen ‚Üí similar vectors (similar meaning ‚úî)

apple ‚Üí different vector (different meaning ‚ùå)

# Why We Use Vector Embeddings

To convert text into numbers

To capture meaning of words

To find similarity between words

Used in NLP, chatbots, search, translation

# üîÑ Difference from One-Hot Encoding

One-Hot ‚Üí large, mostly zeros, no meaning

Embedding ‚Üí small, dense, meaningful vectors

In [29]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding

# Sample sentences
texts = ["I love AI", "AI is powerful"]

# 1Ô∏è‚É£ Tokenization (words ‚Üí numbers)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

# 2Ô∏è‚É£ Padding (same length)
padded = pad_sequences(sequences, maxlen=4, padding='post')

# Vocabulary size
vocab_size = len(tokenizer.word_index) + 1

# 3Ô∏è‚É£ Create Embedding Model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=5, input_length=4))

# 4Ô∏è‚É£ Get Embedded Vectors
embedded_output = model.predict(padded)

print("Padded Input:\n", padded)
print("Embedding Shape:", embedded_output.shape)
print("Embedding Output:\n", embedded_output)

[1m1/1[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m0s[0m 127ms/step
Padded Input:
 [[2 3 1 0]
 [1 4 5 0]]
Embedding Shape: (2, 4, 5)
Embedding Output:
 [[[ 0.02852701  0.02125248  0.01269034 -0.02057981  0.02039136]
  [-0.03472532 -0.01418555 -0.03664702 -0.0132991   0.04659894]
  [-0.03814046 -0.00317582  0.00780234  0.01866535 -0.01538825]
  [ 0.03948298  0.04965678 -0.01951586  0.03381972  0.01196776]]

 [[-0.03814046 -0.00317582  0.00780234  0.01866535 -0.01538825]
  [-0.01458217  0.02955754  0.00091927  0.01096587  0.00520628]
  [-0.01233997  0.01776362  0.03976815  0.02537883  0.01008726]
  [ 0.03948298  0.04965678 -0.01951586  0.03381972  0.01196776]]]




Converts words ‚Üí numbers

Makes equal length using padding

Converts numbers ‚Üí dense vectors