# Word Embedding and CBoW (Continuous Bag of Words)

This notebook demonstrates the step-by-step process of building a vocabulary, creating context-target pairs, and explaining the CBoW architecture using a sample sentence.


## Step 1: Select a Text

Sample: "AI and Data Science have become important and popular topics in Cambodia today."


## Step 2: Tokenize the Text

We break the text into individual words (tokens) using NLTK.


In [1]:
import nltk
#nltk.download('punkt')

sample_text = "AI and Data Science have become important and popular topics in Cambodia today."
tokens = nltk.word_tokenize(sample_text)
print("Tokens:", tokens)

Tokens: ['AI', 'and', 'Data', 'Science', 'have', 'become', 'important', 'and', 'popular', 'topics', 'in', 'Cambodia', 'today', '.']


## Step 3: Build the Vocabulary

We list all unique words and assign each word an index number.

**How to build the vocabulary?**

- Extract all unique words from the token list.
- Assign a unique index to each word.


In [2]:
# Build vocabulary
token_set = sorted(set(tokens))
vocab = {word: idx for idx, word in enumerate(token_set)}
print("Vocabulary:", vocab)

Vocabulary: {'.': 0, 'AI': 1, 'Cambodia': 2, 'Data': 3, 'Science': 4, 'and': 5, 'become': 6, 'have': 7, 'important': 8, 'in': 9, 'popular': 10, 'today': 11, 'topics': 12}


## Step 4: Set the Context Window

We choose how many words before and after the target word to use as context. Here, we use a window size of 2.

**Why set context window?**

- The context window determines how much surrounding information is used to predict the target word. A larger window captures broader context, while a smaller window focuses on local context.


In [3]:
window_size = 2
print(f"Context window size: {window_size}")

Context window size: 2


## Step 5: Identify Target & Context Words

For each word, we mark it as the target and write down the surrounding context words.

**Why take the surrounding words?**

- The surrounding words provide context, helping the model learn the meaning and relationships between words.


In [4]:
# Identify target and context words
target_context_pairs = []
for idx, word in enumerate(tokens):
    start = max(0, idx - window_size)
    end = min(len(tokens), idx + window_size + 1)
    context = [tokens[i] for i in range(start, end) if i != idx]
    target_context_pairs.append((word, context))

for pair in target_context_pairs:
    print(f"Target: {pair[0]}, Context: {pair[1]}")

Target: AI, Context: ['and', 'Data']
Target: and, Context: ['AI', 'Data', 'Science']
Target: Data, Context: ['AI', 'and', 'Science', 'have']
Target: Science, Context: ['and', 'Data', 'have', 'become']
Target: have, Context: ['Data', 'Science', 'become', 'important']
Target: become, Context: ['Science', 'have', 'important', 'and']
Target: important, Context: ['have', 'become', 'and', 'popular']
Target: and, Context: ['become', 'important', 'popular', 'topics']
Target: popular, Context: ['important', 'and', 'topics', 'in']
Target: topics, Context: ['and', 'popular', 'in', 'Cambodia']
Target: in, Context: ['popular', 'topics', 'Cambodia', 'today']
Target: Cambodia, Context: ['topics', 'in', 'today', '.']
Target: today, Context: ['in', 'Cambodia', '.']
Target: ., Context: ['Cambodia', 'today']


## Step 6: Create Input-Output Pairs

For CBoW, the context words are the input and the target word is the output.


In [5]:
# Create input-output pairs for CBoW
cbow_pairs = []
for target, context in target_context_pairs:
    cbow_pairs.append((context, target))

for pair in cbow_pairs:
    print(f"Input (Context): {pair[0]}, Output (Target): {pair[1]}")

Input (Context): ['and', 'Data'], Output (Target): AI
Input (Context): ['AI', 'Data', 'Science'], Output (Target): and
Input (Context): ['AI', 'and', 'Science', 'have'], Output (Target): Data
Input (Context): ['and', 'Data', 'have', 'become'], Output (Target): Science
Input (Context): ['Data', 'Science', 'become', 'important'], Output (Target): have
Input (Context): ['Science', 'have', 'important', 'and'], Output (Target): become
Input (Context): ['have', 'become', 'and', 'popular'], Output (Target): important
Input (Context): ['become', 'important', 'popular', 'topics'], Output (Target): and
Input (Context): ['important', 'and', 'topics', 'in'], Output (Target): popular
Input (Context): ['and', 'popular', 'in', 'Cambodia'], Output (Target): topics
Input (Context): ['popular', 'topics', 'Cambodia', 'today'], Output (Target): in
Input (Context): ['topics', 'in', 'today', '.'], Output (Target): Cambodia
Input (Context): ['in', 'Cambodia', '.'], Output (Target): today
Input (Context): ['C

## Step 7: Draw the CBoW Architecture

Below is a simple flowchart of the CBoW model:

```
Input Layer (Context Words)
        |
        v
   Hidden Layer (Embeddings)
        |
        v
Output Layer (Softmax predicts Target Word)
```

- **Input Layer:** Takes context words as input.
- **Hidden Layer:** Learns word embeddings.
- **Output Layer:** Uses softmax to predict the target word.


## Step 8: Explain the Learning Process

The CBoW model learns to predict the target word from its context using a softmax function at the output layer. During training, **backpropagation** is used to update the weights in the network, minimizing the prediction error and improving the quality of the word embeddings.

- The model takes context words as input.
- It computes the average of their embeddings in the hidden layer.
- The output layer predicts the target word using softmax.
- Backpropagation adjusts the weights to reduce the difference between the predicted and actual target word.

This process is repeated for all input-output pairs, resulting in meaningful word embeddings.
