# Word Tokenization to One‑Hot Tensors (Python + NumPy)

**Goal:** Convert a simple sentence into tokens, map words to indices, build one‑hot vectors, and stack them into a NumPy tensor — beginner friendly and self‑contained.

We stop at step 8 (no embeddings), and there is no separate Dependencies section.

## 2) 📝 Sample sentence & tokenization  
**What this does:** Normalizes the text (lowercase, drop simple punctuation) and splits on spaces to get beginner‑friendly word tokens.

In [None]:
# Minimal setup kept inside code (no "Dependencies" section)
import numpy as np

# Our example sentence
sentence = "Hello world! Welcome to NumPy tokenization."

# Basic cleanup: lowercase & remove a couple of punctuation marks
clean = sentence.lower().replace("!", "").replace(".", "")

# Split on whitespace to get tokens
tokens = clean.split()

print("Tokens:", tokens)
# Expected: ['hello', 'world', 'welcome', 'to', 'numpy', 'tokenization']

## 3) 📚 Build a vocabulary  
**What this does:** Collects unique words and fixes an order, giving us a stable set of columns for vectors.

In [None]:
# Unique words (sorted so the order is deterministic)
vocab = sorted(set(tokens))
V = len(vocab)

print("Vocabulary:", vocab)
print("Vocabulary size:", V)
# Example:
# Vocabulary: ['hello', 'numpy', 'tokenization', 'to', 'welcome', 'world']
# Vocabulary size: 6

## 4) 🔗 Create word ↔ index mappings  
**What this does:** Assigns each vocabulary word a unique integer index so we can look up vector positions quickly.

In [None]:
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}

print("word2idx:", word2idx)
# Example: {'hello': 0, 'numpy': 1, 'tokenization': 2, 'to': 3, 'welcome': 4, 'world': 5}

## 5) 🎯 One‑hot encoding helper  
**What this does:** Defines a tiny function that turns an index into a one‑hot vector (all zeros except a 1 at that index).

In [None]:
def one_hot(idx, vocab_size):
    """
    Create a one-hot vector of length `vocab_size` with a 1 at `idx`.
    """
    vec = np.zeros(vocab_size, dtype=int)
    vec[idx] = 1
    return vec

# Quick sanity check:
print("One-hot for index 0:", one_hot(0, V))
# Example: [1 0 0 0 0 0]

## 6) 🔢 Map tokens → indices  
**What this does:** Converts each token in the sentence into its integer index according to the vocabulary.

In [None]:
token_indices = [word2idx[t] for t in tokens]
print("Token indices:", token_indices)
# Example: [0, 5, 4, 3, 1, 2]

## 7) 🧱 Convert indices → one‑hot vectors & stack into a tensor  
**What this does:** Builds a one‑hot vector per token and stacks them into a 2D NumPy array (tensor) with shape `(num_tokens, vocab_size)`.

In [None]:
one_hots = [one_hot(i, V) for i in token_indices]

# Stack rows to form a (num_tokens, vocab_size) tensor
tensor = np.stack(one_hots, axis=0)

print("Shape of tensor:", tensor.shape)
print("Tensor:\n", tensor)
# Example shape: (6, 6)

## 8) 🔍 Inspecting the result (tokens vs. columns)  
**What this does:** Clarifies how rows/columns line up so you can read the tensor meaningfully.

In [None]:
print("Tokens (row order):", tokens)
print("Vocabulary (column order):", vocab)

# Reading tip:
# - Each ROW corresponds to a token in the original order: ['hello', 'world', 'welcome', 'to', 'numpy', 'tokenization']
# - Each COLUMN corresponds to a vocab entry (alphabetical above), e.g. ['hello','numpy','tokenization','to','welcome','world']
# So the row for 'world' has a 1 in the 'world' column, and 0 elsewhere.