# AI's Fuel: From Words to Tokens

This notebook tells a small story about turning a sentence into something a computer can use: numbers. We begin with plain text and end with a tensor made of one-hot vectors. Each step is short and focused so that no prior knowledge of tokenization is required.

## Step 1 — Provide the sentence
Edit the cell below to set the sentence to transform. Keep the variable name exactly as `TEXT`.

In [None]:
TEXT = "Tokens are text as numbers!"

## Step 2 — Prepare the words
We normalize the sentence by making it lowercase and removing a couple of punctuation marks. Then we split on spaces to see the words as separate pieces.

In [None]:
import numpy as np
clean = TEXT.lower().replace("!", "").replace(".", "")
tokens = clean.split()
print("Tokens:", tokens)

## Step 3 — Collect the vocabulary
From the words we saw, we gather the unique set and fix an order. This ordered list will define the columns of our vectors.

In [None]:
vocab = sorted(set(tokens))
V = len(vocab)
print("Vocabulary:", vocab)
print("Vocabulary size:", V)

## Step 4 — Give each word an index
Each vocabulary word receives a unique integer. These numbers will be used to position the 1s inside our vectors.

In [None]:
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}
print("word2idx:", word2idx)

## Step 5 — Define one-hot vectors
A one-hot vector has zeros everywhere and a single 1 at the index of the word. The length of the vector equals the vocabulary size.

In [None]:
def one_hot(idx, vocab_size):
    vec = np.zeros(vocab_size, dtype=int)
    vec[idx] = 1
    return vec
print("Example one-hot (index 0):", one_hot(0, V))

## Step 6 — Translate the sentence into indices
We replace each token in the sentence with its index from the vocabulary.

In [None]:
token_indices = [word2idx[t] for t in tokens]
print("Token indices:", token_indices)

## Step 7 — Build one-hot vectors and stack them
For each index we create a one-hot vector, then we stack all vectors into a 2D array. The shape tells us: rows are tokens, columns are vocabulary entries.

In [None]:
one_hots = [one_hot(i, V) for i in token_indices]
tensor = np.stack(one_hots, axis=0)
print("Tensor shape:", tensor.shape)
print(tensor)

## Step 8 — Read the alignment
Here are the row tokens and the column vocabulary. Reading any row shows a single 1 beneath the matching vocabulary word.

In [None]:
print("Row tokens in order:", tokens)
print("Vocabulary (column order):", vocab)