# AI's Fuel: From Text to Tokens to Tensors

This notebook turns a piece of text into something a computer can use: a tensor. We begin with a sentence, transform it to tokens, and end with a tensor made of one-hot vectors. Each step guides you through this process.

## Step 1 — Provide the text
Edit the cell below to set the text to convert to tokens. Keep the variable name exactly as `TEXT`.

In [6]:
TEXT = "Tokens are text that become numbers that were text"

## Step 2 — Get the tokens
For simplicity, we ignore punctuation and tokenize based on words. We normalize the text by making it lowercase. Then we split on spaces to obtain the words in the sentence.

In [7]:
clean = TEXT.lower()
tokens = clean.split()
print("Tokens:", tokens)

Tokens: ['tokens', 'are', 'text', 'that', 'become', 'numbers', 'that', 'were', 'text']


## Step 3 — Collect the vocabulary
From the words we saw, we gather the unique ones and find the vocabulary size.

In [8]:
vocab = set(tokens)
V = len(vocab)
print("Vocabulary:", vocab)
print("Vocabulary size:", V)

Vocabulary: {'become', 'that', 'numbers', 'tokens', 'were', 'text', 'are'}
Vocabulary size: 7


## Step 4 — Get the indices
Each vocabulary word receives a unique number. These numbers will be used to identify the words in the vocabulary. Note that repeated words disappear.

In [9]:
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}
print("Vocabulary with index:", word2idx)
print("Index with vocabulary:", idx2word)

Vocabulary with index: {'become': 0, 'that': 1, 'numbers': 2, 'tokens': 3, 'were': 4, 'text': 5, 'are': 6}
Index with vocabulary: {0: 'become', 1: 'that', 2: 'numbers', 3: 'tokens', 4: 'were', 5: 'text', 6: 'are'}


## Step 5 — Define one-hot vectors
A one-hot vector is a vector that has zeros everywhere and a single 1 at the index of the word. The length of the vector equals the vocabulary size.

In [10]:
import numpy as np
def one_hot(idx, vocab_size):
    vec = np.zeros(vocab_size, dtype=int)
    vec[idx] = 1
    return vec

# Play with the example_index
example_index = 1
example_word = idx2word[example_index]
print(f"Example one-hot for word '{example_word}' (index {example_index}):", one_hot(example_index, V))

Example one-hot for word 'that' (index 1): [0 1 0 0 0 0 0]


## Step 6 — Translate the sentence into indices
We replace each token in the sentence with its index from the vocabulary.

In [11]:
token_indices = [word2idx[t] for t in tokens]
print("Original sentence:", TEXT)
print("Vocabulary with index:", word2idx)
print("Token indices based on word2idx:", token_indices)

Original sentence: Tokens are text that become numbers that were text
Vocabulary with index: {'become': 0, 'that': 1, 'numbers': 2, 'tokens': 3, 'were': 4, 'text': 5, 'are': 6}
Token indices based on word2idx: [3, 6, 5, 1, 0, 2, 1, 4, 5]


## Step 7 — Get the tensor
For every word index we create a one-hot vector, then stack all those vectors into an array. This results in our tensor, where each row corresponds to a token (in sentence order) and each column to a vocabulary word.

In [12]:
one_hots = [one_hot(i, V) for i in token_indices]
tensor = np.stack(one_hots, axis=0)
print("Original sentence:", TEXT)
print("Vocabulary with index:", word2idx)
print("Token indices:", token_indices)
print("Tensor shape:", tensor.shape)
print(tensor)

Original sentence: Tokens are text that become numbers that were text
Vocabulary with index: {'become': 0, 'that': 1, 'numbers': 2, 'tokens': 3, 'were': 4, 'text': 5, 'are': 6}
Token indices: [3, 6, 5, 1, 0, 2, 1, 4, 5]
Tensor shape: (9, 7)
[[0 0 0 1 0 0 0]
 [0 0 0 0 0 0 1]
 [0 0 0 0 0 1 0]
 [0 1 0 0 0 0 0]
 [1 0 0 0 0 0 0]
 [0 0 1 0 0 0 0]
 [0 1 0 0 0 0 0]
 [0 0 0 0 1 0 0]
 [0 0 0 0 0 1 0]]


---
## Like this notebook?
[<img src="https://raw.githubusercontent.com/primer/octicons/main/icons/mark-github-16.svg" alt="GitHub" width="18" style="vertical-align: text-bottom;"> Data & AI with Forbes](https://github.com/bforbesc/Data-AI-with-Forbes)

If you enjoyed this walkthrough, check out my repo for more technical data-and-AI topics simply and clearly explained.