Based on https://github.com/lucasvw/micrograd/blob/master/_01_discord_01.ipynb as follow up to https://discord.com/channels/1020383067459821711/1029849849765564528/1056937673241137172.

# Summary

Contrary to what I saw during end-to-end training, indexing is faster than one-hot matrix multiplication on both CPU and GPU. Measurements are rounded for readability.

||CPU|GPU|
|-|-|-|
|One-hot|137 µs|11 µs|
|Indexing|6 µs|5 µs|
|Speedup|23x|2x|


# Setup

In [1]:
import torch
import torch.nn.functional as F

In [2]:
row_dimensions = 703 # Trigram word model input: (.., .a, [...], .z, aa, [...], az)
col_dimensions = 27 # Trigram word model output: (., a, [...], z)

In [3]:
# Simulated weight matrix
W = torch.randn([row_dimensions, col_dimensions])

In [4]:
# simulated X matrix, consisting of 1000 random integers between 0 and 703
X = torch.randint(low=0, high = row_dimensions, size=(1000,))

In [5]:
# One hot encoded x_enc matrix
x_enc = F.one_hot(X, num_classes=row_dimensions).float()

# CPU

In [6]:
%%timeit
x_enc @ W

137 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [7]:
%%timeit
W[X]

6.29 µs ± 3.48 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


# GPU

Move our tensors to the GPU:

In [8]:
x_enc = x_enc.cuda()
W = W.cuda()
X = X.cuda()

In [9]:
%%timeit
a = x_enc @ W

10.7 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [10]:
%%timeit
a = W[X]

4.82 µs ± 2.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
