# Benchmarks

Benchmarks to help with design/architecture decisions of the lib.

In [44]:
import gzip
import os
import shutil
import tempfile

import pandas as pd
import torch
import tqdm
from datasets import load_dataset
from transformer_lens import HookedTransformer

from sparse_autoencoder.autoencoder.model import SparseAutoencoder
from sparse_autoencoder.dataset.dataloader import (
    collate_neel_c4_tokenized,
    create_dataloader,
)
from sparse_autoencoder.train.train import pipeline

## Activation Tensor Sizes

It's useful to know both the size and how much it can be compressed/

In [32]:
# Create a batch of text data
dataset = load_dataset("NeelNanda/c4-code-tokenized-2b", split="train", streaming=True)
first_batch = []
for idx, example in enumerate(dataset):
    if not idx <= 24:
        break
    first_batch.append(example["tokens"])
first_batch = torch.tensor(first_batch)
f"Number of activations to store: {first_batch.numel()}"

Resolving data files:   0%|          | 0/28 [00:00<?, ?it/s]

'Number of activations to store: 25600'

In [61]:
# Create the activations
src_model = HookedTransformer.from_pretrained("NeelNanda/GELU_1L512W_C4_Code")
logits, cache = src_model.run_with_cache(first_batch)
activations = cache["blocks.0.mlp.hook_post"].half()
number_activations = activations.numel()
size_bytes_activations = number_activations * 2  # Assume float 16
size_mb_activations = f"{size_bytes_activations / (10**6):.2f} MB"
f"With {activations.numel()} features at half precision, the features take up {size_mb_activations} of memory"

Loaded pretrained model NeelNanda/GELU_1L512W_C4_Code into HookedTransformer


'With 52428800 features at half precision, the features take up 104.86 MB of memory'

Next we try compressing on the disk (and find the impact is small so probably not worth it):

In [36]:
# Save to temp dir
temp_dir = tempfile.gettempdir()
temp_file = temp_dir + "/temp.pt"
temp_file_gz = temp_file + ".gz"
torch.save(activations, temp_file)

# Zip it
with open(temp_file, "rb") as f_in:
    with gzip.open(temp_file_gz, "wb") as f_out:
        shutil.copyfileobj(f_in, f_out)

# Get the file size back
fs_bytes = os.path.getsize(temp_file_gz)
f"Compressed file size is {fs_bytes / (10**6):.2f} MB"

'Compressed file size is 93.09 MB'

Now let's calculate assuming 8 billion activations:

In [43]:
assumed_n_activation_batches = 8 * (10**9)
assumed_n_activations_per_batch = 2048
uncompressed_size_per_activation = 2  # float16
estimated_size = (
    assumed_n_activation_batches
    * assumed_n_activations_per_batch
    * uncompressed_size_per_activation
)
f"With {assumed_n_activation_batches} batches of {assumed_n_activations_per_batch} activations, \
the estimated size is {estimated_size / (10**12):.2f} TB"

'With 8000000000 batches of 2048 activations, the estimated size is 32.77 TB'

In [60]:
# Calculate the amount of activations you can store with different sizes

sizes_gb = [10, 50, 100, 300, 500, 1000]
activations_per_size = [
    i * (10**9) / uncompressed_size_per_activation / assumed_n_activations_per_batch
    for i in sizes_gb
]

table = pd.DataFrame({"Size (GB)": sizes_gb, "Activations": activations_per_size})
table["Activations"] = table["Activations"].apply(
    lambda x: "{:,.0f}".format(x / 10**6) + "M"
)
table

Unnamed: 0,Size (GB),Activations
0,10,2M
1,50,12M
2,100,24M
3,300,73M
4,500,122M
5,1000,244M


VastAI systems often have quite a lot of HD space (e.g. 300GB) but available ram is often smaller
(e.g. 50GB and we need a reasonable amount left over for moving tensors around etc). This means that
we can store c. 5-10M activations on a typical instance in CPU RAM (sometimes 25M+), or 50-100M on
disk. Both seem like plenty!

To note that replenishing a buffer of cached activations when half used in training seems like a lot
of pain, considering that the improvement is likely marginal. Particularly if we also randomly sort
the prompts for the forward pass of the source model, we'll have a chance of two tokens coming from
the same/nearby prompts as very small.

The conclusion is therefore that we do a need some sort of buffer, as we can't store 40TB on disk
easily, and this buffer can be disk or ram. It needs to store asynchronously (so it doesn't block
the forward pass), and it needs to be able to handle multiple simultaneous writes from e.g.
distributed GPUs.

## Dataset Fetching

## Getting Activations (Forward Pass)

## Activations Buffer

## Learning