## Prepare metadata for a model
We combine important meta data about the model from various sources. We want:

- id: unique identifier made up of {repo}_{topk}_{expansion}
- repo: HF repository
- topk: top k latents used
- expansion: how many
- source_model: HF id of the source model e.g. nomic/nomic-embed-text-v1.5
- d_in: input dimensions of the embeddings for the SAE: e.g. 768

We also will make a parquet for the features with the following columns

- feature: feature id (an index into the SAE decoder weights)
- x: UMAP position x
- y: UMAP position y
- label: feature label
- max_activation: max activation of the feature (seen in the top examples)
- order: an index into an ordering of the features determined by a 1D UMAP

We also prepare the top 10 samples for each feature, these will be split into smaller parquet files for dynamic loading in the browser

- text: text of the sample
- feature: feature id (an index into the SAE decoder weights)
- activation: activation of the feature
- top_indices: indices of the topk features activated on this sample
- top_acts: activation values of the topk features activated

In [18]:
import pandas as pd
import numpy as np
import os
import json
import tqdm
from latentsae.sae import Sae

### Metadata

In [19]:
topk = 64
expansion = 32
repo = "enjalot/sae-nomic-text-v1.5-FineWeb-edu-100BT"
sae_id = f"{topk}_{expansion}"

In [20]:
model = Sae.load_from_hub(repo, sae_id)

Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 41943.04it/s]


In [21]:
name = f"NOMIC_FWEDU_{round(model.num_latents/1000)}k"

In [22]:
meta = {
  "id": f"{repo}_{sae_id}",
  "name": name,
  "repo": repo,
  "topk": topk,
  "expansion": expansion,
  "d_in": model.d_in,
  "num_latents": model.num_latents,
  "source_model": "nomic-ai/nomic-embed-text-v1.5"
}

In [36]:
meta

{'id': 'enjalot/sae-nomic-text-v1.5-FineWeb-edu-100BT_64_32',
 'name': 'NOMIC_FWEDU_25k',
 'repo': 'enjalot/sae-nomic-text-v1.5-FineWeb-edu-100BT',
 'topk': 64,
 'expansion': 32,
 'd_in': 768,
 'num_latents': 24576,
 'source_model': 'nomic-ai/nomic-embed-text-v1.5'}

In [38]:
if not os.path.exists(f"../models/{name}"):
  os.makedirs(f"../models/{name}")

with open(f"../models/{name}/metadata.json", "w") as f:
  json.dump(meta, f)

## Features parquet

UMAP coordinates are prepared in the notebook `umap-decoder.ipynb`.   
Labels generated in `make-labels.ipynb`.  
Samples are downloaded from modal

In [66]:
umap_df = pd.read_parquet(f"data/umap-{name}.parquet")
umap_top10_df = pd.read_parquet(f"data/umap-top10-{name}.parquet")
umap_df.head()

Unnamed: 0,feature,x,y
0,0,-0.303497,0.625081
1,1,-0.742769,-0.499167
2,2,-0.374569,-0.63296
3,3,-0.310232,-0.219518
4,4,-0.400855,-0.180536


In [67]:
top10_df = pd.read_parquet(f"data/top10-{name}.parquet")
top10_df.head()

Unnamed: 0,chunk_index,chunk_text,chunk_token_count,id,url,score,dump,embedding,__index_level_0__,top_acts,top_indices,feature,activation
0,0,- simple past tense and past participle of gel...,76,<urn:uuid:ade93a67-14e4-4835-8fb6-7668fe8004db>,http://www.yourdictionary.com/gelled,2.875,CC-MAIN-2015-18,"[0.055502143, 0.09088759, -0.1564934, -0.04632...",229004,"[0.2783777713775635, 0.2591576874256134, 0.113...","[2006.0, 0.0, 793.0, 3718.0, 5826.0, 10367.0, ...",0,0.259158
1,0,"Also found in: Thesaurus, Medical, Acronyms, E...",496,<urn:uuid:ae2f5cf3-6392-4160-a4ee-5b36b3a97e65>,https://www.thefreedictionary.com/gelatin,3.25,CC-MAIN-2019-51,"[0.041935317, 0.10650849, -0.1574557, -0.04301...",81064,"[0.2565605938434601, 0.22474993765354156, 0.13...","[0.0, 2006.0, 5826.0, 793.0, 5945.0, 21665.0, ...",0,0.256561
2,0,"A gel (from the lat. gelu—freezing, cold, ice ...",172,<urn:uuid:c33df90e-c479-40e2-aa9b-ac6a4f0e47d9>,http://medicalxpress.com/tags/hydrogel/sort/li...,3.28125,CC-MAIN-2015-35,"[0.029497456, 0.109683335, -0.18576318, -0.034...",180126,"[0.24542075395584106, 0.09371058642864227, 0.0...","[0.0, 2006.0, 10367.0, 7655.0, 16380.0, 15780....",0,0.245421
3,7,extended periods of time can damage gelatin an...,500,<urn:uuid:16add753-dbb9-44e2-aad0-b9909c16cff5>,http://mlaiskonis.com/2014/06/07/gelatin/,2.578125,CC-MAIN-2014-41,"[0.0612519, 0.09704458, -0.17871378, -0.017184...",147957,"[0.24373508989810944, 0.15119953453540802, 0.1...","[0.0, 5826.0, 21665.0, 12673.0, 16225.0, 4502....",0,0.243735
4,1,##loid gel. gelatin forms a solution of high v...,251,<urn:uuid:acd089d5-9546-4c25-8142-ab2a5e3204cc>,http://www.foodfacts.com/ci/ingredientsoverlay...,3.25,CC-MAIN-2016-36,"[0.060598753, 0.09118352, -0.18788467, -0.0564...",124253,"[0.2427240014076233, 0.1294490396976471, 0.087...","[0.0, 5826.0, 13791.0, 7655.0, 11963.0, 15780....",0,0.242724


In [68]:
max_activation_per_feature = top10_df.groupby('feature')['activation'].max().reset_index()


In [69]:
max_activation_per_feature

Unnamed: 0,feature,activation
0,0,0.259158
1,1,0.125571
2,2,0.041077
3,3,0.221894
4,4,0.031271
...,...,...
24571,24571,0.037978
24572,24572,0.151966
24573,24573,0.077753
24574,24574,0.224355


In [70]:
feature_df = max_activation_per_feature.copy()
feature_df.rename(columns={'activation': 'max_activation'}, inplace=True)


In [71]:
feature_df["x"] = umap_df["x"]
feature_df["y"] = umap_df["y"]
feature_df["top10_x"] = umap_top10_df["x"]
feature_df["top10_y"] = umap_top10_df["y"]


In [72]:
label_df = pd.read_parquet(f"data/labels-{name}.parquet")

In [73]:
label_df.head()

Unnamed: 0,feature,label
0,0,properties uses and characteristics of gelatin
1,1,Heathcliff's obsessive love and revenge themes.
2,2,military aviation and emergency response strat...
3,3,presidential veto and legislative process
4,4,Early American political figures and women's r...


In [74]:
len(label_df)

24576

In [75]:
feature_df["label"] = label_df["label"].str.replace("FINAL: ", "", regex=False).str.strip()

In [80]:
# order_df = pd.read_parquet(f"data/1d-order-{name}.parquet")
order_df = pd.read_parquet(f"data/1d-order-{name}-top10.parquet")
# Scale the "order" column from 0 to 1
# Scale the "order" column from 0 to 1
order = order_df["order"].values
order_min = order.min()
order_max = order.max()
order = (order - order_min) / (order_max - order_min)


feature_df["order"] = order

In [81]:
# feature_df = pd.read_parquet(f"../web/public/models/{name}/features.parquet")

In [82]:
feature_df.head()

Unnamed: 0,feature,max_activation,x,y,top10_x,top10_y,label,order
0,0,0.259158,-0.303497,0.625081,0.065658,0.372876,properties uses and characteristics of gelatin,0.589799
1,1,0.125571,-0.742769,-0.499167,0.475987,-0.685467,Heathcliff's obsessive love and revenge themes.,0.788589
2,2,0.041077,-0.374569,-0.63296,0.007194,-0.277979,military aviation and emergency response strat...,0.450889
3,3,0.221894,-0.310232,-0.219518,0.134838,-0.860194,presidential veto and legislative process,0.535589
4,4,0.031271,-0.400855,-0.180536,0.437017,-0.489262,Early American political figures and women's r...,0.754275


In [83]:
feature_df.to_parquet(f"../web/public/models/{name}/features.parquet")

## Sample chunks

We chunk our top10 samples into smaller files so we can load them dynamically in the browser.

In [97]:
t10r_df = top10_df[['id', 'chunk_text', 'url', 'feature', 'activation', 'top_acts', 'top_indices']].copy()
t10r_df.rename(columns={'chunk_text': 'text'}, inplace=True)


In [98]:
t10r_df.head()

Unnamed: 0,id,text,url,feature,activation,top_acts,top_indices
0,<urn:uuid:ade93a67-14e4-4835-8fb6-7668fe8004db>,- simple past tense and past participle of gel...,http://www.yourdictionary.com/gelled,0,0.259158,"[0.2783777713775635, 0.2591576874256134, 0.113...","[2006.0, 0.0, 793.0, 3718.0, 5826.0, 10367.0, ..."
1,<urn:uuid:ae2f5cf3-6392-4160-a4ee-5b36b3a97e65>,"Also found in: Thesaurus, Medical, Acronyms, E...",https://www.thefreedictionary.com/gelatin,0,0.256561,"[0.2565605938434601, 0.22474993765354156, 0.13...","[0.0, 2006.0, 5826.0, 793.0, 5945.0, 21665.0, ..."
2,<urn:uuid:c33df90e-c479-40e2-aa9b-ac6a4f0e47d9>,"A gel (from the lat. gelu—freezing, cold, ice ...",http://medicalxpress.com/tags/hydrogel/sort/li...,0,0.245421,"[0.24542075395584106, 0.09371058642864227, 0.0...","[0.0, 2006.0, 10367.0, 7655.0, 16380.0, 15780...."
3,<urn:uuid:16add753-dbb9-44e2-aad0-b9909c16cff5>,extended periods of time can damage gelatin an...,http://mlaiskonis.com/2014/06/07/gelatin/,0,0.243735,"[0.24373508989810944, 0.15119953453540802, 0.1...","[0.0, 5826.0, 21665.0, 12673.0, 16225.0, 4502...."
4,<urn:uuid:acd089d5-9546-4c25-8142-ab2a5e3204cc>,##loid gel. gelatin forms a solution of high v...,http://www.foodfacts.com/ci/ingredientsoverlay...,0,0.242724,"[0.2427240014076233, 0.1294490396976471, 0.087...","[0.0, 5826.0, 13791.0, 7655.0, 11963.0, 15780...."


In [99]:
num_features_per_file = 100
num_files = (len(feature_df) + num_features_per_file - 1) // num_features_per_file  # Calculate number of files needed
print(num_files)



246


In [101]:
sorted_feature_df = feature_df.sort_values(by='order')
if not os.path.exists(f"../web/public/models/{name}/samples"):
  os.makedirs(f"../web/public/models/{name}/samples")

for i in tqdm.tqdm(range(0, len(sorted_feature_df), 100)):
    chunk = sorted_feature_df.iloc[i:i + 100]
    samples_df = t10r_df[t10r_df['feature'].isin(chunk['feature'])]
    samples_df.to_parquet(f"../web/public/models/{name}/samples/chunk_{i // 100}.parquet")

100%|██████████| 246/246 [00:04<00:00, 55.69it/s]


In [102]:
chunk_mapping = {}
for i,f in enumerate(sorted_feature_df['feature'].values):
  chunk_mapping[int(f)] = int(i // 100)
with open(f"../web/public/models/{name}/chunk_mapping.json", "w") as f:
  json.dump(chunk_mapping, f)
