## Prepare metadata for a model
We combine important meta data about the model from various sources. We want:

- id: unique identifier made up of {repo}_{topk}_{expansion}
- repo: HF repository
- topk: top k latents used
- expansion: how many
- source_model: HF id of the source model e.g. nomic/nomic-embed-text-v1.5
- d_in: input dimensions of the embeddings for the SAE: e.g. 768

We also will make a parquet for the features with the following columns

- feature: feature id (an index into the SAE decoder weights)
- x: UMAP position x
- y: UMAP position y
- label: feature label
- max_activation: max activation of the feature (seen in the top examples)
- order: an index into an ordering of the features determined by a 1D UMAP

We also prepare the top 10 samples for each feature, these will be split into smaller parquet files for dynamic loading in the browser

- text: text of the sample
- feature: feature id (an index into the SAE decoder weights)
- activation: activation of the feature
- top_indices: indices of the topk features activated on this sample
- top_acts: activation values of the topk features activated

In [103]:
import pandas as pd
import numpy as np
import os
import json
import tqdm
from latentsae.sae import Sae

Triton not installed, using eager implementation of SAE decoder.


### Metadata

In [104]:
topk = 64
expansion = 32
repo = "enjalot/sae-nomic-text-v1.5-FineWeb-edu-100BT"
sae_id = f"{topk}_{expansion}"

In [105]:
model = Sae.load_from_hub(repo, sae_id)

Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 52428.80it/s]
Dropping extra args {'signed': False}


In [106]:
name = f"NOMIC_FWEDU_{round(model.num_latents/1000)}k"

In [107]:
meta = {
  "id": f"{repo}_{sae_id}",
  "name": name,
  "repo": repo,
  "topk": topk,
  "expansion": expansion,
  "d_in": model.d_in,
  "num_latents": model.num_latents,
  "source_model": "nomic-ai/nomic-embed-text-v1.5"
}

In [108]:
meta

{'id': 'enjalot/sae-nomic-text-v1.5-FineWeb-edu-100BT_64_32',
 'name': 'NOMIC_FWEDU_25k',
 'repo': 'enjalot/sae-nomic-text-v1.5-FineWeb-edu-100BT',
 'topk': 64,
 'expansion': 32,
 'd_in': 768,
 'num_latents': 24576,
 'source_model': 'nomic-ai/nomic-embed-text-v1.5'}

In [109]:
if not os.path.exists(f"../models/{name}"):
  os.makedirs(f"../models/{name}")

with open(f"../models/{name}/metadata.json", "w") as f:
  json.dump(meta, f)

## Features parquet

UMAP coordinates are prepared in the notebook `umap-decoder.ipynb`.   
Labels generated in `make-labels.ipynb`.  
Samples are downloaded from modal

In [110]:
umap_df = pd.read_parquet(f"data/umap-{name}.parquet")
umap_top10_df = pd.read_parquet(f"data/umap-top10-{name}.parquet")
umap_df.head()

Unnamed: 0,x,y
0,-0.303497,0.625081
1,-0.742769,-0.499167
2,-0.374569,-0.63296
3,-0.310232,-0.219518
4,-0.400855,-0.180536


In [113]:
top10_df = pd.read_parquet(f"data/top10_{sae_id}.parquet")
top10_df.head()

Unnamed: 0,chunk_index,chunk_text,chunk_token_count,id,url,score,dump,embedding,top_acts,top_indices,feature,activation
0,0,"2019 Study Abstract\nGenomic imprinting, the m...",191,<urn:uuid:d45d32f3-aee4-464b-a7a7-4659ca6f95a5>,https://desdaughter.com/2019/01/21/genomic-imp...,2.859375,CC-MAIN-2019-18,"[0.070664756, 0.04058804, -0.1678945, -0.04720...","[0.29729074239730835, 0.23520702123641968, 0.1...","[0.0, 19961.0, 19487.0, 3596.0, 9132.0, 16563....",0,0.297291
4973,1,biological function and regulation of imprinte...,500,<urn:uuid:6492a5df-795c-4afc-8b96-adda43d374fe>,http://www.biomedcentral.com/1471-2164/10/144,2.640625,CC-MAIN-2015-11,"[0.08591175, 0.05282476, -0.1623837, -0.038760...","[0.2801821529865265, 0.21019425988197327, 0.11...","[0.0, 19961.0, 12474.0, 18618.0, 5676.0, 18178...",0,0.280182
2426,0,Molecular imprinting is a technique used to cr...,277,<urn:uuid:c4cec9f7-a221-4bac-8872-ad938bbe3b9c>,https://www.advancedsciencenews.com/new-trends...,2.796875,CC-MAIN-2021-04,"[0.06389447, 0.054419804, -0.1866366, -0.05261...","[0.27958399057388306, 0.15807494521141052, 0.1...","[0.0, 21919.0, 18672.0, 3614.0, 13226.0, 15727...",0,0.279584
44507,0,[CLS] imprinting / do not go where the path ma...,500,<urn:uuid:fd9748b7-ad11-4d51-b7d9-b5681c579e36>,https://www.windermeresun.com/2017/08/05/impri...,3.21875,CC-MAIN-2023-40,"[0.045871254, 0.0841982, -0.20583852, -0.07991...","[0.27171218395233154, 0.24269740283489227, 0.1...","[0.0, 6864.0, 8104.0, 3020.0, 15020.0, 8543.0,...",0,0.271712
7496,0,There have been a number of recent insights in...,212,<urn:uuid:40c30498-bed6-4b01-a37e-a2a1b70d80fd>,https://pure.ulster.ac.uk/en/publications/impr...,3.078125,CC-MAIN-2021-10,"[0.06898261, 0.04841869, -0.16367012, -0.05253...","[0.2717033922672272, 0.2493799477815628, 0.117...","[0.0, 19961.0, 21919.0, 14900.0, 22498.0, 1514...",0,0.271703


In [114]:
max_activation_per_feature = top10_df.groupby('feature')['activation'].max().reset_index()


In [115]:
max_activation_per_feature

Unnamed: 0,feature,activation
0,0,0.297291
1,1,0.166444
2,2,0.240715
3,3,0.273029
4,4,0.301342
...,...,...
24571,24571,0.276034
24572,24572,0.273196
24573,24573,0.259888
24574,24574,0.197020


In [116]:
feature_df = max_activation_per_feature.copy()
feature_df.rename(columns={'activation': 'max_activation'}, inplace=True)


In [117]:
feature_df["x"] = umap_df["x"]
feature_df["y"] = umap_df["y"]
feature_df["top10_x"] = umap_top10_df["x"]
feature_df["top10_y"] = umap_top10_df["y"]


In [125]:
label_df = pd.read_parquet(f"data/labels-{name}.parquet")

In [126]:
label_df.head()

Unnamed: 0,feature,label
0,0,FINAL: genomic and molecular imprinting concepts
1,1,FINAL: Hildegard of Bingen's spiritual and cre...
2,2,FINAL: Fairchild Semiconductor and Silicon Val...
3,3,FINAL: socialism political ideology economic e...
4,4,FINAL: vowel sound definitions and teaching me...


In [127]:
len(label_df)

24576

In [133]:
feature_df.set_index('feature', inplace=True)
label_df.set_index('feature', inplace=True)

In [134]:
feature_df["label"] = label_df["label"].str.replace("FINAL: ", "", regex=False).str.strip()

In [143]:
feature_df.reset_index(inplace=True)

In [135]:
# order_df = pd.read_parquet(f"data/1d-order-{name}.parquet")
order_df = pd.read_parquet(f"data/1d-order-{name}-top10.parquet")
# Scale the "order" column from 0 to 1
# Scale the "order" column from 0 to 1
order = order_df["order"].values
order_min = order.min()
order_max = order.max()
order = (order - order_min) / (order_max - order_min)


feature_df["order"] = order

In [136]:
# feature_df = pd.read_parquet(f"../web/public/models/{name}/features.parquet")

In [144]:
feature_df.head()

Unnamed: 0,feature,max_activation,x,y,top10_x,top10_y,label,order
0,0,0.297291,-0.303497,0.625081,0.515561,-0.068958,genomic and molecular imprinting concepts,0.835001
1,1,0.166444,-0.742769,-0.499167,-0.758344,0.305733,Hildegard of Bingen's spiritual and creative l...,0.665455
2,2,0.240715,-0.374569,-0.63296,-0.179834,0.175429,Fairchild Semiconductor and Silicon Valley legacy,0.583131
3,3,0.273029,-0.310232,-0.219518,-0.729569,-0.154941,socialism political ideology economic equality,0.346067
4,4,0.301342,-0.400855,-0.180536,-0.344582,-0.491325,vowel sound definitions and teaching methods,0.437009


In [145]:
feature_df.to_parquet(f"../web/public/models/{name}/features.parquet")

## Sample chunks

We chunk our top10 samples into smaller files so we can load them dynamically in the browser.

In [146]:
t10r_df = top10_df[['id', 'chunk_text', 'url', 'feature', 'activation', 'top_acts', 'top_indices']].copy()
t10r_df.rename(columns={'chunk_text': 'text'}, inplace=True)


In [147]:
t10r_df.head()

Unnamed: 0,id,text,url,feature,activation,top_acts,top_indices
0,<urn:uuid:d45d32f3-aee4-464b-a7a7-4659ca6f95a5>,"2019 Study Abstract\nGenomic imprinting, the m...",https://desdaughter.com/2019/01/21/genomic-imp...,0,0.297291,"[0.29729074239730835, 0.23520702123641968, 0.1...","[0.0, 19961.0, 19487.0, 3596.0, 9132.0, 16563...."
4973,<urn:uuid:6492a5df-795c-4afc-8b96-adda43d374fe>,biological function and regulation of imprinte...,http://www.biomedcentral.com/1471-2164/10/144,0,0.280182,"[0.2801821529865265, 0.21019425988197327, 0.11...","[0.0, 19961.0, 12474.0, 18618.0, 5676.0, 18178..."
2426,<urn:uuid:c4cec9f7-a221-4bac-8872-ad938bbe3b9c>,Molecular imprinting is a technique used to cr...,https://www.advancedsciencenews.com/new-trends...,0,0.279584,"[0.27958399057388306, 0.15807494521141052, 0.1...","[0.0, 21919.0, 18672.0, 3614.0, 13226.0, 15727..."
44507,<urn:uuid:fd9748b7-ad11-4d51-b7d9-b5681c579e36>,[CLS] imprinting / do not go where the path ma...,https://www.windermeresun.com/2017/08/05/impri...,0,0.271712,"[0.27171218395233154, 0.24269740283489227, 0.1...","[0.0, 6864.0, 8104.0, 3020.0, 15020.0, 8543.0,..."
7496,<urn:uuid:40c30498-bed6-4b01-a37e-a2a1b70d80fd>,There have been a number of recent insights in...,https://pure.ulster.ac.uk/en/publications/impr...,0,0.271703,"[0.2717033922672272, 0.2493799477815628, 0.117...","[0.0, 19961.0, 21919.0, 14900.0, 22498.0, 1514..."


In [148]:
num_features_per_file = 100
num_files = (len(feature_df) + num_features_per_file - 1) // num_features_per_file  # Calculate number of files needed
print(num_files)



246


In [149]:
sorted_feature_df = feature_df.sort_values(by='order')
if not os.path.exists(f"../web/public/models/{name}/samples"):
  os.makedirs(f"../web/public/models/{name}/samples")

for i in tqdm.tqdm(range(0, len(sorted_feature_df), 100)):
    chunk = sorted_feature_df.iloc[i:i + 100]
    samples_df = t10r_df[t10r_df['feature'].isin(chunk['feature'])]
    samples_df.to_parquet(f"../web/public/models/{name}/samples/chunk_{i // 100}.parquet")

  0%|          | 0/246 [00:00<?, ?it/s]

100%|██████████| 246/246 [00:04<00:00, 54.94it/s]


In [150]:
chunk_mapping = {}
for i,f in enumerate(sorted_feature_df['feature'].values):
  chunk_mapping[int(f)] = int(i // 100)
with open(f"../web/public/models/{name}/chunk_mapping.json", "w") as f:
  json.dump(chunk_mapping, f)
