<a href="https://colab.research.google.com/github/anthonyq7/drm_pca_step/blob/main/drm_pca_step.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing dependencies

In [4]:
!pip install datasets transformers torch scikit-learn accelerate huggingface_hub tqdm

import torch
import numpy as np
import matplotlib.pyplot as plt
import tqdm
from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification
from accelerate import Accelerator
from sklearn.decomposition import PCA
from datasets import load_dataset
from huggingface_hub import notebook_login



Load PKU-SafeRLHF Dataset

Notable dataset columns: prompt, response_0, response_1, better_response_id

In [5]:
dataset = load_dataset("PKU-Alignment/PKU-SafeRLHF")
train_data = dataset["train"]

print(f"Dataset size: {len(train_data)}")
print(f"Columns: {train_data.column_names}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/Alpaca-7B/train.jsonl:   0%|          | 0.00/77.5M [00:00<?, ?B/s]

data/Alpaca2-7B/train.jsonl:   0%|          | 0.00/72.5M [00:00<?, ?B/s]

data/Alpaca3-8B/train.jsonl:   0%|          | 0.00/59.4M [00:00<?, ?B/s]

data/Alpaca-7B/test.jsonl:   0%|          | 0.00/8.60M [00:00<?, ?B/s]

data/Alpaca2-7B/test.jsonl:   0%|          | 0.00/8.09M [00:00<?, ?B/s]

data/Alpaca3-8B/test.jsonl:   0%|          | 0.00/6.63M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/73907 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8211 [00:00<?, ? examples/s]

Dataset size: 73907
Columns: ['prompt', 'response_0', 'response_1', 'prompt_source', 'response_0_source', 'response_1_source', 'is_response_0_safe', 'is_response_1_safe', 'response_0_harm_category', 'response_1_harm_category', 'response_0_severity_level', 'response_1_severity_level', 'better_response_id', 'safer_response_id', 'response_0_sha256', 'response_1_sha256']


Hugging Face Login

In [6]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Load model: Gemma-2B

In [7]:
model_name = "google/gemma-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, dtype=torch.bfloat16)

accelerator = Accelerator()
device = accelerator.device

model.to(device)
model.eval()

tokenizer_config.json:   0%|          | 0.00/33.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some weights of GemmaForSequenceClassification were not initialized from the model checkpoint at google/gemma-2b and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GemmaForSequenceClassification(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear(in_features=16384, out_features=2048, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): GemmaRMSNorm((2048,), eps=1e-06)
      )
    )
    (norm): GemmaRMS

GET Embeddings

In [11]:
def get_embeddings(text, batch_size=16):
    embeddings = []

    for i in tqdm.tqdm(range(0, len(text), batch_size)):
        batch_text = text[i:i+batch_size]

        inputs = tokenizer(
            batch_text,
            return_tensors="pt",
            padding="max_length",
            truncation=True,
            max_length=512
        ).to(device)

        with torch.no_grad():

            outputs = model(**inputs, output_hidden_states=True)
            hidden_states = outputs.hidden_states[-1]
            pooled = hidden_states.mean(dim=1)

        embeddings.append(pooled.cpu())

    return torch.cat(embeddings, dim=0)

Extract and create embeddings from PKU-SafeRLHF

In [12]:
print(f"Processing dataset on {device}...")

embeddings_c = []
embeddings_r = []

for sample in train_data:

    if sample["better_response_id"] == 0:
        embeddings_c.append(f"{sample['prompt']}\n\n{sample['response_0']}")
        embeddings_r.append(f"{sample['prompt']}\n\n{sample['response_1']}")
    else:
        embeddings_c.append(f"{sample['prompt']}\n\n{sample['response_1']}")
        embeddings_r.append(f"{sample['prompt']}\n\n{sample['response_0']}")

embeddings_chosen = get_embeddings(text=embeddings_c[:2500])
embeddings_rejected = get_embeddings(text=embeddings_r[:2500])

Processing dataset on cuda...


100%|██████████| 157/157 [33:11<00:00, 12.69s/it]
100%|██████████| 157/157 [33:11<00:00, 12.68s/it]


Computing embedding differences

z_i = φ(x, y_chosen) - φ(x, y_rejected)

In [15]:
embedding_differences = embeddings_chosen.to(torch.float32).numpy() - embeddings_rejected.to(torch.float32).numpy()

Zero-centering embedding differences for PCA and print shape

In [16]:
embedding_differences = embedding_differences - embedding_differences.mean(axis=0)

print(f"Embedding differences shape: {embedding_differences.shape}")

Embedding differences shape: (2500, 2048)


PCA to get orthogonal basis vectors

In [17]:
n_components = 100 #like in the paper
pca = PCA(n_components=n_components)
pca.fit(embedding_differences)

basis_vectors = pca.components_

print(f"Basis vectors shape: {basis_vectors.shape}")
print(f"Explained variance ratio: {pca.explained_variance_ratio_[:10]}")

np.save("drm_basis_vectors.npy", basis_vectors)
np.save("pca_explained_variance.npy", pca.explained_variance_ratio_)

Basis vectors shape: (100, 2048)
Explained variance ratio: [0.9044079  0.01414219 0.00715477 0.00512298 0.00449972 0.00364158
 0.00323444 0.00285632 0.00266146 0.00219231]
