In [1]:
import os
import pickle
import time
from pprint import pprint
from typing import Any, Dict, List

import datasets
import kscope
from datasets import Dataset
from tqdm import tqdm
from transformers import AutoTokenizer

In [2]:
# Establish a client connection to the kscope service
client = kscope.Client(gateway_host="llm.cluster.local", gateway_port=3001)
client.model_instances

[{'id': '49e036f3-52f7-462c-91bf-c9450c875c49',
  'name': 'llama2-13b',
  'state': 'ACTIVE'}]

In [3]:
model = client.load_model("llama2-13b")
# If this model is not actively running, it will get launched in the background.
# In this case, wait until it moves into an "ACTIVE" state before proceeding.
while model.state != "ACTIVE":
    time.sleep(1)

### Activation Generation (__NOTE__ This takes a very long time to run (An hour or two)!)

In this notebook, we're going to extract the activations from the final non-pad token of LLaMA-13B for different sets of text inputs. This is done (on the back-end) by inserting hooks into the model that allow for extraction of the model's intermediate latent representations. In this notebook, we'll vary both the layer we extract information from and the input itself to consider the affect that these choices have on the performance of a downstream task. In this case, we'll consider a sampling of the IMDB sentiment analysis task to probe these choices.

To start, we need to define a configuration for the model generation. Because we only care about the activations of our input, the configuration is less important. The only thing we really need to do is set the `max_tokens` to 1 so that we don't have to worry about indexing into the right spot in our activation matrix. That is, the activations we care about will just occur in the last slot of the tensor. For a discussion of the configuration parameters see [CONFIG_README.md](../../prompting_vector_llms/CONFIG_README.md).

In [4]:
generation_config = {"max_tokens": 1}

### Activation Generation 

Activation generation is quite easy. We can use the client to query the remote model and explore the various modules. Here, we are listing off all layers of the LLaMA-2 model.

In [5]:
model.module_names

['tok_embeddings',
 'layers',
 'layers.0',
 'layers.0.attention',
 'layers.0.attention.wq',
 'layers.0.attention.wk',
 'layers.0.attention.wv',
 'layers.0.attention.wo',
 'layers.0.feed_forward',
 'layers.0.feed_forward.w1',
 'layers.0.feed_forward.w2',
 'layers.0.feed_forward.w3',
 'layers.0.attention_norm',
 'layers.0.ffn_norm',
 'layers.1',
 'layers.1.attention',
 'layers.1.attention.wq',
 'layers.1.attention.wk',
 'layers.1.attention.wv',
 'layers.1.attention.wo',
 'layers.1.feed_forward',
 'layers.1.feed_forward.w1',
 'layers.1.feed_forward.w2',
 'layers.1.feed_forward.w3',
 'layers.1.attention_norm',
 'layers.1.ffn_norm',
 'layers.2',
 'layers.2.attention',
 'layers.2.attention.wq',
 'layers.2.attention.wk',
 'layers.2.attention.wv',
 'layers.2.attention.wo',
 'layers.2.feed_forward',
 'layers.2.feed_forward.w1',
 'layers.2.feed_forward.w2',
 'layers.2.feed_forward.w3',
 'layers.2.attention_norm',
 'layers.2.ffn_norm',
 'layers.3',
 'layers.3.attention',
 'layers.3.attention.wq',

We can select the module names of interest and pass them into a `get_activations` function alongside our set of prompts.

__NOTE__: Even though `get_activations` below supports a list of module names as its input, it currently does not support retrieving multiple layer activations at the same time for LLaMA-2. If you would like to retrieve multiple layer activations, please perform multiple calls. For example, if you would like activations from two different layers, perform two calls to `get_activations` with the same set of prompts but different module name inputs.

In [6]:
prompts = ["Hello World", "Fizz Buzz"]

module_name = "layers.0"

activations = model.get_activations(prompts, [module_name], generation_config)
pprint(activations)

# We sent a batch of 2 prompts to the model.
# So a list of length two is returned containing activations for the requested layer
for activations_single_prompt in activations.activations:
    # For each prompt we extract the activations associated with the target module.
    raw_activations = activations_single_prompt[module_name]
    # The activations should have shape (number of tokens + 1) x (activation size)
    # For example, LLaMA-13B has an embedding dimension for the layer requested of 5120
    print("Tensor Shape:", raw_activations.shape)

Activations(activations=[{'layers.0': tensor([[ 0.0464,  0.0161,  0.0542,  ...,  0.0118,  0.0105, -0.0007],
        [ 0.0547, -0.0118,  0.0457,  ...,  0.0460, -0.0007, -0.0750],
        [-0.0071, -0.0102,  0.0236,  ...,  0.0019,  0.0247, -0.0515]],
       dtype=torch.float16)}, {'layers.0': tensor([[ 0.0464,  0.0161,  0.0542,  ...,  0.0118,  0.0105, -0.0007],
        [ 0.0225,  0.0150,  0.0426,  ...,  0.0200,  0.0557, -0.0128],
        [ 0.0478,  0.0051, -0.0265,  ..., -0.0123, -0.0550, -0.0129],
        [ 0.0308, -0.0104,  0.0361,  ...,  0.0083,  0.0245,  0.0138],
        [ 0.0372,  0.0225,  0.0079,  ..., -0.0589, -0.0757, -0.0204]],
       dtype=torch.float16)}], logprobs=[[-0.8916411399841309], [-3.503145456314087]], sequences=['!', ','], tokens=[['!'], [',']])
Tensor Shape: torch.Size([3, 5120])
Tensor Shape: torch.Size([5, 5120])


__NOTE__: In the code below, we're going to only use batch sizes of 1 to ensure memory management on the backend doesn't get out of hand and slow the model down.

#### Tokenizer 

For activation retrieval, we need to instantiate a tokenizer to obtain appropriate token indices for our labels. 

__NOTE__: All LLaMA-2 models, regardless of size, used the same tokenizer. However, if you want to use a different type of model, a different tokenizer may be needed.

If you are on the cluster, the tokenizer may be loaded from `/model-weights/Llama-2-7b-hf`. Otherwise, you'll need to download the `config.json`, `tokenizer.json`, `tokenizer.model`, and `tokenizer_config.json` from there to your local machine.

In [7]:
# Tokenizer prepares the input of the model. LLaMA models of all sizes use the same underlying tokenizer
tokenizer = AutoTokenizer.from_pretrained("/model-weights/Llama-2-7b-hf")
# Let's test out how the tokenizer works on an example sentence. Note that the token with ID = 1 is the
# Beginning of sentence token ("BOS")
encoded_tokens = tokenizer.encode("Fizz Buzz")
print(f"Encoded Tokens: {encoded_tokens}")
# If you ever need to move back from token ids, you can use tokenizer.decode or tokenizer.batch_decode
decoded_tokens = [tokenizer.decode(encoded_token) for encoded_token in encoded_tokens]
print(f"Decoded Tokens: {decoded_tokens}")

Encoded Tokens: [1, 383, 4981, 350, 18813]
Decoded Tokens: ['<s>', 'F', 'izz', 'B', 'uzz']


As can be seen above, the second example, "Fizz Buzz", is tokenized as ["F", "izz", "B", 'uzz]. So we receive a tensor with 5 rows (one for each token and a final one for the next token to be generated) and 5120 columns (the hidden dimension of LLaMA-2-13B).

As a proof of concept of the few-shot abilities of LLMs, we'll only use a small training dataset and will only perform validation using a small test subset for compute efficiency.

* Training set: 100 randomly sampled training examples
* Test set: 300 randomly sample test examples

In [8]:
imdb = datasets.load_dataset("imdb")
train_size = 100
test_size = 300
n_demonstrations = 5

activation_save_path = "./resources/llama2_13b_activations"

small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(train_size))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(test_size))])
# We're going to be experimenting with the affect that prompting the model for the task we care about has on a
# classifier trained on the activations in terms of performance. So we will construct demonstrations by randomly
# selecting a set of 5 examples from the training set to serve this purpose.
small_demonstration_set = imdb["train"].shuffle(seed=42).select([i for i in list(range(n_demonstrations))])

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [9]:
# Split a list into a tuple of lists of a fixed size
def batcher(prompts: List[str], batch_size: int) -> Dataset:
    return (prompts[pos : pos + batch_size] for pos in range(0, len(prompts), batch_size))

In [10]:
# We're running a lot of activation retrievals. Once in a while there is a json decoding or triton error. If that
# happens, we retry the activations request.
def get_activations_with_retries(prompt: str, layers: List[str], config: Dict[str, Any], retries: int = 10) -> Any:
    retries_required = 0
    for _ in range(retries):
        try:
            activations = model.get_activations(prompt, layers, config)
            if retries_required > 0:
                print(f"Something went wrong in activation retrieval. Needed {retries_required} retries")
            return activations
        except Exception as e:  # noqa: F841
            retries_required += 1
    raise ValueError("Exceeded retry limit. Exiting Process")

In [11]:
def truncate_input_text(text: str, truncation_length: int) -> str:
    # If text is longer than truncation length, split by space, take the last truncation_length tokens
    split_text = text.split(" ")
    if len(split_text) > truncation_length:
        text = " ".join(split_text[-truncation_length:])
    return text

### Raw Text Activations

Let's start by getting the activations associated with the raw review text. We'll do activations for the text coupled with a prompt below.

In [12]:
def generate_dataset_activations(
    split: str,
    inputs: List[str],
    labels: List[int],
    module_name: str,
    pickle_name: str,
) -> None:
    print("Generating Activations with Prompts: " + split)

    parsed_activations = []
    # Using a batch size of 1 as activation fetching is heavy
    batches = list(batcher(inputs, 1))
    for input_batch in tqdm(batches):
        # Getting activations for each input batch. For an example of how get_activations works, see beginning of this
        # notebook.
        input_batch = [truncate_input_text(input, 256) for input in input_batch]
        raw_activations = get_activations_with_retries(input_batch, [module_name], generation_config).activations
        for raw_activation in raw_activations:
            # We will be performing classification on the last token non-pad token of the sequence. This is common
            # practice for auto-regressive models (e.g. OPT, Falcon, LLaMA-2). So we only keep the last row of the
            # activation matrix.
            parsed_activations.append(raw_activation[module_name][-1].float())

    cached_activations = {"activations": parsed_activations, "labels": labels}

    with open(os.path.join(activation_save_path, f"{split}{pickle_name}.pkl"), "wb") as handle:
        pickle.dump(cached_activations, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [13]:
layer_numbers = ["10", "20", "30", "39"]
module_names = [f"layers.{layer_number}" for layer_number in layer_numbers]

train_labels = small_train_dataset["label"]
test_labels = small_test_dataset["label"]

assert len(module_names) == len(layer_numbers)

for module_name, layer_number in zip(module_names, layer_numbers):
    print(f"Generation Activations for Module Name {module_name}")
    generate_dataset_activations(
        "train", small_train_dataset["text"], train_labels, module_name, f"_activations_demo_{layer_number}"
    )
    generate_dataset_activations(
        "test", small_test_dataset["text"], test_labels, module_name, f"_activations_demo_{layer_number}"
    )

Generation Activations for Module Name layers.10
Generating Activations with Prompts: train
Something went wrong in activation retrieval. Needed 3 retries
Batch 50 Completed
Something went wrong in activation retrieval. Needed 1 retries
Batch 100 Completed
Generating Activations with Prompts: test
Something went wrong in activation retrieval. Needed 1 retries
Something went wrong in activation retrieval. Needed 1 retries
Something went wrong in activation retrieval. Needed 1 retries
Something went wrong in activation retrieval. Needed 1 retries
Batch 50 Completed
Something went wrong in activation retrieval. Needed 1 retries
Something went wrong in activation retrieval. Needed 1 retries
Something went wrong in activation retrieval. Needed 1 retries
Batch 100 Completed
Something went wrong in activation retrieval. Needed 1 retries
Something went wrong in activation retrieval. Needed 1 retries
Something went wrong in activation retrieval. Needed 2 retries
Batch 150 Completed
Something we

### Prompt Conditioned Activations

Now let's generate activations pre-conditioned with an instruction and a few demonstrations.

In [14]:
def create_demonstrations(instruction: str, demonstration_set: Dataset) -> str:
    label_int_to_str = {0: "negative", 1: "positive"}
    demonstration = f"{instruction}"
    demo_texts = demonstration_set["text"]
    demo_labels = demonstration_set["label"]
    for text, label in zip(demo_texts, demo_labels):
        # truncate the text in case it is very long (cutting the first part of text)
        text = truncate_input_text(text, 64)
        demonstration = f"{demonstration}\nText: {text}\nSentiment: {label_int_to_str[label]}"
    return f"{demonstration}\n"

In [15]:
def create_prompts(texts: List[str], demonstration: str) -> List[str]:
    truncated_texts = [truncate_input_text(text, 128) for text in texts]
    return [f"{demonstration}Text: {text} The sentiment is" for text in truncated_texts]

Below we show the demonstration structure (based on 5 examples) and what each prompt passed to LLaMA-2 looks like.

In [16]:
demonstration = create_demonstrations("Classify the sentiment of the text.", small_demonstration_set)
print(f"Demonstration:\n{demonstration}")

train_prompts = create_prompts(small_train_dataset["text"], demonstration)
test_prompts = create_prompts(small_test_dataset["text"], demonstration)
print(f"Prompt Example:\n{train_prompts[0]}")

Demonstration:
Classify the sentiment of the text.
Text: like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
Sentiment: positive
Text: favorite song is sung by the King, Hank (bing Crosby) and Sir "Saggy" Sagamore. OVerall a great family movie or even a great Date movie. This is a movie you can watch over and over again. The princess played by Rhonda Fleming is gorgeous. I love this movie!! If you liked Danny Kaye in the Court Jester then you will definitely like this movie.
Sentiment: positive
Text: comprehensible reason for the engagement in South Asia. And for that matter also the reason for every single US-American soldier that was there. Instead, Rambo gets to take r

In [17]:
layer_numbers = ["10", "20", "30", "39"]
module_names = [f"layers.{layer_number}" for layer_number in layer_numbers]

train_labels = small_train_dataset["label"]
test_labels = small_test_dataset["label"]

assert len(module_names) == len(layer_numbers)

for module_name, layer_number in zip(module_names, layer_numbers):
    print(f"Generation Activations for Module Name {module_name}")
    generate_dataset_activations(
        "train", train_prompts, train_labels, module_name, f"_activations_with_prompts_demo_{layer_number}"
    )
    generate_dataset_activations(
        "test",
        test_prompts,
        test_labels,
        module_name,
        f"_activations_with_prompts_demo_{layer_number}",
    )

Generation Activations for Module Name layers.10
Generating Activations with Prompts: train
Batch 50 Completed
Batch 100 Completed
Generating Activations with Prompts: test
Batch 50 Completed
Batch 100 Completed
Batch 150 Completed
Batch 200 Completed
Batch 250 Completed
Batch 300 Completed
Generation Activations for Module Name layers.20
Generating Activations with Prompts: train
Batch 50 Completed
Batch 100 Completed
Generating Activations with Prompts: test
Batch 50 Completed
Batch 100 Completed
Batch 150 Completed
Batch 200 Completed
Batch 250 Completed
Batch 300 Completed
Generation Activations for Module Name layers.30
Generating Activations with Prompts: train
Batch 50 Completed
Batch 100 Completed
Generating Activations with Prompts: test
Batch 50 Completed
Batch 100 Completed
Batch 150 Completed
Batch 200 Completed
Batch 250 Completed
Batch 300 Completed
Generation Activations for Module Name layers.39
Generating Activations with Prompts: train
Batch 50 Completed
Batch 100 Com

With these activations saved, the next step is to train a simple classifier on top of them in order to perform the sentiment classification. This is done in the `train_on_activations.ipynb` notebook.