# Analyzing the semantics of neural activation patterns

In my analysis, I seek to understand relationships between the semantic content of inputs and the corresponding neural activation patterns. This preliminary analysis finds few strong conclusions but opens opportunities for further analysis and understanding down the road.

To execute this analysis I rely on two data sources: the WordNet module in NLTK (which encodes semantic information about words, such as their "domain" and their "part of speech") and an "AMAP" of the Pythia 70M large language model, generated against a Wikipedia dataset. Assuming audience familiarity with WordNet, I'm going to explain the AMAP before explaining my analysis.

An AMAP is an activation map generated by Corentin Kervadec's [SYSIF](https://github.com/CorentinKervadec/SYSIF) library. It can basically be understood as a two-dimensional matrix, where the columns are the tokens being fed to the model, the rows are the neurons of the model, and the values inside the matrix represent the activation value (i.e. the output of the neuron's activation function) of the corresponding neuron in response to the corresponding token. So if we want to know how the (arbitrarily) "5th" neuron of the 2nd layer fired in response to the token "hat", we can find the column that codes for "hat", and the row that codes for the "5th" neuron of the 2nd layer of the network.

Using this pre-generated AMAP, we can look at activation patterns for different words, and then look for trends in the activation patterns of different types of words. This type of interpretability analysis is useful and interesting because it gives us deeper insights into the nature of LLMs and how they respond to different words and concepts.

Let's dive in.

## Import packages

In [1]:
import torch

import numpy as np
import pandas as pd

from nltk.corpus import wordnet as wn
from sklearn.manifold import TSNE
import plotly.express as px
from transformers import AutoTokenizer

from src.amap.amap import LMamap
from src.data.dataset_loader import load_hf_dataset_with_sampling
from src.model.causal_lm import CausalLanguageModel
from src.utils.init_utils import init_device, init_random_seed

  from .autonotebook import tqdm as notebook_tqdm


# Load the AMAP

First, we need to load the AMAP and access it as a dataframe in order to be able to extract the relevant insights.

## Specify AMAP information

In [2]:
model_name = "EleutherAI/pythia-70m-deduped"
dataset =  "wikipedia,20220301.en,train"
n_samples = 100000
device = "cpu"
load = "/Users/dleybz/Documents/UPF Courses/Research/Neural Activation Patterns/Analysis 0/SYSIF/data"

fp16 = True
window_size = 15
batch_size = 9
window_stride = 15
mode = 'input'

In [3]:
model = CausalLanguageModel(
    model_name,
    device="cuda" if torch.cuda.is_available() else "cpu",
    fast_tkn=True if not ('opt' in model_name) else False, #because of a bug in OPT
    fp16=fp16)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [4]:
amapper = LMamap(model=model,
                    device=device,
                    mode=mode,
                    fp16=fp16)

## Load AMAP

In [42]:
amapper.load(load, dataset, window_size)
amap_df = amapper.get_df_amap()

amap_df.head()

[AMAP] Loading files...




[AMAP] amap-pythia-70m-deduped-wikipedia,20220301.en,train-N100000-1337_position_.pickle loaded!
[AMAP] tokens-count-pythia-70m-deduped-wikipedia,20220301.en,train-N100000-1337_position_.pickle loaded!
[AMAP] Sanity check




[AMAP] Done :-)


Unnamed: 0,<|endoftext|>,<|padding|>,!,"""",#,$,%,&,',(,...,[POS_10],[POS_11],[POS_12],[POS_13],[POS_14],[POS_15],#unit,#mode,#layer,#mean
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.040833,0.039856,0.040161,0.04007,0.039825,0.0,0_0,input,0,0.090454
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.02243,0.022659,0.022705,0.02272,0.022598,0.0,0_1,input,0,0.057739
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.046295,0.045959,0.045166,0.04599,0.045197,0.0,0_2,input,0,0.115784
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.031921,0.031799,0.032288,0.031891,0.031982,0.0,0_3,input,0,0.078979
4,0.0,0.0,0.0,0.0,0.0,0.0,0.62207,0.0,0.0,0.0,...,0.045715,0.045563,0.046082,0.04599,0.045868,0.0,0_4,input,0,0.090149


# Collect semantic information

Next, we collect semantic information. To do this, we will take all of the synsets in WordNet, then extract the first lemma, and then extract the domain and part of speech of each lemma.

## Get a list of the first lemma for each synset in WordNet

In [6]:
all_synsets = list(wn.all_synsets())

def first_lemma(synset):
    lemmas = synset.lemmas()
    return lemmas[0]

first_lemmas = [first_lemma(synset) for synset in all_synsets]
first_lemma_names = [lemma.name() for lemma in first_lemmas]

## Extract the domain for each synset

In [7]:
def first_domain(synset):
    domains = synset.topic_domains()
    if len(domains) == 0:
        return 'None'
    else:
        return domains[0].name()

first_domains = [first_domain(synset) for synset in all_synsets]

## Extract POS for each synset

In [8]:
all_pos = [synset.pos() for synset in all_synsets]

## Extract an average AMAP for each lemma

Now that we have collected this information, we also need to extract the AMAP for each lemma, in order to be able to analyze the AMAPs of lemmas based on their semantic qualities. A novel insight in this analysis is that AMAPs contain token-level information whereas WordNet contains lemma-level information. Originally, I used a naive approach of only using the lemmas that had a 1:1 mapping to a token in the AMAP, but this meant losing a large portion of the dataset.

Instead, I realized that I could simply apply the same tokenizer that the Pythia 70M model already relies on, and pass the lemmas through that tokenizer to get the constituent tokens, and then extract the AMAPs for those tokens from the dataset. By averaging the AMAPs for each constituent token, I derive a new AMAP that represents the mean activation of the network in response to the entire lemma.

In [9]:
def lemma_to_tokens(lemma, tokenizer):
    inputs = tokenizer.tokenize(lemma, return_tensors="pt")
    tokens = [input.strip("Ġ") for input in inputs] # the tokenizer sometimes inserts this character
    return tokens

def tokens_to_avg_amap(tokens):
    amaps = amap_df[tokens]
    array = np.array(amaps)
    average_amap = np.mean(array, axis=1)
    return average_amap

def lemmas_to_amaps(lemmas, tokenizer):
    token_lol = [lemma_to_tokens(lemma, tokenizer) for lemma in lemmas]
    averages = [tokens_to_avg_amap(tokens) for tokens in token_lol]
    return averages

tokenizer = AutoTokenizer.from_pretrained(
    "EleutherAI/pythia-70m-deduped",
    revision="step3000",
    cache_dir="./pythia-70m-deduped/step3000",
)

amaps = lemmas_to_amaps(first_lemma_names, tokenizer)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Combine all of the data into a single dataframe

In [41]:
analyzable_df = pd.DataFrame({'synset': [synset.name() for synset in all_synsets], 'domain': first_domains, 'lemma': first_lemma_names, 'pos': all_pos, 'amap': amaps})

analyzable_df.head()

Unnamed: 0,synset,domain,lemma,pos,amap
0,able.a.01,,able,a,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.06223, 0.0, 0.0, 0..."
1,unable.a.01,,unable,a,"[0.0208, 0.0, 0.0, 0.0, 0.0, 0.03111, 0.1488, ..."
2,abaxial.a.01,biology.n.01,abaxial,a,"[0.0, 0.07135, 0.1606, 0.124, 0.0, 0.0, 0.0, 0..."
3,adaxial.a.01,biology.n.01,adaxial,a,"[0.1852, 0.07135, 0.11383, 0.124, 0.0, 0.0, 0...."
4,acroscopic.a.01,botany.n.02,acroscopic,a,"[0.2449, 0.0, 0.1396, 0.0, 0.2805, 0.00624, 0...."


In [21]:
def get_layer_and_tsne(layer_number, amap_df, analyzable_df):
    # get the activations only for that layer
    layer_indexes = amap_df[amap_df['#layer'] == layer_number].index
    col_name = 'layer'+str(layer_number)
    analyzable_df[col_name] = analyzable_df['amap'].apply(lambda x: x[layer_indexes])
    
    
    # tsne those activations
    amap_lol = [list(array) for array in analyzable_df[col_name]]
    amap_array = np.array(amap_lol)

    X_embedded = TSNE(n_components=2).fit_transform(amap_array)
    analyzable_df['x_'+col_name] = X_embedded[:,0]
    analyzable_df['y_'+col_name] = X_embedded[:,1]
    
    return analyzable_df



In [22]:
for i in amap_df['#layer'].unique():
    get_layer_and_tsne(i, amap_df, analyzable_df)

In [40]:
analyzable_df.head()

Unnamed: 0,synset,domain,lemma,pos,amap,layer0,x_layer0,y_layer0,layer1,x_layer1,...,y_layer2,layer3,x_layer3,y_layer3,layer4,x_layer4,y_layer4,layer5,x_layer5,y_layer5
0,able.a.01,,able,a,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.06223, 0.0, 0.0, 0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.06223, 0.0, 0.0, 0...",-68.73542,68.479126,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.06223, 0.0, 0.0, 0...",-70.674873,...,68.12088,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.06223, 0.0, 0.0, 0...",14.028213,53.809151,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.06223, 0.0, 0.0, 0...",-63.521675,72.260651,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.06223, 0.0, 0.0, 0...",-68.163849,75.362358
1,unable.a.01,,unable,a,"[0.0208, 0.0, 0.0, 0.0, 0.0, 0.03111, 0.1488, ...","[0.0208, 0.0, 0.0, 0.0, 0.0, 0.03111, 0.1488, ...",-73.160461,60.558235,"[0.0208, 0.0, 0.0, 0.0, 0.0, 0.03111, 0.1488, ...",-74.275551,...,61.125755,"[0.0208, 0.0, 0.0, 0.0, 0.0, 0.03111, 0.1488, ...",57.59734,63.680691,"[0.0208, 0.0, 0.0, 0.0, 0.0, 0.03111, 0.1488, ...",-72.470848,63.897709,"[0.0208, 0.0, 0.0, 0.0, 0.0, 0.03111, 0.1488, ...",-67.214516,59.784946
2,abaxial.a.01,biology.n.01,abaxial,a,"[0.0, 0.07135, 0.1606, 0.124, 0.0, 0.0, 0.0, 0...","[0.0, 0.07135, 0.1606, 0.124, 0.0, 0.0, 0.0, 0...",-47.069584,29.273985,"[0.0, 0.07135, 0.1606, 0.124, 0.0, 0.0, 0.0, 0...",-47.994194,...,32.653339,"[0.0, 0.07135, 0.1606, 0.124, 0.0, 0.0, 0.0, 0...",31.074736,32.598499,"[0.0, 0.07135, 0.1606, 0.124, 0.0, 0.0, 0.0, 0...",-48.047474,34.412838,"[0.0, 0.07135, 0.1606, 0.124, 0.0, 0.0, 0.0, 0...",-47.765808,30.635893
3,adaxial.a.01,biology.n.01,adaxial,a,"[0.1852, 0.07135, 0.11383, 0.124, 0.0, 0.0, 0....","[0.1852, 0.07135, 0.11383, 0.124, 0.0, 0.0, 0....",-47.05476,29.277395,"[0.1852, 0.07135, 0.11383, 0.124, 0.0, 0.0, 0....",-47.983456,...,32.653347,"[0.1852, 0.07135, 0.11383, 0.124, 0.0, 0.0, 0....",31.067787,32.594536,"[0.1852, 0.07135, 0.11383, 0.124, 0.0, 0.0, 0....",-48.034508,34.41267,"[0.1852, 0.07135, 0.11383, 0.124, 0.0, 0.0, 0....",-47.773018,30.641258
4,acroscopic.a.01,botany.n.02,acroscopic,a,"[0.2449, 0.0, 0.1396, 0.0, 0.2805, 0.00624, 0....","[0.2449, 0.0, 0.1396, 0.0, 0.2805, 0.00624, 0....",10.540625,-55.473915,"[0.2449, 0.0, 0.1396, 0.0, 0.2805, 0.00624, 0....",2.830513,...,-51.792801,"[0.2449, 0.0, 0.1396, 0.0, 0.2805, 0.00624, 0....",7.682254,-52.637489,"[0.2449, 0.0, 0.1396, 0.0, 0.2805, 0.00624, 0....",20.343779,-53.89394,"[0.2449, 0.0, 0.1396, 0.0, 0.2805, 0.00624, 0....",-0.582875,-53.012756


In [38]:
def visualize_tsne(dataframe, layer_number, col):
    fig = px.scatter(x=dataframe['x_layer'+str(layer_number)], y=dataframe['y_layer'+str(layer_number)], color=dataframe[col], hover_name=dataframe['lemma'])
    fig.update_layout(
        title="t-SNE visualization of layer "+str(layer_number)+" activations",
        xaxis_title="First t-SNE",
        yaxis_title="Second t-SNE",
    )
    fig.show()

domain_df = analyzable_df.loc[analyzable_df['domain']!='None']

visualize_tsne(domain_df[0:5000], 0, 'domain')





In [39]:
domain_frequency = domain_df.groupby('domain').size()
top_10_domains = domain_frequency.sort_values(ascending=False).head(5)

in_top_10 = [domain in top_10_domains for domain in domain_df['domain']]
top_df = domain_df.loc[in_top_10]

for i in amap_df['#layer'].unique():
    visualize_tsne(top_df, i, 'domain')























