# Mining Transformer self-attention

This notebook implements the training of a DSDM instance (located in folder [src/lib/memory/DSDM.py](https://github.com/dfichiu/ba-thesis/blob/master/src/lib/memory/DSDM.py)) with subsequences constructed by passing each sentence through the pre-traianed [BERT base uncase](https://huggingface.co/bert-base-uncased) and mining the resulting self-attention matrices. Inference is also performed for the in-set inference sentences.

The experiment currently run trains on 20 articles (10 are the inference articles) on the full attention landscape (144 heads - 12 layers w/ 12 heads/layer). The subsequences are first sorted by length and then by chunk score, with a single subsequence per head being committed to memory. Stop words are removed during training (See [Training II](###Training-II): Subsequence construction parameters), but they are kept in during inference (they act as noise.) The preprocessing of the sentence during inference is implemented in the `infer` function from the [src/lib/utils/inference.py](https://github.com/dfichiu/ba-thesis/blob/0524e5598786147aefad596641bff0c0a061cd1f/src/lib/utils/inference.py#L190) module.


For longer training, considering using the training script [/src/experiments/train_memory.py](https://github.com/dfichiu/ba-thesis/blob/master/src/experiments/train_memory.py).

For a description of the parameters that can be set during training and inference, please refer to the respective sections:
- [Training I](###Training-I): DSDM parameters
- [Training II](###Training-II): Subsequence construction parameters
- [Inference](#Inference)


In [1]:
### Set path for imports. ###
import sys
import os

# Get the absolute path of the parent directory.
parent_dir = os.path.abspath(os.path.join(os.path.dirname("__file__"), ".."))

# Add the parent directory to the system path to be able to import modules from 'lib.'
sys.path.append(parent_dir)

In [2]:
%%capture
import datasets

from datetime import datetime
import ipywidgets as widgets
from IPython.display import HTML, Markdown as md
import itertools

from lib.memory import DSDM
from lib.utils import cleanup, configs, inference, learning, preprocess, utils 

import math
import matplotlib
import matplotlib.pyplot as plt
import networkx as nx
from nltk.corpus import stopwords
import numpy as np
import random

import pandas as pd
import pathlib
import pickle

import string
import seaborn as sns

from transformers import AutoTokenizer, AutoModel

import torch
import torchhd as thd
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F 

from tqdm import tqdm

### Package options ###
## Torch
# Disable gradients.
torch.set_grad_enabled(False)
torch.set_printoptions(threshold=10_000)

[nltk_data] Downloading package punkt to
[nltk_data]     /nfs/home/dfichiu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /nfs/home/dfichiu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /nfs/home/dfichiu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
### Utils ###
def plot_heatmap(x: np.array, labels: np.array) -> None:
    plt.figure(figsize=(15, 15))
    sns.heatmap(
        x,
        linewidth=0.5,
        xticklabels=labels,
        yticklabels=labels,
        annot=True,
        fmt='.2f',
    )
    plt.title(f'Self-attention matrix: layer {layer}, head {head}', fontsize=15)
    
    plt.show()
    return

def average_out_and_remove_rows(
    t: torch.tensor,
    averages_idx: list,
    remove_idx: np.array
) -> torch.tensor:
    for average_idx in averages_idx:  # The nested lists can have different dimensions.
        # Replace the attention scores of the first token with the average of the token attention scores.
        t[min(average_idx)] = torch.mean(t[average_idx], dim=0, keepdim=True)
    return t[~remove_idx]


def preprocess_attention_scores(
    attention_scores: torch.tensor,
    averages_idx: list,
    remove_idx: np.array
) -> torch.Tensor:
    """
    Preprocess self-attention matrix.
    
    Average out rows associated with subwords to create entries of reconstructed
    words. Remove punctuation, stop words, and subwords. Apply same procedure to columns by
    transposing the matrix.
    """
    # Remove entries from rows.
    attention_scores = average_out_and_remove_rows(attention_scores, averages_idx, remove_idx)
    # Transpose matrix.
    attention_scores = attention_scores.transpose(0, 1)
    # Remove entries from columns.
    attention_scores = average_out_and_remove_rows(attention_scores, averages_idx, remove_idx)
    # Transpose matrix.
    return attention_scores.transpose(0, 1)
        
    

def backward_pass(G, current_node, left_edge, right_edge, sequence, mean):
    in_nodes = np.array([edge[0] for edge in list(G.in_edges(current_node))])
    in_nodes = in_nodes[(in_nodes > left_edge) & (in_nodes < current_node)]
    for node in in_nodes:
        sequence[node] = 1
        sequences.append(sequence)
        mean += G[node][current_node]['weight']
        means.append(round(mean / (sum(sequence) - 1), 2))
        backward_pass(G, node, left_edge, node, sequence.copy(), mean)
        forward_pass(G, node, left_edge, current_node, sequence.copy(), mean)
        
    return
    
    
def forward_pass(G, current_node, left_edge, right_edge, sequence, mean):
    out_nodes = np.array([edge[1] for edge in list(G.out_edges(current_node))])
    out_nodes = out_nodes[(out_nodes > current_node) & (out_nodes < right_edge)]
    for node in out_nodes:
        sequence[node] = 1
        mean += G[current_node][node]['weight']
        sequences.append(sequence)
        means.append(round(mean / (sum(sequence) - 1), 2))
        backward_pass(G, node, current_node, node, sequence.copy(), mean)
        forward_pass(G, node, node, right_edge, sequence.copy(), mean)
            
    return
    

def construct_sequences(G: nx.DiGraph, n_tokens):
    """Construct subsequences from weighted directed graph."""
    for node in G.nodes():
        sequence = np.zeros(n_tokens)
        mean = 0
        sequence[node] = 1
        #sequences.append(sequence) # Do not allow for 1-token sequences.
        forward_pass(G, node, node, n_tokens, sequence.copy(), mean)

In [4]:
def save_memory(cleanup, memory):
    """Save codebook and memory to file."""
    now = str(datetime.now()).replace(':', "-").replace('.', '-')
    
    if not os.path.exists('memories/method2'):
        os.makedirs('memories/method2')
    if not os.path.exists('cleanups/method2'):
        os.makedirs('cleanups/method2')
        
    with open(f'memories/method2/memory_{now}.pkl', 'wb') as outp:
        pickle.dump(memory, outp, pickle.HIGHEST_PROTOCOL)
    with open(f'cleanups/method2/cleanup_{now}.pkl', 'wb') as outp:
        pickle.dump(cleanup, outp, pickle.HIGHEST_PROTOCOL)

In [5]:
# Load Wikipedia dataset.
# TODO: Split between server and local.
# wiki_dataset = datasets.load_dataset("wikipedia", "20220301.en")['train']
wiki_dataset = datasets.load_dataset(
    "wikipedia",
    "20220301.en",
    cache_dir="/nfs/data/projects/daniela")['train']

Found cached dataset wikipedia (/nfs/data/projects/daniela/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559)


  0%|          | 0/1 [00:00<?, ?it/s]

In [6]:
# Set device.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Set seed.
utils.fix_seed(41)

Using seed: 41

## Training
### Training I

<ins>**DSDM parameters**</ins> 
- `address_size`
- `ema_time_period`, `learning_rate_update`, `normalize` - These parameters shouldn't change, as their values influence whether DSDM aggregates during saving or not;
- `as_threshold`
- `temperature`
- `prune_mode`
- `max_size_address_space`

- `safeguard_bins`
- `bin_score_threshold_type`
- `bin_score_threshold`
 
- `safeguard_chunks`
- `chunk_score_threshold`

For a documentation of the DSDM parameters, please refer to the DSDM class, located in the folder [src/lib/memory/DSDM.py](https://github.com/dfichiu/ba-thesis/blob/master/src/lib/memory/DSDM.py).

In [7]:
### DSDM parameters ###
# These parameters shouldn't change.
address_size = 1000
ema_time_period = 100000
learning_rate_update = 0

normalize = False 

# Attention score threshold
as_threshold = 0.5


temperature = 0.05

# Pruning parameters
prune_mode = None
max_size_address_space = 10

safeguard_bins = True
bin_score_threshold_type = 'static'
bin_score_threshold = 1e-8
 
safeguard_chunks = True
chunk_score_threshold = 0.8

In [8]:
# Initialize codebook, i.e., class that saves token - atomic hypervector associations.
cleanup = cleanup.Cleanup(address_size)

In [9]:
# Load pre-trained BERT base uncased and Wordpiece tokenizer.
model_name = "bert-base-uncased"  # Has 12 layers
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# The BERT model can process texts of the maximal length of 512 tokens.
MAXIMUM_SEQUENCE_LENGTH = 512

In [10]:
# Initialize DSDM object.
memory = DSDM.DSDM(
    address_size=address_size,
    ema_time_period=ema_time_period,
    learning_rate_update=learning_rate_update,
    temperature=temperature,
    normalize=normalize,
    prune_mode=prune_mode,
    max_size_address_space=max_size_address_space,
    safeguard_bins=safeguard_bins,
    bin_score_threshold_type=bin_score_threshold_type,
    bin_score_threshold=bin_score_threshold,
    safeguard_chunks=safeguard_chunks,
    chunk_score_threshold=chunk_score_threshold,
)

In [11]:
train_size = 10 # Parameter: Number of train articles

train_idx = np.random.randint(0, len(wiki_dataset) - 1000, size=1000000)
# Select train articles.
train_idx = train_idx[:train_size]
# Manually add the articles from which the in-set inference sentences were selected.
train_idx = np.append(np.array([6458629, 6458633, 6458645, 6458648, 6458659, 6458664, 6458665,
   6458667, 6458668, 6458573]), train_idx)

In [12]:
### Not used ###
# Global duplicated addresses counter.
dups_found = 0

def remove_duplicates(memory):
    """Remove duplicate addresses from a DSDM object.
    
    Given a DSDM object, for each address, remove address that have a (cosine) similarity
    higer than 0.95 to it.
    
    Implemented by a global keep mask that is updated for each address using 'and.'
    """
    global dups_found
    global_keep_mask = torch.tensor([True] * len(memory.addresses)).to(device)
    
    for idx, address in enumerate(memory.addresses):
        if global_keep_mask[idx].item():
            cos = torch.nn.CosineSimilarity()
            keep_mask = cos(memory.addresses, address) < 0.95
            # Keep current address.
            keep_mask[idx] = True
            global_keep_mask &= keep_mask

    if global_keep_mask.sum().item() > 0:
        dups_found += 1
        # Remove similar addresses.
        memory.addresses = memory.addresses[global_keep_mask]
        # Remove bins.
        memory.bins = memory.bins[global_keep_mask]
        # Remove chunk scores.
        memory.chunk_scores = memory.chunk_scores[global_keep_mask]

### Training II

<ins>**Subsequence construction parameters**</ins>
    
Regarding subsequence construction, the following parameters/settings can be adjusted in the below cell:
- `remove_stopwords_training`: If `True`, remove stop words during training;
- **Subsequence sorting:** The generated subsequences can be arranged in two ways: they can be initially sorted by chunk score and then by length (in descending order), or conversely. Currently, the subsequences are first sorted by length and then by chunk score;
- `n_subsequences` The number of subsequences to save to memory after sorting; Currently, the number is set to 1.
- `layers`: Intger list with encoder layers to costruct subsequences from.

The places in code where the above setting can be set are marked by a comment.

In [13]:
remove_stopwords_training = True
# layers = [0]
layers = np.arange(0, 12).tolist()
n_subsequences = 1

In [14]:
### Training ###
for pos, i in enumerate(tqdm(train_idx)):
    # Add article number to DSDM for statistics.
    memory.add_wiki_article(int(i))
    # Get text from article.
    text = wiki_dataset[int(i)]['text']
    
    # Split text into sentences.
    sentences = preprocess.split_text_into_sentences(text)
    
    for sentence in sentences:
        inputs = tokenizer(sentence, return_tensors="pt")
        if inputs['input_ids'].shape[1] > MAXIMUM_SEQUENCE_LENGTH:
            # If the sentence is longer than the maximum no. of allowed tokens, skip it.
            break
        
        outputs = model(**inputs, output_attentions=True)
        attention_matrix = outputs.attentions
        
        encoding = tokenizer.encode(sentence)
        labels = tokenizer.convert_ids_to_tokens(encoding)

        i = 0
        averages_idx = []
        while i < len(labels) - 1:
            j = i + 1
            average_idx = []
            while labels[j].startswith('#'):
                average_idx.append(j)
                labels[i] += labels[j].replace('#', '')
                j += 1
            if average_idx != []:
                average_idx.append(i)
                averages_idx.append(average_idx)
            i = j
        
        # Construct multiple masks to indentify uninformative tokens:
        ## i) subwords: Start with '##;'
        ## ii) punctuation: Use string.punctuation to identify them;
        ## iii) other: Uninformative characters that are not part of 'string.punctuation;'
        ## iv) stop words: Use 'stopwords' from 'nltk.corpus.'
        # Then apply OR to construct global mask of uninformative tokens.
        hashtag_idx = np.array([label.startswith("#") for label in labels])
        stopwords_idx = np.array([label in stopwords.words('english') for label in labels])
        punctuation_idx = np.array([label in string.punctuation for label in labels])
        dash_idx = np.array([(len(label) == 1 and ord(label) == 8211) for label in labels])
        # Parameter stop words: Remove or leave in.
                
        # Remove uninformative tokens from sentence
        # by applying global mask.
        remove_idx = hashtag_idx | punctuation_idx | dash_idx
        if remove_stopwords_training:
            remove_idx |= stopwords_idx  
        labels = np.array(labels)[~remove_idx]
        # Remove '[CLS]' and '[SEP]' tokens from sentence tokens.
        labels = labels[1:(len(labels) - 1)]

        for layer in layers:
            for head in range(12):
                head_scores_raw_tensor = attention_matrix[layer][0][head].clone()
                
                # Remove self-attention matrix entries (rows & columns) of uninformative tokens.
                head_scores_raw_tensor = preprocess_attention_scores(head_scores_raw_tensor, averages_idx, remove_idx)

                head_scores_raw = head_scores_raw_tensor.numpy()
                
                # Remove entries (rows & columns) associated with '[CLS]' and '[SEP]' tokens.
                head_scores = head_scores_raw[1:(len(head_scores_raw) - 1), 1:(len(head_scores_raw) - 1)].copy()

                # Zero out entries with an attention weight
                # lower than the attention score threshold.
                head_scores[head_scores < as_threshold] = 0
                
                # Construct graph from matrix.
                G = nx.from_numpy_array(head_scores, create_using=nx.DiGraph())
                
                # Construct subsequences and calculate associated
                # chunk scores (i.e., averages of the associated attention weights).
                # ----
                # sequences: binary vector where the 
                # 1-components indicate the tokens that are part of the subsequence;
                # means: float vector with the chunk scores.
                sequences = []
                means = []
                n_tokens = len(labels)
                construct_sequences(G, n_tokens)
                
                # Construct dataframe from subsequences.
                df = pd.DataFrame(data=[sequences, means]).T.rename(columns={0: 'seq',  1: 'score'})
                    
                if len(df) > 0:
                    # Get subsequence length.
                    df['len'] = df['seq'].map(sum)
                    df['score'] = df['score'].astype('float64')
                    # Parameter: subsequence sorting: Length and then Chunk score
                    df = df.sort_values(by=['len', 'score'], ascending=[False, False]).reset_index(drop=True)
                    top3_df = df.head(n_subsequences) 
                    
                    # Save sequences w/ chunk scores to memory.
                    for i in range(len(top3_df)):
                        # Call 'generate_query' to construct token superposition.
                        memory.save(
                            inference.generate_query(
                                address_size,
                                cleanup,
                                labels[top3_df['seq'][i].astype(bool)]
                            ),
                            top3_df['score'][i]
                        )
        # If prune_mode is set, prune memory.
        memory.prune()
#     if (pos + 1) % 50 == 0:
#         remove_duplicates(memory)

100%|███████████████████████████████████████████| 20/20 [01:13<00:00,  3.68s/it]


In [15]:
#save_memory(cleanup, memory)

In [16]:
inference_sentences_in = [
    """Blaine was reared in a Prohibition home, and while still a young girl, she became a very active participant at temperance meetings, where she won great favor for her songs and recitations.""",
    """In 1910, she was elected to the position of organizer and lecturer of the National WCTU.""",
    """Another feature of her work was the organization of temperance mass-meetings of Sunday-school children, usually preceded by a formal parade.""",
    """With all other games played, a victory over Everton had put United top of the group on nine points.""",
    """The 2022 FA Women's League Cup Final was the 11th final of the FA Women's League Cup, England's secondary cup competition for women's football teams and its primary league cup tournament.""",
    """In 2020 Mico's single 'igare' awarded as the best song of the summer in Kiss Summer Awards.""",
    """She collected the speech and words of Dublin city and donated her collection to the Department of Irish Folklore at University College, Dublin.""",
    """Traditional palyanytsya was baked from yeast dough.""",
    """First, hops were boiled in a pot, which was then poured into a makitra, to which sifted wheat flour was added.""",
    """ Jonathan Holland of ScreenDaily deemed the film to be "superbly directed by Palomero, who seems to have a special gift for seeing the world through children's eyes." """   
]

## Inference
### Concept extraction
<ins>**Parameters:**</ins>
1. `retrieve_mode`, with values `top_k` and `pooling`;

The value
- `top_k` corresponds to the return of the most similar (in the sense of the cosine similarity) `k` addresses found in the memory when querying it with the superposition of the inference sentence. `k` can be freely choosen, but it is currently set to 7. For each addresss, a dataframe containing the highest similarities between the address and the atomic tokens is returned. The number of atomic vectors returned can be set in the function [inference.get_similarities_to_atomic_set](https://github.com/dfichiu/ba-thesis/blob/master/src/lib/utils/inference.py)

- `pooling` corresponds to the result (i.e., dataframe w/ the tokens with the highest cosine similarity) of the retrieve operation when querying the memory with the superposition of the inference sentence.

2. `remove_stopwords_inference`: If `True`, remove stop word from inference sentence.

In [17]:
retrieve_mode = "top_k"
remove_stopwords_inference = True

In [18]:
### Inference ###
# Get table with token similarities.
retrieved_contents = inference.infer(
    memory.address_size,
    cleanup,
    memory,
    inference_sentences_in,
    retrieve_mode=retrieve_mode,
    k=7,
    remove_stopwords=remove_stopwords_inference,
)

if retrieve_mode == "top_k":
    sims_df = pd.DataFrame(columns=['sentence', 'token', 'similarity']) 
    
    for s, addresses in zip(inference_sentences_in, retrieved_contents):
        display(s)
        out_tables = []
        for a in addresses:
            address_sims_df = inference.get_similarities_to_atomic_set(
                a, cleanup)
            out = widgets.Output()
            with out:
                display(address_sims_df)
            out_tables.append(out)
        display(widgets.HBox(out_tables))
elif retrieve_mode == "pooling":  
    sims_df = pd.DataFrame(columns=['sentence', 'token', 'similarity']) 
      
    for s, c in zip(inference_sentences_in, retrieved_contents):
        sentence_sims_df = inference.get_similarities_to_atomic_set(
            c, cleanup)
        sentence_sims_df['sentence'] = [s] * len(sentence_sims_df)
        sims_df = pd.concat([sims_df, sentence_sims_df])

    sims_df = sims_df.sort_values(['sentence', 'similarity'], ascending=False) \
                     .set_index(['sentence', 'token'])
    
    display(sims_df)
else:  # unrecognized
    pass

'Blaine was reared in a Prohibition home, and while still a young girl, she became a very active participant at temperance meetings, where she won great favor for her songs and recitations.'

HBox(children=(Output(), Output(), Output(), Output(), Output(), Output(), Output()))

'In 1910, she was elected to the position of organizer and lecturer of the National WCTU.'

HBox(children=(Output(), Output(), Output(), Output(), Output(), Output(), Output()))

'Another feature of her work was the organization of temperance mass-meetings of Sunday-school children, usually preceded by a formal parade.'

HBox(children=(Output(), Output(), Output(), Output(), Output(), Output(), Output()))

'With all other games played, a victory over Everton had put United top of the group on nine points.'

HBox(children=(Output(), Output(), Output(), Output(), Output(), Output(), Output()))

"The 2022 FA Women's League Cup Final was the 11th final of the FA Women's League Cup, England's secondary cup competition for women's football teams and its primary league cup tournament."

HBox(children=(Output(), Output(), Output(), Output(), Output(), Output(), Output()))

"In 2020 Mico's single 'igare' awarded as the best song of the summer in Kiss Summer Awards."

HBox(children=(Output(), Output(), Output(), Output(), Output(), Output(), Output()))

'She collected the speech and words of Dublin city and donated her collection to the Department of Irish Folklore at University College, Dublin.'

HBox(children=(Output(), Output(), Output(), Output(), Output(), Output(), Output()))

'Traditional palyanytsya was baked from yeast dough.'

HBox(children=(Output(), Output(), Output(), Output(), Output(), Output(), Output()))

'First, hops were boiled in a pot, which was then poured into a makitra, to which sifted wheat flour was added.'

HBox(children=(Output(), Output(), Output(), Output(), Output(), Output(), Output()))

' Jonathan Holland of ScreenDaily deemed the film to be "superbly directed by Palomero, who seems to have a special gift for seeing the world through children\'s eyes." '

HBox(children=(Output(), Output(), Output(), Output(), Output(), Output(), Output()))

### Memory visualization
Visualize 30 randomly selected memory addresses. Visualize refers to recovering the atomic tokens (w/ their cosine similarity) from the superposition.

In [19]:
print(f"Number of existing memory addresses: {len(memory.addresses)}")

Number of existing memory addresses: 2226


In [20]:
print(f"Number of memory expansions: {memory.n_expansions}")

Number of memory expansions: 2226


In [21]:
print(f"Number of memory updates: {memory.n_updates}")

Number of memory updates: 2272


In [22]:
addresses = np.random.randint(0, len(memory.addresses), size=30)

for address in addresses:
    display(md(f"### <ins>Address {address}</ins>"))
    display(md(f"Address **chunk score:** {memory.scores[address][0]}, **bin score:** {memory.scores[address][1]}"))
    address_sims_df = inference.get_similarities_to_atomic_set(
            memory.addresses[address],
            cleanup,
    )
    display(address_sims_df)

### <ins>Address 2162</ins>

Address **chunk score:** 0.64, **bin score:** 0.012557531238070296

Unnamed: 0,token,similarity
0,war,1.0
1,showing,0.1
2,foreign,0.09
3,current,0.09
4,mills,0.08
5,resulting,0.08
6,kirk,0.08
7,halle,0.08
8,signatures,0.08
9,involving,0.07


### <ins>Address 1359</ins>

Address **chunk score:** 0.72, **bin score:** 1.8851003660463284

Unnamed: 0,token,similarity
0,association,0.51
1,italian,0.5
2,club,0.49
3,football,0.48
4,creole,0.12
5,made,0.11
6,7,0.11
7,season,0.09
8,land,0.09
9,college,0.09


### <ins>Address 190</ins>

Address **chunk score:** 0.54, **bin score:** 4.92499124146728e-06

Unnamed: 0,token,similarity
0,pageant,0.7
1,author,0.7
2,bravo,0.1
3,broke,0.09
4,wins,0.09
5,competed,0.09
6,rewarded,0.08
7,1958,0.08
8,february,0.08
9,front,0.08


### <ins>Address 32</ins>

Address **chunk score:** 0.62, **bin score:** 1.172681302058498e-05

Unnamed: 0,token,similarity
0,church,0.7
1,village,0.7
2,crust,0.1
3,language,0.09
4,future,0.09
5,5th,0.09
6,national,0.09
7,far,0.09
8,land,0.08
9,february,0.08


### <ins>Address 1017</ins>

Address **chunk score:** 0.65, **bin score:** 0.03156996721273739

Unnamed: 0,token,similarity
0,asking,0.72
1,time,0.72
2,players,0.12
3,belief,0.1
4,ranking,0.09
5,marched,0.09
6,links,0.08
7,“,0.08
8,town,0.08
9,former,0.08


### <ins>Address 1279</ins>

Address **chunk score:** 0.9, **bin score:** 0.001829154674406766

Unnamed: 0,token,similarity
0,brown,0.68
1,reddish,0.68
2,teams,0.12
3,urged,0.1
4,jonathan,0.09
5,services,0.09
6,independent,0.09
7,tennessee,0.09
8,bedford,0.08
9,tasmania,0.08


### <ins>Address 528</ins>

Address **chunk score:** 0.6, **bin score:** 4.01254919060734e-06

Unnamed: 0,token,similarity
0,ц,0.7
1,replaces,0.7
2,double,0.1
3,music,0.1
4,connemara,0.1
5,recognized,0.1
6,spirit,0.09
7,intention,0.09
8,god,0.09
9,reacting,0.08


### <ins>Address 127</ins>

Address **chunk score:** 0.52, **bin score:** 1.000227808091334

Unnamed: 0,token,similarity
0,elected,0.7
1,position,0.7
2,brothers,0.1
3,coffee,0.1
4,current,0.09
5,fc,0.09
6,usl,0.09
7,equity,0.09
8,central,0.09
9,wins,0.08


### <ins>Address 1124</ins>

Address **chunk score:** 1.0, **bin score:** 1.9803518019818032

Unnamed: 0,token,similarity
0,postponed,0.58
1,december,0.55
2,following,0.55
3,burn,0.11
4,tinged,0.11
5,captured,0.1
6,resolved,0.1
7,beach,0.09
8,body,0.09
9,70th,0.09


### <ins>Address 2118</ins>

Address **chunk score:** 0.64, **bin score:** 4.107751471194021e-07

Unnamed: 0,token,similarity
0,births,0.71
1,1999,0.71
2,professional,0.1
3,25th,0.1
4,battled,0.09
5,red,0.09
6,doubles,0.09
7,super,0.09
8,reconnaissance,0.08
9,rtve,0.07


### <ins>Address 1117</ins>

Address **chunk score:** 0.87, **bin score:** 11.66826658515901

Unnamed: 0,token,similarity
0,ham,0.7
1,west,0.7
2,new,0.12
3,sam,0.09
4,7th,0.08
5,teenage,0.08
6,share,0.08
7,postponed,0.08
8,ceo,0.08
9,marty,0.08


### <ins>Address 1557</ins>

Address **chunk score:** 0.54, **bin score:** 2.667402067701996e-05

Unnamed: 0,token,similarity
0,accomplishments,0.7
1,include,0.7
2,november,0.11
3,cushion,0.1
4,family,0.1
5,state,0.1
6,able,0.09
7,ceo,0.08
8,semi,0.08
9,realized,0.08


### <ins>Address 1868</ins>

Address **chunk score:** 0.86, **bin score:** 9.849129453121108e-07

Unnamed: 0,token,similarity
0,tournament,0.69
1,history,0.69
2,return,0.1
3,television,0.09
4,illustrated,0.09
5,god,0.09
6,city,0.09
7,commissioners,0.09
8,settlement,0.09
9,alessia,0.09


### <ins>Address 698</ins>

Address **chunk score:** 0.8, **bin score:** 11.699442257658623

Unnamed: 0,token,similarity
0,final,0.59
1,league,0.57
2,cup,0.57
3,winter,0.11
4,creating,0.1
5,allocated,0.1
6,dominance,0.09
7,saw,0.09
8,held,0.09
9,hundreds,0.08


### <ins>Address 963</ins>

Address **chunk score:** 1.0, **bin score:** 2.9989445247850584

Unnamed: 0,token,similarity
0,leicester,0.6
1,promoted,0.6
2,newly,0.56
3,football,0.11
4,personal,0.1
5,becoming,0.1
6,deputy,0.09
7,2,0.09
8,header,0.09
9,globe,0.08


### <ins>Address 891</ins>

Address **chunk score:** 0.53, **bin score:** 0.007922498457696103

Unnamed: 0,token,similarity
0,feed,0.69
1,ball,0.69
2,union,0.1
3,late,0.09
4,diverse,0.08
5,gal,0.08
6,fails,0.08
7,third,0.08
8,member,0.08
9,organizer,0.08


### <ins>Address 1237</ins>

Address **chunk score:** 0.97, **bin score:** 2.272528525720796e-06

Unnamed: 0,token,similarity
0,cards,0.71
1,yellow,0.71
2,later,0.13
3,shot,0.09
4,3,0.09
5,paaltjasker,0.08
6,2002,0.08
7,additional,0.08
8,1913,0.08
9,russo,0.08


### <ins>Address 605</ins>

Address **chunk score:** 0.93, **bin score:** 4.000096902370869

Unnamed: 0,token,similarity
0,red,0.68
1,wolves,0.68
2,2003,0.13
3,action,0.11
4,палити,0.09
5,way,0.09
6,two,0.08
7,1st,0.08
8,leicester,0.08
9,miss,0.08


### <ins>Address 1654</ins>

Address **chunk score:** 0.52, **bin score:** 1.2840417339954335e-06

Unnamed: 0,token,similarity
0,indigenous,0.71
1,whittier,0.71
2,roof,0.1
3,releasing,0.09
4,1913,0.09
5,rounds,0.09
6,appeal,0.09
7,covid,0.09
8,regiment,0.09
9,previously,0.08


### <ins>Address 1578</ins>

Address **chunk score:** 1.0, **bin score:** 1.0296567149306213

Unnamed: 0,token,similarity
0,ms,0.49
1,named,0.45
2,extension,0.45
3,school,0.45
4,harvard,0.43
5,belief,0.1
6,win,0.1
7,hotspur,0.09
8,scoring,0.09
9,rules,0.09


### <ins>Address 412</ins>

Address **chunk score:** 1.0, **bin score:** 1.869423656200233

Unnamed: 0,token,similarity
0,shooting,0.52
1,los,0.5
2,locations,0.49
3,included,0.48
4,calmly,0.1
5,screenings,0.1
6,11,0.09
7,governor,0.09
8,author,0.09
9,screenplay,0.08


### <ins>Address 1847</ins>

Address **chunk score:** 1.0, **bin score:** 0.8526459775500381

Unnamed: 0,token,similarity
0,represents,0.43
1,team,0.42
2,hockey,0.41
3,national,0.41
4,mexico,0.4
5,field,0.38
6,moved,0.1
7,production,0.1
8,organized,0.1
9,brown,0.09


### <ins>Address 314</ins>

Address **chunk score:** 1.0, **bin score:** 0.8793271417955584

Unnamed: 0,token,similarity
0,school,0.48
1,high,0.47
2,st,0.45
3,attended,0.44
4,louis,0.43
5,fernandez,0.1
6,4th,0.09
7,freetown,0.09
8,urged,0.09
9,1998,0.09


### <ins>Address 1052</ins>

Address **chunk score:** 0.64, **bin score:** 0.03726942149717266

Unnamed: 0,token,similarity
0,home,0.59
1,tie,0.57
2,another,0.57
3,practitioners,0.11
4,madrid,0.1
5,executives,0.1
6,e,0.09
7,multi,0.09
8,ridge,0.09
9,subsequently,0.08


### <ins>Address 2163</ins>

Address **chunk score:** 0.74, **bin score:** 0.20780543158256887

Unnamed: 0,token,similarity
0,battalion,0.57
1,designated,0.56
2,10th,0.56
3,committing,0.1
4,first,0.09
5,tested,0.09
6,member,0.09
7,doubles,0.09
8,alleged,0.09
9,5th,0.09


### <ins>Address 242</ins>

Address **chunk score:** 0.55, **bin score:** 1.999939153749731

Unnamed: 0,token,similarity
0,navy,0.72
1,secretary,0.72
2,process,0.09
3,annual,0.09
4,citations,0.09
5,skinner,0.09
6,back,0.09
7,passed,0.08
8,alice,0.08
9,enlist,0.08


### <ins>Address 2102</ins>

Address **chunk score:** 0.57, **bin score:** 3.930315807920692e-06

Unnamed: 0,token,similarity
0,team,0.59
1,national,0.57
2,part,0.55
3,yards,0.1
4,boiled,0.09
5,leone,0.09
6,briefly,0.08
7,minute,0.08
8,madrid,0.07
9,oven,0.07


### <ins>Address 1434</ins>

Address **chunk score:** 0.56, **bin score:** 4.00005357405579

Unnamed: 0,token,similarity
0,examinations,0.72
1,passed,0.72
2,symbols,0.1
3,mm,0.1
4,person,0.1
5,8th,0.1
6,kept,0.09
7,los,0.09
8,trustees,0.08
9,medical,0.08


### <ins>Address 411</ins>

Address **chunk score:** 0.65, **bin score:** 0.004168421177385473

Unnamed: 0,token,similarity
0,locations,0.7
1,included,0.7
2,provides,0.11
3,launched,0.1
4,reed,0.09
5,give,0.08
6,register,0.08
7,appearance,0.08
8,measures,0.08
9,commission,0.07


### <ins>Address 436</ins>

Address **chunk score:** 0.86, **bin score:** 0.0012844261832707093

Unnamed: 0,token,similarity
0,films,0.65
1,pregnancy,0.35
2,teenage,0.35
3,upcoming,0.35
4,drama,0.34
5,spanish,0.33
6,e,0.11
7,printed,0.1
8,defeat,0.1
9,fellow,0.1


In [23]:
# import gensim.downloader as api
# from sklearn.manifold import TSNE

In [24]:
#Load pre-trained word embeddings (Word2Vec in this example)
# word_vectors = api.load("word2vec-google-news-300")

In [25]:
# %%capture
# address_embeddings = []
# address_concepts = []
# addresses = []
# bins = []
# chunk_scores = []

# for idx, address in enumerate(memory.addresses):
#     tokens = inference.get_most_similar_HVs(inference.get_similarities_to_atomic_set(address, cleanup))
#     embeddings = [word_vectors[word] for word in tokens if word in word_vectors]
#     if embeddings:
#         addresses.append(idx)
#         bins.append(memory.scores[idx, 1].item())
#         chunk_scores.append(memory.scores[idx, 0].item())
#         address_concepts.append(" ".join(tokens))
#         address_embeddings.append(sum(embeddings) / len(embeddings))

In [26]:
# reduced_embeddings = TSNE(n_components=2, random_state=42, perplexity=2).fit_transform(np.array(address_embeddings))

# df = pd.DataFrame(reduced_embeddings, columns=["Dimension 1", "Dimension 2"])
# df["Address"] = addresses
# df["Chunk"] = address_concepts
# df['Bin'] = bins
# df['Chunk-score'] = chunk_scores

In [27]:
# import plotly.express as px

# fig = px.scatter(
#     df, x="Dimension 1", y="Dimension 2",
#     text="Chunk", hover_data=["Address", "Bin", "Chunk-score"],
#     title="Memory concepts"
# )
# fig.show()