# Sliding window n-gram method

This notebook implements the training of a DSDM instance (located in folder [src/lib/memory/DSDM.py](https://github.com/dfichiu/ba-thesis/blob/master/src/lib/memory/DSDM.py)) using the sliding window n-gram method.


The experiment currently run is the one presented in the 'Experiments' section of the thesis, i.e., training on a small piece of text (w/o stop words) taken from a Guardian article to check whether distance-aware representations get constructed. To track the values of the EMA threshold and the minimum cosine distance (to-BMU distance), please uncomment the respective lines in the `save` method of the [DSDM class](https://github.com/dfichiu/ba-thesis/blob/master/src/lib/memory/DSDM.py).

For longer training sessions, please use the training script [src/experiments/train_SWNmemory.py](https://github.com/dfichiu/ba-thesis/blob/master/src/experiments/train_SWNmemory.py).

<ins> Note</ins> (Preprocessing): The sliding window n-gram method was developed before the method mining Transformer self-attention. Initially, we used a preprocessing pipeline based on sklearn's Pipeline module. (See [preprocessing](https://github.com/dfichiu/ba-thesis/blob/master/src/lib/utils/preprocess.py).) Therefore, to remove or keep stopwords, we commented the respective step in the pipeline. 

In [1]:
import sys
import os

# Get the absolute path of the parent directory.
parent_dir = os.path.abspath(os.path.join(os.path.dirname("__file__"), ".."))

# Add the parent directory to the system path to be able to import modules from 'lib.'
sys.path.append(parent_dir)

In [2]:
import datasets

from IPython.display import HTML, Markdown as md
import itertools

from lib.memory import DSDM
from lib.utils import cleanup, configs, inference, learning, preprocess, utils 

import math
import matplotlib
import matplotlib.pyplot as plt
import numpy
import numpy as np
import random

import pandas as pd
import pathlib

import torch
import torchhd as thd
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F 

from tqdm import tqdm
# Type checking
import typing

[nltk_data] Downloading package punkt to
[nltk_data]     /nfs/home/dfichiu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /nfs/home/dfichiu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /nfs/home/dfichiu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# Load Wikipedia dataset.
# TODO: Split between server and local.
#wiki_dataset = datasets.load_dataset("wikipedia", "20220301.en")['train']
wiki_dataset = datasets.load_dataset(
    "wikipedia",
    "20220301.en",
    cache_dir="/nfs/data/projects/daniela")['train']

Found cached dataset wikipedia (/nfs/data/projects/daniela/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559)


  0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
# Set device.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Set seed.
utils.fix_seed(41)

Using seed: 41

In [5]:
# Set DSDM hyperparameters.
address_size = 1000

ema_time_period = 50
learning_rate_update = 0.5

temperature = 0.05

normalize = False

# Pruning
prune_mode = None
max_size_address_space = 4000

In [6]:
# N-gram size
chunk_sizes = [5]

In [7]:
# Initialize codebook, i.e., class that saves token - atomic hypervector associations.
cleanup = cleanup.Cleanup(address_size)

In [8]:
# Initialize memory.
memory = DSDM.DSDM(
    address_size=address_size,
    ema_time_period=ema_time_period,
    learning_rate_update=learning_rate_update,
    temperature=temperature,
    normalize=normalize,
    prune_mode=prune_mode,
    max_size_address_space=max_size_address_space
) 

In [9]:
# Construct train set (texts) and inference set (sentences; in and out of train set text).
train_size = 1

train_size = 10 # Parameter: Number of train articles

train_idx = np.random.randint(0, len(wiki_dataset) - 1000, size=1000000)
# Select train articles.
train_idx = train_idx[:train_size]
# Manually add the articles from which the in-set inference sentences were selected.
train_idx = np.append(np.array([6458629, 6458633, 6458645, 6458648, 6458659, 6458664, 6458665,
   6458667, 6458668, 6458573]), train_idx)

# # Text indeces
# train_idx = np.random.randint(0, len(wiki_dataset) - 100, size=train_size)
# # Generate and append train articles present in all experiments.
# intest_idx = np.random.randint(len(wiki_dataset) - 100, len(wiki_dataset), size=20)
# _set = list(set(intest_idx))
# intest_idx = np.array(_set)[: len(_set) // 2]
# outtest_idx = np.array(_set)[len(_set) // 2 :]
# train_idx = np.append(train_idx, intest_idx)

# Text indeces from which we extract sentences.
# intest_idx = np.random.choice(train_idx, test_size)
# outtest_idx = np.random.choice(np.setdiff1d(np.arange(len(wiki_dataset)), train_idx), test_size)

In [10]:
# inference_sentences_in = []
# inference_sentences_out = []

# for idx_in, idx_out in zip(intest_idx, outtest_idx):
#     # Get sentences.
#     sentences_in = utils.preprocess.split_text_into_sentences(wiki_dataset[int(idx_in)]['text'])
#     sentences_out = utils.preprocess.split_text_into_sentences(wiki_dataset[int(idx_out)]['text'])
    
#     # Get sentence index.
#     sentence_idx_in = int(
#         np.random.randint(
#             0,
#             len(sentences_in),
#             size=1
#         ).item()
#     )
#     sentence_idx_out = int(
#         np.random.randint(
#             0,
#             len(sentences_out),
#             size=1
#         ).item()
#     )

#     # Append sentence to list.
#     inference_sentences_in.append(sentences_in[sentence_idx_in])
#     inference_sentences_out.append(sentences_out[sentence_idx_out])

In [11]:
# Remove duplicates
remove_dups = False

In [12]:
### Remove duplicates ###
dups_found = 0

def remove_duplicates(memory):
    """Remove duplicate addresses from a DSDM object.
    
    Given a DSDM object, for each address, remove address that have a (cosine) similarity
    higer than 0.95 to it.
    
    Implemented by a global keep mask that is updated for each address using 'and.'
    """
    
    global dups_found
    global_keep_mask = torch.tensor([True] * len(memory.addresses)).to(device)
    
    for idx, address in enumerate(memory.addresses):
        if global_keep_mask[idx].item():
            cos = torch.nn.CosineSimilarity()
            keep_mask = cos(memory.addresses, address) < 0.8
            # Keep current address
            keep_mask[idx] = True
            global_keep_mask &= keep_mask

    if global_keep_mask.sum().item() > 0:
        dups_found += len(global_keep_mask) - global_keep_mask.sum().item()
        # Remove similar addresses
        memory.addresses = memory.addresses[global_keep_mask]
        # Remove bins & chunk scores
        memory.scores = memory.scores[global_keep_mask]

In [13]:
### Training ###
for i in tqdm(train_idx):
    text = wiki_dataset[int(i)]['text']
    
    # Piece of text extracted from a Guardian article.
    text = """
    Images showed widespread destruction across Marrakech, a Unesco world heritage site where newer apartment complexes on the edge of the sprawling city border a network of alleyways shaded by historic and ornate buildings.
    The earthquake, its magnitude estimated at 6.8, sent stone slabs tumbling to the ground, creating piles of rubble in the streets as terrified people fled and spent the night on the pavement and in squares, afraid to return to their homes.
    As Saturday dawned, Shonibare described how people rushed to check on their neighbours amid confusion and fear about whether to remain outdoors or shelter from potential aftershocks and the soaring early September temperatures.
    It’s hard to get a sense of how things are being managed. So far we have been seeing police vans, ambulances and fire trucks going to the centre. It seems they are doing the best they can but I do not know if anyone fully knows the extent of the damage their neighbours amid confusion and fear about whether to remain outdoors."""
    
    # Preprocess data. 
    sentences_tokens = preprocess.preprocess_text(text)
    for sentence_tokens in sentences_tokens:
        # Generate atomic HVs for unknown tokens.
        learning.generate_atomic_HVs_from_tokens_and_add_them_to_cleanup(
            memory.address_size,
            cleanup,
            sentence_tokens
        )
        
        # Learning: Construct the chunks of each sentence and save them to memory.
        learning.generate_chunk_representations_and_save_them_to_memory(
            memory.address_size,
            cleanup,
            memory,
            sentence_tokens,
            chunk_sizes=chunk_sizes
        )
    if remove_dups:
        remove_duplicates(memory)
    # Break in order to train only the piece of text.
    break 

  0%|                                                    | 0/20 [00:00<?, ?it/s]


In [14]:
# inference_sentences_in = ['Dagored', 'is an Italian', 'record labels', 'based in Firenze', 'formed', 'in 1998.'] 250, 0.05 temperature

In [15]:
### Old idea: Divide and conquer to improve inference. ###
# def score_partition(input_partition, output_partition):
#     # Note: What if a sentence contains the same word multiple times? This is why using 'set' is a bad idea!
#     set_query = set(preprocess.remove_stopwords(tokens)[0]) 
#     set_content = inference.get_most_similar_HVs(sentence_sims_df, delta_threshold=0.1)

#     set_input = set(input_partition)
#     set_output = set(output_partition)
    
#     score = len(set_input.intersection(set_output)) / len(set_input)
#     return score




# def divide_and_conquer(token_partitions: typing.List[typing.List[str]]):
#     retrieve_mode = "pooling"
    
#     for tp in token_partitions:
#         retrieved_content = inference.infer(
#             memory.address_size,
#             cleanup,
#             memory,
#             [tp],
#             retrieve_mode=retrieve_mode
#         )
#         output_tokens = inference.get_most_similar_HVs(
#             inference.get_similarities_to_atomic_set(
#                 retrieved_contents[0],
#                 cleanup,
#             ),
#             delta_threshold=0.1
#         )
#         score = score_partition(tp, output_tokens)

#     display(score)
#     if score == 1:
#         return tokens
#     else:
#         return max(score, divide_and_conquer())
    

In [16]:
# retrieve_mode = "top_k"

# # Get table with token similarities for each "out-of-train" sentence.
# retrieved_contents = inference.infer(
#     memory.address_size,
#     cleanup,
#     memory,
#     inference_sentences_in,
#     retrieve_mode=retrieve_mode,
#     k=3, #TODO: What if index is out of range?
# )

# if retrieve_mode == "top_k":
#     sims_df = pd.DataFrame(columns=['sentence', 'token', 'similarity']) 
    
#     for s, addresses in zip(inference_sentences_in, retrieved_contents):
#         display(s)
#         for a in addresses:
#             address_sims_df = inference.get_similarities_to_atomic_set(
#                 a, cleanup)
#             display(address_sims_df)
# elif retrieve_mode == "pooling":  
#     sims_df = pd.DataFrame(columns=['sentence', 'token', 'similarity']) 
      
#     for s, c in zip(inference_sentences_in, retrieved_contents):
#         sentence_sims_df = inference.get_similarities_to_atomic_set(
#             c, cleanup)
#         sentence_sims_df['sentence'] = [s] * len(sentence_sims_df)
#         sims_df = pd.concat([sims_df, sentence_sims_df])

#     sims_df = sims_df.sort_values(['sentence', 'similarity'], ascending=False) \
#                      .set_index(['sentence', 'token'])
    
#     display(sims_df)
# else:  # unrecognized
#     pass

## Memory visualization

In [17]:
# addresses = np.random.randint(0, len(memory.addresses), size=30)
# addresses = [244, 245, 246, 247]
addresses = [56, 55, 54, 53, 52, 51]
for address in addresses:
    display(md(f"### Address {address}"))
    address_sims_df = inference.get_similarities_to_atomic_set(
            memory.addresses[address],
            cleanup,
    )
    display(address_sims_df)

### Address 56

Unnamed: 0,token,similarity
0,extent,0.49
1,damage,0.48
2,neighbours,0.46
3,amid,0.36
4,confusion,0.27
5,knows,0.24
6,fully,0.19
7,earthquake,0.08
8,rubble,0.07
9,return,0.05


### Address 55

Unnamed: 0,token,similarity
0,fully,0.51
1,anyone,0.5
2,knows,0.47
3,extent,0.42
4,know,0.27
5,damage,0.24
6,best,0.18
7,shonibare,0.07
8,ornate,0.06
9,shelter,0.06


### Address 54

Unnamed: 0,token,similarity
0,ambulances,0.53
1,trucks,0.5
2,fire,0.47
3,going,0.38
4,centre,0.28
5,vans,0.28
6,police,0.13
7,return,0.1
8,dawned,0.1
9,city,0.09


### Address 53

Unnamed: 0,token,similarity
0,vans,0.5
1,ambulances,0.48
2,police,0.48
3,seeing,0.48
4,far,0.27
5,fire,0.22
6,return,0.09
7,buildings,0.06
8,dawned,0.06
9,soaring,0.05


### Address 52

Unnamed: 0,token,similarity
0,get,0.52
1,things,0.47
2,hard,0.44
3,sense,0.44
4,managed,0.25
5,’,0.21
6,ornate,0.07
7,magnitude,0.06
8,centre,0.06
9,piles,0.05


### Address 51

Unnamed: 0,token,similarity
0,aftershocks,0.47
1,early,0.45
2,september,0.45
3,temperatures,0.43
4,soaring,0.43
5,whether,0.08
6,managed,0.07
7,rushed,0.06
8,ground,0.05
9,know,0.04


In [18]:
print (f"Updates percentage: {round(memory.n_updates / (memory.n_updates + memory.n_expansions), 3)}%")

Updates percentage: 0.23%


In [19]:
print(f"Number of memory updates: {memory.n_updates}")

Number of memory updates: 17


In [20]:
print(f"Number of memory expansions: {memory.n_expansions}")

Number of memory expansions: 57


In [21]:
print(f"Number of existing memory addresses: {len(memory.addresses)}")

Number of existing memory addresses: 57


In [22]:
memory.n_deletions

0

In [23]:
dups_found

0