In [None]:
import os
import numpy as np
import pandas as pd
import math
import pickle
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve

from transformers import pipeline
import torch

import nlpsig

from signax import signature

from load_data import data_folder, seed, corpus_df, english_train, corpus_sample_df

# Anomaly Detection task

In natural language processing (NLP) tasks, we often are dealing with high dimensional streams of data. Neural network architectures known as Transformers have been shown to be very effective in NLP tasks, and we can use these to obtain high dimensional streams of embeddings for words/tokenised text. In this notebook, we will look at how we can use path signature techniques to analyse these high dimensional streams of embeddings. In particular, we will look at how we can perform outlier detection on the path signatures of the embeddings obtained from a pre-trained Transformer. 

In particular, in this notebook, we consider the task of determining whether a word is an english word or not by using the path signature of the stream of _character_ embeddings of the word. That is, each word is represented as a stream of character embeddings. We do this by training a Transformer model (from scratch) on a _masked language modelling_ task (or _Cloze_ task) as described in [[1]](https://arxiv.org/abs/1810.04805) using a corpus of english words, and then use this model to obtain a stream of (character) embeddings for a sample of (english and non-english) words. Path signature techniques are applied to analyse the streams of embeddings and finally we attempt to detect the non-english words as outliers in the space of path signatures.

Since we are dealing with high dimensional streams of data, we will use a dimension reduction technique to reduce the dimension of the embeddings before computing the path signature. We will look at how we can perform outlier detection on the path signatures of the dimension-reduced embeddings.

The pipeline for this task is as follows:
1. Train a Transformer model on a masked language modelling task using a corpus of english words.
2. Obtain a stream of character embeddings for a sample of english and non-english words using the trained model (note that we ensure that the english words in this sample are not in the training corpus to pre-train the Transforemr).
    - The english words in this sample are our _inlier_ class while the non-english words are our _outlier_ class in this example.
3. Perform dimension reduction on the streams of embeddings.
4. Compute the path signature of the dimension-reduced embeddings.
5. Perform outlier detection on the path signatures to detect the non-english words.

## `nlpsig` library

In this notebook, we illustrate how we can use the [`nlpsig`](https://github.com/datasig-ac-uk/nlpsig) package to utilise transformers in order to obtain streams of high dimensional embeddings, which can then be analysed using path signature techniques.

## Language dataset

In the `data/` folder, we have several text folders of words from different languages:
- `wordlist_de.txt`: German words
- `wordlist_en.txt`: English words
- `wordlist_fr.txt`: French words
- `wordlist_it.txt`: Italian words
- `wordlist_pl.txt`: Polish words
- `wordlist_sv.txt`: Swedish words

We additionally have a `alphabet.txt` file which just stores the alphabet characters ('a', 'b', 'c', ...).

The task is to split the words into its individual characters and to obtain an embedding for each of them. We can represent a word by a path of its character embeddings and compute its path signature to use as features in predicting the language for which the word belongs.

Here we look at obtaining embeddings using a Transformer model.

In [None]:
data_folder = "data"

## Prepare training data and test data

We prepare our data in the `load_data.py` script, so look in there for more details.

Our test data will consist a sample of 10000 english words and 10000 non-english words (2000 from each of the remaining languages). We will use the remaining english words as our training data to train the Transformer model. We can see that in the original full corpus there are relatively fewer English words than the other languages...

In [None]:
corpus_df["language"].value_counts()

We are going to train our language model on the English words, so taking out a sample of English words from the corpus...

In [None]:
english_train

To make the dataset bit more manageable, I'll just take a sample of each of the languages. In our resulting corpus, we have an equal amount of english (inliers) and non-english words (outliers):

In [None]:
corpus_sample_df["language"].value_counts()

In [None]:
corpus_sample_df.head()

## Training a language model

We want to train a masked language model for our corpus of English words. In particular, we mask out particular letters and ask our model to try predict the masked letter.
We do this using the `nlpsig.TextEncoder` class which provides a wrapper around the `transformers` library, and have done this in a separate notebook.

## Evaluating trained model

Evaluating the performance on predicting the masked letter for the test dataset. To do this, for each word in our test dataset, we will mask each letter on its own and ask the model to predict the masked letter. So for a 5 letter word, we have 5 predictions to make - one for each letter given the other letters.

For our tokenizer, we see that `<mask>` is used as the mask token:

In [None]:
model_name = "english-char-bert"

In [None]:
text_encoder = nlpsig.TextEncoder(
    df=corpus_sample_df,
    feature_name="word",
    model_name=model_name,
)
text_encoder.load_pretrained_model()

In [None]:
text_encoder.tokenizer.special_tokens_map

In [None]:
def compute_masked_character_accuracy(fill_mask, words):
    was_correct = []
    print(f"Evaluating with {len(words)} words")
    for word in tqdm(words):
        masked_strings = [word[:i] + '<mask>' + word[i+1:] for i in range(len(word))]
        predictions = [fill_mask(word)[0]['sequence'] for word in masked_strings]
        was_correct += [pred == word for pred in predictions]
    
    acc = np.sum(was_correct) / len(was_correct)
    print(f"Accuracy: {acc}")
    return acc

In [None]:
fill_mask = pipeline("fill-mask",
                     model=model_name,
                     tokenizer=model_name)

compute_masked_character_accuracy(fill_mask, text_encoder.dataset_split["test"]["word"])

We can see that we have a 75% accuracy on this masked language modelling task.

## Obtaining token and word embeddings

There are many ways in which one can get embeddings from the transformer network, as the output is the layers for the full network. A few ways are:

- Returning the output of a particular hidden layer
    - use `.obtain_embeddings(method = "hidden_layer", layers = l)` where `l` is the layer you want
    - If no layer is requested, it will just give you the second-to-last hidden layer of the transformer network.
- Concatenate the output of several hidden layers
    - use `.obtain_embeddings(method = "concatenate", layers = [l_1, l_2, ...])` where `[l_1, l_2, ...]` is a list of layers you want to concatenate
- Element-wise sum the output of several hidden layers
    - use `.obtain_embeddings(method = "sum" , layers = [l_1, l_2, ...])` where `[l_1, l_2, ...]` is a list of layers you want to sum
- Mean the output of several hidden layers
    - use `.obtain_embeddings(method = "mean" , layers = [l_1, l_2, ...])` where `[l_1, l_2, ...]` is a list of layers you want to mean

Each of these methods will return a 2-dimensional array with dimensions `[token, embedding]`.

If a more custom way to obtain embeddings from the hidden layers, you can specify what layers you want, and it will return them (i.e. using `.obtain_embeddings(method = "hidden_layer", layers = [l_1, l_2, ...])` where `[l_1, l_2, ...]` is a list of hidden layers you want) and so the output will be a 3-dimensional array with dimensions `[layer, token, embedding]` for which you would need to combine in such a way that you would have an embedding for each token. The above methods would return a 2-dimensional array with dimensions `[token, embedding]`.

Note that if we had passed in a pre-trained model (remember above, we just initialised one with a config and so have random weight), we could've directly obtain token embeddings by the `.obtain_embeddings()` method without the need to train our model first. We will do this later when obtaining embeddings for the words in `corpus_sample_df`.

In the below, we just obtain the last hidden layer of the network (the 6th one in this case).

In [None]:
# setting the model to CPU (might not be always necessary to run this)
text_encoder.model.to('cpu')
english_token_embeddings = text_encoder.obtain_embeddings(method="hidden_layer", layers=6)

By inspecting the shape of this, we can see that we have a 2-dimensional array with dimensions `[token, embedding]` where the embeddings are 768 dimensional in this network.

In [None]:
english_token_embeddings.shape

Now that we have token embeddings for each text, it is possible to pool these embeddings to obtain an embedding for the full text (for this case, this embedding would represent the word itself. We can use the `.pool_token_embeddings()` method for doing this.

Again, there are several methods and full details can be found in the documentation, but a few are:

- Taking the mean of the token embeddings
    - use `.pool_token_embeddings(method = "mean")`
- Taking the element-wise max of the token embeddings
    - use `.pool_token_embeddings(method = "max")`
- Taking the element-wise sum of the token embeddings
    - use `.pool_token_embeddings(method = "sum")`
- Taking the token-embedding for the CLS token (a special token that is used in some transformers like BERT and RoBERTa)
    - but this is only available to us if we set `skip_special_tokens=False` when tokenizing the text with `.tokenize_text()` method (note by default, this is set to `True` and so we don't have access to this method here)
    - use `.pool_token_embeddings(method = "cls")`
        - note this will produce an error if the CLS token is not available...

For example, to pool the character embeddings by taking the mean of the token embeddings:

In [None]:
pooled_english_mean = text_encoder.pool_token_embeddings()

Again, we can inspect the shape and we can see that we have embeddings for each of our words:

In [None]:
pooled_english_mean.shape

## Dimension reduction

We can perform dimension reduction with `nlpsig` using the `DimReduce` class. Here, we will use Gaussian Random Projections (implemented using [`scikit-learn`](https://scikit-learn.org/stable/modules/random_projection.html)) by setting `method="gaussian_random_projection"`, but there are other standard methods available:
- UMAP [[3]](https://arxiv.org/abs/1802.03426) (implemented using the [`umap-learn`](https://umap-learn.readthedocs.io/en/latest/api.html))
    - `method="umap"`
- PCA [[4]](http://www.miketipping.com/papers/met-mppca.pdf) (implemented using [`scikit-learn`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html))
    - `method="pca"`
- TSNE [[5]](https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf) (implemented using [`scikit-learn`](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html))
    - `method="tsne"`
- Post Processing Algorithm (PPA) with PCA (PPA-PCA) [[6]](https://arxiv.org/abs/1702.01417)
    - `method="ppapca"`
- PPA-PCA-PPA [[7]](https://aclanthology.org/W19-4328/)
    - `method="ppapacppa"`

In [None]:
reduction = nlpsig.DimReduce(
    method="gaussian_random_projection",
    n_components=25,
)

english_token_embeddings_reduced = reduction.fit_transform(english_token_embeddings, random_state=seed)

In [None]:
english_token_embeddings_reduced.shape

We can save these embeddings for later use:

In [None]:
with open(f"english_token_embeddings.pkl",'wb') as f:
    pickle.dump(english_token_embeddings, f)
with open(f"english_reduced_token_embeddings.pkl",'wb') as f:
    pickle.dump(english_token_embeddings_reduced, f)

As we have embeddings for each token, we can obtain a path for each word by constructing a path of the token embeddings. To do this, we can use the `PrepareData` class and pass in our tokenized dataframe (the dataframe where we have each token in our data and we also have the corresponding id for each word which is saved in the `text_id` column of the tokenized dataframe.

We pass in the column which defines the ids, `text_id`, the column which defines the labels, `language`, the token embeddings and the dimension-reduced embeddings.

In [None]:
english_dataset = nlpsig.PrepareData(
    text_encoder.tokenized_df,
    id_column="text_id",
    embeddings=english_token_embeddings,
    embeddings_reduced=english_token_embeddings_reduced
)

The class concatenates the embeddings and the dimension-reduced embeddings that are passed into to the class initalisation and stores it in the `.df` attribute of `english_dataset`.

Here, the columns beginning with `d` denote the dimensions of the dimension reduced transformer embeddings, whereas the columns beginning with `e` denote the dimensions of embeddings obtained from the transformer.

Furthermore, we can see from the printed out information that a `timeline_index` column was added to the dataframe, which is the last column here:

In [None]:
english_dataset.df

We can construct a path by using the `.pad()` method, and result of this is a multi-dimensional array or tensor (in particular a numpy array or PyTorch tensor) which can be then used in some downstream task. It is called "pad" because arrays and tensors are rectangular and if there are cases where there isn't enough data (e.g. if a word only has 3 letters/tokens and we want to make paths of length 4), we "pad" with either the last token embedding (if we set `zero_padding=False`) or with a vector of zeros (if we set `zero_padding=True`).

Here, we construct paths by setting a length of the paths (we call this method `k_last` in the code and we have to specify the length with `k=50` - the maximum sequence length that we used when defining the transformer model).

We alternatively can construct to the longest word possible (by setting `method="max"`). The `time_feature` argument allows us to specify what time features we want to keep. Here we don't have any besides the index in which the word is, which is given by `timeline_index` and we choose not to standardise that by specifying `standardise_time_feature=False`.

The `pad_by` argument specifies that we are padding for each word (as each word is given a particular `text_id` in the tokenized dataframe above). There is an alternative option to construct a path by looking at the history of a particular embedding (i.e. the stream embeddings that occurred before), but this is not useful here and we will cover that in another notebook.

In [None]:
path_specifics = {
    "pad_by": "id",
    "zero_padding": True,
    "method": "k_last",
    "k": 50,
    "features": ["timeline_index"],
    "standardise_method": [None],
    "embeddings": "dim_reduced",
    "pad_from_below": True
}

In [None]:
english_word_path = english_dataset.pad(**path_specifics)

In [None]:
english_word_path.shape

In [None]:
len(english_dataset.df["text_id"].unique())

We also store this array as a dataframe in `.df_padded` so that you can see what the columns correspond to, where columns beginning with `e` denote the dimensions of embeddings obtained from the transformer (here we have none as we only requested to keep the dimension reduced embeddings), and columns beginning with `d` denote the dimensions of the dimension reduced transformer embeddings.

We can see for the first word in the dataset (with `text_id==0`), this is a word with 10 letters and we can see how we have padded the word to length 50.

In [None]:
# still has the labels and the ids
english_dataset.df_padded[english_dataset.df_padded["text_id"]==0]

In [None]:
text_encoder.df.iloc[0]

For the padded rows, we give these a label `-1` to denote that they have been added.

Note that for padding, the method pads from below by default, but we can pad by above by setting `pad_from_below=False`.

To obtain a path as a Numpy array, we use the `.get_path()` method which by default keeps the time features and will remove the id and label columns. We make this more explicit by setting `include_features=True` here.

In [None]:
english_word_path = english_dataset.get_path(include_features=True)
english_word_path.shape

In [None]:
english_word_path[0]

## Obtaining path signatures for the english words

We use [`signax`](https://github.com/anh-tong/signax) to compute path signatures, which we compute up to depth 2 here.

In [None]:
sig_depth = 2
english_word_sig = np.array(signature(english_word_path, sig_depth))

In [None]:
english_word_sig.shape

## Obtaining a paths and signatures for words in `corpus_df`

Now that we have trained our model and obtained signatures for each word in our sample of english words, we also want to obtain embeddings for the words in `corpus_sample_df`. Currently, `TextEncoder` only works with the data that is passed into the function and stored in `.df` and `.dataset`, so we need to initialise a new `TextEncoder` object with the `corpus_sample_df` dataframe and also the trained model.

We can then obtain embeddings easily (recall from above we first need to tokenize the text, and then use the `.obtain_embeddings()` and `.pool_token_embeddings()` methods to do this).

In [None]:
text_encoder_2 = nlpsig.TextEncoder(
    df=corpus_sample_df,
    feature_name="word",
    model=text_encoder.model,
    config=text_encoder.config,
    tokenizer=text_encoder.tokenizer,
    data_collator=text_encoder.data_collator
)

Note that since we're just loading in our pretrained model from above, we could also just have passed in the path to the model directly via the `model_name` argument, and use the `.load_pretrained_model()` method which loads in the model, config, tokenizer and data collator that was used. So the below initialisation achieves the same result:

In [None]:
text_encoder_2 = nlpsig.TextEncoder(
    df=corpus_sample_df,
    feature_name="word",
    model_name=model_name
)
text_encoder_2.load_pretrained_model()

In [None]:
text_encoder_2.tokenize_text()

In [None]:
text_encoder_2.tokenized_df

After tokenizing, we can obtain token embeddings and also pool these token embeddings with `.obtain_embeddings()` and `.pool_token_embeddings()` methods available.

In [None]:
token_embeddings = text_encoder_2.obtain_embeddings(method="hidden_layer", layers=6)

In [None]:
token_embeddings.shape

To reduce the embeddings, we want to use the same transform that we used earlier on. Recall that we used Gaussian random projections using the [`scikit-learn`](https://scikit-learn.org/stable/modules/random_projection.html) package. After fitting and transforming with the vectors in `english_token_embeddings`, we stored the `sklearn.random_projection.GaussianRandomProjection` object in `reduction.reducer` which we can use again:

In [None]:
type(reduction.reducer)

We can then transform new data using the `.transform()` method of the `sklearn.random_projection.GaussianRandomProjection` class which will use the same transformation that we fitted to above when applying dimension reduction to the token embeddings for our corpus of english words (in `english_train`).

In [None]:
embeddings_reduced = reduction.reducer.transform(token_embeddings)

In [None]:
embeddings_reduced.shape

Optionally, we can save these embeddings for later:

In [None]:
with open(f"corpus_sample_token_embeddings.pkl",'wb') as f:
    pickle.dump(token_embeddings, f)
with open(f"corpus_sample_reduced_token_embeddings.pkl",'wb') as f:
    pickle.dump(embeddings_reduced, f)

We again obtain paths with the `PrepareData` class, and pass in the tokenized dataframe created in `text_encoder_2`:

In [None]:
text_encoder_2.tokenized_df

In [None]:
corpus_dataset = nlpsig.PrepareData(
    text_encoder_2.tokenized_df,
    id_column="text_id",
    label_column="language",
    embeddings=token_embeddings,
    embeddings_reduced=embeddings_reduced
)

In [None]:
corpus_word_path = corpus_dataset.pad(**path_specifics)

By inspecting the shape of `corpus_word_path`, we see that we have a path for each word and the dimension of the array is `[batch, length of path, channels]`.

In [None]:
corpus_word_path.shape

In [None]:
len(corpus_dataset.df["text_id"].unique())

To obtain a path as a torch tensor, we use the `.get_path()` method which by default keeps the time features and will remove the id and label columns from the path that is generated. 

In [None]:
word_path = corpus_dataset.get_path(include_features=True)
word_path.shape

In [None]:
word_path[0]

In [None]:
# compute path signatures
corpus_signatures = np.array(signature(word_path, sig_depth))

In [None]:
corpus_signatures.shape

We obtain the signatures for our inliers and outliers:

In [None]:
english_word_indices = corpus_sample_df[corpus_sample_df["language"]=="en"].index
non_english_word_indices = corpus_sample_df[corpus_sample_df["language"]!="en"].index

In [None]:
corpus_sample_df.iloc[english_word_indices]

In [None]:
corpus_sample_df.iloc[non_english_word_indices]

In [None]:
# obtain signatures for english words and non-english words in corpus_sample_df
inlier_signatures = corpus_signatures[english_word_indices]
outlier_signatures = corpus_signatures[non_english_word_indices]

In [None]:
inlier_signatures.shape

In [None]:
outlier_signatures.shape

## Anomaly detection task

To recap the task at hand:
- We trained a language model using a corpus of english words stored in the `english_train` dataframe.
- We have another set of english words (inliers) and some non-english words (outliers) which are stored in the `corpus_sample_df` dataframe.
- We now want to see how we could detect the non-english words efficiently, in particular, we use the following method:
    - For each word in `english_train` and `corpus_sample_df`, we have a vector representation for them (e.g. we've computed the path signatures for each of them and they are stored in `english_word_sig`).
    - For each word in `corpus_sample_df`, we compute the minimum (Euclidean) distance of between its path signature to path signatures for our corpus of known English words (i.e. each row in `english_word_sig`).
    - We then look the [ROC curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html) to see how well separated are the english words to the non-english words. For a good performance, we hope that there is good separation, and so we measure the success of this method using the [ROCAUC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html).

In [None]:
def plot_roc_curve(inlier_scores, outlier_scores, title=""):
    # concatenate scores and labels
    y_true = np.concatenate([np.zeros(len(inlier_scores)),
                             np.ones(len(outlier_scores))])
    scores = np.concatenate([np.array(inlier_scores),
                             np.array(outlier_scores)])
    
    # compute and plot metrics
    fpr, tpr, threshold = roc_curve(y_true, scores)
    roc_auc = roc_auc_score(y_true, scores)
    
    plt.title(f"Receiver Operating Characteristic {title}")
    plt.plot(fpr, tpr, 'b', label = f"AUC = {round(roc_auc, 2)}")
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()
    
    return roc_auc

In [None]:
def compute_min_euclidean_dist(main_corpus, embedding):
    # compute score of individual word
    # compute euclidean distance between the embedding to each row in main_corpus
    diff = main_corpus - embedding.repeat(main_corpus.shape[0], 1)
    euclidean_dist = distances = diff.pow(2).sum(1).sqrt()
    return distances.min().item()

def compute_scores(main_corpus, inliers, outliers, plot=False, title=""):
    # compute scores for inliers and outliers
    inlier_scores = [compute_min_euclidean_dist(main_corpus=main_corpus,
                                                embedding=embedding)
                     for embedding in tqdm(inliers)]
    outlier_scores = [compute_min_euclidean_dist(main_corpus=main_corpus,
                                                 embedding=embedding)
                      for embedding in tqdm(outliers)]
    if plot:
        return plot_roc_curve(inlier_scores=inlier_scores,
                              outlier_scores=outlier_scores,
                              title=title)
    else:
        return inlier_scores, outlier_scores

In [None]:
compute_scores(main_corpus=english_word_sig,
               inliers=inlier_signatures,
               outliers=outlier_signatures,
               plot=True,
               title="\n(using depth=2 signatures of dimension reduced streams)")

## Using one-hot encodings

Here, we simply construct a path of one-hot encodings of the characters and so the number of channels in the path is 26. We also take a cumulative sum transformation on the path (which has length 50 again).

In [None]:
def construct_path(char_seq, alpha_len=26):
    # construct path via one-hot encoding of characters
    n = len(char_seq)
    its = np.zeros(n, np.int64)
    for i in range(n):
        its[i] = ord(char_seq[i]) - 97
    A = np.zeros((n, alpha_len))
    j = 0
    for i in its:
        A[j, i] += 1
        j += 1

    return A

def get_one_hot_paths_from_words(words,
                                 max_word_len,
                                 pad_from_below=True,
                                 alpha_len=26,
                                 cumsum_transform=True):
    # compute path for each word in words
    path = np.array(
        [
            np.vstack(
                [
                    construct_path(word),
                    np.zeros((100 - len(word), alpha_len)),
                ]
            )
            if pad_from_below else
            np.vstack(
                [
                    np.zeros((100 - len(word), alpha_len)),
                    construct_path(word),
                ]
            )
            for word in tqdm(words)
        ]
    )
    if pad_from_below:
        path = path[:, :max_word_len, :alpha_len]
    else:
        path = path[:, -max_word_len:, :alpha_len]
    if cumsum_transform:
        path = np.cumsum(path, axis=1)

    return torch.tensor(path)

In [None]:
english_word_one_hot_paths = get_one_hot_paths_from_words(words=english_train["word"],
                                                          max_word_len=20,
                                                          pad_from_below=True,
                                                          cumsum_transform=True)
corpus_one_hot_paths = get_one_hot_paths_from_words(words=corpus_sample_df["word"],
                                                    max_word_len=20,
                                                    pad_from_below=True,
                                                    cumsum_transform=True)

In [None]:
english_word_one_hot_paths.shape

In [None]:
corpus_one_hot_paths.shape

In [None]:
# compute signatures for english words
english_word_one_hot_signatures = signature(english_word_one_hot_paths, 2)

# compute signatures for inliers and outliers
corpus_one_hot_signatures = signature(corpus_one_hot_paths, 2)
inlier_one_hot_signatures = corpus_one_hot_signatures[english_word_indices]
outlier_one_hot_signatures = corpus_one_hot_signatures[non_english_word_indices]

In [None]:
english_word_one_hot_signatures.shape

In [None]:
inlier_one_hot_signatures.shape

In [None]:
outlier_one_hot_signatures.shape

In [None]:
compute_scores(main_corpus=english_word_one_hot_signatures,
               inliers=inlier_one_hot_signatures,
               outliers=outlier_one_hot_signatures,
               plot=True,
               title="\n(using depth=2 signatures of one-hot encoding streams)")

## Acknowledgements

The computations described in this notebook were performed using the Baskerville Tier 2 HPC service (https://www.baskerville.ac.uk/). Baskerville was funded by the EPSRC and UKRI through the World Class Labs scheme (EP/T022221/1) and the Digital Research Infrastructure programme (EP/W032244/1) and is operated by Advanced Research Computing at the University of Birmingham.

## References

[1] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. _arXiv preprint arXiv:1810.04805_.

[2] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V., 2019. Roberta: A robustly optimized bert pretraining approach. _arXiv preprint arXiv:1907.11692_.

[3] McInnes, L., and Healy, J. 2018. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, _arXiv preprint arXiv:1802.03426_.

[4] Tipping, M. E., and Bishop, C. M., 1999. Probabilistic principal component analysis. _Journal of the Royal Statistical Society: Series B (Statistical Methodology)_, 61(3), 611-622.

[5] van der Maaten, L.J.P., and Hinton, G.E., 2008. Visualizing High-Dimensional Data using t-SNE. _Journal of Machine Learning Research_, 9:2579-2605.


[6] Mu, J., Bhat, S., and Viswanath, P. (2017). All-but-the-top: Simple and effective postprocessing for word representations. _arXiv preprint arXiv:1702.01417_.

[7] Raunak, V., Gupta, V., and Metze, F. (2019). Effective dimensionality reduction for word embeddings. In _Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP- 2019)_, 235–243.