Skip to content

Disentangling an LLM. Attempting the unsupervised labelling of latent feature representations inside a neural network.

License

Notifications You must be signed in to change notification settings

colurw/pandora_NLP

Repository files navigation

pandora_NLP

Deep learning models can be criticised for taking a 'black box' approach to modelling data. Whilst one can judge the accuracy of their outputs, the lack of information available about the process that derives them leads to wariness about some aspects of their use. For example, studying whether discriminatory biases exist deep within a model can be difficult.

In some cases, it might be possible to peek inside this box and understand what latent feature representations (concepts) the algorithm is using to make its inferences. Ideally, this would mean understanding what each network node represents in the latent space.

Bug Or Feature

In machine learning, each row of training data is represented by a feature vector (a list of numbers). Each number in the vector can be thought of as point along a unique axis/dimension.

As a feature vector passes through a neural network, the number of its dimensions is often altered by each new layer, with each node in that layer representing one dimension. The effect of this is the abstraction of latent features (meaning they cannot be directly observed) - hence the black box problem.

Embeddings

A document-level NLP embedding is an example of a dataset processed by an autoencoder to reduce dimensionality, where large one-hot encoded vectors representing individual words in each document are compressed into smaller feature vectors. Each value in the vector represents a concept in the autoencoder's training corpus.

These latent cardinal concepts are learned by the machine as the most efficient way to distinguish between variance in the training documents, whilst retaining information about their entanglement. For example, linear relationships are preserved in latent space.

image

Embeddings As Node Activation States

Embeddings are created by saving feature vectors emerging at the middle layer (i.e., the layer with the fewest nodes) of an autoencoder neural network.

The aim of this project is to describe the abstracted concepts represented by these nodes/dimensions in a Universal Sentence Encoder (USE) large language model, by analysing the contents of documents that trigger the largest activations (both +/-) of a particular node.

The documents used are the dataset of 1.2 million first paragraphs of Wikipedia pages, created by github.com/colurw/wiki_abstracts_NLP

Python Scripts

1_find_extreme_articles

Searches through the embedded data for articles with the highest values in a particular dimension of their feature vector. It creates a list of which dimension each article in the dataset strongly activates, with placeholders ('-1') if the article is not one of the most extreme articles in any dimension.

2_TF-IDF_keywords

Manipulates the original data to allow a text frequency-inverse document frequency (TF-IDF) calculation, that finds the most significant words (keywords) in the cluster of documents found at the two extremes of each dimension in the embedding.

3_chatGPT_dimension_labelling

Generates a list of strongly-representative article abstracts for each dimension and uses ChatGPT to infer the two concepts that best represent its positive and negative directions. These concepts, along with keyword data, and article examples, are written to text files.

In an effort to eliminate the effects of noise in the data, the keywords found in the top 100 articles for the dimension are compared to those of the top 300 articles. Also, the concept labelling function is called twice per direction (four times per dimension) to compare the top 10 articles to the following 10 articles. A similar/coherent result acting as a (simplistic) validation of any discovered concepts.

"Pandora", Yes? Does It Work?

Often the dimensions were found to have a surmisable and 'validated' concept in only one direction. Typically, in these cases, the other direction of the dimension will give much more varied (almost random) results, with their two concepts being labelled as "diverse" or "miscellaneous" or having no obvious similarity. Notably, when opposite ends of a dimension could be labelled with a degree of certainty, their concept labels were unrelated ("Film and entertainment / Performing Arts" vs. "Exploration / Historical Figures and Events"), rather than being mutual opposites. (E.g. "Childhood / Children's Education" vs. "Death and its various aspects / Death"). This was not always the case however, and many dimensions did not demonstrate an easily-determinable concept label in either direction.

[EDIT: Without further analysis I suspect the clustered nature of the articles might be adding a significant bias. An improved sampling technique that accounts for this, may well give less coherent results.]

It might be possible to improve on these results by not just searching for the most extreme articles for a given dimension, but instead finding articles for which the direction of its feature vector (in multi-dimensional space) is more closely aligned with the dimension of interest - i.e., the other numbers in its feature vector are closer to zero, compared to the (normally distributed) random numbers found at present.

[EDIT: Articles found in this manner show little coherency (And somewhat large angles to the axis). This might be due to the h-clustering algorithm orienting the new axes in empty regions to create more orthogonality with the data.]

The ChatGPT prompt could be optimised, for example by improving the system prompt, including the results of the keyword analysis, by connecting to a more recent model, or simply by feeding it more than ten articles at a time.

About

Disentangling an LLM. Attempting the unsupervised labelling of latent feature representations inside a neural network.

Topics

Resources

License

Stars

Watchers

Forks

Languages