# Analysis of Embeddings

The notebook at hand aims to dive into the possible patterns dimensionality reduction techniques can show within the proposed embedding models.

To analyze the embedding spaces we use Arize's Phoenix app which decomposes the high dimensionality into a 3-dimensional space using UMAP. Additionally we will also look at each embedding space in a PCA-decomposed representation to see how much impact the decomposition algorithm may have.

We looked at the performance of the following two embedding models:
- [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
- [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)

In [None]:
import os
import sys

import pandas as pd
import plotly
import plotly.express as px

from dotenv import load_dotenv

current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)

sys.path.append(parent_dir)

plotly.offline.init_notebook_mode()
load_dotenv()

#### Loading Embeddings
The first step is to load the persisted embeddings from each embedding model. The embedding vectors (i.e. embedding vector matrix) was saved in the `data/embeddings/*` folder for each split in the initial train, test and validation split. 

The following block loads these split up matrices into memory.

In [None]:
DATA_DIR = os.getenv('DATA_DIR', 'data')
EMBEDDING_DATA_DIR = os.path.abspath(os.path.join(parent_dir, DATA_DIR, 'embeddings'))

weak_labelled = {}

print(f"Reading weak labelled data from {EMBEDDING_DATA_DIR}")

embedding_model_dirs = [d for d in os.listdir(EMBEDDING_DATA_DIR) if os.path.isdir(os.path.join(EMBEDDING_DATA_DIR, d))]
embeddings = {}

for dir in embedding_model_dirs:
    print(f"- Opening Embeddings from {dir}")
    curr_embeddings = {}
    for file in os.listdir(os.path.join(EMBEDDING_DATA_DIR, dir)):
        if file.endswith('.pkl'):
            filename = file.split('.')[0]
            curr_embeddings[filename] = pd.read_pickle(os.path.join(EMBEDDING_DATA_DIR, dir, file))
        print(f"  - Read {file}")
    embeddings[dir] = curr_embeddings

#### Loading Data Partitions
To get a full glance at the embedding space's attributes we may want to look at the content of a review and how it relates to other reviews in the space so within this next code block we gather the nominal attributes of each review and load the three split parquets (train, test and validation or in our case unlabelled, labelled and validation) into the memory.

In [None]:
PARTITIONS_DATA_DIR = os.path.abspath(os.path.join(parent_dir, DATA_DIR, 'partitions'))

print(f"Reading partitions data from {PARTITIONS_DATA_DIR}")    

partitions = {}

for file in os.listdir(PARTITIONS_DATA_DIR):
    if file.endswith('.parquet'):
        filename = file.split('.')[0]
        partitions[filename] = pd.read_parquet(os.path.join(PARTITIONS_DATA_DIR, file))
    print(f'- Read {file}')

#### Merging content, title and label to embedding vectors
The goal now is to merge the matrix representations with the dataframes. This will later on allow us to pass a dataset into Phoenix that contains the reviews content and its embedding vector.

In [None]:
import numpy as np

merged_partitions = {}

for embedding_model in embeddings:
    print(f'Merging partitions for model {embedding_model}')
    merged_list = []
    
    for partition_key, partition_df in partitions.items():
        curr_partition_name = partition_key.split('_')[0]
        matched = False
        
        for embedding_key, embedding_array in embeddings[embedding_model].items():
            if curr_partition_name == embedding_key.split('_')[0]:
                if isinstance(embedding_array, (list, pd.Series)):
                    embedding_array = np.array(embedding_array)
                
                if len(partition_df) == embedding_array.shape[0]:
                    partition_df = partition_df.copy()
                    partition_df['embedding'] = embedding_array.tolist()
                    merged_list.append(partition_df)
                    matched = True
                    print(f"  - Merged {embedding_key} with {partition_key}")
                else:
                    print(f"  - Number of rows do not match for {embedding_key} and {partition_key}")
        
        if not matched:
            print(f"  - No matching embedding found for {partition_key}")
    
    if merged_list:
        merged_partitions[embedding_model] = pd.concat(merged_list, ignore_index=True)
    else:
        print(f"No partitions were merged for model {embedding_model}")

## Visualizing Labeled data
To project the high dimensional embeddings into a humanly readable format we implemented Arize's Phoenix app that allows us to interactively look at the embedding space projected down into 3 dimensions by UMAP.

Additionally, it might be insightful to also look at a different dimension reduction approach - Therefore we made the `plot_pca` function which will project the embedding space into two dimensions using PCA.

In [None]:
from sklearn.decomposition import PCA

def break_content(text, length=50):
    lines = []
    while len(text) > length:
        space_index = text.rfind(' ', 0, length)
        if space_index == -1:
            space_index = length
        lines.append(text[:space_index])
        text = text[space_index:].lstrip()
    lines.append(text)
    return '<br>'.join(lines)


def plot_pca(weak_labelled, key):
    if key not in weak_labelled:
        raise ValueError(f"File {key} not found in the weak_labelled dictionary.")

    df = weak_labelled[key]

    embeddings = np.vstack(df['embedding'].values)
    content = df['content'].apply(lambda x: break_content(x)).values

    pca = PCA(n_components=3)
    reduced_embeddings = pca.fit_transform(embeddings)

    pca_df = pd.DataFrame(reduced_embeddings, columns=['PCA1', 'PCA2', 'PCA3'])
    pca_df['Content'] = content

    fig = px.scatter_3d(pca_df, x='PCA1', y='PCA2', z='PCA3',
                        title=f'PCA of Embedding Vectors for {key}',
                        size_max=5, opacity=0.6, height=800,
                        hover_data={'Content': True})
    fig.update_traces(marker=dict(size=2))
    fig.show()


### MiniLM Embedding Space
First we will take a look at how the embedding space of the `mini-lm` embedding model looks.

Note, to see the projected space in the Phoenix app, make sure to click the "text_embedding" link inside the app, this will load the 3-dimensional UMAP projection. Another thing to note is that UMAP in uses stochastic algorithms to speed up calculation so the representation you see may not look the same as we noted down so **this decomposition approach is non-deterministic**.

In [None]:
from src.px_utils import create_dataset, launch_px

mini_lm_ds = create_dataset('mini_lm', merged_partitions['mini_lm'], merged_partitions['mini_lm']['embedding'], content=merged_partitions['mini_lm']['content'])

px_session = launch_px(mini_lm_ds, None)
px_session.view()

The embedding positions in the space are relatively clearly clustered. In this first example we are using the embedding vectors of huggingface's `all-MiniLM-L6-v2` BERT sentence transformer. This sentence transformer was trained on sentence pairs that appear as a Q&A. The resulting vector embedding therefore describes the semantic content of such a sentence - This is exactly what we can see in the embedding space; Reviews of the `amazon-polarity` dataset are clustered together according to their product niche, as for example already mentioned with the cluster containing music reviews.

But we can also observe other semantic relationships:
- Video game reviews lie between music and book reviews: This axis could perhaps describe interactivity; music can be enjoyed passively, games do have some interactions between cutscenes while books capture ones concentration and attention entirely.
- Video game reviews lie opposite of tech gadgets and other devices: This axis might describe the abstraction of virtuality. Games are completely virtual tech while tech gadgets are physical devices.
- Kid's toys are clustered between games and tech gadgets

#### PCA of MiniLM
Since Phoenix doesn't allow for a different dimension reduction technique we implement a PCA strategy ourselves. The UMAP technique differs vastly from PCA so looking at another technique could yield more interesting observations in the embedding space. PCA on the other hand is deterministic so the observations made may make more sense. 

In [None]:
plot_pca(merged_partitions, 'mini_lm')

Compared to the UMAP representation the PCA reduction shows a triangular shaped embedding space, at each corner a cluster emerges. We can still roughly see the following four clusters:
- Music albums
- Books
- Movies
- Tech Gadgets

So this visualization again support the claims made in the above analysis; The `all-MiniLM-L6-v2` clearly succeeds in embedding and clustering the reviews according to their semantic relatedness.

### mpnet-base Embedding Space
Now we look at the embedding space of the `all-mpnet-base-v2` model.

In [None]:
mpnet_base_ds = create_dataset('mpnet_base', merged_partitions['mpnet_base'], merged_partitions['mpnet_base']['embedding'], content=merged_partitions['mpnet_base']['content'])

px_session = launch_px(mpnet_base_ds, None)
px_session.view()

The UMAP projection of `mpnet_base` as seen in Phoenix also shows roughly the same clusters as the UMAP projection of the `mini_lm` embedding space.
The main and most obvious sights stay the same as already noted in the previous exploration on the `mini_lm`'s decomposition:
- One cluster that separates itself from the other points in the space is the music-related cluster
- On the other  side of the space much data points seem to be about books

#### PCA of mpnet-base

In [None]:
plot_pca(merged_partitions, 'mpnet_base')

This `mpnet_base` PCA projection shows a decomposed space similar to the PCA of the `mini_lm` embedding model. This observation makes sense because both embedding models were trained with similar BERT-style objectives focused on mapping and clustering the semantic meanings of sentences. Consequently, the decomposed spaces map similar variances onto the principal components.

A confusion that might arise is the fact that when comparing both 3D-PCA plots the principal component #2 seems to be flipped. This does not change the components meaning since the principal components derived from PCA are unique up to a sign flip. This is because the eigenvectors of a covariance matrix (which define the principal components) can point in either direction along the axis they define. Both directions represent the same principal component, just with inverted signs.