# Analysis of Embeddings

The notebook at hand aims to dive into the possible patterns dimensionality reduction techniques can show within the proposed weak labeling models and the embedding models used.

## Loading Weak Labeled data

In [None]:
import os
import sys

import pandas as pd
import plotly
import plotly.express as px
from dotenv import load_dotenv

current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)

sys.path.append(parent_dir)

plotly.offline.init_notebook_mode()
load_dotenv()

In [None]:
DATA_DIR = os.getenv('DATA_DIR', 'data')
EMBEDDING_DATA_DIR = os.path.abspath(os.path.join(parent_dir, DATA_DIR, 'embeddings'))

weak_labelled = {}

print(f"Reading weak labelled data from {EMBEDDING_DATA_DIR}")

embedding_model_dirs = [d for d in os.listdir(EMBEDDING_DATA_DIR) if os.path.isdir(os.path.join(EMBEDDING_DATA_DIR, d))]
embeddings = {}

for dir in embedding_model_dirs:
    print(f"- Opening Embeddings from {dir}")
    curr_embeddings = {}
    for file in os.listdir(os.path.join(EMBEDDING_DATA_DIR, dir)):
        if file.endswith('.pkl'):
            filename = file.split('.')[0]
            curr_embeddings[filename] = pd.read_pickle(os.path.join(EMBEDDING_DATA_DIR, dir, file))
        print(f"  - Read {file}")
    embeddings[dir] = curr_embeddings

In [None]:
PARTITIONS_DATA_DIR = os.path.abspath(os.path.join(parent_dir, DATA_DIR, 'partitions'))

print(f"Reading partitions data from {PARTITIONS_DATA_DIR}")    

partitions = {}

for file in os.listdir(PARTITIONS_DATA_DIR):
    if file.endswith('.parquet'):
        filename = file.split('.')[0]
        partitions[filename] = pd.read_parquet(os.path.join(PARTITIONS_DATA_DIR, file))
    print(f'- Read {file}')

## Merging content, title and label to embedding vectors

In [None]:
merged_partitions = {}

for embedding_model in embeddings:
    merged_partitions[embedding_model] = {}
    print(f'For {embedding_model}:')
    for partition in partitions:
        curr_partition_name = partition.split('_')[0]
        embeddings_keys = embeddings[embedding_model].keys()
        
        for embedding_key in embeddings_keys:
            if curr_partition_name == embedding_key.split('_')[0]:
                partition_data = partitions[partition]
                embedding_data = embeddings[embedding_model][embedding_key]
                partition_data['embedding'] = embedding_data.tolist()
                
                merged_partitions[embedding_model][partition] = partition_data
                
                print(f"- Merged {embedding_key} with {partition}")

In [None]:
# concatenate all partitions for each embedding model
for embedding_model in merged_partitions:
    print(f"Concatenating partitions for {embedding_model}")
    partitions_data = pd.concat(merged_partitions[embedding_model].values(), ignore_index=True)
    merged_partitions[embedding_model] = partitions_data

In [None]:
merged_partitions['mini_lm']

## Visualizing Labeled data
To project the high dimensional embeddings into a humanly readable format we implemented Arize's Phoenix app that allows us to interactively look at the embedding space projected down into 3 dimensions by UMAP.

Additionally, it might be insightful to also look at a different dimension reduction approach - Therefore we made the `plot_pca` function which will project the embedding space into two dimensions using PCA.

In [None]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

def break_content(text, length=50):
    lines = []
    while len(text) > length:
        space_index = text.rfind(' ', 0, length)
        if space_index == -1:
            space_index = length
        lines.append(text[:space_index])
        text = text[space_index:].lstrip()
    lines.append(text)
    return '<br>'.join(lines)


def plot_pca(weak_labelled, key):
    if key not in weak_labelled:
        raise ValueError(f"File {key} not found in the weak_labelled dictionary.")

    df = weak_labelled[key]

    embeddings = np.vstack(df['embedding'].values)
    content = df['content'].apply(lambda x: break_content(x)).values

    pca = PCA(n_components=3)
    reduced_embeddings = pca.fit_transform(embeddings)

    pca_df = pd.DataFrame(reduced_embeddings, columns=['PCA1', 'PCA2', 'PCA3'])
    pca_df['Content'] = content

    fig = px.scatter_3d(pca_df, x='PCA1', y='PCA2', z='PCA3',
                        title=f'PCA of Embedding Vectors for {key}',
                        size_max=5, opacity=0.6, height=800,
                        hover_data={'Content': True})
    fig.update_traces(marker=dict(size=2))
    fig.show()


### MiniLM Embedding Space

In [None]:
from src.px_utils import create_dataset, launch_px

knn_key = 'mlp_weak_labeling_weaklabels.parquet'

mini_lm_ds = create_dataset('mini_lm', merged_partitions['mini_lm'], merged_partitions['mini_lm']['embedding'], content=merged_partitions['mini_lm']['content'])

px_session = launch_px(mini_lm_ds, None)
px_session.view()

The embedding positions in the space are relatively clearly clustered. In this first example we are using the embedding vectors of huggingface's `all-MiniLM-L6-v2` BERT sentence transformer. This sentence transformer was trained on sentence pairs that appear as a Q&A. The resulting vector embedding therefore describes the semantic content of such a sentence - This is exactly what we can see in the embedding space; Reviews of the `amazon-polarity` dataset are clustered together according to their product niche, as for example already mentioned with the cluster containing music reviews.

But we can also observe other semantic relationships:
- Video game reviews lie between music and book reviews: This axis could perhaps describe interactivity; music can be enjoyed passively, games do have some interactions between cutscenes while books capture ones concentration and attention entirely.
- Video game reviews lie opposite of tech gadgets and other devices: This axis might describe the abstraction of virtuality. Games are completely virtual tech while tech gadgets are physical devices.
- Kid's toys are clustered between games and tech gadgets

Since Phoenix doesn't allow for a different dimension reduction technique we implement a PCA strategy ourselves. The UMAP technique differs vastly from PCA so looking at another technique could yield more interesting observations in the embedding space.

In [None]:
plot_pca(merged_partitions, 'mini_lm')

Compared to the UMAP representation the PCA reduction doesn't seem to show much more separation in labels or semantics. We can still roughly see the following four clusters:
- Music albums
- Books
- Movies
- Tech Gadgets