# Analysis of Embeddings

The notebook at hand aims to dive into the possible patterns dimensionality reduction techniques can show within the proposed weak labeling models and the embedding models used.

## Loading Weak Labeled data

In [None]:
import os
import sys
from dotenv import load_dotenv
import plotly.express as px
import plotly
import pandas as pd

current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)

sys.path.append(parent_dir)

plotly.offline.init_notebook_mode()
load_dotenv()

In [None]:
WL_DATA_DIR = os.getenv('DATA_DIR', 'data')
WL_DATA_DIR = os.path.abspath(os.path.join(parent_dir, WL_DATA_DIR, 'weak_labelled'))

weak_labelled = {}

print(f"Reading weak labelled data from {WL_DATA_DIR}")
for file in os.listdir(WL_DATA_DIR):
    weak_labelled[file] = pd.read_parquet(os.path.join(WL_DATA_DIR, file))
    print(f"- Read {file}")

## Visualizing Labeled data
To project the high dimensional embeddings into a humanly readable format we implemented Arize's Phoenix app that allows us to interactively look at the embedding space projected down into 3 dimensions by UMAP.

Additionally, it might be insightful to also look at a different dimension reduction approach - Therefore we made the `plot_pca` function which will project the embedding space into two dimensions using PCA.

In [None]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

def break_content(text, length=50):
        lines = []
        while len(text) > length:
            space_index = text.rfind(' ', 0, length)
            if space_index == -1:
                space_index = length
            lines.append(text[:space_index])
            text = text[space_index:].lstrip()
        lines.append(text)
        return '<br>'.join(lines)

def plot_pca(weak_labelled, key):
    if key not in weak_labelled:
        raise ValueError(f"File {key} not found in the weak_labelled dictionary.")
    
    df = weak_labelled[key]

    embeddings = np.vstack(df['embedding_vec'].values)
    labels = np.array(df['label'].values)
    content = df['content'].apply(lambda x: break_content(x)).values

    pca = PCA(n_components=3)
    reduced_embeddings = pca.fit_transform(embeddings)
    
    pca_df = pd.DataFrame(reduced_embeddings, columns=['PCA1', 'PCA2', 'PCA3'])
    pca_df['Label'] = labels
    pca_df['Content'] = content

    fig = px.scatter_3d(pca_df, x='PCA1', y='PCA2', z='PCA3', color='Label', 
                        title=f'PCA of Embedding Vectors for {key}',
                        size_max=5, opacity=0.6, height=800,
                        hover_data={'Content': True})
    fig.update_traces(marker=dict(size=2))
    fig.show()


### KNN Weak Labels
For this first view onto the embedding space we will look at how the KNN labels the sentiments.

In [None]:
from src.px_utils import create_dataset, launch_px

knn_key = 'mlp_weak_labeling_weaklabels.parquet'

knn_wl_ds = create_dataset('knn', weak_labelled[knn_key], 
                           weak_labelled[knn_key]['embedding_vec'], 
                           weak_labelled[knn_key]['label'])

px_session = launch_px(knn_wl_ds, None)
px_session.view()

The cluster distancing itself the farthest seems to consist almost solely of **music album reviews** that were labeled with `1`, meaning a positive sentiment.

Otherwise, the labels don't show a specific pattern or clustering. However, the embedding positions in the space are relatively clearly clustered: We are using the embedding vectors of huggingface's `all-MiniLM-L6-v2` BERT sentence transformer. This sentence transformer was trained on sentence pairs that appear as a Q&A. The resulting vector embedding therefore describes the semantic content of such a sentence - This is exactly what we can see in the embedding space; Reviews of the `amazon-polarity` dataset are clustered together according to their product niche, as for example already mentioned with the cluster containing music reviews.

But we can also observe other semantic relationships:
- Video game reviews lie between music and book reviews: This axis could perhaps describe interactivity; music can be enjoyed passively, games do have some interactions between cutscenes while books capture ones concentration and attention entirely.
- Video game reviews lie opposite of tech gadgets and other devices: This axis might describe the abstraction of virtuality. Games are completely virtual tech while tech gadgets are physical devices.
- Kid's toys are clustered between games and tech gadgets

Since Phoenix doesn't allow for a different dimension reduction technique we implement a PCA strategy ourselves. The UMAP technique differs vastly from PCA so looking at another technique could yield more interesting observations in the embedding space.

In [None]:
plot_pca(weak_labelled, knn_key)

Compared to the UMAP representation the PCA reduction doesn't seem to show much more separation in labels or semantics. We can still roughly see the following four clusters:
- Music albums
- Books
- Movies
- Tech Gadgets

### Using a sentence transformer with higher dimensionality

In [None]:
log_reg_key = 'log_reg_weak_labeling_weaklabels.parquet'

log_reg_wl_ds = create_dataset('log_reg', weak_labelled[log_reg_key], 
                           weak_labelled[log_reg_key]['embedding_vec'], 
                           weak_labelled[log_reg_key]['label'])

px_session = launch_px(log_reg_wl_ds, None)
px_session.view()

In [None]:
plot_pca(weak_labelled, log_reg_key)

### Multi-Layer Perceptron Weak Labelling

In [None]:
mlp_key = 'mlp_weak_labeling_weaklabels.parquet'

mlp_wl_ds = create_dataset('mlp_reg', weak_labelled[mlp_key], 
                           weak_labelled[mlp_key]['embedding_vec'], 
                           weak_labelled[mlp_key]['label'])

px_session = launch_px(mlp_wl_ds, None)
px_session.view()

In [None]:
plot_pca(weak_labelled, mlp_key)

### Random Forest Weak Labelling

In [None]:
rf_key = 'rf_weak_labeling_weaklabels.parquet'

rf_wl_ds = create_dataset('rf_reg', weak_labelled[rf_key], 
                           weak_labelled[rf_key]['embedding_vec'], 
                           weak_labelled[rf_key]['label'])

px_session = launch_px(rf_wl_ds, None)
px_session.view()

In [None]:
plot_pca(weak_labelled, rf_key)

### Support Vector Machine Weak Labelling

In [None]:
svm_key = 'svm_weak_labeling_weaklabels.parquet'

svm_wl_ds = create_dataset('svm_reg', weak_labelled[svm_key], 
                           weak_labelled[svm_key]['embedding_vec'], 
                           weak_labelled[svm_key]['label'])

px_session = launch_px(svm_wl_ds, None)
px_session.view()

In [None]:
plot_pca(weak_labelled, svm_key)