# Analysis of Embeddings

The notebook at hand aims to dive into the possible patterns dimensionality reduction techniques can show within the proposed weak labeling models and the embedding models used.

## Loading Weak Labeled data

In [None]:
import os
import sys
from dotenv import load_dotenv

import pandas as pd

current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)

sys.path.append(parent_dir)

load_dotenv()

In [None]:
WL_DATA_DIR = os.getenv('DATA_DIR', 'data')
WL_DATA_DIR = os.path.abspath(os.path.join(parent_dir, WL_DATA_DIR, 'weak_labelled'))

weak_labelled = {}

print(f"Reading weak labelled data from {WL_DATA_DIR}")
for file in os.listdir(WL_DATA_DIR):
    weak_labelled[file] = pd.read_parquet(os.path.join(WL_DATA_DIR, file))
    print(f"- Read {file}")

## Visualizing Labeled data
To project the high dimensional embeddings into a humanly readable format we implemented Arize's Phoenix app that allows us to interactively look at the embedding space projected down into 3 dimensions by UMAP.

Additionally, it might be insightful to also look at a different dimension reduction approach - Therefore we made the `plot_pca` function which will project the embedding space into two dimensions using PCA.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.decomposition import PCA
from src.px_utils import create_dataset, launch_px

def plot_pca(weak_labelled, key):
    if key not in weak_labelled:
        raise ValueError(f"Key {key} not found in the weak_labelled dictionary.")
    
    df = weak_labelled[key]

    embeddings = np.vstack(df['embedding_vec'].values)
    labels = np.array(df['label'].values)

    pca = PCA(n_components=2)
    reduced_embeddings = pca.fit_transform(embeddings)
    
    plt.figure(figsize=(12, 8))
    pca_df = pd.DataFrame(reduced_embeddings, columns=['PCA1', 'PCA2'])
    pca_df['Label'] = labels
    
    sns.scatterplot(x='PCA1', y='PCA2', hue='Label', palette='viridis', data=pca_df, alpha=0.6)
    plt.suptitle(f'PCA of Embedding Vectors for {key}')
    plt.title(f'Explained Variance: {pca.explained_variance_ratio_}')
    plt.xlabel('PCA Component 1')
    plt.ylabel('PCA Component 2')
    plt.legend(title='Label')
    plt.show()

### KNN Weak Labels

In [None]:
knn_key = 'mlp_weak_labeling_weaklabels.parquet'

knn_wl_ds = create_dataset('knn', weak_labelled[knn_key], 
                           weak_labelled[knn_key]['embedding_vec'], 
                           weak_labelled[knn_key]['label'])

px_session = launch_px(knn_wl_ds, None)
px_session.view()

`Cluster 3` seems to clearly stand out from the other clusters. This cluster consists almost solely of music album reviews that were labeled with `1`, meaning a positive sentiment.

Otherwise, the weak labeled embeddings don't show a specific pattern in the UMAP projection made by Phoenix.

Since Phoenix doesn't allow for a different dimension reduction technique we implement a PCA strategy ourselves. The UMAP technique differs vastly from PCA so looking at another technique could yield more interesting observations in the embedding space.

In [None]:
plot_pca(weak_labelled, key)

Compared to the UMAP representation the PCA reduction shows more separation of both labels. The `0`-labels show a relatively dense plane on top of the `1`-labeled embeddings. Though it needs to be said that neither of the classes is separated clearly, the noted observation just describes a tendency that gets visible through PCA.

### Logistic Regression Weak Labeling

In [None]:
log_reg_key = 'log_reg_weak_labeling_weaklabels.parquet'

log_reg_wl_ds = create_dataset('log_reg', weak_labelled[log_reg_key], 
                           weak_labelled[log_reg_key]['embedding_vec'], 
                           weak_labelled[log_reg_key]['label'])

px_session = launch_px(log_reg_wl_ds, None)
px_session.view()

In [None]:
plot_pca(weak_labelled, log_reg_key)

### Multi-Layer Perceptron Weak Labelling

In [None]:
mlp_key = 'mlp_weak_labeling_weaklabels.parquet'

mlp_wl_ds = create_dataset('mlp_reg', weak_labelled[mlp_key], 
                           weak_labelled[mlp_key]['embedding_vec'], 
                           weak_labelled[mlp_key]['label'])

px_session = launch_px(mlp_wl_ds, None)
px_session.view()

In [None]:
plot_pca(weak_labelled, mlp_key)

### Random Forest Weak Labelling

In [None]:
rf_key = 'rf_weak_labeling_weaklabels.parquet'

rf_wl_ds = create_dataset('rf_reg', weak_labelled[rf_key], 
                           weak_labelled[rf_key]['embedding_vec'], 
                           weak_labelled[rf_key]['label'])

px_session = launch_px(rf_wl_ds, None)
px_session.view()

In [None]:
plot_pca(weak_labelled, rf_key)

### Support Vector Machine Weak Labelling

In [None]:
svm_key = 'svm_weak_labeling_weaklabels.parquet'

svm_wl_ds = create_dataset('svm_reg', weak_labelled[svm_key], 
                           weak_labelled[svm_key]['embedding_vec'], 
                           weak_labelled[svm_key]['label'])

px_session = launch_px(svm_wl_ds, None)
px_session.view()

In [None]:
plot_pca(weak_labelled, svm_key)