# Analysis of Embeddings

## Loading Weakly Labeled data

In [None]:
import pandas as pd

df_knn_wl = pd.read_parquet('../data/weak_labelled/knn_weak_labeling_weaklabels.parquet')
df_knn_wl.head(5)

df_knn_wl['embedding_vec'].loc[0].shape

## Visualizing Labeled data
To project the high dimensional embeddings into a humanly readable format we implemented Arize's Phoenix app that allows us to interactively look at the embedding space projected down into 3 dimensions by UMAP.

In [None]:
import os
import sys

current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)

sys.path.append(parent_dir)

from src.px_utils import create_dataset, launch_px

test_ds = create_dataset("Weak Labeled Dataset", df_knn_wl, df_knn_wl['embedding_vec'], df_knn_wl['label'])

launch_px(test_ds, None)

`Cluster 3` seems to clearly stand out from the other clusters. This cluster consists almost solely of music album reviews that were labeled with `1`, meaning a positive sentiment.

Otherwise, the weak labeled embeddings don't show a specific pattern in the UMAP projection made by Phoenix.

### Projection with PCA

Since Phoenix doesn't allow for a different dimension reduction technique we implement a PCA strategy ourselves. The UMAP technique differs vastly from PCA so looking at another technique could yield more interesting observations in the embedding space.

In [None]:
from sklearn.decomposition import PCA
import numpy as np

pca = PCA(n_components=3)
pca_result = pca.fit_transform(np.stack(df_knn_wl['embedding_vec'].values))

df_knn_wl['PC1'] = pca_result[:,0]
df_knn_wl['PC2'] = pca_result[:,1]
df_knn_wl['PC3'] = pca_result[:,2]

import matplotlib.pyplot as plt
import seaborn as sns

fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df_knn_wl['PC1'], df_knn_wl['PC2'], df_knn_wl['PC3'], c=df_knn_wl['label'], cmap='viridis', alpha=0.1)
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
plt.show()