Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using LayerNorm before PCA while performing embedding dimensionality reduction. #2657

Open
Adversarian opened this issue May 18, 2024 · 2 comments

Comments

@Adversarian
Copy link

Hi,
I would to begin by thanking you for your tremendous work on this library.

I had a question regarding your dimensionality_reduction.py example.

This is where you fit a PCA on top of the embeddings obtained by your model:

# To determine the PCA matrix, we need some example sentence embeddings.
# Here, we compute the embeddings for 20k random sentences from the AllNLI dataset
pca_train_sentences = nli_sentences[0:20000]
train_embeddings = model.encode(pca_train_sentences, convert_to_numpy=True)

# Compute PCA on the train embeddings matrix
pca = PCA(n_components=new_dimension)
pca.fit(train_embeddings)
pca_comp = np.asarray(pca.components_)

I was wondering if it would be better if first a LayerNorm with elementwise_affine=False was appended to the model to ensure PCA receives standardized inputs. I've extended sentence-transformer's models.LayerNorm so that it accepts additional args and kwargs for self.norm and performed this experiment on my own dataset (which I'm not at liberty to share unfortunately), and it seems to be performing better than plain PCA with no LayerNorm.

I was just wondering if somehow I got lucky with my particular data or if it's something to actually consider when performing dimensionality reduction.

Thanks in advance!

@Jakobhenningjensen
Copy link

Jakobhenningjensen commented May 22, 2024

You can just use sklearn on un-normalized data

from sklearn.preprocessing import normalize
pca_train_sentences = nli_sentences[0:20000]
train_embeddings = model.encode(pca_train_sentences, convert_to_numpy=True)
normalized_embeddings = normalize(train_embeddings)

# Compute PCA on the train embeddings matrix
pca = PCA(n_components=new_dimension)
pca.fit(normalized_embeddings)
pca_comp = np.asarray(pca.components_)

or just use normalize_embeddings=True for the model.encode

pca_train_sentences = nli_sentences[0:20000]
train_embeddings = model.encode(pca_train_sentences, convert_to_numpy=True, normalize_embeddings=True)

# Compute PCA on the train embeddings matrix
pca = PCA(n_components=new_dimension)
pca.fit(train_embeddings)
pca_comp = np.asarray(pca.components_)

@Adversarian
Copy link
Author

Adversarian commented May 22, 2024

@Jakobhenningjensen Thanks for your response!

The idea is to not have to use an external preprocessing and have the model perform end-to-end forward passes natively. Which is why we use a dense layer here filled with the PCA components instead of pca.transforming new inputs every time and that means your first suggestion isn't desirable as it requires sklearn's normalize every time during inference.

Secondly, LayerNorm and torch.nn.functional.normalize (which is what happens if you set normalize_embeddings=True) do very different things. Since PCA is sensitive to data scale, it's a good practice to z-score standardize your data before fitting a PCA on top of it which is what LayerNorm with elementwise_affine=False does (or leaving elementwise_affine=True and never using it in training mode). Meanwhile, torch.nn.functional.normalize, simply divides each tensor by its $L_p$ norm to ensure that all tensors have unit length in $L_p$ space. I'm not sure if these two scenarios are mathematically equivalent from the point of view of PCA. I'm just pointing out the differences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants