Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize embeddings when using a custom dense encoder? #1720

Closed
orionw opened this issue Nov 18, 2023 · 3 comments
Closed

Normalize embeddings when using a custom dense encoder? #1720

orionw opened this issue Nov 18, 2023 · 3 comments

Comments

@orionw
Copy link

orionw commented Nov 18, 2023

Hi there!

Thanks for the great work on Pyserini! I had a naive question that I can't seem to find the answer to.

I'd like to use E5 (https://huggingface.co/intfloat/e5-large but other models are similar) and they recommend normalizing the embeddings. I can't find an option for that in Pyserini to do that, when it's not a dense/sparse combination. I'd like to do just dense encodings, but make sure the embeddings are normalized to properly use E5.

I've been using pyserini.encode so far but don't see any options in there for it. Does Pyserini support this?

@MXueguang
Copy link
Member

Hi @orionw,
The AutoDocumentEncoder has the argument l2_norm for initialisation.

def __init__(self, model_name, tokenizer_name=None, device='cuda:0', pooling='cls', l2_norm=False):

However, the option is not exposed in pyserini.encode as an CLI argument.
I'll create a pull request to add this.

@orionw
Copy link
Author

orionw commented Nov 21, 2023

Thanks a bunch @MXueguang!

@MXueguang
Copy link
Member

#1722

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants