Normalize embeddings when using a custom dense encoder? #1720

orionw · 2023-11-18T04:05:42Z

Hi there!

Thanks for the great work on Pyserini! I had a naive question that I can't seem to find the answer to.

I'd like to use E5 (https://huggingface.co/intfloat/e5-large but other models are similar) and they recommend normalizing the embeddings. I can't find an option for that in Pyserini to do that, when it's not a dense/sparse combination. I'd like to do just dense encodings, but make sure the embeddings are normalized to properly use E5.

I've been using pyserini.encode so far but don't see any options in there for it. Does Pyserini support this?

The text was updated successfully, but these errors were encountered:

MXueguang · 2023-11-19T18:11:37Z

Hi @orionw,
The AutoDocumentEncoder has the argument l2_norm for initialisation.

pyserini/pyserini/encode/_auto.py

Line 25 in b931e52

    
           def __init__(self, model_name, tokenizer_name=None, device='cuda:0', pooling='cls', l2_norm=False):

However, the option is not exposed in pyserini.encode as an CLI argument.
I'll create a pull request to add this.

orionw · 2023-11-21T08:58:13Z

Thanks a bunch @MXueguang!

MXueguang · 2023-11-29T23:02:23Z

#1722

MXueguang mentioned this issue Nov 21, 2023

config pooling and l2norm for autoencoder #1722

Merged

MXueguang closed this as completed Nov 29, 2023

orionw mentioned this issue Mar 11, 2024

AutoQueryEncoder not using query prefixes (for E5-like and other models) #1812

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize embeddings when using a custom dense encoder? #1720

Normalize embeddings when using a custom dense encoder? #1720

orionw commented Nov 18, 2023

MXueguang commented Nov 19, 2023

orionw commented Nov 21, 2023

MXueguang commented Nov 29, 2023

Normalize embeddings when using a custom dense encoder? #1720

Normalize embeddings when using a custom dense encoder? #1720

Comments

orionw commented Nov 18, 2023

MXueguang commented Nov 19, 2023

orionw commented Nov 21, 2023

MXueguang commented Nov 29, 2023