Weird Behaviour for Finetuning Embeddings #2646

harry7171 · 2024-05-15T09:17:03Z

Hi ,

Background -
I am trying to finetune the BGE-Large model - 'BAAI/bge-large-en-v1.5' on custom domain specific dataset.
I am using data in format - triplets - (anchor, postive sample, negative samples) , i have (1 anchor, 1 pos , 5 negative samples) in a single data point
Loss used - MultipleNegativesRankingLoss (earlier tried Triplet loss too)
Warmup steps - 10% of training data
Training samples - 13k samples (each sample - 1 anchor, 1 positive, 5 negative samples)
Learning rate - 2e-6 (tried with 3e-5, 2e-5 too)
scheduler - Tried linear warmup, cosine, cosine with hardrestarts.

The problem is that the expectations after finetuning is that the model in retrieval should retrieve these positives on top and negatives as low as possible in ranking , but what is observed is that these positives pairs are even pushed down in rank with the finetuned model, the pairs which were coming on top or top10 are even degraded and are not in even top 5 instead outside top 50.
To add to it, it is observed that after 1 epoch when we try checking , the results get little better , but after 1 epoch it then further degrades and the pairs which were coming for eg - (7k positive pairs with their anchor for retrieval was having rank as 1 gets degraded to just 3k having 1 and rest going even below top 50).

Been struggling to come up why is it and how to better it.

Thanks in advance

tomaarsen · 2024-05-15T12:23:38Z

Hello!

This all sounds quite reasonable. (I'm assuming you're using Sentence Transformers before v3.0 here:) What is the format of your InputExample exactly? Is it this one?

InputExample(texts=["Represent this sentence for searching relevant passages: my query", "my positive", "my negative 1", "my negative 2", "my negative 3", "my negative 4", "my negative 5"])

Note in particular the prompt for the query, which the https://huggingface.co/BAAI/bge-large-en-v1.5 model recommends. Could that be the reason?

Other than that, your setup seems like it should work well, and you should have enough training samples to see a meaningful improvement.
Another potential reason is that some models (not embedding models per se) don't finetune as well as others: they get worse for a bit before they "pick back up" and get better. That might be a bit of a stretch here, though.

Lastly, MultipleNegativesRankingLoss uses the provided negatives as well as the "in-batch negatives", e.g. all positives and negatives from other queries in the same batch. If you have a lot of exact overlap across your training samples, then it's possible that a lot of these "in-batch negatives" are actually relevant to your query. In that case, you'll start training with false negatives, which can be bad for performance.
If you have exact duplicates, then you can use NoDuplicatesDataLoader so that no duplicates exist in a batch. Otherwise, you can use GISTEmbedLoss with a simple guide model like all-MiniLM-L6-v2:

MultipleNegativesRankingLoss is similar to this loss, but it does not use a guide model to guide the in-batch negative sample selection. GISTEmbedLoss yields a stronger training signal at the cost of some training overhead.

In essence, GISTEmbedLoss is MultipleNegativesRankingLoss, but it uses a "guide model" to ignore some in-batch negatives if the guide model deems thinks that similarity(anchor, in-batch-negative) > similarity(anchor, positive). This helps get rid of some false negatives.

Tom Aarsen

harry7171 · 2024-05-16T12:15:22Z

@tomaarsen thanks for the detailed explanation.

for the input format , no, didnt use the BGE format. will try with it once.
Apart from that tried using GISTEmbed loss with a guide as mentioned, tried 2 epoch, batch size - 16 with data format - ('anchor', 'pos', 'neg') and the results detoriated, I am using custom function which takes the positives and anchors and creates a vector store and use it to fetch top k and assign the rank of it and further calculate MRR (using this approach as we are using in RAG and the positive embeddings should rank higher ), so with that the MRR is decreased. Do you suggest any other way to use to judge the or evaluate the performance?
Also tried to replicate this - https://huggingface.co/blog/how-to-train-sentence-transformers. but infact to my surprise even this after 2 epochs detoriated. VERY STRANGE!!
I am not sure if I am doing some major blunder or how do I validate that this finetuning actually can work even with sample data. let me know if you have any examples for similar case , can try replicating it .
Can it be with the Embedding model ? should I try another one ?
Do let me know if you have any ideas or any things which can be the issue and I can try

Update : I found out this similar issue -#2358
Are Learning rate and Batch size the bottleneck here?

Thanks in advance!!

harry7171 · 2024-05-21T09:02:53Z

Hi @tomaarsen .
Did you get any chance to look into this.

Thanks in advance

tomaarsen · 2024-05-21T10:20:19Z

Hello!

I'm afraid not, I've been busy with the upcoming v3.0 release. You can try any of the example scripts to see if you can reproduce this somehow, or you can switch to another model. I don't have any great other ideas, but the prompt/input format is quite important, so that might be it.

Tom Aarsen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird Behaviour for Finetuning Embeddings #2646

Weird Behaviour for Finetuning Embeddings #2646

harry7171 commented May 15, 2024

tomaarsen commented May 15, 2024

harry7171 commented May 16, 2024 •

edited

harry7171 commented May 21, 2024

tomaarsen commented May 21, 2024

Weird Behaviour for Finetuning Embeddings #2646

Weird Behaviour for Finetuning Embeddings #2646

Comments

harry7171 commented May 15, 2024

tomaarsen commented May 15, 2024

harry7171 commented May 16, 2024 • edited

harry7171 commented May 21, 2024

tomaarsen commented May 21, 2024

harry7171 commented May 16, 2024 •

edited