-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update results for Russian models #19
Update results for Russian models #19
Conversation
@KennethEnevoldsen @Muennighoff Can you merge this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the ping! Looked at a few samples probably worth discussing these before merging.
...e5-mistral-7b-instruct/07163b72af1488142a360786df853f237b1a3ca1/KinopoiskClassification.json
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quite a large change in accuracy here, why is that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to reproduce the results for 1.12.75, so I guess due to some changes between this versions (1.12.75 -> 1.14.12). And it's not only for me5-small and this task (Georeview) but for almost all models and Classification/Clustering tasks.
Maybe you have any hypothesis what changes could affect results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found that running MultiLabelClassification tasks at first cause the problem. Code to reproduce:
# mteb==1.14.12
MODEL_NAME="cointegrated/rubert-tiny2"
mteb run \
-m $MODEL_NAME -l rus --output_folder results \
--co2_tracker true --verbosity 2 --batch_size 16 \
-t \
"SensitiveTopicsClassification" \
"GeoreviewClassification"
GeoreviewClassification is 0.408935546875
while single run gives 0.39638671875
. Also checked Retrieval and Reranking tasks and they are ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So running the MultiLabelClassification first changes the result? It seems to be an issue with a seed being manipulated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will run this tasks separately and update the results then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look into this in a PR as well - see if we can get it fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More generally we should probably try to fix this issue here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR here: embeddings-benchmark/mteb#1193
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My results for bge-m3 for comparison
new_results.tar.gz
@KennethEnevoldsen I've updated affected tasks. It seems Clustering also has to be run separately. Also found clustering results for sbert/rubert models in v1.12.25 gives better results and I reproduced this difference. |
Right so order also matters here. Well, we knew it was a problem given this, but we should def. get that patched up (though it might be a major version bump) |
We've made several experiments (@Samoed) with the bge-m3 model and could not achieve the same results. As mentioned in embeddings-benchmark/mteb#942, this is not a big problem. For now we would like to go with these results and will update them in new versions. Waiting #25 to be merged. |
That works for me as well |
@KennethEnevoldsen I think this PR can be merged. After that, I will update the paths in my branch (or whatever else we decide) |
a3326d8
into
embeddings-benchmark:main
This PR updates results for several models to update Russian leaderboard (embeddings-benchmark/leaderboard#26).
Models are updated:
Most part of update brings minor changes but now uses one version (1.14.12) and kg_co2_emissions is computed (that wasn't). Results for multilingual MassiveIntent, MassiveScenario and STS22 are not changed. And instruct models now use detailed instructions from embeddings-benchmark/mteb#1163.