Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converted VG to hierarchical #694

Merged
merged 20 commits into from
May 24, 2024
Merged

Converted VG to hierarchical #694

merged 20 commits into from
May 24, 2024

Conversation

x-tabdeveloping
Copy link
Contributor

@x-tabdeveloping x-tabdeveloping commented May 14, 2024

Checklist for adding MMTEB dataset

Reason for dataset addition:
VG Clustering was not hierarchical nor ClusteringFast before #656

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

@x-tabdeveloping
Copy link
Contributor Author

I can't run it for E5 as Ucloud is down today and it would take hours on my computer :')

@x-tabdeveloping
Copy link
Contributor Author

Also added stratified subsampling code to AbsTask for multilabel problems, as this was missing.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Let us get the results in from the other models as well - otherwise good

mteb/abstasks/AbsTask.py Show resolved Hide resolved
mteb/abstasks/AbsTask.py Show resolved Hide resolved
@KennethEnevoldsen
Copy link
Contributor

@x-tabdeveloping let us get this one merged in as well

@x-tabdeveloping
Copy link
Contributor Author

[15:44] There are currently no machines available to run your job.
[15:44] A smaller machine might give you quicker access to your job.
[15:45] Job has been cancelled

:')

@x-tabdeveloping
Copy link
Contributor Author

Okay I got a machine halleluyah

@x-tabdeveloping
Copy link
Contributor Author

Since the stratified subsampling doesn't exactly work as expected with multilabel data, I will just go with a random sample of 2048 entries I think.

@x-tabdeveloping
Copy link
Contributor Author

@KennethEnevoldsen green light?

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

green light 🟢

@KennethEnevoldsen KennethEnevoldsen enabled auto-merge (squash) May 23, 2024 14:10
@x-tabdeveloping
Copy link
Contributor Author

Tests are failing because of missing datasets unrelated to my PR, what to do?

@KennethEnevoldsen
Copy link
Contributor

pull from main should fix it

@KennethEnevoldsen KennethEnevoldsen merged commit ece878e into main May 24, 2024
7 checks passed
@KennethEnevoldsen KennethEnevoldsen deleted the vg-hierarchical branch May 24, 2024 09:42
dokato pushed a commit to dokato/mteb that referenced this pull request May 24, 2024
* Added hierarchical VG clustering tasks

* Added startified subsampling for multilabel tasks to AbsTask

* Added stratified subsampling to VG clustering

* Fixed stratified subsampling for multilabel tasks

* fix: Converted VG to AbsTaskClusteringFast

* Added results for paraphrase model

* Removed debugging print statements

* Added 'not specified' license to VGHierarchical

* Added proper license from Norsk Aviskorpus

* Ran linting

* Replaced stratification with just regular subsampling

* fix: fixed subsampling

* Added results for VG

* Added points

* fix: Fixed JSON in 694.jsonl

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants