-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MalteseNewsClassification added #546
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a multilabel dataset. Let's keep it that way, and wait for #440 to merge before proceeding. Metadata looks good otherwise, and still requires linting.
Yup, that makes, sense, thanks @isaac-chung , I'll keep an eye on this pr! |
It's merged :) |
thanks @isaac-chung , yeah I saw, but as this is quite big, we're still trying to figure how to do stratified sampling for multilabel problem, see: #698 |
@dokato Let's keep this in a separate PR so we can discuss the two things on different threads (this task and then stratified subsampling). It should also be fine considering that a) all of these models will be rerun anyway b) we already have a function that does this, maybe faulty, but the interface already exists and we can swap it out anytime |
K, done that! Created new PR in #760 This should be ready @isaac-chung. I "unwiped" it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for adding this!
* MalteseNewsClassification added * lint fixes * MalteseNewsClassification as multilabel + WIP stratification added * results for multilabel classfication updated * Maltese MultiLabelClassification added --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Maltese news classification dataset: https://huggingface.co/datasets/MLRS/maltese_news_categories
As suggested here: #419
Checklist for adding MMTEB dataset
Reason for dataset addition:
mteb
package.mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.438.jsonl
).