Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple label annotated datasets #455

Open
asparius opened this issue Apr 19, 2024 · 8 comments
Open

Multiple label annotated datasets #455

asparius opened this issue Apr 19, 2024 · 8 comments

Comments

@asparius
Copy link
Collaborator

Currently, mmteb has two datasets: a Turkish multidomain product review dataset and a Kurdish sentiment dataset. These datasets contain categorical annotations such as domain tag, in addition to sentiment labels. I am wondering whether we should create a new dataset that includes these categorical annotations. I believe it could be beneficial for low-resource languages. Waiting for your opinions :)

@imenelydiaker
Copy link
Contributor

If you're talking about Multilabel classification tasks, I think this PR #440 may be interesting 🙂

@asparius
Copy link
Collaborator Author

It is different than multiple-label classification, perhaps I phrased it incorrectly. They are datasets that have both Sentiment and also category labels for the given samples. I have only uploaded their sentiment part, my question is whether I should also add this category classification task as well?

@imenelydiaker
Copy link
Contributor

I guess you can create a new task using the same dataset and only change the label column? But since it's the same dataset and text, I don't know if changing the classes is relevant to the benchmark?

@KennethEnevoldsen
Copy link
Contributor

They are datasets that have both Sentiment and also category labels for the given samples

That seems to me to be multilabel classification (as opposed to multiclass).

@imenelydiaker
Copy link
Contributor

I think you can frame it as a multilabel task, since the dataset offers 2 columns that can be used as labels. It's just that in a multilabel setting you'll try to predict both classes at the same time no?

@KennethEnevoldsen
Copy link
Contributor

It depends on the classifier used. A softmax classifier (eq. to an one layer MLP: embedding size --> n labels x n label types, e.g. 256, 2x3, assuming three binary label categories) would be the same as 3 independent softmax classifiers. However, if you add one layer to that MLP it does influence it.

From the PR (which is a WIP so it might change) it seem like they use:

https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html

which states

This strategy consists of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification.

So essentially independent classifers.

@x-tabdeveloping
Copy link
Contributor

For this a ClassifierChain seems more appropriate as the labels are clearly dependent on each other. Chime in on the discussion at #440. I'm thinking of adding multiple options to the task so that independence does not need to be assumed but we have to discuss how this is best executed.

@KennethEnevoldsen
Copy link
Contributor

Ah right then you might want to add that in as well. Btw. KNN seems to support multilabel natively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants