Multiple label annotated datasets #455

asparius · 2024-04-19T21:20:19Z

Currently, mmteb has two datasets: a Turkish multidomain product review dataset and a Kurdish sentiment dataset. These datasets contain categorical annotations such as domain tag, in addition to sentiment labels. I am wondering whether we should create a new dataset that includes these categorical annotations. I believe it could be beneficial for low-resource languages. Waiting for your opinions :)

imenelydiaker · 2024-04-20T17:45:14Z

If you're talking about Multilabel classification tasks, I think this PR #440 may be interesting 🙂

asparius · 2024-04-23T13:48:55Z

It is different than multiple-label classification, perhaps I phrased it incorrectly. They are datasets that have both Sentiment and also category labels for the given samples. I have only uploaded their sentiment part, my question is whether I should also add this category classification task as well?

imenelydiaker · 2024-04-23T15:21:28Z

I guess you can create a new task using the same dataset and only change the label column? But since it's the same dataset and text, I don't know if changing the classes is relevant to the benchmark?

KennethEnevoldsen · 2024-04-23T16:11:48Z

They are datasets that have both Sentiment and also category labels for the given samples

That seems to me to be multilabel classification (as opposed to multiclass).

imenelydiaker · 2024-04-23T16:30:29Z

I think you can frame it as a multilabel task, since the dataset offers 2 columns that can be used as labels. It's just that in a multilabel setting you'll try to predict both classes at the same time no?

KennethEnevoldsen · 2024-04-23T16:41:00Z

It depends on the classifier used. A softmax classifier (eq. to an one layer MLP: embedding size --> n labels x n label types, e.g. 256, 2x3, assuming three binary label categories) would be the same as 3 independent softmax classifiers. However, if you add one layer to that MLP it does influence it.

From the PR (which is a WIP so it might change) it seem like they use:

https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html

which states

This strategy consists of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification.

So essentially independent classifers.

x-tabdeveloping · 2024-04-24T11:54:59Z

For this a ClassifierChain seems more appropriate as the labels are clearly dependent on each other. Chime in on the discussion at #440. I'm thinking of adding multiple options to the task so that independence does not need to be assumed but we have to discuss how this is best executed.

KennethEnevoldsen · 2024-04-24T12:22:15Z

Ah right then you might want to add that in as well. Btw. KNN seems to support multilabel natively.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple label annotated datasets #455

Multiple label annotated datasets #455

asparius commented Apr 19, 2024

imenelydiaker commented Apr 20, 2024

asparius commented Apr 23, 2024

imenelydiaker commented Apr 23, 2024

KennethEnevoldsen commented Apr 23, 2024

imenelydiaker commented Apr 23, 2024

KennethEnevoldsen commented Apr 23, 2024

x-tabdeveloping commented Apr 24, 2024

KennethEnevoldsen commented Apr 24, 2024

Multiple label annotated datasets #455

Multiple label annotated datasets #455

Comments

asparius commented Apr 19, 2024

imenelydiaker commented Apr 20, 2024

asparius commented Apr 23, 2024

imenelydiaker commented Apr 23, 2024

KennethEnevoldsen commented Apr 23, 2024

imenelydiaker commented Apr 23, 2024

KennethEnevoldsen commented Apr 23, 2024

x-tabdeveloping commented Apr 24, 2024

KennethEnevoldsen commented Apr 24, 2024