-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple label annotated datasets #455
Comments
If you're talking about Multilabel classification tasks, I think this PR #440 may be interesting 🙂 |
It is different than multiple-label classification, perhaps I phrased it incorrectly. They are datasets that have both Sentiment and also category labels for the given samples. I have only uploaded their sentiment part, my question is whether I should also add this category classification task as well? |
I guess you can create a new task using the same dataset and only change the label column? But since it's the same dataset and text, I don't know if changing the classes is relevant to the benchmark? |
That seems to me to be multilabel classification (as opposed to multiclass). |
I think you can frame it as a multilabel task, since the dataset offers 2 columns that can be used as labels. It's just that in a multilabel setting you'll try to predict both classes at the same time no? |
It depends on the classifier used. A softmax classifier (eq. to an one layer MLP: embedding size --> n labels x n label types, e.g. 256, 2x3, assuming three binary label categories) would be the same as 3 independent softmax classifiers. However, if you add one layer to that MLP it does influence it. From the PR (which is a WIP so it might change) it seem like they use: https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html which states
So essentially independent classifers. |
For this a |
Ah right then you might want to add that in as well. Btw. KNN seems to support multilabel natively. |
Currently, mmteb has two datasets: a Turkish multidomain product review dataset and a Kurdish sentiment dataset. These datasets contain categorical annotations such as domain tag, in addition to sentiment labels. I am wondering whether we should create a new dataset that includes these categorical annotations. I believe it could be beneficial for low-resource languages. Waiting for your opinions :)
The text was updated successfully, but these errors were encountered: