New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[datasets] Load Rubrix dataset as HF dataset #422
Comments
Implementing this and giving it a more thorough thought, I propose to extend the API a bit and basically tackle both use cases (sharing via HF Hub, and training with transformers) separately. Sharing via HF HubThis API is basically what I discussed with Paco and what I am implementing right now:
Writing tests, I already encountered some edge cases I haven't thought of, and I have the feeling that there are much more (especially taking into account the other tasks TokenClassification and Text2Text). Also, the So I think, the output of So the main idea of dataset = rb.load(..., format='datasets')
dataset.push_to_hub("Recognai/my_dataset")
# or
dataset = dataset.map(<whatever the feel like doing today>) Training with transformersHere I think it would be nice to have something that really prepares your dataset for training with transformers, so that in principle you could do: data = rb.load(...)
dataset = rb.utils.create_dataset_for_transformers(data) # this should work for every output type of `rb.load(format=...)`
dataset = dataset.map(tokenizer)
transformers.Trainer(..., train_dataset=dataset, ...) Ideally, this would work for every task. This would simplify our fine-tune transformers tutorial (the first one), and currently doing the NER tutorial with Leire, I think this would be a huge help especially for beginners.
Not sure if @dvsrepo @frascuchon Thoughts? My idea would be to provide a PR for the first use case as soon as possible, and then start working on the second one. |
Thanks @dcfidalgo for thorough discussion. I understand what you describe. (while writing I keep changing my mind so please read this as thinking out loud comments and read until the end) As a first thought, I think some of the things you describe could be solved properly in order to have better compatibility with the hf format. One example is obtaining the labels in the dataset (which I think is something that could be obtained via metrics/aggregation from Elastic). For predictions, I don't think we need/or is safe to use ClassLabels as this is outside the standard hf format for text classification/token classification. I wouldn't mind to have two separate methods for 1) output a Dataset (to use it or to push), and 2) prepare for training (which you could push too). My main concern is that if we don't use ClassLabel for the Is there a middle-ground, where we at least keep some minimal compatibility for the input/output, or is it just too messy with Another question is: maybe your suggested approach is better for the following use case:
Is your suggested approach (not changing much the schema during load) better for the above use case? It might be the case that if we do the ClassLabel thing it becomes more difficult to recreate a Rubrix dataset. So maybe your suggestion enables the following:
|
My concern with this mix of the two use cases, is that we end up with a format that is kind of a Frankenstein and not easy to deal with in either use case.
I was definitely thinking about your described flow of
and thought that maybe In my opinion, if we want to share datasets for interoperability with other libraries, they should be minimal and follow the most standard format which often is a format optimized for training. So your last paragraph pretty much nails my point...
|
Ok, so on my side go ahead and implement your suggested approach. Just as a brief note, I think interoperability with the Hub will be equally important if not more once we have Rubrix Cloud. I see the Hub as the reproducibility/versioning layer (at least for not on-prem installations) we've discussed some time ago with regards to snapshots. |
* feat: add Dataset class * test: add dataset tests * feat: add meaningful error message to TaskType * feat: add to from datasest for text classification * test: add dataset fixtures * feat: implement pandas to text classification * feat: add token classification support * test: add token classification tests * test: use singlelabel_textclassification_records * chore: small improvements * refactor: switch to class implementation * chore: import in init * test: add missing tests plus minor fixes * chore: add future warning about as_pandas * chore: more integrations in the library * fix: wrong import * chore: add new test dependency * test: add test for tasktype * feat: ignore not supported columns * test: add tests for read_pandas/datasets * docs: put type hints only in description, it becomes too messy * test: improve tests * docs: curate docstrings * docs: add datasets to python reference * fix: return None's instead of empty dicts for metrics * docs: add dataset guide * docs: increase contrast * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * test: add general log/load tests for all allowed input types of rb.load * fix: remove append Co-authored-by: Daniel Vila Suero <daniel@recogn.ai>
* feat: add Dataset class * test: add dataset tests * feat: add meaningful error message to TaskType * feat: add to from datasest for text classification * test: add dataset fixtures * feat: implement pandas to text classification * feat: add token classification support * test: add token classification tests * test: use singlelabel_textclassification_records * chore: small improvements * refactor: switch to class implementation * chore: import in init * test: add missing tests plus minor fixes * chore: add future warning about as_pandas * chore: more integrations in the library * fix: wrong import * chore: add new test dependency * test: add test for tasktype * feat: ignore not supported columns * test: add tests for read_pandas/datasets * docs: put type hints only in description, it becomes too messy * test: improve tests * docs: curate docstrings * docs: add datasets to python reference * fix: return None's instead of empty dicts for metrics * docs: add dataset guide * docs: increase contrast * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * test: add general log/load tests for all allowed input types of rb.load * fix: remove append Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> (cherry picked from commit 14c8087)
We need to add an integration with Hugging Face Datasets library. This will make it really easy to (1) train transformers, and (2) upload dataset to the Hub, avoiding the boilerplate and potential errors. Plus it comes with really interesting features for bath processing, etc.
I see several options:
Add a
dataset
return type forrb.load()
Define a method
to_hf_dataset()
. Not sure what will be the input though? Maybe the dataset name, but then we might want to add a query, etc. so maybe records but it's kind of inneficient and ugly to do rb.load and then to_hf_dataset.Any other options?
Internally this will:
I think we can easily support all tasks. See a dirty example here:
https://rubrix.readthedocs.io/en/master/tutorials/08-error_analysis_using_loss.html#5.-Sharing-the-dataset-in-the-Hugging-Face-Hub
I would include this in the next release
The text was updated successfully, but these errors were encountered: