[datasets] Load Rubrix dataset as HF dataset #422

dvsrepo · 2021-10-05T14:44:36Z

We need to add an integration with Hugging Face Datasets library. This will make it really easy to (1) train transformers, and (2) upload dataset to the Hub, avoiding the boilerplate and potential errors. Plus it comes with really interesting features for bath processing, etc.

I see several options:

Add a dataset return type for rb.load()
Define a method to_hf_dataset() . Not sure what will be the input though? Maybe the dataset name, but then we might want to add a query, etc. so maybe records but it's kind of inneficient and ugly to do rb.load and then to_hf_dataset.

Any other options?

Internally this will:

Use Features (ClassName, Value, Sequence, etc.) to define the schema of the dataset
Create the Dataset.from_

I think we can easily support all tasks. See a dirty example here:

https://rubrix.readthedocs.io/en/master/tutorials/08-error_analysis_using_loss.html#5.-Sharing-the-dataset-in-the-Hugging-Face-Hub

I would include this in the next release

The text was updated successfully, but these errors were encountered:

dcfidalgo · 2022-01-25T14:34:17Z

Implementing this and giving it a more thorough thought, I propose to extend the API a bit and basically tackle both use cases (sharing via HF Hub, and training with transformers) separately.

Sharing via HF Hub

This API is basically what I discussed with Paco and what I am implementing right now: rb.load gets a new argument format that is either rubrix, pandas, or datasets, and defines the return type. The problem here is, that if I want to make use of the datasets.ClassLabel feature, the code becomes really bloated:

counting all labels in prediction/annotation
dealing a lot with None values
casting new types to the dataset columns

Writing tests, I already encountered some edge cases I haven't thought of, and I have the feeling that there are much more (especially taking into account the other tasks TokenClassification and Text2Text).

Also, the ClassLabel feature is particularly helpful when using the dataset for training with transformers, but if you want to train with the output of rb.load(format='datasets'), then you have to do a lot more: remove not annotated records, deduplicate, rename columns, ...

So I think, the output of rb.load(format='datasets') should be a datasets.Dataset with only the necessary transformations to the original data model, namely transforming tuples/lists with different types to dicts (("HAM", 0.5) -> {"label": "HAM", "score": 0.5} or ("PER", 0, 5) -> {"label": "PER", "start": 0, "end": 5}).

So the main idea of rb.load(format='datasets') would be: sharing Rubrix datasets via the HF Hub, or feeling more comfortable using a Dataset in the pre/postprocessing of the data.

dataset = rb.load(..., format='datasets')
dataset.push_to_hub("Recognai/my_dataset")
# or
dataset = dataset.map(<whatever the feel like doing today>)

Training with transformers

Here I think it would be nice to have something that really prepares your dataset for training with transformers, so that in principle you could do:

data = rb.load(...)
dataset = rb.utils.create_dataset_for_transformers(data)  # this should work for every output type of `rb.load(format=...)`
dataset = dataset.map(tokenizer)
transformers.Trainer(..., train_dataset=dataset, ...)

Ideally, this would work for every task. This would simplify our fine-tune transformers tutorial (the first one), and currently doing the NER tutorial with Leire, I think this would be a huge help especially for beginners.

rb.utils.create_dataset_for_transformers could help to:

remove not annotated records
optionally deduplicate
use the ClassLabel feature for the labels
prepare classification labels also in the case of multi-label classification problems
transform entities to ner tags (for leire this would have been extremely valuable) for TokenClassification records
rename the columns accordingly
...

Not sure if rb.utils is a good place for this method, but eventually this could go into a Dataset class that Paco and I were talking about.

@dvsrepo @frascuchon Thoughts? My idea would be to provide a PR for the first use case as soon as possible, and then start working on the second one.

dvsrepo · 2022-01-25T16:10:12Z

Thanks @dcfidalgo for thorough discussion.

I understand what you describe.

(while writing I keep changing my mind so please read this as thinking out loud comments and read until the end)

As a first thought, I think some of the things you describe could be solved properly in order to have better compatibility with the hf format. One example is obtaining the labels in the dataset (which I think is something that could be obtained via metrics/aggregation from Elastic). For predictions, I don't think we need/or is safe to use ClassLabels as this is outside the standard hf format for text classification/token classification.

I wouldn't mind to have two separate methods for 1) output a Dataset (to use it or to push), and 2) prepare for training (which you could push too). My main concern is that if we don't use ClassLabel for the annotation we're loosing interoperability and things like the Dataset preview which enable much more reuse. We'll be just pushing our own format which is not easily reusable for others outside Rubrix. In other words, if we upload a dataset to the Hub adding the tag text-classification it should have a minimal compatibility with text-classification datasets.

Is there a middle-ground, where we at least keep some minimal compatibility for the input/output, or is it just too messy with None or missing annotation values? My understanding is that we're going to deal with the ClassLabel step anyway in the utils method, so I want to understand if the problem is missing values in the "full" dataset or there are other problems.

Another question is: maybe your suggested approach is better for the following use case:

I rb.load a dataset and push it to the hub.
I load the dataset from the hub and log to a new dataset

Is your suggested approach (not changing much the schema during load) better for the above use case? It might be the case that if we do the ClassLabel thing it becomes more difficult to recreate a Rubrix dataset.

So maybe your suggestion enables the following:

Rubrix format inside Dataset, push to hub as rubrix-dataset, which can be easily used for recreating a Rubrix dataset after loading from the hub.
Utils for building a clean training set which can be used for (1) ease the training within rubrix workflows, and (2) push to hub a clean training set which can be used by other frameworks.

dcfidalgo · 2022-01-25T16:44:21Z

Is there a middle-ground, where we at least keep some minimal compatibility for the input/output, or is it just too messy with None or missing annotation values? My understanding is that we're going to deal with the ClassLabel step anyway in the utils method, so I want to understand if the problem is missing values in the "full" dataset or there are other problems.

My concern with this mix of the two use cases, is that we end up with a format that is kind of a Frankenstein and not easy to deal with in either use case.

for logging it to rubrix you still need to convert classlabels to strings, potentially rename columns;
for training you still need to do a lot more stuff (see above);

I was definitely thinking about your described flow of

I rb.load a dataset and push it to the hub.

I load the dataset from the hub and log to a new dataset

and thought that maybe rb.log could even accept a string that points to the dataset in the hub. But then I thought that sharing datasets via the Hub could also become very obsolete once Rubrix Cloud is coming ...

In my opinion, if we want to share datasets for interoperability with other libraries, they should be minimal and follow the most standard format which often is a format optimized for training.

So your last paragraph pretty much nails my point...

So maybe your suggestion enables the following:

Rubrix format inside Dataset, push to hub as rubrix-dataset, which can be easily used for recreating a Rubrix dataset after loading from the hub.

Utils for building a clean training set which can be used for (1) ease the training within rubrix workflows, and (2) push to hub a clean training set which can be used by other frameworks.

dvsrepo · 2022-01-25T16:56:06Z

Ok, so on my side go ahead and implement your suggested approach. Just as a brief note, I think interoperability with the Hub will be equally important if not more once we have Rubrix Cloud. I see the Hub as the reproducibility/versioning layer (at least for not on-prem installations) we've discussed some time ago with regards to snapshots.

* feat: add Dataset class * test: add dataset tests * feat: add meaningful error message to TaskType * feat: add to from datasest for text classification * test: add dataset fixtures * feat: implement pandas to text classification * feat: add token classification support * test: add token classification tests * test: use singlelabel_textclassification_records * chore: small improvements * refactor: switch to class implementation * chore: import in init * test: add missing tests plus minor fixes * chore: add future warning about as_pandas * chore: more integrations in the library * fix: wrong import * chore: add new test dependency * test: add test for tasktype * feat: ignore not supported columns * test: add tests for read_pandas/datasets * docs: put type hints only in description, it becomes too messy * test: improve tests * docs: curate docstrings * docs: add datasets to python reference * fix: return None's instead of empty dicts for metrics * docs: add dataset guide * docs: increase contrast * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * test: add general log/load tests for all allowed input types of rb.load * fix: remove append Co-authored-by: Daniel Vila Suero <daniel@recogn.ai>

* docs: update tutorial with new dataset feature * docs: add admonition * docs: rename file * docs: add subsections

* feat: add Dataset class * test: add dataset tests * feat: add meaningful error message to TaskType * feat: add to from datasest for text classification * test: add dataset fixtures * feat: implement pandas to text classification * feat: add token classification support * test: add token classification tests * test: use singlelabel_textclassification_records * chore: small improvements * refactor: switch to class implementation * chore: import in init * test: add missing tests plus minor fixes * chore: add future warning about as_pandas * chore: more integrations in the library * fix: wrong import * chore: add new test dependency * test: add test for tasktype * feat: ignore not supported columns * test: add tests for read_pandas/datasets * docs: put type hints only in description, it becomes too messy * test: improve tests * docs: curate docstrings * docs: add datasets to python reference * fix: return None's instead of empty dicts for metrics * docs: add dataset guide * docs: increase contrast * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * Update docs/guides/datasets_in_the_client.ipynb Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> * test: add general log/load tests for all allowed input types of rb.load * fix: remove append Co-authored-by: Daniel Vila Suero <daniel@recogn.ai> (cherry picked from commit 14c8087)

* docs: update tutorial with new dataset feature * docs: add admonition * docs: rename file * docs: add subsections (cherry picked from commit 56a4a72)

…es (#1177) Closes #1176 * fix: empty dict -> None, they fail when pushing the dataset to the HF hub * test: add tests * test: extend tests * feat: allow Nones as input for metadata * test: add push_to_hub test * test: add env variable to CI (cherry picked from commit c9ec36a)

Closes #1186 * fix: prevent None to "None" * test: add tests (cherry picked from commit 5891d2b)

dvsrepo added type: enhancement Indicates new feature requests good first issue Indicates a good issue for first-time contributors status: help wanted Indicates that a maintainer wants help on an issue or pull request labels Oct 5, 2021

dvsrepo changed the title ~~Add support or example to share a Rubrix dataset into the Hugging Face hub~~ Add support or example to share a Rubrix dataset in the Hugging Face hub Oct 6, 2021

dvsrepo changed the title ~~Add support or example to share a Rubrix dataset in the Hugging Face hub~~ [datasets] Integration with hf datasets Dec 30, 2021

dvsrepo assigned dcfidalgo Dec 30, 2021

dvsrepo assigned frascuchon Jan 14, 2022

frascuchon added the client label Jan 14, 2022

frascuchon added this to To do in Release via automation Jan 14, 2022

frascuchon removed good first issue Indicates a good issue for first-time contributors status: help wanted Indicates that a maintainer wants help on an issue or pull request labels Jan 14, 2022

dvsrepo changed the title ~~[datasets] Integration with hf datasets~~ [datasets] Load Rubrix dataset as HF dataset Jan 14, 2022

frascuchon moved this from To do to TODO Release in Release Jan 17, 2022

dcfidalgo mentioned this issue Jan 25, 2022

feat(#422): add possibility to load datasets as a datasets.Dataset #1035

Closed

dcfidalgo mentioned this issue Feb 7, 2022

feat(#422): introducing the rb.Dataset* classes #1109

Merged

6 tasks

frascuchon moved this from Planified to In progress in Release Feb 9, 2022

frascuchon moved this from In progress to Review in Release Feb 11, 2022

dcfidalgo closed this as completed in #1109 Feb 14, 2022

frascuchon moved this from Review to Ready to DEV QA in Release Feb 14, 2022

dcfidalgo pushed a commit that referenced this issue Feb 16, 2022

docs(#422): update tutorial with new dataset feature (#1155)

56a4a72

* docs: update tutorial with new dataset feature * docs: add admonition * docs: rename file * docs: add subsections

frascuchon moved this from Ready to DEV QA to Approved DEV QA in Release Feb 17, 2022

frascuchon moved this from Approved DEV QA to Ready to Release QA in Release Feb 17, 2022

frascuchon pushed a commit that referenced this issue Feb 17, 2022

docs(#422): update tutorial with new dataset feature (#1155)

96345dc

* docs: update tutorial with new dataset feature * docs: add admonition * docs: rename file * docs: add subsections (cherry picked from commit 56a4a72)

dcfidalgo moved this from Ready to Release QA to Approved Release QA in Release Feb 17, 2022

frascuchon pushed a commit that referenced this issue Feb 19, 2022

fix(#422): prevent converting None ids to str 'None's (#1187)

66104b5

Closes #1186 * fix: prevent None to "None" * test: add tests (cherry picked from commit 5891d2b)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[datasets] Load Rubrix dataset as HF dataset #422

[datasets] Load Rubrix dataset as HF dataset #422

dvsrepo commented Oct 5, 2021 •

edited

dcfidalgo commented Jan 25, 2022

dvsrepo commented Jan 25, 2022

dcfidalgo commented Jan 25, 2022 •

edited

dvsrepo commented Jan 25, 2022

[datasets] Load Rubrix dataset as HF dataset #422

[datasets] Load Rubrix dataset as HF dataset #422

Comments

dvsrepo commented Oct 5, 2021 • edited

dcfidalgo commented Jan 25, 2022

Sharing via HF Hub

Training with transformers

dvsrepo commented Jan 25, 2022

dcfidalgo commented Jan 25, 2022 • edited

dvsrepo commented Jan 25, 2022

dvsrepo commented Oct 5, 2021 •

edited

dcfidalgo commented Jan 25, 2022 •

edited