Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[datasets] Load Rubrix dataset as HF dataset #422

Closed
dvsrepo opened this issue Oct 5, 2021 · 4 comments · Fixed by #1109
Closed

[datasets] Load Rubrix dataset as HF dataset #422

dvsrepo opened this issue Oct 5, 2021 · 4 comments · Fixed by #1109
Assignees
Labels
type: enhancement Indicates new feature requests
Projects

Comments

@dvsrepo
Copy link
Member

dvsrepo commented Oct 5, 2021

We need to add an integration with Hugging Face Datasets library. This will make it really easy to (1) train transformers, and (2) upload dataset to the Hub, avoiding the boilerplate and potential errors. Plus it comes with really interesting features for bath processing, etc.

I see several options:

  1. Add a dataset return type for rb.load()

  2. Define a method to_hf_dataset() . Not sure what will be the input though? Maybe the dataset name, but then we might want to add a query, etc. so maybe records but it's kind of inneficient and ugly to do rb.load and then to_hf_dataset.

Any other options?

Internally this will:

  1. Use Features (ClassName, Value, Sequence, etc.) to define the schema of the dataset
  2. Create the Dataset.from_

I think we can easily support all tasks. See a dirty example here:

https://rubrix.readthedocs.io/en/master/tutorials/08-error_analysis_using_loss.html#5.-Sharing-the-dataset-in-the-Hugging-Face-Hub

I would include this in the next release

@dvsrepo dvsrepo added type: enhancement Indicates new feature requests good first issue Indicates a good issue for first-time contributors status: help wanted Indicates that a maintainer wants help on an issue or pull request labels Oct 5, 2021
@dvsrepo dvsrepo changed the title Add support or example to share a Rubrix dataset into the Hugging Face hub Add support or example to share a Rubrix dataset in the Hugging Face hub Oct 6, 2021
@dvsrepo dvsrepo changed the title Add support or example to share a Rubrix dataset in the Hugging Face hub [datasets] Integration with hf datasets Dec 30, 2021
@frascuchon frascuchon added this to To do in Release via automation Jan 14, 2022
@frascuchon frascuchon removed good first issue Indicates a good issue for first-time contributors status: help wanted Indicates that a maintainer wants help on an issue or pull request labels Jan 14, 2022
@dvsrepo dvsrepo changed the title [datasets] Integration with hf datasets [datasets] Load Rubrix dataset as HF dataset Jan 14, 2022
@frascuchon frascuchon moved this from To do to TODO Release in Release Jan 17, 2022
@dcfidalgo
Copy link
Contributor

Implementing this and giving it a more thorough thought, I propose to extend the API a bit and basically tackle both use cases (sharing via HF Hub, and training with transformers) separately.

Sharing via HF Hub

This API is basically what I discussed with Paco and what I am implementing right now: rb.load gets a new argument format that is either rubrix, pandas, or datasets, and defines the return type. The problem here is, that if I want to make use of the datasets.ClassLabel feature, the code becomes really bloated:

  • counting all labels in prediction/annotation
  • dealing a lot with None values
  • casting new types to the dataset columns

Writing tests, I already encountered some edge cases I haven't thought of, and I have the feeling that there are much more (especially taking into account the other tasks TokenClassification and Text2Text).

Also, the ClassLabel feature is particularly helpful when using the dataset for training with transformers, but if you want to train with the output of rb.load(format='datasets'), then you have to do a lot more: remove not annotated records, deduplicate, rename columns, ...

So I think, the output of rb.load(format='datasets') should be a datasets.Dataset with only the necessary transformations to the original data model, namely transforming tuples/lists with different types to dicts (("HAM", 0.5) -> {"label": "HAM", "score": 0.5} or ("PER", 0, 5) -> {"label": "PER", "start": 0, "end": 5}).

So the main idea of rb.load(format='datasets') would be: sharing Rubrix datasets via the HF Hub, or feeling more comfortable using a Dataset in the pre/postprocessing of the data.

dataset = rb.load(..., format='datasets')
dataset.push_to_hub("Recognai/my_dataset")
# or
dataset = dataset.map(<whatever the feel like doing today>)

Training with transformers

Here I think it would be nice to have something that really prepares your dataset for training with transformers, so that in principle you could do:

data = rb.load(...)
dataset = rb.utils.create_dataset_for_transformers(data)  # this should work for every output type of `rb.load(format=...)`
dataset = dataset.map(tokenizer)
transformers.Trainer(..., train_dataset=dataset, ...)

Ideally, this would work for every task. This would simplify our fine-tune transformers tutorial (the first one), and currently doing the NER tutorial with Leire, I think this would be a huge help especially for beginners.

rb.utils.create_dataset_for_transformers could help to:

  • remove not annotated records
  • optionally deduplicate
  • use the ClassLabel feature for the labels
  • prepare classification labels also in the case of multi-label classification problems
  • transform entities to ner tags (for leire this would have been extremely valuable) for TokenClassification records
  • rename the columns accordingly
  • ...

Not sure if rb.utils is a good place for this method, but eventually this could go into a Dataset class that Paco and I were talking about.

@dvsrepo @frascuchon Thoughts? My idea would be to provide a PR for the first use case as soon as possible, and then start working on the second one.

@dvsrepo
Copy link
Member Author

dvsrepo commented Jan 25, 2022

Thanks @dcfidalgo for thorough discussion.

I understand what you describe.

(while writing I keep changing my mind so please read this as thinking out loud comments and read until the end)

As a first thought, I think some of the things you describe could be solved properly in order to have better compatibility with the hf format. One example is obtaining the labels in the dataset (which I think is something that could be obtained via metrics/aggregation from Elastic). For predictions, I don't think we need/or is safe to use ClassLabels as this is outside the standard hf format for text classification/token classification.

I wouldn't mind to have two separate methods for 1) output a Dataset (to use it or to push), and 2) prepare for training (which you could push too). My main concern is that if we don't use ClassLabel for the annotation we're loosing interoperability and things like the Dataset preview which enable much more reuse. We'll be just pushing our own format which is not easily reusable for others outside Rubrix. In other words, if we upload a dataset to the Hub adding the tag text-classification it should have a minimal compatibility with text-classification datasets.

Is there a middle-ground, where we at least keep some minimal compatibility for the input/output, or is it just too messy with None or missing annotation values? My understanding is that we're going to deal with the ClassLabel step anyway in the utils method, so I want to understand if the problem is missing values in the "full" dataset or there are other problems.

Another question is: maybe your suggested approach is better for the following use case:

  1. I rb.load a dataset and push it to the hub.
  2. I load the dataset from the hub and log to a new dataset

Is your suggested approach (not changing much the schema during load) better for the above use case? It might be the case that if we do the ClassLabel thing it becomes more difficult to recreate a Rubrix dataset.

So maybe your suggestion enables the following:

  1. Rubrix format inside Dataset, push to hub as rubrix-dataset, which can be easily used for recreating a Rubrix dataset after loading from the hub.
  2. Utils for building a clean training set which can be used for (1) ease the training within rubrix workflows, and (2) push to hub a clean training set which can be used by other frameworks.

@dcfidalgo
Copy link
Contributor

dcfidalgo commented Jan 25, 2022

Is there a middle-ground, where we at least keep some minimal compatibility for the input/output, or is it just too messy with None or missing annotation values? My understanding is that we're going to deal with the ClassLabel step anyway in the utils method, so I want to understand if the problem is missing values in the "full" dataset or there are other problems.

My concern with this mix of the two use cases, is that we end up with a format that is kind of a Frankenstein and not easy to deal with in either use case.

  • for logging it to rubrix you still need to convert classlabels to strings, potentially rename columns;
  • for training you still need to do a lot more stuff (see above);

I was definitely thinking about your described flow of

  1. I rb.load a dataset and push it to the hub.
  2. I load the dataset from the hub and log to a new dataset

and thought that maybe rb.log could even accept a string that points to the dataset in the hub. But then I thought that sharing datasets via the Hub could also become very obsolete once Rubrix Cloud is coming ...

In my opinion, if we want to share datasets for interoperability with other libraries, they should be minimal and follow the most standard format which often is a format optimized for training.

So your last paragraph pretty much nails my point...

So maybe your suggestion enables the following:

  1. Rubrix format inside Dataset, push to hub as rubrix-dataset, which can be easily used for recreating a Rubrix dataset after loading from the hub.
  2. Utils for building a clean training set which can be used for (1) ease the training within rubrix workflows, and (2) push to hub a clean training set which can be used by other frameworks.

@dvsrepo
Copy link
Member Author

dvsrepo commented Jan 25, 2022

Ok, so on my side go ahead and implement your suggested approach. Just as a brief note, I think interoperability with the Hub will be equally important if not more once we have Rubrix Cloud. I see the Hub as the reproducibility/versioning layer (at least for not on-prem installations) we've discussed some time ago with regards to snapshots.

@frascuchon frascuchon moved this from Planified to In progress in Release Feb 9, 2022
@frascuchon frascuchon moved this from In progress to Review in Release Feb 11, 2022
dcfidalgo pushed a commit that referenced this issue Feb 14, 2022
* feat: add Dataset class

* test: add dataset tests

* feat: add meaningful error message to TaskType

* feat: add to from datasest for text classification

* test: add dataset fixtures

* feat: implement pandas to text classification

* feat: add token classification support

* test: add token classification tests

* test: use singlelabel_textclassification_records

* chore: small improvements

* refactor: switch to class implementation

* chore: import in init

* test: add missing tests plus minor fixes

* chore: add future warning about as_pandas

* chore: more integrations in the library

* fix: wrong import

* chore: add new test dependency

* test: add test for tasktype

* feat: ignore not supported columns

* test: add tests for read_pandas/datasets

* docs: put type hints only in description, it becomes too messy

* test: improve tests

* docs: curate docstrings

* docs: add datasets to python reference

* fix: return None's instead of empty dicts for metrics

* docs: add dataset guide

* docs: increase contrast

* Update docs/guides/datasets_in_the_client.ipynb

Co-authored-by: Daniel Vila Suero <daniel@recogn.ai>

* Update docs/guides/datasets_in_the_client.ipynb

Co-authored-by: Daniel Vila Suero <daniel@recogn.ai>

* Update docs/guides/datasets_in_the_client.ipynb

Co-authored-by: Daniel Vila Suero <daniel@recogn.ai>

* Update docs/guides/datasets_in_the_client.ipynb

Co-authored-by: Daniel Vila Suero <daniel@recogn.ai>

* Update docs/guides/datasets_in_the_client.ipynb

Co-authored-by: Daniel Vila Suero <daniel@recogn.ai>

* test: add general log/load tests for all allowed input types of rb.load

* fix: remove append

Co-authored-by: Daniel Vila Suero <daniel@recogn.ai>
@frascuchon frascuchon moved this from Review to Ready to DEV QA in Release Feb 14, 2022
dcfidalgo pushed a commit that referenced this issue Feb 16, 2022
* docs: update tutorial with new dataset feature

* docs: add admonition

* docs: rename file

* docs: add subsections
@frascuchon frascuchon moved this from Ready to DEV QA to Approved DEV QA in Release Feb 17, 2022
@frascuchon frascuchon moved this from Approved DEV QA to Ready to Release QA in Release Feb 17, 2022
frascuchon pushed a commit that referenced this issue Feb 17, 2022
* feat: add Dataset class

* test: add dataset tests

* feat: add meaningful error message to TaskType

* feat: add to from datasest for text classification

* test: add dataset fixtures

* feat: implement pandas to text classification

* feat: add token classification support

* test: add token classification tests

* test: use singlelabel_textclassification_records

* chore: small improvements

* refactor: switch to class implementation

* chore: import in init

* test: add missing tests plus minor fixes

* chore: add future warning about as_pandas

* chore: more integrations in the library

* fix: wrong import

* chore: add new test dependency

* test: add test for tasktype

* feat: ignore not supported columns

* test: add tests for read_pandas/datasets

* docs: put type hints only in description, it becomes too messy

* test: improve tests

* docs: curate docstrings

* docs: add datasets to python reference

* fix: return None's instead of empty dicts for metrics

* docs: add dataset guide

* docs: increase contrast

* Update docs/guides/datasets_in_the_client.ipynb

Co-authored-by: Daniel Vila Suero <daniel@recogn.ai>

* Update docs/guides/datasets_in_the_client.ipynb

Co-authored-by: Daniel Vila Suero <daniel@recogn.ai>

* Update docs/guides/datasets_in_the_client.ipynb

Co-authored-by: Daniel Vila Suero <daniel@recogn.ai>

* Update docs/guides/datasets_in_the_client.ipynb

Co-authored-by: Daniel Vila Suero <daniel@recogn.ai>

* Update docs/guides/datasets_in_the_client.ipynb

Co-authored-by: Daniel Vila Suero <daniel@recogn.ai>

* test: add general log/load tests for all allowed input types of rb.load

* fix: remove append

Co-authored-by: Daniel Vila Suero <daniel@recogn.ai>
(cherry picked from commit 14c8087)
frascuchon pushed a commit that referenced this issue Feb 17, 2022
* docs: update tutorial with new dataset feature

* docs: add admonition

* docs: rename file

* docs: add subsections

(cherry picked from commit 56a4a72)
@dcfidalgo dcfidalgo moved this from Ready to Release QA to Approved Release QA in Release Feb 17, 2022
frascuchon pushed a commit that referenced this issue Feb 18, 2022
…es (#1177)

Closes #1176

* fix: empty dict -> None, they fail when pushing the dataset to the HF hub

* test: add tests

* test: extend tests

* feat: allow Nones as input for metadata

* test: add push_to_hub test

* test: add env variable to CI

(cherry picked from commit c9ec36a)
frascuchon pushed a commit that referenced this issue Feb 19, 2022
Closes #1186

* fix: prevent None to "None"

* test: add tests

(cherry picked from commit 5891d2b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement Indicates new feature requests
Projects
No open projects
Release
Approved Release QA
3 participants