Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): simplify load flow from hf datasets with no rb format #1234

Merged
merged 10 commits into from
Mar 14, 2022

Conversation

frascuchon
Copy link
Member

@frascuchon frascuchon commented Mar 8, 2022

With this PR you can convert hf datasets to rubrix datasets as follow:

import datasets
amazon_reviews = datasets.load_dataset("amazon_reviews_multi", "es", split="train")
rb_ds = rb.DatasetForTextClassification.from_datasets(
    amazon_reviews, 
    inputs=["review_title","review_body"], 
    annotation="stars"
)
# Here you can just log the dataset
rb.log(rb_ds, name="amazon-reviews")

We can define similar approach for metadata or text for the other tasks, or tags for token classification

@dcfidalgo @dvsrepo any suggestion?

TODO:

  • Include tests for text2text
  • Fix failing tests once hf bug is fixed:
import datasets
ds = datasets.Dataset.from_dict({ "a": [None, [1,2], None, [3,4]]})
ds.map(lambda x: { "b": "b"}).to_pandas() # None values in a becomes empty lists

See this PR huggingface/datasets#3676

@codecov
Copy link

codecov bot commented Mar 8, 2022

Codecov Report

Merging #1234 (52421b5) into master (0935db7) will decrease coverage by 0.38%.
The diff coverage is 76.31%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1234      +/-   ##
==========================================
- Coverage   94.68%   94.29%   -0.39%     
==========================================
  Files         127      127              
  Lines        5470     5573     +103     
==========================================
+ Hits         5179     5255      +76     
- Misses        291      318      +27     
Flag Coverage Δ
pytest 94.29% <76.31%> (-0.39%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/rubrix/client/models.py 88.88% <48.38%> (-8.01%) ⬇️
src/rubrix/client/datasets.py 95.46% <86.74%> (-2.74%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7bf10d4...52421b5. Read the comment docs.

@dvsrepo
Copy link
Member

dvsrepo commented Mar 9, 2022

With this PR you can convert hf datasets to rubrix datasets as follow:

import datasets
amazon_reviews = datasets.load_dataset("amazon_reviews_multi", "es", split="train")
rb_ds = rb.DatasetForTextClassification.from_datasets(
    amazon_reviews, 
    inputs=["review_title","review_body"], 
    annotation="stars"
)
# Here you can just log the dataset
rb.log(rb_ds, name="amazon-reviews")

We can define similar approach for metadata or text for the other tasks, or tags for token classification

@dcfidalgo @dvsrepo any suggestion?

This is looking good!

For text it should be straightforward, for metadata there are some dataset which contain json objects in some fields.

For token classification we need at least tokens and tags and then generate stand-off annotations with that, right?

@dcfidalgo
Copy link
Contributor

Nice shortcut, will definitely be helpful. Just as a side note, for token classification datasets on the Hub, often only tokens are provided, so we must reconstruct the text input for the TokenClassificationRecords.

@frascuchon
Copy link
Member Author

Nice shortcut, will definitely be helpful. Just as a side note, for token classification datasets on the Hub, often only tokens are provided, so we must reconstruct the text input for the TokenClassificationRecords.

Yes, we can recompose text field in terms of tokens field by assuming white spaces between tokens

@frascuchon frascuchon changed the title feat(datasets): simplify load flow from hf datasetswith no rb format feat(datasets): simplify load flow from hf datasets with no rb format Mar 9, 2022
@frascuchon frascuchon marked this pull request as ready for review March 9, 2022 23:24
@frascuchon
Copy link
Member Author

For token classification we need at least tokens and tags and then generate stand-off annotations with that, right

Agree with this. Although it would also take that logic to the creation of a token classification record

record = TokenClassificationRecord(tokens=....,tags=.....)

@frascuchon frascuchon added this to In progress in Release via automation Mar 14, 2022
@frascuchon frascuchon self-assigned this Mar 14, 2022
@frascuchon frascuchon force-pushed the refactor/easy-hf-dataset-mappings branch from 384411a to 4f1b89c Compare March 14, 2022 09:49
@frascuchon
Copy link
Member Author

frascuchon commented Mar 14, 2022

I've included a workaround to fix the test. The token classification model will try to build the text field from tokens if no char-level span is provided.

But in general, operating with None annotation/prediction lists could raise errors related to issue: huggingface/datasets#3676

Anyway, we can go ahead with this feat. even if issue is not resolved since most of hf datasets will provide annotations

dcfidalgo
dcfidalgo previously approved these changes Mar 14, 2022
Copy link
Contributor

@dcfidalgo dcfidalgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 Very nice!

Release automation moved this from In progress to Review Mar 14, 2022
@frascuchon frascuchon merged commit a64476b into master Mar 14, 2022
Release automation moved this from Review to Ready to DEV QA Mar 14, 2022
@frascuchon frascuchon deleted the refactor/easy-hf-dataset-mappings branch March 14, 2022 22:12
@frascuchon frascuchon moved this from Ready to DEV QA to Ready to Release QA in Release Mar 25, 2022
frascuchon added a commit that referenced this pull request Mar 25, 2022
…#1234)

* fix: optional search_keywords

* feat(datasets): simplify load flow from hf datasetswith no rb format

* feat(token-class): allow create record with tags list

* feat: mapping shortcut

* chore: adjust datasets mappings

* chore: better messages

* feat: parse shorcut for text2text

* test: skip dataset

* refactor: build text from tokens if possible for NER records

* test: fix tests

(cherry picked from commit a64476b)
@frascuchon frascuchon moved this from Ready to Release QA to Approved Release QA in Release Mar 28, 2022
frascuchon added a commit that referenced this pull request Mar 28, 2022
…#1234)

* fix: optional search_keywords

* feat(datasets): simplify load flow from hf datasetswith no rb format

* feat(token-class): allow create record with tags list

* feat: mapping shortcut

* chore: adjust datasets mappings

* chore: better messages

* feat: parse shorcut for text2text

* test: skip dataset

* refactor: build text from tokens if possible for NER records

* test: fix tests

(cherry picked from commit a64476b)

fix(ner): build entities from tags (#1327)

* fix(ner): parse ner tags

- Parse entities from `U` tags
- Assume `I` or `L` as start token if not found before

* test: add missing tests

(cherry picked from commit aac62fc)
frascuchon added a commit that referenced this pull request Mar 30, 2022
…#1234)

* fix: optional search_keywords

* feat(datasets): simplify load flow from hf datasetswith no rb format

* feat(token-class): allow create record with tags list

* feat: mapping shortcut

* chore: adjust datasets mappings

* chore: better messages

* feat: parse shorcut for text2text

* test: skip dataset

* refactor: build text from tokens if possible for NER records

* test: fix tests

(cherry picked from commit a64476b)

fix(ner): build entities from tags (#1327)

* fix(ner): parse ner tags

- Parse entities from `U` tags
- Assume `I` or `L` as start token if not found before

* test: add missing tests

(cherry picked from commit aac62fc)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Release
Approved Release QA
Development

Successfully merging this pull request may close these issues.

None yet

3 participants