feat(datasets): simplify load flow from hf datasets with no rb format #1234

frascuchon · 2022-03-08T16:00:11Z

With this PR you can convert hf datasets to rubrix datasets as follow:

import datasets
amazon_reviews = datasets.load_dataset("amazon_reviews_multi", "es", split="train")
rb_ds = rb.DatasetForTextClassification.from_datasets(
    amazon_reviews, 
    inputs=["review_title","review_body"], 
    annotation="stars"
)
# Here you can just log the dataset
rb.log(rb_ds, name="amazon-reviews")

We can define similar approach for metadata or text for the other tasks, or tags for token classification

@dcfidalgo @dvsrepo any suggestion?

TODO:

Include tests for text2text
Fix failing tests once hf bug is fixed:

import datasets
ds = datasets.Dataset.from_dict({ "a": [None, [1,2], None, [3,4]]})
ds.map(lambda x: { "b": "b"}).to_pandas() # None values in a becomes empty lists

See this PR huggingface/datasets#3676

codecov · 2022-03-08T16:14:29Z

Codecov Report

Merging #1234 (52421b5) into master (0935db7) will decrease coverage by 0.38%.
The diff coverage is 76.31%.

@@            Coverage Diff             @@
##           master    #1234      +/-   ##
==========================================
- Coverage   94.68%   94.29%   -0.39%     
==========================================
  Files         127      127              
  Lines        5470     5573     +103     
==========================================
+ Hits         5179     5255      +76     
- Misses        291      318      +27

Flag	Coverage Δ
pytest	`94.29% <76.31%> (-0.39%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/rubrix/client/models.py	`88.88% <48.38%> (-8.01%)`	⬇️
src/rubrix/client/datasets.py	`95.46% <86.74%> (-2.74%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7bf10d4...52421b5. Read the comment docs.

dvsrepo · 2022-03-09T09:53:59Z

With this PR you can convert hf datasets to rubrix datasets as follow:
import datasets
amazon_reviews = datasets.load_dataset("amazon_reviews_multi", "es", split="train")
rb_ds = rb.DatasetForTextClassification.from_datasets(
    amazon_reviews, 
    inputs=["review_title","review_body"], 
    annotation="stars"
)
# Here you can just log the dataset
rb.log(rb_ds, name="amazon-reviews")
We can define similar approach for metadata or text for the other tasks, or tags for token classification

@dcfidalgo @dvsrepo any suggestion?

This is looking good!

For text it should be straightforward, for metadata there are some dataset which contain json objects in some fields.

For token classification we need at least tokens and tags and then generate stand-off annotations with that, right?

dcfidalgo · 2022-03-09T10:57:39Z

Nice shortcut, will definitely be helpful. Just as a side note, for token classification datasets on the Hub, often only tokens are provided, so we must reconstruct the text input for the TokenClassificationRecords.

frascuchon · 2022-03-09T12:38:15Z

Nice shortcut, will definitely be helpful. Just as a side note, for token classification datasets on the Hub, often only tokens are provided, so we must reconstruct the text input for the TokenClassificationRecords.

Yes, we can recompose text field in terms of tokens field by assuming white spaces between tokens

frascuchon · 2022-03-09T23:27:07Z

For token classification we need at least tokens and tags and then generate stand-off annotations with that, right

Agree with this. Although it would also take that logic to the creation of a token classification record

record = TokenClassificationRecord(tokens=....,tags=.....)

frascuchon · 2022-03-14T11:23:42Z

I've included a workaround to fix the test. The token classification model will try to build the text field from tokens if no char-level span is provided.

But in general, operating with None annotation/prediction lists could raise errors related to issue: huggingface/datasets#3676

Anyway, we can go ahead with this feat. even if issue is not resolved since most of hf datasets will provide annotations

dcfidalgo

💯 Very nice!

…#1234) * fix: optional search_keywords * feat(datasets): simplify load flow from hf datasetswith no rb format * feat(token-class): allow create record with tags list * feat: mapping shortcut * chore: adjust datasets mappings * chore: better messages * feat: parse shorcut for text2text * test: skip dataset * refactor: build text from tokens if possible for NER records * test: fix tests (cherry picked from commit a64476b)

…#1234) * fix: optional search_keywords * feat(datasets): simplify load flow from hf datasetswith no rb format * feat(token-class): allow create record with tags list * feat: mapping shortcut * chore: adjust datasets mappings * chore: better messages * feat: parse shorcut for text2text * test: skip dataset * refactor: build text from tokens if possible for NER records * test: fix tests (cherry picked from commit a64476b) fix(ner): build entities from tags (#1327) * fix(ner): parse ner tags - Parse entities from `U` tags - Assume `I` or `L` as start token if not found before * test: add missing tests (cherry picked from commit aac62fc)

frascuchon requested a review from dcfidalgo March 8, 2022 16:00

frascuchon changed the title ~~feat(datasets): simplify load flow from hf datasetswith no rb format~~ feat(datasets): simplify load flow from hf datasets with no rb format Mar 9, 2022

frascuchon marked this pull request as ready for review March 9, 2022 23:24

frascuchon added this to In progress in Release via automation Mar 14, 2022

frascuchon self-assigned this Mar 14, 2022

frascuchon added 8 commits March 14, 2022 10:49

fix: optional search_keywords

0940a76

feat(datasets): simplify load flow from hf datasetswith no rb format

4a27a8b

feat(token-class): allow create record with tags list

6996140

feat: mapping shortcut

2dad7d9

chore: adjust datasets mappings

34f226f

chore: better messages

6c64791

feat: parse shorcut for text2text

b2ebc34

test: skip dataset

4f1b89c

frascuchon force-pushed the refactor/easy-hf-dataset-mappings branch from 384411a to 4f1b89c Compare March 14, 2022 09:49

refactor: build text from tokens if possible for NER records

88be31e

dcfidalgo previously approved these changes Mar 14, 2022

View reviewed changes

Release automation moved this from In progress to Review Mar 14, 2022

test: fix tests

52421b5

frascuchon dismissed dcfidalgo’s stale review via 52421b5 March 14, 2022 15:04

frascuchon merged commit a64476b into master Mar 14, 2022

Release automation moved this from Review to Ready to DEV QA Mar 14, 2022

frascuchon deleted the refactor/easy-hf-dataset-mappings branch March 14, 2022 22:12

frascuchon moved this from Ready to DEV QA to Ready to Release QA in Release Mar 25, 2022

frascuchon moved this from Ready to Release QA to Approved Release QA in Release Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): simplify load flow from hf datasets with no rb format #1234

feat(datasets): simplify load flow from hf datasets with no rb format #1234

frascuchon commented Mar 8, 2022 •

edited

codecov bot commented Mar 8, 2022 •

edited

dvsrepo commented Mar 9, 2022

dcfidalgo commented Mar 9, 2022

frascuchon commented Mar 9, 2022

frascuchon commented Mar 9, 2022

frascuchon commented Mar 14, 2022 •

edited

dcfidalgo left a comment

feat(datasets): simplify load flow from hf datasets with no rb format #1234

feat(datasets): simplify load flow from hf datasets with no rb format #1234

Conversation

frascuchon commented Mar 8, 2022 • edited

codecov bot commented Mar 8, 2022 • edited

Codecov Report

dvsrepo commented Mar 9, 2022

dcfidalgo commented Mar 9, 2022

frascuchon commented Mar 9, 2022

frascuchon commented Mar 9, 2022

frascuchon commented Mar 14, 2022 • edited

dcfidalgo left a comment

Choose a reason for hiding this comment

frascuchon commented Mar 8, 2022 •

edited

codecov bot commented Mar 8, 2022 •

edited

frascuchon commented Mar 14, 2022 •

edited