-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): simplify load flow from hf datasets with no rb format #1234
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1234 +/- ##
==========================================
- Coverage 94.68% 94.29% -0.39%
==========================================
Files 127 127
Lines 5470 5573 +103
==========================================
+ Hits 5179 5255 +76
- Misses 291 318 +27
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
This is looking good! For For token classification we need at least |
Nice shortcut, will definitely be helpful. Just as a side note, for token classification datasets on the Hub, often only tokens are provided, so we must reconstruct the |
Yes, we can recompose text field in terms of tokens field by assuming white spaces between tokens |
Agree with this. Although it would also take that logic to the creation of a token classification record record = TokenClassificationRecord(tokens=....,tags=.....) |
384411a
to
4f1b89c
Compare
I've included a workaround to fix the test. The token classification model will try to build the text field from tokens if no char-level span is provided. But in general, operating with Anyway, we can go ahead with this feat. even if issue is not resolved since most of hf datasets will provide annotations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 Very nice!
…#1234) * fix: optional search_keywords * feat(datasets): simplify load flow from hf datasetswith no rb format * feat(token-class): allow create record with tags list * feat: mapping shortcut * chore: adjust datasets mappings * chore: better messages * feat: parse shorcut for text2text * test: skip dataset * refactor: build text from tokens if possible for NER records * test: fix tests (cherry picked from commit a64476b)
…#1234) * fix: optional search_keywords * feat(datasets): simplify load flow from hf datasetswith no rb format * feat(token-class): allow create record with tags list * feat: mapping shortcut * chore: adjust datasets mappings * chore: better messages * feat: parse shorcut for text2text * test: skip dataset * refactor: build text from tokens if possible for NER records * test: fix tests (cherry picked from commit a64476b) fix(ner): build entities from tags (#1327) * fix(ner): parse ner tags - Parse entities from `U` tags - Assume `I` or `L` as start token if not found before * test: add missing tests (cherry picked from commit aac62fc)
…#1234) * fix: optional search_keywords * feat(datasets): simplify load flow from hf datasetswith no rb format * feat(token-class): allow create record with tags list * feat: mapping shortcut * chore: adjust datasets mappings * chore: better messages * feat: parse shorcut for text2text * test: skip dataset * refactor: build text from tokens if possible for NER records * test: fix tests (cherry picked from commit a64476b) fix(ner): build entities from tags (#1327) * fix(ner): parse ner tags - Parse entities from `U` tags - Assume `I` or `L` as start token if not found before * test: add missing tests (cherry picked from commit aac62fc)
With this PR you can convert hf datasets to rubrix datasets as follow:
We can define similar approach for
metadata
ortext
for the other tasks, ortags
for token classification@dcfidalgo @dvsrepo any suggestion?
TODO:
See this PR huggingface/datasets#3676