-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Add SpanMarker Argilla Trainer for NER #2693
Feat: Add SpanMarker Argilla Trainer for NER #2693
Conversation
…tps://github.com/argilla-io/argilla into feat/span_marker_trainer
…tps://github.com/argilla-io/argilla into feat/span_marker_trainer
Required for SpanMarker evaluation
…tps://github.com/argilla-io/argilla into feat/span_marker_trainer
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## feat/2658-add-argilla-training-module-for-openai #2693 +/- ##
====================================================================================
+ Coverage 92.26% 92.34% +0.08%
====================================================================================
Files 171 172 +1
Lines 9029 9124 +95
====================================================================================
+ Hits 8331 8426 +95
Misses 698 698
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 1 file with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
Failing test originates from the upstream: #2691 |
Co-authored-by: David Berenstein <david.m.berenstein@gmail.com>
f891a47
into
argilla-io:feat/2658-add-argilla-training-module-for-openai
## [1.7.0](v1.6.0...v1.7.0) ### Added - add `max_retries` and `num_threads` parameters to `rg.log` to run data logging request concurrently with backoff retry policy. See [#2458](#2458) and [#2533](#2533) - `rg.load` accepts `include_vectors` and `include_metrics` when loading data. Closes [#2398](#2398) - Added `settings` param to `prepare_for_training` ([#2689](#2689)) - Added `prepare_for_training` for `openai` ([#2658](#2658)) - Added `ArgillaOpenAITrainer` ([#2659](#2659)) - Added `ArgillaSpanMarkerTrainer` for Named Entity Recognition ([#2693](#2693)) - Added `ArgillaTrainer` CLI support. Closes ([#2809](#2809)) ### Changed - Argilla quickstart image dependencies are externalized into `quickstart.requirements.txt`. See [#2666](#2666) - bulk endpoints will upsert data when record `id` is present. Closes [#2535](#2535) - moved from `click` to `typer` CLI support. Closes ([#2815](#2815)) - Argilla server docker image is built with PostgreSQL support. Closes [#2686](#2686) - The `rg.log` computes all batches and raise an error for all failed batches. - The default batch size for `rg.log` is now 100. ### Fixed - `argilla.training` bugfixes and unification ([#2665](#2665)) - Resolved several small bugs in the `ArgillaTrainer`. ### Deprecated - The `rg.log_async` function is deprecated and will be removed in next minor release.
Hello!
Pull Request overview
Details
The SpanMarker Argilla trainer is based on the Transformers Trainer, as SpanMarker is tightly implemented on top of transformers. However, we don't need to do the tokenization, data collation or evaluation on the Argilla side, unlike with Transformers. This makes the SpanMarker Argilla trainer relatively small.
Usage
First, we need an annotated dataset:
And then we can use the new Trainer to train with this dataset:
(You can use lower batch sizes or
model_max_length=256
if you have memory issues. You can also usefp16
instead ofbf16
if you get an error.)This produces the following logs:
Click to see the logs
In short, I trained to 0.939 eval F1 on CoNLL03 in 5 minutes.
Type of change
How Has This Been Tested
Tests still need to be written. I'll be working on this - but I'll publish this as a draft already so it's available for reviews already.