-
Notifications
You must be signed in to change notification settings - Fork 35
chore: bump version to 0.14.0 #334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Coverage Report
Files without new missing coverage
264 files skipped due to complete coverage. Coverage success: total of 98.00% is above 97.98% 🎉 |
e52545f to
adb0c54
Compare
|



Changelog
Added
edsnlp.packagecommandedsnlp.loadnow correctly takes disable, enable and exclude parameters into accountpython -m edsnlp.evaluatescript to evaluate a model on a dataseteds.splitpipe to split a document into multiple documents based on a splitting pattern (useful for training)converterargument ofedsnlp.data.read/from_...to be a list of converters instead of a single converteredsnlp.trainscript and APIChanged
eds.span_context_getter's parametercontext_sentsis no longer optional and must be explicitly set to 0 to disable sentence contextbatch_sizeargument ofPipelineis deprecated and is not used anymore. Use thebatch_sizeargument ofstream.map_pipelineinstead.Fixed
multiprocessingbackend. This prevents some executions from hanging indefinitely at the end of the processing.multiprocessingbackend. This is useful when the system is running out of file descriptors andulimit -nis not an option. Torch sharing strategy can also be set via an environment variableTORCH_SHARING_STRATEGY(default isfile_descriptor, consider usingfile_systemif you encounter issues).Data API changes
LazyCollectionobjects are now calledStreamobjectsBy default,
multiprocessingbackend now preserves the order of the input data. To disable this and improve performance, usedeterministic=Falsein theset_processingmethod🚀 Parallelized GPU inference throughput improvements !
The
.map_batches,.map_pipelineand.map_gpumethods now support a specificbatch_sizeand batching function, instead of having a single batch size for all pipesReaders now have a
loopparameter to cycle over the data indefinitely (useful for training)Readers now have a
shuffleparameter to shuffle the data before iterating over itIn
multiprocessingmode, file based readers now read the data in the workers (was an option before)We now support two new special batch sizes
These are also compatible in batched writer such as parquet, where each input fragment can be processed and mapped to a single matching output fragment.
💥 Breaking change: a
mapfunction returning a list or a generator won't be automatically flattened anymore. Useflatten()to flatten the output if needed. This shouldn't change the behavior for most users since most writers (to_pandas, to_polars, to_parquet, ...) still flatten the output💥 Breaking change: the
chunk_sizeandsort_chunksare now deprecated : to sort data before applying a transformation, use.map_batches(custom_sort_fn, batch_size=...)Training API changes
We now provide a training script
python -m edsnlp.train --config config.cfgthat should fit many use cases. Check out the docs !In particular, we do not require pytorch's Dataloader for training and can rely solely on EDS-NLP stream/data API, which is better suited for large streamable datasets and dynamic preprocessing (ie different result each time we apply a noised preprocessing op on a sample).
Each trainable component can now provide a
statsfield in itspreprocessoutput to log info about the sample (number of words, tokens, spans, ...):Support multi GPU training via hugginface
accelerateand EDS-NLPStreamAPI consideration of env['WOLRD_SIZE'] and env['LOCAL_RANK'] environment variablesChecklist