0.2.2
Major Changes
Parallelization of Data Preprocessing 🚀
Data preprocessing via the Processor is now fast while maintaining a low memory footprint. Before, the parallelization via multiprocessing was causing serious memory issues on larger data sets (e.g. for Language Model fine-tuning). Now, we are running a small chunk through the whole processor (-> Samples -> Featurization -> Dataset ...). The multiprocessing is handled by the DataSilo now which simplifies implementation.
With this new approach we can still easily inspect & debug all important transformations for a chunk, but only keep the resulting dataset in memory once a process has finished with a chunk.
Multilabel classification
We support now also multilabel classification. Prepare your data by simply setting multilabel=true
in the TextClassificationProcessor
and use the new MultiLabelTextClassificationHead
for your model.
=> See an example here
Concept of Tasks
To further simplify multi-task learning we added the concept of "tasks". With this you can now use one TextClassificationProcessor
to preprocess data for multiple tasks (e.g. using two columns in your CSV for classification).
Example:
- Add the tasks to the Processor:
processor = TextClassificationProcessor(...)
news_categories = ["Sports", "Tech", "Politics", "Business", "Society"]
publisher = ["cnn", "nytimes","wsj"]
processor.add_task(name="category", label_list=news_categories, metric="acc", label_column_name="category_label")
processor.add_task(name="publisher", label_list=publisher, metric="acc", label_column_name="publisher_label")
- Link the data to right
PredictionHead
by supplying the task name at initialization:
category_head = MultiLabelTextClassificationHead(layer_dims=[768,5)], task_name="action_type")
publisher_head = MultiLabelTextClassificationHead(layer_dims=[768, 3], task_name="parts")
Update to transformers 2.0
We are happy to see how huggingface's repository is growing and how they made another major step with the new 2.0 release. Since their collection of language models is awesome, we will continue building upon their language models and tokenizers. However, we will keep following a different philosophy for all other components (dataprocessing, training, inference, deployment ...) to improve usability, allow multitask learning and simplify usage in the industry.
Modelling:
- ['enhancement] Add Multilabel Classification (#89)
- ['enhancement] Add PredictionHead for Regression task (#50)
- [enhancement] Introduce concept of "tasks" to support of multitask training using multiple heads of the same type (e.g. for multiple text classification tasks) (#75)
- [enhancement] Update dependency to transformers 2.0 (#106)
- [bug] TypeError: classification_report() got an unexpected keyword argument 'target_names' #93
- [bug] Fix issue with class weights (#82)
Data Handling:
- [enhancement] Chunkwise multiprocessing to reduce memory footprint in preprocessing large datasets (#88)
- [bug] Threading Error upon building Data Silo #90
- [bug] Multiprocessing causes data preprocessing to crash #110
(#102) - [bug] Multiprocessing Error with PyTorch Version 1.2.0 #97
- [bug] Windows fixes (#109)
Inference:
- [enhancement] excessive uncalled-for warnings when using the inferencer #104
- [enhancement] Get probability distribution over all classes in Inference mode (#102)
- [enhancement] Add InferenceProcessor (#72)
- [bug] Fix classifcation report bug with binary doc classification
Other:
- [enhancement] Add more tests (#108)
- [enhancement] do logging within run_experiment() (#37)
- [enhancement] Improved logging (#82, #87 #105)
- [bug] fix custom vocab for bert-base-cased (#108)
Thanks to all contributors: @tripl3a, @busyxin, @AhmedIdr, @jinnerbichler, @Timoeller, @tanaysoni, @brandenchan , @tholor
👩🌾 Happy FARMing!