Skip to content

0.2.2

Compare
Choose a tag to compare
@tanaysoni tanaysoni released this 14 Oct 19:14
· 538 commits to master since this release

Major Changes

Parallelization of Data Preprocessing 🚀

Data preprocessing via the Processor is now fast while maintaining a low memory footprint. Before, the parallelization via multiprocessing was causing serious memory issues on larger data sets (e.g. for Language Model fine-tuning). Now, we are running a small chunk through the whole processor (-> Samples -> Featurization -> Dataset ...). The multiprocessing is handled by the DataSilo now which simplifies implementation.

With this new approach we can still easily inspect & debug all important transformations for a chunk, but only keep the resulting dataset in memory once a process has finished with a chunk.

Multilabel classification

We support now also multilabel classification. Prepare your data by simply setting multilabel=true in the TextClassificationProcessor and use the new MultiLabelTextClassificationHead for your model.
=> See an example here

Concept of Tasks

To further simplify multi-task learning we added the concept of "tasks". With this you can now use one TextClassificationProcessor to preprocess data for multiple tasks (e.g. using two columns in your CSV for classification).
Example:

  1. Add the tasks to the Processor:
    processor = TextClassificationProcessor(...)

    news_categories = ["Sports", "Tech", "Politics", "Business", "Society"]
    publisher = ["cnn", "nytimes","wsj"]

    processor.add_task(name="category", label_list=news_categories, metric="acc", label_column_name="category_label")
    processor.add_task(name="publisher", label_list=publisher, metric="acc", label_column_name="publisher_label")
  1. Link the data to right PredictionHead by supplying the task name at initialization:
category_head = MultiLabelTextClassificationHead(layer_dims=[768,5)], task_name="action_type")
publisher_head = MultiLabelTextClassificationHead(layer_dims=[768, 3], task_name="parts")

Update to transformers 2.0

We are happy to see how huggingface's repository is growing and how they made another major step with the new 2.0 release. Since their collection of language models is awesome, we will continue building upon their language models and tokenizers. However, we will keep following a different philosophy for all other components (dataprocessing, training, inference, deployment ...) to improve usability, allow multitask learning and simplify usage in the industry.


Modelling:

  • ['enhancement] Add Multilabel Classification (#89)
  • ['enhancement] Add PredictionHead for Regression task (#50)
  • [enhancement] Introduce concept of "tasks" to support of multitask training using multiple heads of the same type (e.g. for multiple text classification tasks) (#75)
  • [enhancement] Update dependency to transformers 2.0 (#106)
  • [bug] TypeError: classification_report() got an unexpected keyword argument 'target_names' #93
  • [bug] Fix issue with class weights (#82)

Data Handling:

  • [enhancement] Chunkwise multiprocessing to reduce memory footprint in preprocessing large datasets (#88)
  • [bug] Threading Error upon building Data Silo #90
  • [bug] Multiprocessing causes data preprocessing to crash #110
    (#102)
  • [bug] Multiprocessing Error with PyTorch Version 1.2.0 #97
  • [bug] Windows fixes (#109)

Inference:

  • [enhancement] excessive uncalled-for warnings when using the inferencer #104
  • [enhancement] Get probability distribution over all classes in Inference mode (#102)
  • [enhancement] Add InferenceProcessor (#72)
  • [bug] Fix classifcation report bug with binary doc classification

Other:

  • [enhancement] Add more tests (#108)
  • [enhancement] do logging within run_experiment() (#37)
  • [enhancement] Improved logging (#82, #87 #105)
  • [bug] fix custom vocab for bert-base-cased (#108)

Thanks to all contributors: @tripl3a, @busyxin, @AhmedIdr, @jinnerbichler, @Timoeller, @tanaysoni, @brandenchan , @tholor

👩‍🌾 Happy FARMing!