Skip to content

data pipeline

YonghaoHe edited this page Feb 5, 2021 · 5 revisions

data pipeline

data pipeline provides main functionalities including packing data, data sampling (dataset), augmentation and data loading. This part is crucial for training efficiency, because the speed of data loading is the bottleneck in many training cases. To this end, our data pipeline has the following key features:

  • easily adapt to any annotation formats by writing your own annatation parsers
  • do not rely on any training frameworks, like PyTorch, MXNet, for easy porting
  • support multiple dataset types for highly efficiency data accessing
  • libjpeg-turbo is adopted for very fast decoding
  • multi-threading for sufficiently taking advantages of multi-cores

Annotation Parser

As we know, different datasets have different annotation formats. COCO-style is conventional, so you have to convert your annotations to COCO-style usually. Here, you don't have to do that. A parser is the interpreter to understand annotation format and generate samples for datasets. In general, you need to write your own parser inherited from the base parser. We have implemented three parsers for WIDERFACE, TT100K and COCO for your reference.

Dataset

Our dataset acts closely like dataset class in PyTorch. Generally, dataset has three types:

  • disk-based dataset
  • memory-based compressed dataset
  • memory-based uncompressed dataset

disk-based dataset

This kind of dataset only keeps image paths for subsequent data access. In order to obtain final image, two steps are performed: 1) read the raw compressed bytes into memory (disk I/O); 2) decode the bytes to recover the final image. Obviously, the whole process is time-consuming. The advantage is that the dataset is memory-friendly.

How to build a disk-based dataset: make sure that the key 'image_path' is set in your samples generated by your parser. Refer to tt100k_parser.py for reference.

memory-based compressed dataset

This kind of dataset keeps compressed data bytes of each sample. To recover the final image, only one step is needed: decompress the bytes (in most cases, the image is compressed in jpeg format). Disk I/O is avoided, and the efficiency is much better than disk-based dataset. However, you have to make sure that the memory can fully hold all the compressed bytes of the dataset.

How to build a memory-based compressed dataset: make sure that the key 'image_bytes' is filled in your samples generated by your parser. Refer to widerface_parser.py for reference.

memory-based uncompressed dataset

This kind of dataset stores the original images. So disk I/O and decompressing are both avoided. But you need a very large amount of memory to hold all the data. It is not feasible in most cases unless you have enough memory.

How to build a memory-based uncompressed dataset: make sure that the key 'image' is filled in all samples generated by your parser.

In summary, the first two kinds of dataset are often used. Of cause, you can create a hybrid dataset as your needs. All you have to do is write a proper parser to generate samples.

Packing Dataset

Once you finish the parser, you should pack you dataset next. Packing is very simple, just read the code for reference.

Sampler

In the data pipeline, samplers are crucial. We have two types of samplers: dataset sampler and region sampler.

dataset sampler

A dataset sampler decides the order of each sample sent to the training precess. It is easy to understand the function of a dataset sampler, so please check the code for reference. If you have more complex ideas, you can implement a new dataset sampler.

region sampler

A region sampler determines which part of the image is cropped for use. This operation will make great impact on data distribution, and a appropriate region sampler can boost the performance evidently. Someone may interpret region samplers as a way of data augmentation. That's right. Here, we combine region samplers together with dataset samplers for building a more logical data pipeline. We have implemented some region samplers for your reference. In most cases, RandomBBoxCropRegionSampler is enough. Of cause, you can write your own region samplers.

Augmentation

We purely rely on albumentation for data augmentation. In our code, some easy augmentations are employed like horizon flip and image norm. Composing theses augmentation operators to form your augmentation pipeline. Here, we only provide two pipelines: coco train/val pipelines and widerface train/val pipelines. In TT100K_augmentation_pipeline.py, we create an augmentation pipeline for TT100K in a independent .py file, and it is also an example for you to learn how to write your own augmentation pipeline.

Data Loader

We implement a multi-threading data loader for fast batch producing. (In fact, we tried multi-process and found it's slower than multi-threading) The data loader combines all above modules to prepare data batches. The num_workers must be set to a proper value, say 4-10, according to the max number of physical cpu cores.

Tips

  • you can put any information that you need in the whole training process into sample objects by adding key-value pairs. Also be aware that Sample has some reserved keys.