Image decoder for data_pipeline #107

am831 · 2023-10-17T22:42:49Z

What does this PR do? Please describe:
Implements image_decoder for loading an image dataset as part of the data_pipeline using the libpng and libjpeg libraries. Png and jpeg decoding is accessed through the ImageDecoder class.

You can use this example to manually test it:

from dataclasses import dataclass
from pathlib import Path
from typing import Generator, Sequence, Tuple

import logging 
import torch
from torch import Tensor

from fairseq2.data import Collater, FileMapper, StringLike
from fairseq2.data.image import ImageDecoder
from fairseq2.data.text import StrSplitter, read_text
from fairseq2.typing import DataType, Device
from fairseq2.data.data_pipeline import DataPipeline



@dataclass
class DataContext:
    data_file: Path
    """The pathname of the test TSV data file."""

    img_field: str
    """The string field corresponding to the relative path of the image file."""

    img_root_dir: Path
    """The pathname of the directory under which imag files are stored."""

    device: Device
    """The device on which to run inference."""

    dtype: DataType
    """The dtype with which to run inference."""


def build_data_pipeline(ctx: DataContext) -> DataPipeline:
    # TODO: This will be soon auto-tuned. Right now hand-tuned for devfair.
    n_parallel = 4

    # Open TSV, skip the header line, split into fields, and return three fields
    # only.
    split_tsv = StrSplitter(
        # We assume the tsv file has these 3 fields.
        names=["id", ctx.img_field, "raw_target_text"], indices=[0, 1, 2]
    )

    pipeline_builder = read_text(ctx.data_file, rtrim=True).skip(1).map(split_tsv)

    # Memory map image files and cache up to 10 files.
    map_file = FileMapper(root_dir=ctx.img_root_dir, cached_fd_count=10)

    pipeline_builder.map(map_file, selector=ctx.img_field, num_parallel_calls=n_parallel)

    # Decode mmap'ed image using libpng or libjpeg.
    decode_img = ImageDecoder()

    pipeline_builder.map(
        [decode_img],
        selector=f"{ctx.img_field}.data",
        num_parallel_calls=n_parallel,
    )

    # Build and return the data pipeline.
    return pipeline_builder.and_return()


def run_pipeline(ctx: DataContext):
    """Iterate through the specified TSV file and return translation + reference text + units"""
    # Build a simple pipeline that just reads a single TSV file.
    pipeline = build_data_pipeline(ctx)
    
    # Iterate through each example in the TSV file until CTRL-C.
    for example in pipeline:
        print(example)

if __name__ == "__main__":
    # fmt: off
    ctx = DataContext(
        # TODO: Update these three fields.
        data_file=Path("<path_to_tsv_file_with_relative_img_paths>"),
        img_field="img_file",
        img_root_dir=Path("<path_to_parent_directory_of_img_files>"),
        device=torch.device("cpu"),
        dtype=torch.float32,
    )
    # fmt: on

    run_pipeline(ctx)

Check list:

Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
Did you read the contributor guideline?
Did you make sure that your PR does only one thing instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

…to img_processing

cbalioglu

LGTM!

am831 and others added 20 commits October 9, 2023 15:36

image decoder header file and constructor

0f5a8cf

Merge branch 'facebookresearch:main' into img_processing

8932880

png.cc and png.h

8bc7f63

Merge branch 'facebookresearch:main' into img_processing

4cef991

Merge branch 'img_processing' of https://github.com/am831/fairseq2 in…

d92ea92

…to img_processing

Merge branch 'facebookresearch:main' into img_processing

6b94093

remove png files and import libpng methods

a499b18

image decoder

794c34a

Merge branch 'img_processing' of https://github.com/am831/fairseq2 in…

ca13683

…to img_processing

python binding

f751cc8

png decoder progress and py binding

1a7b52a

build py binding

87b02c7

update init

10bcc14

Merge branch 'facebookresearch:main' into img_processing

d6ecdbb

png decoder progress

5e36386

png decoder progress

3f4dc22

remove print statements

59f6799

populate tensor object with png data

9ab69a3

remove unnecessary include

2cfeca3

Merge branch 'facebookresearch:main' into img_processing

23b0243

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 17, 2023

am831 and others added 9 commits October 18, 2023 09:55

Merge branch 'facebookresearch:main' into img_processing

8a48b9b

Merge branch 'facebookresearch:main' into img_processing

8d12ed8

more efficient use of memory for populating tensor

740366a

Merge branch 'facebookresearch:main' into img_processing

e3c9635

Merge branch 'img_processing' of https://github.com/am831/fairseq2 in…

3853c64

…to img_processing

use get_raw_mutable_storage instead

1fc8e45

add png to build configs, support float32 tensor

18f3c8e

fix lint

4b59f2d

fix lint

c6cdb4b

am831 added 18 commits December 6, 2023 10:09

change package name

9f14881

pass non const ptr to libjpeg

ddbde92

change package name

cc34f61

fix typo

8ae6b3a

download libjpeg 1.5 and use const data_ptr

41ec5ac

libjpeg 1.5

3c2105d

fix typo

40faa0f

add init

781689b

use setjmp/longjmp for error handling

c8c1c63

new classes for resource management

98f33e3

include new classes in cmakelists

fa325d0

fix include order

df235cb

fix double free and seg fault

3c011f7

unit test for corrupted images

3d88336

more descriptive test name

715a59b

fix lint

a435ac6

fix lint

d14936d

clang tidy

ca2ac55

cbalioglu force-pushed the img_processing branch 6 times, most recently from e3c7a37 to 246c321 Compare December 6, 2023 20:51

Update CMake

14aa70a

cbalioglu force-pushed the img_processing branch from 246c321 to 14aa70a Compare December 6, 2023 20:58

am831 added 2 commits December 6, 2023 17:57

Merge branch 'img_processing' of https://github.com/am831/fairseq2 in…

41c0f90

…to img_processing

resize test images

c981e86

cbalioglu approved these changes Dec 7, 2023

View reviewed changes

cbalioglu merged commit 6dda4a3 into facebookresearch:main Dec 7, 2023
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image decoder for data_pipeline #107

Image decoder for data_pipeline #107

am831 commented Oct 17, 2023 •

edited

cbalioglu left a comment

Image decoder for data_pipeline #107

Image decoder for data_pipeline #107

Conversation

am831 commented Oct 17, 2023 • edited

cbalioglu left a comment

Choose a reason for hiding this comment

am831 commented Oct 17, 2023 •

edited