Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image decoder for data_pipeline #107

Merged
merged 125 commits into from Dec 7, 2023
Merged

Conversation

am831
Copy link
Contributor

@am831 am831 commented Oct 17, 2023

What does this PR do? Please describe:
Implements image_decoder for loading an image dataset as part of the data_pipeline using the libpng and libjpeg libraries. Png and jpeg decoding is accessed through the ImageDecoder class.

You can use this example to manually test it:

from dataclasses import dataclass
from pathlib import Path
from typing import Generator, Sequence, Tuple

import logging 
import torch
from torch import Tensor

from fairseq2.data import Collater, FileMapper, StringLike
from fairseq2.data.image import ImageDecoder
from fairseq2.data.text import StrSplitter, read_text
from fairseq2.typing import DataType, Device
from fairseq2.data.data_pipeline import DataPipeline



@dataclass
class DataContext:
    data_file: Path
    """The pathname of the test TSV data file."""

    img_field: str
    """The string field corresponding to the relative path of the image file."""

    img_root_dir: Path
    """The pathname of the directory under which imag files are stored."""

    device: Device
    """The device on which to run inference."""

    dtype: DataType
    """The dtype with which to run inference."""


def build_data_pipeline(ctx: DataContext) -> DataPipeline:
    # TODO: This will be soon auto-tuned. Right now hand-tuned for devfair.
    n_parallel = 4

    # Open TSV, skip the header line, split into fields, and return three fields
    # only.
    split_tsv = StrSplitter(
        # We assume the tsv file has these 3 fields.
        names=["id", ctx.img_field, "raw_target_text"], indices=[0, 1, 2]
    )

    pipeline_builder = read_text(ctx.data_file, rtrim=True).skip(1).map(split_tsv)

    # Memory map image files and cache up to 10 files.
    map_file = FileMapper(root_dir=ctx.img_root_dir, cached_fd_count=10)

    pipeline_builder.map(map_file, selector=ctx.img_field, num_parallel_calls=n_parallel)

    # Decode mmap'ed image using libpng or libjpeg.
    decode_img = ImageDecoder()

    pipeline_builder.map(
        [decode_img],
        selector=f"{ctx.img_field}.data",
        num_parallel_calls=n_parallel,
    )

    # Build and return the data pipeline.
    return pipeline_builder.and_return()


def run_pipeline(ctx: DataContext):
    """Iterate through the specified TSV file and return translation + reference text + units"""
    # Build a simple pipeline that just reads a single TSV file.
    pipeline = build_data_pipeline(ctx)
    
    # Iterate through each example in the TSV file until CTRL-C.
    for example in pipeline:
        print(example)

if __name__ == "__main__":
    # fmt: off
    ctx = DataContext(
        # TODO: Update these three fields.
        data_file=Path("<path_to_tsv_file_with_relative_img_paths>"),
        img_field="img_file",
        img_root_dir=Path("<path_to_parent_directory_of_img_files>"),
        device=torch.device("cpu"),
        dtype=torch.float32,
    )
    # fmt: on

    run_pipeline(ctx)

Check list:

  • Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
  • Did you read the contributor guideline?
  • Did you make sure that your PR does only one thing instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 17, 2023
@cbalioglu cbalioglu force-pushed the img_processing branch 6 times, most recently from e3c7a37 to 246c321 Compare December 6, 2023 20:51
Copy link
Contributor

@cbalioglu cbalioglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cbalioglu cbalioglu merged commit 6dda4a3 into facebookresearch:main Dec 7, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants