Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17318: [C++][Dataset] Support async streaming interface for getting fragments in Dataset #13804

Merged
merged 3 commits into from
Sep 20, 2022

Commits on Aug 10, 2022

  1. MINOR: [C++] Create a header with forward declarations for AsyncGener…

    …ator
    
    The original `async_generator.h` is quite heavy to include,
    although many users need only a tiny fraction of what is
    inside the header.
    
    Provide forward declarations header (`async_generator_fwd.h`)
    for basic `AsyncGenerator` and other generator classes,
    so that users do not need to include the entire
    `async_generator.h`, for example, if they only need a
    `AsyncGenerator` typedef.
    
    Signed-off-by: Pavel Solodovnikov <pavel.al.solodovnikov@gmail.com>
    ManManson committed Aug 10, 2022
    Configuration menu
    Copy the full SHA
    f0a6eba View commit details
    Browse the repository at this point in the history

Commits on Sep 3, 2022

  1. ARROW-17318: [C++][Dataset] Support async streaming interface for get…

    …ting fragments in Dataset
    
    Add `GetFragmentsAsync()` and `GetFragmentsAsyncImpl()`
    functions to the generic `Dataset` interface, which
    allows to produce fragments in a streamed fashion.
    
    This is one of the prerequisites for making
    `FileSystemDataset` to support lazy fragment
    processing, which, in turn, can be used to start
    scan operations without waiting for the entire
    dataset to be discovered.
    
    To aid the transition process of moving to async
    implementation in `Dataset`/`AsyncScanner` code,
    a default implementation for `GetFragmentsAsyncImpl()`
    is provided (iterating over `GetFragmentsImpl()`
    via a `BackgroundGenerator` and transferring results
    back to a given executor).
    
    Tests: unit(release)
    
    Signed-off-by: Pavel Solodovnikov <pavel.al.solodovnikov@gmail.com>
    ManManson committed Sep 3, 2022
    Configuration menu
    Copy the full SHA
    5ce9a18 View commit details
    Browse the repository at this point in the history

Commits on Sep 7, 2022

  1. ARROW-17138: [C++][Dataset] Add basic tests for GetFragments and `G…

    …etFragmentsAsync`
    
    Provide two basic tests for `Dataset::GetFragments` and
    `Dataset::GetFragmentsAsync` interfaces, utilizing
    `InMemoryDataset` for testing purposes.
    
    There was a helper function `AssertDatasetFragmentsEqual()`
    for testing `GetFragments()` method, but it was unused until now,
    meaning that the `Dataset` was essentially missing a part of
    test coverage for `GetFragments()`.
    
    This is fixed now by using this helper function in a
    simple test case `TestInMemoryDataset::GetFragmentsSync`.
    
    Analogous helper `AssertDatasetAsyncFragmentsEqual` is
    introduced to iterate the dataset via `Dataset::GetFragmentsAsync()`
    and is used in the `TestInMemoryDataset::GetFragmentsAsync` test-case.
    
    Also, I have encountered a bug in
    `DatasetFixtureMixin::AssertFragmentEquals`, which caused the
    `GetFragmentsSync` test-case to fail:
    
    The underlying fragment scanner always assumed to completely
    drain the provided batch generator, but this is not the case with
    `GetFragmentsSync` and `GetFragmentsAsync` test-cases, where each
    fragment in the dataset is composed of a single batch from the source
    batch generator.
    Hence, the `AssertFragmentEquals()` should be called
    as much times as there are fragments in the dataset, each time
    advancing the batch generator position by a single batch.
    But, the `AssertFragmentEquals()` would immediately fail after the
    first iteration, because batch generator is not exhausted yet.
    
    The bug is also fixed in this patch.
    
    Tests: unit(release)
    
    Signed-off-by: Pavel Solodovnikov <pavel.al.solodovnikov@gmail.com>
    ManManson committed Sep 7, 2022
    Configuration menu
    Copy the full SHA
    b47679f View commit details
    Browse the repository at this point in the history