Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Expose FileSystemFactoryOptions #30771

Closed
asfimport opened this issue Jan 7, 2022 · 4 comments
Closed

[R] Expose FileSystemFactoryOptions #30771

asfimport opened this issue Jan 7, 2022 · 4 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jan 7, 2022

ARROW-4406 notes that:

Currently reading parquet files generated by Hadoop (EMR) from S3 fails with "ValueError: >Found files in an intermediate directory" because of the _$folder$ empty files.

This was fixed in the pyarrow but R still has this issue.

The R side does not seem to have something similar to:

  def _should_silently_exclude(self, file_name):
    return (file_name.endswith('.crc') or # Checksums
            file_name.endswith('_$folder$') or # HDFS directories in S3
            file_name.startswith('.') or # Hidden files starting with .
            file_name.startswith('_') or # Hidden files starting with _
            file_name in EXCLUDED_PARQUET_PATHS)

Reporter: Bob Rudis
Assignee: Neal Richardson / @nealrichardson

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-15280. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Dewey Dunnington / @paleolimbot:
Thank you for reporting this!

Where this lives in Python:

def _should_silently_exclude(self, file_name):
return (file_name.endswith('.crc') or # Checksums
file_name.endswith('_$folder$') or # HDFS directories in S3
file_name.startswith('.') or # Hidden files starting with .
file_name.startswith('_') or # Hidden files starting with _
file_name in EXCLUDED_PARQUET_PATHS)

We don' have something similar to this approach in R, but it looks like these are exposed in the C++ API as well:

struct FileSystemFactoryOptions {
/// Either an explicit Partitioning or a PartitioningFactory to discover one.
///
/// If a factory is provided, it will be used to infer a schema for partition fields
/// based on file and directory paths then construct a Partitioning. The default
/// is a Partitioning which will yield no partition information.
///
/// The (explicit or discovered) partitioning will be applied to discovered files
/// and the resulting partition information embedded in the Dataset.
PartitioningOrFactory partitioning{Partitioning::Default()};
/// For the purposes of applying the partitioning, paths will be stripped
/// of the partition_base_dir. Files not matching the partition_base_dir
/// prefix will be skipped for partition discovery. The ignored files will still
/// be part of the Dataset, but will not have partition information.
///
/// Example:
/// partition_base_dir = "/dataset";
///
/// - "/dataset/US/sales.csv" -> "US/sales.csv" will be given to the partitioning
///
/// - "/home/john/late_sales.csv" -> Will be ignored for partition discovery.
///
/// This is useful for partitioning which parses directory when ordering
/// is important, e.g. DirectoryPartitioning.
std::string partition_base_dir;
/// Invalid files (via selector or explicitly) will be excluded by checking
/// with the FileFormat::IsSupported method. This will incur IO for each files
/// in a serial and single threaded fashion. Disabling this feature will skip the
/// IO, but unsupported files may be present in the Dataset
/// (resulting in an error at scan time).
bool exclude_invalid_files = false;
/// When discovering from a Selector (and not from an explicit file list), ignore
/// files and directories matching any of these prefixes.
///
/// Example (with selector = "/dataset/**"):
/// selector_ignore_prefixes = {"_", ".DS_STORE" };
///
/// - "/dataset/data.csv" -> not ignored
/// - "/dataset/_metadata" -> ignored
/// - "/dataset/.DS_STORE" -> ignored
/// - "/dataset/_hidden/dat" -> ignored
/// - "/dataset/nested/.DS_STORE" -> ignored
std::vector<std::string> selector_ignore_prefixes = {
".",
"_",
};
};

...which we could wire up to R here:

arrow/r/src/dataset.cpp

Lines 171 to 172 in 03219e2

// TODO(fsaintjacques): Make options configurable
auto options = ds::FileSystemFactoryOptions{};

and here:

...) {

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
A couple of observations as I start to look into this:

  • The code in pyarrow where this is mentioned looks to be in the legacy ParquetDataset code path, which predates the Dataset API we use in R. I believe this is being deprecated (right @jorisvandenbossche?). So it's not a drop-in fix for R.
  • There is a TODO to expose the FileSystemFactoryOptions in R, which we should do here. That struct has a selector_ignore_prefixes option, but since this _$folder$ is a suffix, it won't help. In fact, there are some misleading C++ tests (such as
    TEST_F(FileSystemDatasetFactoryTest, OptionsIgnoredDefaultPrefixes) {
    // When constructing a factory from a FileSelector,
    // `selector_ignore_prefixes` governs which files are filtered out.
    selector_.recursive = true;
    MakeFactory({
    fs::File("."),
    fs::File("_"),
    fs::File("_$folder$/dat"),
    fs::File("_SUCCESS"),
    fs::File("not_ignored_by_default"),
    fs::File("not_ignored_by_default_either/dat"),
    });
    AssertFinishWithPaths({"not_ignored_by_default", "not_ignored_by_default_either/dat"});
    }
    ) that include the _$folder$ string but not as a suffix like this. Perhaps we should add an ignore_suffixes option, or some other interface for filtering like this (cc @bkietz)
  • However, FileSystemFactoryOptions also has an exclude_invalid_files option, default false, and if true should successfully exclude _$folder$ files because they are empty. So exposing FileSystemFactoryOptions in R will allow you to flip this and that should do the trick.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:

The code in pyarrow where this is mentioned looks to be in the legacy ParquetDataset code path, which predates the Dataset API we use in R. I believe this is being deprecated (right Joris Van den Bossche?). So it's not a drop-in fix for R.

Yes, that's correct (so this is actually also an issue for python)

Perhaps we should add an ignore_suffixes option, or some other interface for filtering like this (cc Ben Kietzman)

I was thinking exactly the same. Based on the python snippet from the legacy code, it is clear that it's ignoring some files both based on prefixes and suffixes.

It might be possible to have some more advanced / smarter (callback based?) filename filter option, but I suppose that a simpler prefix+suffix ignore options will cover almost all use cases?

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
Issue resolved by pull request 13171
#13171

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants