New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-767: [C++] Filesystem abstraction #4225
Conversation
Submitted for discussion. There are some open questions here:
My answers:
|
/// Move / rename a file or directory. | ||
/// | ||
/// The destination will be replaced if it exists. | ||
virtual Status Move(const std::string& src, const std::string& dest) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another small nit: If we go with strings as an abstraction would it make more sense to use string_view as the api?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in my opinion, string_view
is useful for performance-critical APIs, which is not the case here. Implicit conversion between string_view
and std::string
does not seem to work on all compilers, which would make the API more cumbersome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think std::string
is OK
std::vector<FileStats>* out) = 0; | ||
|
||
/// Create a folder and subfolders. | ||
virtual Status CreateFolder(const std::string& path, bool recursive = true) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are you throughts on the implementation for this, when the underlying store doesn't support folders explicitly (e.g. S3)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should perhaps emulate them using /
as a standard delimiter (perhaps configurable when instantiating the S3 filesystem object).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like TensorFlow handles this by creating an ersatz file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conversation has been had many times over!
Specifically for S3, there are two ideas of what a folder might mean, in the absence of a real posix-like hierarchy: the simplest, that create-folder is a no-op, and you only ever infer folders if they contain things; and that a empty key which ends with '/' is to be considered an empty folder (but it could later morph to a file if data is written). The latter is the convention used by the S3 console.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I've never used S3, so I can't make a choice myself. The question would be: how do people usually use buckets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Listing files with the S3 API allows for a delimiter/prefix mechanism, you say how you want to define the directory name separators (always "/") and what prefix you want to list, and you get back a list of keys with that prefix and no more delimiters and a list of "common prefixes" of keys with that prefix and more delimiters. That acts a lot like a posix list directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so basically using /
as a delimiter to simulate directories in S3 is an appropriate approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using / as a delimiter to simulate directories in S3
Definitely, but it needs to be done with care, mindful that it isn't truly like that (and this is the case for other key-value stores too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but that's more of a quality of implementation issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good testing, of course; but an outstanding issue on s3fs has been whether mkdir(path)
followed by exists(path)
should necessarily always return True. It is debatable!
/// Delete a file. | ||
virtual Status DeleteFile(const std::string& path) = 0; | ||
/// Delete many files. | ||
virtual Status DeleteFiles(const std::vector<string>& path) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a concrete use-case for batch delete? I would imagine most file systems don't support transactional type deletes, so it might pay to expose this at a higher level instead of the core filesystem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect something like deleting a Parquet dataset with a large number of files/partitions?
The intention is not to allow transactional deletes but rather to avoid a roundtrip per each individual delete (which may hurt a lot if accessing a remote filesystem with 100+ms latencies).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, certainly, some file-systems will provide shortcuts for bulk operations or various types, and they will be very important in many use cases. In general, it would be exceptionally useful if file-systems provide glob and directory-tree-based operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need globbing at the filesystem level or can we give the client a complete list of directory contents and do the filtering on the client side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to make the default implementation of this call DeleteFile
on each path. Are there any filesystems that do support multiple-deletes in a single function call / RPC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the server can do glob, rather than having to issue many list-dir commands and filter ourselves, this would apply to that too. I'm not sure that any of the services you are thinking about here do - the only one I can think of is SSH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that recursively listing, in the current proposal, is possible with GetTargetStats(const Selector& select, ...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that Arrow should be concerned with deleting upstream files. At least in this iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pragmatically, deleting files can be useful for testing :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fsaintjacques we need to be able to delete files to roll back incomplete writes
std::vector<FileStats>* out) = 0; | ||
|
||
/// Create a folder and subfolders. | ||
virtual Status CreateFolder(const std::string& path, bool recursive = true) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like TensorFlow handles this by creating an ersatz file
/// Delete a file. | ||
virtual Status DeleteFile(const std::string& path) = 0; | ||
/// Delete many files. | ||
virtual Status DeleteFiles(const std::vector<string>& path) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to make the default implementation of this call DeleteFile
on each path. Are there any filesystems that do support multiple-deletes in a single function call / RPC?
/// Move / rename a file or directory. | ||
/// | ||
/// The destination will be replaced if it exists. | ||
virtual Status Move(const std::string& src, const std::string& dest) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could call this RenameFile
(as TF does) instead of Move
. I note that Move
operations on directories may not be natively supported and may have to be emulated for e.g. S3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. Move
implies that the destination could be clobbered. Rename
also implies you're staying in the same directory.
/// Move / rename a file or directory. | ||
/// | ||
/// The destination will be replaced if it exists. | ||
virtual Status Move(const std::string& src, const std::string& dest) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think std::string
is OK
I expect you want to surpass and replace that, which would be fine with me (except I have have probably too many biases/ideas on how it should function...) Note that all fsspec implementations are all currently co-derived from pyarrow file-system, so they do work with the current pyarrow stack just fine. I notice that you are going exclusively with tensorF naming convention here, which is not the case for the old pyarrow file-system class, so there is room for confusion. |
I'm actually not really looking at the TensorFlow naming, just using whatever feels reasonable to me :-) (the |
From Wes's comment at some point, many of the methods are also aliased: |
@pitrou I think we should take a minimalist approach for the time being to address the use cases we have in front of us (reading and writing datasets in a variety of formats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou I am not quite sure I understand the problem statement - what is the problem that you want to solve, that STL is not solving? (https://en.cppreference.com/w/cpp/filesystem)
@aregm Two things:
|
@aregm this is discussed in the Filesystems section in https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing which has been discussed on the mailing list |
Then why not to expand the standard filsystem library instead of creating another abstraction layer? TensorFlow was designed before C++17, and with a different goal in mind. |
It is not only concrete library, but it is also a standard API, and we are talking about the API, not implementation. The implementation in Linux is a concrete library. Basing on the standard one allows to easily derive from it and reuse whatever is needed on AWS C++ SDK, HDFS API for remote, and fallback to the STL when local. This will preserve clean and familiar API and keep LSP working, which makes things simpler, and we will not add another API.
You can't avoid going 17. It's there, and there is no decent compiler, that does not support it. So this is not a case. |
|
@aregm there's a number of details in our platform design that you may be glossing over a bit. We've defined all of our own abstractions for memory management (e.g. At this point redesigning around |
indeed, but there are many of these to draw on. From my (biased) point of view, the Filesystem libraryAPI is incomplete (my reference, as always) and contains a lot that is unnecessary (blocks and links and such) |
Wes, can you, please, point to the design and strategy docs, because there are lots of silent assumptions that you have in mind with other guys, that I am not aware of, as they implicitly assumed. One of which is the requirements for filesystem - I completely understand why memory management should be custom, Arrow is a framework for building various in-memory data analytical and database systems, which implies optimized and efficient memory management. A thing you cannot get from stdlibc++. What are the requirements or the design philosophy for the FS - I do not know, so assume that do not add more entities without need is the right strategy. So any design doc is appreciated. I'll just stack them and add links to the maillist and Confluence so everyone will be aware. |
/// Any symlink is automatically dereferenced, recursively | ||
virtual Status GetTargetStats(const std::string& path, FileStats* out); | ||
/// Same, for many targets at once. | ||
virtual Status GetTargetStats(const std::vector<std::string>& paths, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the behavior on partial failures? Return nothing with status? How does the consumer know which file failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on what failures you have in mind. A non-existent file returns successfully with the NonExistent
file type. An IO error (e.g. network error to the S3 server) will probably translate into an error Status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment is more regarding partial failures, say you want stats for [a, b, c, d] and returns [stat_a, stat_b, FAIL, stat_d]. What should be the returned array, and how do you know which of the input failed.
Usually, the return type would be Array<Result<FileStat, FileError>>
, but your current signature is Result<Array<FileStat>, FileError>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, can you give an example of a partial failure that's not an exceptional condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that, for the purposes of this API, most errors can be modelled as NonExistent
.
For example, in the POSIX stat() docs, the errors EACCESS, ELOOP, ENAMETOOLONG, ENOENT, ENOTDIR can be mapped to NonExistent
.
We're left with the errors EIO and EOVERFLOW (plus ENOMEM on Linux), which should be truly exceptional, and EBADF and EINVAL, which point to bugs in the code.
/// Delete a file. | ||
virtual Status DeleteFile(const std::string& path) = 0; | ||
/// Delete many files. | ||
virtual Status DeleteFiles(const std::vector<string>& path) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that Arrow should be concerned with deleting upstream files. At least in this iteration.
@aregm I don't have detailed design documents about our IO platform -- there are 3+ years worth of JIRA issues and pull requests (and associated discussions), you'll have to review the changelog / history of the project to get a more nuanced idea about how we've arrived at this point. Broadly speaking, this is our approach in this project:
The STL is much more general purpose and not aware of the abstractions we've developed and specialized requirements around zero-copy semantics that we have. I do not think it is viable for our purposes. If you disagree with the project's overall approach to memory management, IO, and interacting with remote data please start a mailing list discussion about it. |
b285cfe
to
4723fd6
Compare
I've added a RandomAccessFile-returning API. I think demanding a memory-mapped file is premature optimization. |
At least S3 does: https://docs.aws.amazon.com/AmazonS3/latest/API/multiobjectdeleteapi.html |
Side note: I would be inclined to create a |
@pitrou let us leave the memory-mapped file question for a future PR where we can discuss it in more detail |
Other question still: is it a good idea to have compression as part of this API? Do we benefit by allowing filesystem-specific implementations of compression, or does it just create more work for filesystem implementors? |
5d258e1
to
200e09f
Compare
I've started a mock filesystem implementation (in-memory) to help validate the API. |
bbdc679
to
9810082
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks reasonable as a first pass on the API. I left a few minor comments, but I think this can be merged relatively soon and we can move on to creating some implementations
Status FileSystem::DeleteFiles(const std::vector<std::string>& paths) { | ||
Status st = Status::OK(); | ||
for (const auto& path : paths) { | ||
st &= DeleteFile(path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this short-circuit on the first failure? If multiple paths fail then some of the error state will get clobbered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aim here is that the user does not have to retry deleting selectively the other files.
// A system clock time point expressed as a 64-bit (or more) number of | ||
// nanoseconds since the epoch. | ||
using TimePoint = | ||
std::chrono::time_point<std::chrono::system_clock, std::chrono::nanoseconds>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any filesystems that provide nanosecond-level information (curious)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently ext4
does.
|
||
// The file type. | ||
FileType type() const { return type_; } | ||
void type(FileType type) { type_ = type; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it should be SetType
? TBH, I think the Google style guide is not very helpful for such accessor methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_$PROP
is the naming style used by Protocol Buffers at least. I agree the guidance is hazy
|
||
// The full file path in the filesystem. | ||
std::string path() const { return path_; } | ||
void path(const std::string& path) { path_ = path; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_path
? And same per other attrs below
struct ARROW_EXPORT Selector { | ||
// The directory in which to select files. | ||
// If the path exists but doesn't point to a directory, this should be an error. | ||
std::string base_dir; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking about some kind of wildcard API at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the idea, though it depends whether it's useful to have filesystem-specific implementations.
/// Move / rename a file or directory. | ||
/// | ||
/// The destination will be replaced if it exists. | ||
// XXX separate MoveFile / MoveDir ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's OK to be both for now. In HDFS I wrote Rename
for both directories and files
https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs.h#L139
Some filesystems may have to stat the path to determine what it is, where some don't have to, so I think having a single API makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
Note that "stat first then choose the API" is vulnerable to race conditions. But for our purposes I'm not sure we care about such issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we can leave it to the particular filesystem implementation to decide what is most appropriate to do
cpp/src/arrow/filesystem/mockfs.h
Outdated
|
||
// XXX It's not very practical to have to explicitly declare inheritance | ||
// of default overrides. | ||
using FileSystem::OpenOutputStream; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'm not sure without thinking about it what a better way would be though (to have a with-flags and flagless version of these methods)
@@ -161,6 +161,8 @@ class ARROW_EXPORT BufferReader : public RandomAccessFile { | |||
std::shared_ptr<Buffer> buffer() const { return buffer_; } | |||
|
|||
protected: | |||
inline Status CheckClosed() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does inline
have any effect here (curious)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hoping that it incites the compiler to inline the implementation into its callers... @fsaintjacques am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if inline
will have the desired effect, but if it is marked inline
then this function should be private, not protected: if BufferReader
is subclassed in another translation unit and the subclass calls CheckClosed
then we'll get a linker error when the inline function isn't defined
@wesm Do you think it's useful to keep the auto-(de)compression option and have each filesystem implementation handle it? |
I'm concerned the auto-decompression could be a bit too magical. It might be better to instantiate the stream decompressor explicitly |
Ok. So the |
Hm. Perhaps it's better to omit the Open* methods with flags for now, then |
9810082
to
16dfd34
Compare
Ok, I addressed the review comments and a few XXXs and TODOs as well. |
16dfd34
to
5ee4481
Compare
5ee4481
to
a40ba52
Compare
Will merge once CI passes. |
Add a
FileSystem
interface plus aMockFileSystem
implementation (only keeping data in memory).