-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6720: [C++] Add FileSystemFactory and HadoopFileSystem under arrow::fs #5717
Conversation
93ed5d2
to
0393435
Compare
@pitrou would you have time to review? (I think you've been working on file system stuff?) |
0393435
to
ae352c2
Compare
I'm working on the related PR (currently still a draft): #5709 |
Cool! thanks, @bkietz . The reason I am doing this is that we want to add a parquet reader/writer java api in arrow, and cpp part of codes are in PR #5719 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this, it looks like a good start. Please add some test cases
/// | ||
/// \param[out] options Made FileSystemOptions | ||
/// \return Status | ||
static Status Make(std::shared_ptr<FileSystemOptions>* options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is just a struct, please simplify by providing a default constructor (or just default member initializers) instead of a factory function. Probably better: make a constructor which requires the path (since a reasonable default doesn't exist for that)
/// | ||
/// \param[in] use_hdfs3 a switch for using hdfs or hdfs3 lib | ||
/// \param[in] replication replication number of hdfs | ||
void ConfigureHDFSOptions(bool use_hdfs3, int replication = 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this mutator is necessary; IMHO
opts.use_hdfs3 = true;
opts.replication = 2;
is more expressive than
opts.ConfigureHDFSOptions(true, 2);
since the options are specified by name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, instead of duplicating use_hdfs3
in both FileSystemOptions and HdfsOptions, it would probably be better FileSystemOptions could simply contain an instance of HdfsOptions (required iff fs_type is inferred as hdfs)
~Impl() {} | ||
|
||
Status Open(std::shared_ptr<FileSystemOptions> options) { | ||
FileSystemType fs_type = GetFileSystemType(options->full_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If fs_type is inferred from the path, I think this inference should be folded into PathInfo::Make. (since you use fs_type below, keep it as a member of PathInfo)
// TODO: check if CreateDir will fail when there is Dir. | ||
RETURN_NOT_OK(local_fs->CreateDir(path_info->directory)); | ||
fs_ = std::make_shared<SubTreeFileSystem>(path_info->directory, local_fs); | ||
} break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} break; | |
break; | |
} |
} | ||
|
||
FileSystemType GetFileSystemType(const std::string& path) { | ||
if (path.find("hdfs") != std::string::npos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be far more rigorous. For example, this would infer /home/bkietz/data/hdfs_perf/dat.par
as hdfs. Look at arrow/util/uri.h
|
||
/// \class FileSystemFactory | ||
/// \brief FileSystemFactory to create FileSystems. | ||
class ARROW_EXPORT FileSystemFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary to add a pimpl'd class for this; I think it's fine to just declare
Status MakeFileSystem(const FileSystemOptions& options, std::shared_ptr<FileSystem>* fs);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
void FileSystemOptions::ConfigurePath(const std::string& path) { this->full_path = path; } | ||
|
||
enum struct FileSystemType { HDFS, LOCAL, S3 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enum struct FileSystemType { HDFS, LOCAL, S3 }; | |
enum class FileSystemType { HDFS, LOCAL, S3 }; |
return Status::OK(); | ||
} | ||
|
||
static Status ParseLOCAL(const std::string& full_path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: this name should be ParseLocal
. I see that you're following the enumeration; if you like you could use a template to follow it more closely:
template <FileSystemType>
static Status Parse(const std::string& full_path, std::shared_ptr<PathInfo>* path_info);
template <>
static Status Parse<FileSystemType::LOCAL>(const std::string& full_path, std::shared_ptr<PathInfo>* path_info) {
// LOCAL parse impl
}
cpp/src/arrow/CMakeLists.txt
Outdated
@@ -276,6 +276,10 @@ if(ARROW_FILESYSTEM) | |||
list(APPEND ARROW_SRCS filesystem/s3fs.cc) | |||
endif() | |||
|
|||
if(ARROW_HDFS) | |||
list(APPEND ARROW_SRCS filesystem/hdfs.cc filesystem/filesystem_utils.cc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If FileSystemFactory is not explicitly specific to hdfs then it should be compiled even if ARROW_HDFS
is not set (and should not break the build in that case; this will probably require some #ifdef
blocks in filesystem_utils.cc
). If it is specific to HDFS then it should be refactored to make that explicit
I'd rather see the Ideally, I'd like high-level filesystem configuration to be URI-based. URI representation for remote and local paths is quite widespread, and it makes things easier for bindings. See e.g. https://docs.dask.org/en/latest/remote-data-services.html For example, you could have a URI such as |
sorry, my mistake. |
hi, @pitrou , I see, I can add a URI-based path parser, thanks. And the reason why I put Hdfs and Filesystem_utils in the same PR is because the ultimate goal of doing this is to enable a parquet java reader/writer like the one #5522 did, this is more like a step stone to that, so filesystemfactory is more important to me in this PR. Hope that will be OK to you. |
No problem, and thanks for all you review, I will fix those tomorrow. |
The options and factory here are not ok as designed. Choosing HDFS should certainly not be a boolean flag. As I said, using a URI mechanism instead should be more flexible and easier for users. |
The usehdfs3 flag here is not for choosing HDFS, it’s more like an option to tell HDFS using hdfs3 lib, and it will be fixed once we use URL based path
|
Hi, @bkietz @pitrou ====== ====== ====== ====== |
It's worked on #5756. |
|
||
/// \class PathInfo | ||
/// \brief A class used to parse URI-based path. | ||
class PathInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to make PathInfo public? it's only constructible from a string, so it seems that it should be an implementation detail of arrow::fs::MakeFileSystem()
. If you want to unit test PathInfo
separately, you can declare it in an internal header (for example, filesystem/util_internal.h
) so that it can be included by tests as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made PathInfo public is because after we getting the std::shared_ptr instance, we will also need to use path_info->GetFileName() and path_info-> GetDirectoryName() to see what file is asked in the URI-based path, take filesystem_utils_test#133 as an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that only the filesystem factory and unit tests will need to make that query, so PathInfo can be moved to an internal header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkietz, oh, I aimed to use unit tests to indicate that later on who used file systemfactory will need Pathinfo
If that doesn’t make clear to you, do you mind to take a look of my following pr #5719 ? It will need pathinfo to open a parquet input file and output stream check this link to line of codes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkietz , as I explained before, I had submitted a huge pr #5522 before and @emkornfield asked me to make the pr into smaller ones , so I submitted two PR #5717 and #5719 for file system unified api and parquet jni bridge, that is why I needed this pathinfo class as a public one, hope this makes sense to you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xuechendi Your only usage of it outside file system factory in that PR is extracting the directory and file name in the jni bridge. If you need that to be public then you could expose (for example) static std::string HadoopFileSystem::GetPath(const std::string& uri)
as a separate helper function and keep PathInfo
private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, @bkietz , I am not sure if I get what you proposed, I thought hadoopfilesystem is a class for filesystem operations, such as create dir, delete dir, open read file and open write file, so we should not store infomations like filepath inside hadoopfilesystem, am I right?
And if I expose a function called getFilePath inside hadoopfilesystem, then I still need to expose one inside s3fs and localfs when I need to open files through path, why don't we just put this as an public PathInfo class inside filesystem_utils.h?
Do you mind if we can discuss on this in an IM? I felt like we can only discuss on this one time a day, which make the conversion very slow. my gmail account is xuechendi@gmail.com
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the only use case for the PathInfo class is to retrieve file paths from uris then that doesn't warrant a public class. You're right that it applies to more than HDFS, maybe you could make it a static member of Uri:
Result<std::string> Uri::GetPath(const std::string& uri) {
Uri parsed;
RETURN_NOT_OK(parsed.Parse(uri));
return parsed.path();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkietz , I see, I’ll add it to URI, thanks
namespace fs { | ||
namespace internal { | ||
|
||
#if !defined _WIN32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#if !defined _WIN32 | |
#ifndef _WIN32 |
17cea20
to
dd317d7
Compare
Codecov Report
@@ Coverage Diff @@
## master #5717 +/- ##
==========================================
+ Coverage 88.99% 89.42% +0.42%
==========================================
Files 1006 820 -186
Lines 137241 122141 -15100
Branches 1501 0 -1501
==========================================
- Hits 122143 109225 -12918
+ Misses 14733 12916 -1817
+ Partials 365 0 -365
Continue to review full report at Codecov.
|
Hi, @bkietz , I've fixed codes according to your review, thanks |
@xuechendi Once #5770 is merged, you can use it to avoid parsing query strings yourself... |
@xuechendi It's merged now. |
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
dd317d7
to
9fe3b0d
Compare
1. Make MakeFileSystem as a namespace function 2. Use uri.h to parse path string Signed-off-by: Chendi Xue <chendi.xue@intel.com>
9fe3b0d
to
ff5aa70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @xuechendi. This is much better already!
class HadoopFileSystem::Impl { | ||
public: | ||
explicit Impl(HdfsOptions options) : options_(std::move(options)) {} | ||
~Impl() { auto status = Close(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you log errors here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I should, but since this is a destruction method, I am not sure the preferred way, should I throw an exception here? Or any other suggestions?
cpp/src/arrow/filesystem/hdfs.cc
Outdated
RETURN_NOT_OK(client_->GetChildren(select.base_dir, &file_list)); | ||
for (auto file : file_list) { | ||
FileStats stat; | ||
RETURN_NOT_OK(GetTargetStats(file, &stat)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you are not handling the allow_non_existent
, recursive
and max_recursion
members in Selector
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fixed in latest commit
cpp/src/arrow/filesystem/hdfs.cc
Outdated
std::vector<std::string> file_list; | ||
RETURN_NOT_OK(client_->GetChildren(path, &file_list)); | ||
for (auto file : file_list) { | ||
RETURN_NOT_OK(client_->Delete(path + file)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if file
is a directory? DeleteDirContents
is supposed to be recursive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete Function in io/hdfs.cc is able to delete the input path either is file or directory, and I will change the recursive flag to true.
} | ||
|
||
void HdfsOptions::ConfigureHdfsUser(const std::string& user_name) { | ||
if (!user_name.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea is if we don't need to specify username, then leave it as originally is instead of changing it to "", so if there is some assumption for hdfs_config, for example, if user is null, then use is XXX, I won't make override that.
fs_type_ = GetFileSystemTypeFromString(uri.scheme()); | ||
switch (fs_type_) { | ||
case FileSystemType::HDFS: | ||
options_.emplace("host_name", uri.host()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all the HDFS specifics (options, etc.) should be moved to hfds.h
/ hdfs.cc
.
For example you could have:
class ARROW_EXPORT HadoopFileSystem : public FileSystem {
public:
// ...
/// Create a HdfsFileSystem instance from the given options.
static Status Make(const HdfsOptions& options, std::shared_ptr<HadoopFileSystem>* out);
/// Create a HdfsFileSystem instance from the given URI.
static Status Make(const Uri& uri, std::shared_ptr<HadoopFileSystem>* out);
};
(and similarly LocalFileSystem::Make
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did as that way is because S3Options is configured outside S3FileSystem, so I am thinking to do the same thing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S3Options is configured in s3fs.h
/ s3fs.cc
. Similarly, HDFS options should be configured in hdfs.h
/ hdfs.cc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little confused, if S3Options is configured internally in s3fs.cc why s3fs.h makes both S3Options and Make function which uses S3Options as input ARROW_EXPORT? Doesn’t that means users should create and configure S3Options instance firstly then pass to Make function as parameter? Can you give me a more specific example of how you would like to see ?
@pitrou, I’ve fixed issues you mentioned and passed CI tests, one remaining one is I didn’t return error in the hadoopfilesystem destruction method, maybe you can give me some advice ? |
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
fdb1937
to
282ec95
Compare
@xuechendi I'm working on a new PR to replace this one. |
@pitrou , I am confused, Does that means this PR won’t be merged? I took quite an effort on this, may I know the reason? |
Don't worry @xuechendi, I'm basing my work on yours. It's just a way of getting to completion quicker ;-) |
@pitrou, I see, thanks! |
We previously work on JIRA Arrow-6720, PR#5522 to add a Parquet Reader/Writer Java API to Arrow
According to previous comments, I splitted that PR into three small ones.
For this PR, I added a new filesystem_util under arrow/filesystem to provide a filesystemfactory here. And also wrapped hdfs codes under arrow/io to arrow/filesystem, so filesystemfactory will open filesystem by its specified path and then return a unified filesystem instance.
Signed-off-by: Chendi Xue chendi.xue@intel.com