Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6720: [C++] Add FileSystemFactory and HadoopFileSystem under arrow::fs #5717

Closed
wants to merge 3 commits into from

Conversation

xuechendi
Copy link

@xuechendi xuechendi commented Oct 23, 2019

We previously work on JIRA Arrow-6720, PR#5522 to add a Parquet Reader/Writer Java API to Arrow
According to previous comments, I splitted that PR into three small ones.

For this PR, I added a new filesystem_util under arrow/filesystem to provide a filesystemfactory here. And also wrapped hdfs codes under arrow/io to arrow/filesystem, so filesystemfactory will open filesystem by its specified path and then return a unified filesystem instance.

Signed-off-by: Chendi Xue chendi.xue@intel.com

@github-actions
Copy link

@xuechendi xuechendi force-pushed the wip_upstream_filesystem branch 6 times, most recently from 93ed5d2 to 0393435 Compare October 23, 2019 13:21
@fsaintjacques fsaintjacques changed the title ARROW-6720:[C++]Add FileSystemFactory and HadoopFileSystem under arrow fs ARROW-6720:[C++] Add FileSystemFactory and HadoopFileSystem under arrow fs Oct 23, 2019
@emkornfield emkornfield changed the title ARROW-6720:[C++] Add FileSystemFactory and HadoopFileSystem under arrow fs ARROW-6720: [C++] Add FileSystemFactory and HadoopFileSystem under arrow fs Oct 24, 2019
@emkornfield
Copy link
Contributor

@pitrou would you have time to review? (I think you've been working on file system stuff?)

@xuechendi xuechendi changed the title ARROW-6720: [C++] Add FileSystemFactory and HadoopFileSystem under arrow fs ARROW-6720: [C++] Add FileSystemFactory and HadoopFileSystem under arrow::fs Oct 24, 2019
@bkietz
Copy link
Member

bkietz commented Oct 24, 2019

I'm working on the related PR (currently still a draft): #5709
I'll review this

@bkietz bkietz assigned bkietz and unassigned bkietz Oct 24, 2019
@bkietz bkietz self-requested a review October 24, 2019 13:33
@xuechendi
Copy link
Author

I'm working on the related PR (currently still a draft): #5709
I'll review this

Cool! thanks, @bkietz . The reason I am doing this is that we want to add a parquet reader/writer java api in arrow, and cpp part of codes are in PR #5719 .
When doing this, we need to add some unified API so we can read either from localfs or hdfs, that is why I wrapped hdfs under arrow/io into arrow/filesystem and also added a filesystemfactory.
I hope we can work on this together and get this PR or some similiar APIs ready as soon as possible, so parquet and also orc jni wrapper can take use of this.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, it looks like a good start. Please add some test cases

///
/// \param[out] options Made FileSystemOptions
/// \return Status
static Status Make(std::shared_ptr<FileSystemOptions>* options);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is just a struct, please simplify by providing a default constructor (or just default member initializers) instead of a factory function. Probably better: make a constructor which requires the path (since a reasonable default doesn't exist for that)

///
/// \param[in] use_hdfs3 a switch for using hdfs or hdfs3 lib
/// \param[in] replication replication number of hdfs
void ConfigureHDFSOptions(bool use_hdfs3, int replication = 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this mutator is necessary; IMHO

opts.use_hdfs3 = true;
opts.replication = 2;

is more expressive than

opts.ConfigureHDFSOptions(true, 2);

since the options are specified by name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, instead of duplicating use_hdfs3 in both FileSystemOptions and HdfsOptions, it would probably be better FileSystemOptions could simply contain an instance of HdfsOptions (required iff fs_type is inferred as hdfs)

~Impl() {}

Status Open(std::shared_ptr<FileSystemOptions> options) {
FileSystemType fs_type = GetFileSystemType(options->full_path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If fs_type is inferred from the path, I think this inference should be folded into PathInfo::Make. (since you use fs_type below, keep it as a member of PathInfo)

// TODO: check if CreateDir will fail when there is Dir.
RETURN_NOT_OK(local_fs->CreateDir(path_info->directory));
fs_ = std::make_shared<SubTreeFileSystem>(path_info->directory, local_fs);
} break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} break;
break;
}

}

FileSystemType GetFileSystemType(const std::string& path) {
if (path.find("hdfs") != std::string::npos) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be far more rigorous. For example, this would infer /home/bkietz/data/hdfs_perf/dat.par as hdfs. Look at arrow/util/uri.h


/// \class FileSystemFactory
/// \brief FileSystemFactory to create FileSystems.
class ARROW_EXPORT FileSystemFactory {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to add a pimpl'd class for this; I think it's fine to just declare

Status MakeFileSystem(const FileSystemOptions& options, std::shared_ptr<FileSystem>* fs);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou 's comment would make this even simpler:

Status MakeFileSystem(const std::string& uri, std::shared_ptr<FileSystem>* fs);


void FileSystemOptions::ConfigurePath(const std::string& path) { this->full_path = path; }

enum struct FileSystemType { HDFS, LOCAL, S3 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
enum struct FileSystemType { HDFS, LOCAL, S3 };
enum class FileSystemType { HDFS, LOCAL, S3 };

return Status::OK();
}

static Status ParseLOCAL(const std::string& full_path,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: this name should be ParseLocal. I see that you're following the enumeration; if you like you could use a template to follow it more closely:

template <FileSystemType>
static Status Parse(const std::string& full_path, std::shared_ptr<PathInfo>* path_info);

template <>
static Status Parse<FileSystemType::LOCAL>(const std::string& full_path, std::shared_ptr<PathInfo>* path_info) {
  // LOCAL parse impl
}

@@ -276,6 +276,10 @@ if(ARROW_FILESYSTEM)
list(APPEND ARROW_SRCS filesystem/s3fs.cc)
endif()

if(ARROW_HDFS)
list(APPEND ARROW_SRCS filesystem/hdfs.cc filesystem/filesystem_utils.cc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If FileSystemFactory is not explicitly specific to hdfs then it should be compiled even if ARROW_HDFS is not set (and should not break the build in that case; this will probably require some #ifdef blocks in filesystem_utils.cc). If it is specific to HDFS then it should be refactored to make that explicit

@pitrou
Copy link
Member

pitrou commented Oct 24, 2019

I'd rather see the FileSystemFactory and FileSystemOptions stuff removed from this PR.

Ideally, I'd like high-level filesystem configuration to be URI-based. URI representation for remote and local paths is quite widespread, and it makes things easier for bindings. See e.g. https://docs.dask.org/en/latest/remote-data-services.html

For example, you could have a URI such as hdfs://host/some/path?replication=3 which would hold necessary information.

@bkietz bkietz closed this Oct 24, 2019
@bkietz bkietz reopened this Oct 24, 2019
@bkietz
Copy link
Member

bkietz commented Oct 24, 2019

sorry, my mistake.

@xuechendi
Copy link
Author

I'd rather see the FileSystemFactory and FileSystemOptions stuff removed from this PR.

Ideally, I'd like high-level filesystem configuration to be URI-based. URI representation for remote and local paths is quite widespread, and it makes things easier for bindings. See e.g. https://docs.dask.org/en/latest/remote-data-services.html

For example, you could have a URI such as hdfs:///some/path?replication=3 which would hold necessary information.

hi, @pitrou , I see, I can add a URI-based path parser, thanks. And the reason why I put Hdfs and Filesystem_utils in the same PR is because the ultimate goal of doing this is to enable a parquet java reader/writer like the one #5522 did, this is more like a step stone to that, so filesystemfactory is more important to me in this PR. Hope that will be OK to you.

@xuechendi
Copy link
Author

sorry, my mistake.

No problem, and thanks for all you review, I will fix those tomorrow.

@pitrou
Copy link
Member

pitrou commented Oct 24, 2019

The options and factory here are not ok as designed. Choosing HDFS should certainly not be a boolean flag. As I said, using a URI mechanism instead should be more flexible and easier for users.

@xuechendi
Copy link
Author

xuechendi commented Oct 24, 2019 via email

@xuechendi
Copy link
Author

xuechendi commented Oct 25, 2019

Hi, @bkietz @pitrou
I submitted a new commit to remove FilesystemOptions and using URI-based path to pass options.
For URI parser, I used a regex to split URI into four part: host/port/path/options. Then handle each part respectively.
Here is an test example:

======
Original string is hdfs://vsr602:9000/path/to/ws_sold_date_sk=2452642/part-00196-5adaa592-957b-475d-95f7-64881c8e0c68.c000.snappy.parquet?replication=1&use_hdfs3=1
Type is HDFS
Host is vsr602
port is 9000
path is /path/to/ws_sold_date_sk=2452642/part-00196-5adaa592-957b-475d-95f7-64881c8e0c68.c000.snappy.parquet
replication: 1
use_hdfs3: 1

======
Original string is hdfs://192.168.0.1:9000/path/to/ws_sold_date_sk=2452642/part-00196-5adaa592-957b-475d-95f7-64881c8e0c68.c000.snappy.parquet?replication=1&use_hdfs3=1
Type is HDFS
Host is 192.168.0.1
port is 9000
path is /path/to/ws_sold_date_sk=2452642/part-00196-5adaa592-957b-475d-95f7-64881c8e0c68.c000.snappy.parquet
replication: 1
use_hdfs3: 1

======
Original string is file:///path/to/ws_sold_date_sk=2452642/part-00196-5adaa592-957b-475d-95f7-64881c8e0c68.c000.snappy.parquet
Type is LOCALFS
path is /path/to/ws_sold_date_sk=2452642/part-00196-5adaa592-957b-475d-95f7-64881c8e0c68.c000.snappy.parquet

======
Original string is /path/to/ws_sold_date_sk=2452642/part-00196-5adaa592-957b-475d-95f7-64881c8e0c68.c000.snappy.parquet
Type is LOCALFS
path is /path/to/ws_sold_date_sk=2452642/part-00196-5adaa592-957b-475d-95f7-64881c8e0c68.c000.snappy.parquet

@kou
Copy link
Member

kou commented Oct 30, 2019

It's worked on #5756.
We can ignore it in this PR.

@xuechendi xuechendi closed this Oct 30, 2019
@xuechendi xuechendi reopened this Oct 30, 2019
@xuechendi
Copy link
Author

@wesm @pitrou @bkietz ,hi, guys, tests are added and codes passed all checks, thanks

cpp/src/arrow/filesystem/filesystem_utils.h Outdated Show resolved Hide resolved

/// \class PathInfo
/// \brief A class used to parse URI-based path.
class PathInfo {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to make PathInfo public? it's only constructible from a string, so it seems that it should be an implementation detail of arrow::fs::MakeFileSystem(). If you want to unit test PathInfo separately, you can declare it in an internal header (for example, filesystem/util_internal.h) so that it can be included by tests as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made PathInfo public is because after we getting the std::shared_ptr instance, we will also need to use path_info->GetFileName() and path_info-> GetDirectoryName() to see what file is asked in the URI-based path, take filesystem_utils_test#133 as an example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that only the filesystem factory and unit tests will need to make that query, so PathInfo can be moved to an internal header.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkietz, oh, I aimed to use unit tests to indicate that later on who used file systemfactory will need Pathinfo
If that doesn’t make clear to you, do you mind to take a look of my following pr #5719 ? It will need pathinfo to open a parquet input file and output stream check this link to line of codes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkietz , as I explained before, I had submitted a huge pr #5522 before and @emkornfield asked me to make the pr into smaller ones , so I submitted two PR #5717 and #5719 for file system unified api and parquet jni bridge, that is why I needed this pathinfo class as a public one, hope this makes sense to you!

Copy link
Member

@bkietz bkietz Nov 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuechendi Your only usage of it outside file system factory in that PR is extracting the directory and file name in the jni bridge. If you need that to be public then you could expose (for example) static std::string HadoopFileSystem::GetPath(const std::string& uri) as a separate helper function and keep PathInfo private

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, @bkietz , I am not sure if I get what you proposed, I thought hadoopfilesystem is a class for filesystem operations, such as create dir, delete dir, open read file and open write file, so we should not store infomations like filepath inside hadoopfilesystem, am I right?
And if I expose a function called getFilePath inside hadoopfilesystem, then I still need to expose one inside s3fs and localfs when I need to open files through path, why don't we just put this as an public PathInfo class inside filesystem_utils.h?
Do you mind if we can discuss on this in an IM? I felt like we can only discuss on this one time a day, which make the conversion very slow. my gmail account is xuechendi@gmail.com

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the only use case for the PathInfo class is to retrieve file paths from uris then that doesn't warrant a public class. You're right that it applies to more than HDFS, maybe you could make it a static member of Uri:

Result<std::string> Uri::GetPath(const std::string& uri) {
  Uri parsed;
  RETURN_NOT_OK(parsed.Parse(uri));
  return parsed.path();
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkietz , I see, I’ll add it to URI, thanks

cpp/src/arrow/filesystem/filesystem_utils.h Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/hdfs.cc Outdated Show resolved Hide resolved
namespace fs {
namespace internal {

#if !defined _WIN32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#if !defined _WIN32
#ifndef _WIN32

cpp/src/arrow/filesystem/hdfs_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/filesystem_utils.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/filesystem_utils.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/filesystem_utils.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/hdfs.cc Outdated Show resolved Hide resolved
@wesm
Copy link
Member

wesm commented Oct 31, 2019

I'll let @bkietz and @pitrou look after this, if you need me to take a look at something please let me know

@codecov-io
Copy link

Codecov Report

Merging #5717 into master will increase coverage by 0.42%.
The diff coverage is 54.39%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5717      +/-   ##
==========================================
+ Coverage   88.99%   89.42%   +0.42%     
==========================================
  Files        1006      820     -186     
  Lines      137241   122141   -15100     
  Branches     1501        0    -1501     
==========================================
- Hits       122143   109225   -12918     
+ Misses      14733    12916    -1817     
+ Partials      365        0     -365
Impacted Files Coverage Δ
cpp/src/arrow/filesystem/hdfs.h 100% <100%> (ø)
cpp/src/arrow/filesystem/filesystem_utils.h 100% <100%> (ø)
cpp/src/arrow/filesystem/hdfs.cc 22.06% <22.06%> (ø)
cpp/src/arrow/filesystem/hdfs_test.cc 48.57% <48.57%> (ø)
cpp/src/arrow/filesystem/filesystem_utils.cc 71.96% <71.96%> (ø)
cpp/src/arrow/filesystem/filesystem_utils_test.cc 98.3% <98.3%> (ø)
cpp/src/arrow/json/converter.cc 90.05% <0%> (-1.76%) ⬇️
cpp/src/arrow/json/chunked_builder.cc 80% <0%> (-1.67%) ⬇️
cpp/src/arrow/util/thread_pool_test.cc 97.66% <0%> (-0.94%) ⬇️
go/arrow/ipc/writer.go
... and 198 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d2ed30a...dd317d7. Read the comment docs.

@xuechendi
Copy link
Author

Hi, @bkietz , I've fixed codes according to your review, thanks

@pitrou
Copy link
Member

pitrou commented Nov 4, 2019

@xuechendi Once #5770 is merged, you can use it to avoid parsing query strings yourself...

@xuechendi
Copy link
Author

@pitrou, I see, when do you think #5770 will be merged, and do you think this PR need to wait until #5770 being merged?

@pitrou
Copy link
Member

pitrou commented Nov 5, 2019

@xuechendi It's merged now.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
1. Make MakeFileSystem as a namespace function
2. Use uri.h to parse path string

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi
Copy link
Author

@bkietz @pitrou , I’ve modified codes to use uri.h for path string parsing and made pathinfo private.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @xuechendi. This is much better already!

cpp/src/arrow/filesystem/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/filesystem_utils.h Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/filesystem_utils.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/filesystem_utils.h Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/filesystem_utils.cc Outdated Show resolved Hide resolved
class HadoopFileSystem::Impl {
public:
explicit Impl(HdfsOptions options) : options_(std::move(options)) {}
~Impl() { auto status = Close(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you log errors here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I should, but since this is a destruction method, I am not sure the preferred way, should I throw an exception here? Or any other suggestions?

RETURN_NOT_OK(client_->GetChildren(select.base_dir, &file_list));
for (auto file : file_list) {
FileStats stat;
RETURN_NOT_OK(GetTargetStats(file, &stat));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you are not handling the allow_non_existent, recursive and max_recursion members in Selector.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed in latest commit

std::vector<std::string> file_list;
RETURN_NOT_OK(client_->GetChildren(path, &file_list));
for (auto file : file_list) {
RETURN_NOT_OK(client_->Delete(path + file));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if file is a directory? DeleteDirContents is supposed to be recursive.

Copy link
Author

@xuechendi xuechendi Nov 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete Function in io/hdfs.cc is able to delete the input path either is file or directory, and I will change the recursive flag to true.

}

void HdfsOptions::ConfigureHdfsUser(const std::string& user_name) {
if (!user_name.empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this?

Copy link
Author

@xuechendi xuechendi Nov 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea is if we don't need to specify username, then leave it as originally is instead of changing it to "", so if there is some assumption for hdfs_config, for example, if user is null, then use is XXX, I won't make override that.

fs_type_ = GetFileSystemTypeFromString(uri.scheme());
switch (fs_type_) {
case FileSystemType::HDFS:
options_.emplace("host_name", uri.host());
Copy link
Member

@pitrou pitrou Nov 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the HDFS specifics (options, etc.) should be moved to hfds.h / hdfs.cc.
For example you could have:

class ARROW_EXPORT HadoopFileSystem : public FileSystem {
 public:
  // ...

  /// Create a HdfsFileSystem instance from the given options.
  static Status Make(const HdfsOptions& options, std::shared_ptr<HadoopFileSystem>* out);
  /// Create a HdfsFileSystem instance from the given URI.
  static Status Make(const Uri& uri, std::shared_ptr<HadoopFileSystem>* out);
};

(and similarly LocalFileSystem::Make)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did as that way is because S3Options is configured outside S3FileSystem, so I am thinking to do the same thing here.

Copy link
Member

@pitrou pitrou Nov 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3Options is configured in s3fs.h / s3fs.cc. Similarly, HDFS options should be configured in hdfs.h / hdfs.cc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused, if S3Options is configured internally in s3fs.cc why s3fs.h makes both S3Options and Make function which uses S3Options as input ARROW_EXPORT? Doesn’t that means users should create and configure S3Options instance firstly then pass to Make function as parameter? Can you give me a more specific example of how you would like to see ?

@xuechendi
Copy link
Author

@pitrou, I’ve fixed issues you mentioned and passed CI tests, one remaining one is I didn’t return error in the hadoopfilesystem destruction method, maybe you can give me some advice ?

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@pitrou
Copy link
Member

pitrou commented Nov 12, 2019

@xuechendi I'm working on a new PR to replace this one.

@xuechendi
Copy link
Author

@pitrou , I am confused, Does that means this PR won’t be merged? I took quite an effort on this, may I know the reason?

@pitrou
Copy link
Member

pitrou commented Nov 12, 2019

Don't worry @xuechendi, I'm basing my work on yours. It's just a way of getting to completion quicker ;-)

@xuechendi
Copy link
Author

@pitrou, I see, thanks!

@xuechendi
Copy link
Author

This PR has been moved to #5820 by @pitrou .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants