Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ballista does not support external file systems #10

Closed
ZhangqyTJ opened this issue Dec 8, 2021 · 6 comments · Fixed by #290
Closed

Ballista does not support external file systems #10

ZhangqyTJ opened this issue Dec 8, 2021 · 6 comments · Fixed by #290
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@ZhangqyTJ
Copy link

ZhangqyTJ commented Dec 8, 2021

I added the s3 (minio_store) module in datafusion/src/datasource/object_store, and registered the minio_store in benchmarks/tpch.rs through the register_object_store() method of ExecutionContext. But when I start the Scheduler and Executor, and then run "cargo run --bin tpch --release****", the data in minio cannot be read.
After checking the code, I found that LocalFileSystem is used directly at ballista/rust/core/src/serde/physical_plan/from_proto.rs(789) and ballista/rust/core/src/serde/logical_plan/from_proto.rs(201), so I modified these two codes to minio_store and it ran successfully.
How to make Ballista support external file system?

The project address after I added minio_store
https://github.com/ZhangqyTJ/arrow-datafusion.git
Modify the code before running
ballista/rust/core/src/serde/physical_plan/from_proto.rs(789)
ballista/rust/core/src/serde/logical_plan/from_proto.rs(201)
Run command
To run the scheduler from source:

cd $ARROW_HOME/ballista/rust/scheduler
RUST_LOG=info cargo run --release

By default the scheduler will bind to 0.0.0.0 and listen on port 50050.

To run the executor from source:

cd $ARROW_HOME/ballista/rust/executor
RUST_LOG=info cargo run --release

To run the benchmarks:

    cargo run --bin tpch --release benchmark ballista --host localhost --port 50050 --query 1 --partitions 1 --path s3://test1/tpch_tbl/cutdata --format tbl --storage-type minio --endpoint 192.168.75.81:9091 --username minioadmin --password minioadmin --bucket test1 
@ZhangqyTJ ZhangqyTJ added the bug Something isn't working label Dec 8, 2021
@houqp
Copy link
Member

houqp commented Dec 8, 2021

Hi @ZhangqyTJ are you right that right now ballista is hardcoded to use local file system object store. Supporting remote file system is in our roadmap and we have all the infrastructure in place to execute on this now thanks to the object store abstraction. As you already mentioned, the main change that needs to be introduced is the plan ser/de layer. You are more than welcome to work on this if you would like to. I don't think anyone is actively working on this at the moment.

@houqp
Copy link
Member

houqp commented Dec 8, 2021

We had some prior discussions around this subject in apache/datafusion#1072

@houqp houqp added enhancement New feature or request help wanted Extra attention is needed and removed bug Something isn't working labels Dec 8, 2021
@ZhangqyTJ
Copy link
Author

I will try to solve this issue

@ZhangqyTJ
Copy link
Author

I think I have resolved this issue:
(1) Add the'object_store_str' field in ListingTableScanNode to store the serialized object_store
(2) Serialize object_store in to_proto.rs, deserialize in from_proto.rs, and perform type matching through ObjectStoreEnum
But I found a new issue. When I tested reading minio's .tbl data, it was successful, but reading the .parquet file gave an error: ParquetError("Parquet error: underlying IO error: failed to fill whole buffer")'. I don't know the reason for the error

@houqp
Copy link
Member

houqp commented Dec 14, 2021

Could you share a reproducible code example with us? It would be really hard for us to locate the problem with a generic error message :)

@ZhangqyTJ
Copy link
Author

ZhangqyTJ commented Dec 14, 2021

Project address: https://github.com/ZhangqyTJ/arrow-datafusion.git
Test code file: datafusion/src/physical_plan/file_format/parquet.rs
Test function: test_minio_parquet()
Steps:
(1) Build a minio service: https://docs.min.io/docs/minio-quickstart-guide.html
(2) Create a bucket in minio:
Access with browser http://ip:port (default user name and password: minioadmin)
click 'Object Orowser' -> 'Create bucket'
(3) Modify the parameters in test_minio_parquet(): s3_options, data_rows
(4) Execute test_minio_parquet()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants