Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ruby] Add support for loading table by Arrow Dataset #18794

Closed
asfimport opened this issue Aug 20, 2021 · 22 comments
Closed

[Ruby] Add support for loading table by Arrow Dataset #18794

asfimport opened this issue Aug 20, 2021 · 22 comments

Comments

@asfimport
Copy link

asfimport commented Aug 20, 2021

Reporter: Kouhei Sutou / @kou
Assignee: Kouhei Sutou / @kou

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-13687. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
Issue resolved by pull request 10970
#10970

@asfimport
Copy link
Author

Kanstantsin Ilchanka / @simpl1g:
@kou  can you please provide example how to read file/folder from s3, I can't find anything in Ruby code related to S3, I would like to add it to examples

Also, doing this causes segfault on version 5.0.0 

Arrow::S3FileSystem.new

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
https://slide.rabbit-shocker.org/authors/kou/rubykaigi-takeout-2021/?page=32


Arrow::Table.load("s3://bucket/path.arrow")

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
If you want to test this with MinIO, you need to add endpoint_override=http://127.0.0.1:... query string such as Arrow::Table.load("s3://bucket/path.arrow?endpoint_override=http://...").

@asfimport
Copy link
Author

Kanstantsin Ilchanka / @simpl1g:
I tried to run it, but I have error.

ruby-2.7.4/gems/gobject-introspection-3.4.9/lib/gobject-introspection/loader.rb:616:in `invoke': [file-system-dataset-factory][set-file-system-uri]: NotImplemented: Got S3 URI but Arrow compiled without S3 support (Arrow::Error::NotImplemented)

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
You need S3 support enabled Apache Arrow C++ as the error message says.

@asfimport
Copy link
Author

Kanstantsin Ilchanka / @simpl1g:
So I can't use HomeBrew version? and should build from sources? As far as I understood I need to have aws-sdk-cpp available and ARROW_S3 env variable set, is it correct? Or maybe there is some docs that I can read?

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
You can use Homebrew version once you send a pull request tohttps://github.com/homebrew/homebrew-core for enabling S3 support.

The following patch will work:


diff --git a/Formula/apache-arrow.rb b/Formula/apache-arrow.rb
index 4983e512380..14870af7b7f 100644
--- a/Formula/apache-arrow.rb
+++ b/Formula/apache-arrow.rb
@@ -15,6 +15,7 @@ class ApacheArrow < Formula
     sha256 cellar: :any_skip_relocation, x86_64_linux:  "ca1305cb5335250312a597d489717bb8066ef1ac5343ab9dfd59349b3eadfdf5"
   end
 
+  depends_on "aws-sdk-cpp"
   depends_on "boost" => :build
   depends_on "cmake" => :build
   depends_on "llvm" => :build
@@ -55,6 +56,7 @@ class ApacheArrow < Formula
       -DARROW_PLASMA=ON
       -DARROW_PROTOBUF_USE_SHARED=ON
       -DARROW_PYTHON=ON
+      -DARROW_S3=ON
       -DARROW_WITH_BZ2=ON
       -DARROW_WITH_ZLIB=ON
       -DARROW_WITH_ZSTD=ON

@asfimport
Copy link
Author

@asfimport
Copy link
Author

Kanstantsin Ilchanka / @simpl1g:
I build apache-arrow locally and it works with S3, thanks!. However is it possible to update brew formulae without updating it's version, is it safe to use --force?

Questions:

  • How can I pass access_token/secret_key so that I can access private files?

  • How can I read/write by partitions?

    Also I did some testing, here are problems that I found:

  • Speed is very slow. It took almost 1 hour to download 400 Mb file through S3 compared to 15 seconds via usual Net::HTTP. For small files diff is not so huge. Here is benchmark with small file. Maybe it is somehow connected that I tested it with custom brew build?
    {code:java}
    require 'arrow-dataset'
    require 'net/http'
    require 'benchmark/ips'

    s3_uri = URI("s3://simpl1g-example/correct.csv")
    http_uri = URI("https://simpl1g-example.s3.eu-central-1.amazonaws.com/correct.csv")
    Benchmark.ips do |x|
    x.report('S3') { Arrow::Table.load(s3_uri) }
    x.report('Http') { Arrow::Table.load(Arrow::Buffer.new(Net::HTTP.get(http_uri)), format: :csv) }
    x.compare!
    end

  1. Comparison:
  2.             Http:        9.6 i/s
    
  3.               S3:        4.9 i/s - 1.97x  slower
    
    {code}
  • Not sure if it is real problem, but I can't cancel downloading big objects, process stucks until download finished, now it takes hours, I guess because of slow read, I can only do kill -9 for process.

     

    Arrow::Table.load(URI("s3://big-parquet-file.parquet"))

     

  • Doing S3 call doesn't work the same as for local file. I have TSV file with .csv extension. Parsing local file works fine. On S3 it fails
    {code:java}

  1. Works fine
    Arrow::Table.load("file.csv", delimiter: "\t")
    Arrow::Table.load("file.csv", format: :tsv)

    {code:java}
    Arrow::Table.load(URI("s3://simpl1g-example/file.csv"), delimiter: "\t")
    gobject-introspection-3.4.9/lib/gobject-introspection/loader.rb:616:in `invoke': [file-system-dataset-factory][finish]: Invalid: Error creating dataset. Could not read schema from 'simpl1g-example/file.csv': Could not open CSV input source 'simpl1g-example/file.csv': Invalid: CSV parse error: Row #2: Expected 1 columns, got 2: 6	18	iPhone9,2	1635840547. Is this a 'csv' file? (Arrow::Error::Invalid)
    
    Arrow::Table.load(URI("s3://simpl1g-example/file.csv"), format: :tsv)
    Traceback (most recent call last):
    	23: from bin/console:5:in `<main>'
    	 6: from (irb):22:in `<main>'
    	 5: from /Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-6.0.0/lib/arrow/table.rb:29:in `load'
    	 4: from /Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-6.0.0/lib/arrow/table-loader.rb:24:in `load'
    	 3: from /Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-6.0.0/lib/arrow/table-loader.rb:56:in `load'
    	 2: from /Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-dataset-6.0.0/lib/arrow-dataset/arrow-table-loadable.rb:35:in `load_from_uri'
    	 1: from /Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-dataset-6.0.0/lib/arrow-dataset/arrow-table-loadable.rb:39:in `internal_load_from_uri'
    /Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-dataset-6.0.0/lib/arrow-dataset/file-format.rb:39:in `resolve': undefined method `[]' for nil:NilClass (NoMethodError)

     

     

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
Homebrew formula support revision for updating in the same version:


diff --git a/Formula/apache-arrow.rb b/Formula/apache-arrow.rb
index 4983e512380..901692d5a35 100644
--- a/Formula/apache-arrow.rb
+++ b/Formula/apache-arrow.rb
@@ -5,6 +5,7 @@ class ApacheArrow < Formula
   mirror "https://archive.apache.org/dist/arrow/arrow-6.0.0/apache-arrow-6.0.0.tar.gz"
   sha256 "69d268f9e82d3ebef595ad1bdc83d4cb02b20c181946a68631f6645d7c1f7a90"
   license "Apache-2.0"
+  revision 1
   head "https://github.com/apache/arrow.git", branch: "master"
 
   bottle do

Could you send the changes you tried to Homebrew?

@asfimport
Copy link
Author

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
Thanks.

How can I pass access_token/secret_key so that I can access private files?

You can use general user/password syntax for URI: s3://#\{USER\}:#\{PASSWORD\}\@backet/path

@asfimport
Copy link
Author

Kouhei Sutou / @kou:

How can I read/write by partitions?

Convenient APIs (Arrow::Table#save/Arrow::Table.load) for them aren't implemented yet. Could you open a Jira issue for this?

@asfimport
Copy link
Author

Kanstantsin Ilchanka / @simpl1g:

 Convenient APIs (Arrow::Table#save/Arrow::Table.load) for them aren't implemented yet. Could you open a Jira issue for this?

https://issues.apache.org/jira/browse/ARROW-14604

@asfimport
Copy link
Author

Kanstantsin Ilchanka / @simpl1g:

 You can use general user/password syntax for URI: s3://#\{USER}:#\{PASSWORD}@backet/path

I'm not sure that this is valid for S3, they require different authentication

I expect something like this 

cdef class S3FileSystem(FileSystem):

s3_fs = Arrow::S3FileSystem.new(access_key: 'key', secret_key: 'key', region: 'region')
table = Arrow::Table.load(URI("s3://backet/path"), filesystem: s3_fs)

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
Did you try it?

S3FileSystem uses user/password information in URI if they exist:

const auto username = uri.username();
if (!username.empty()) {
options.ConfigureAccessKey(username, uri.password());
} else {
options.ConfigureDefaultCredentials();
}

@asfimport
Copy link
Author

Kanstantsin Ilchanka / @simpl1g:
Thanks, your code helped. It should be s3://#{ACCESS_KEY}:#{SECRET_KEY}@backet/path

However secret key can contain / for example and it will be an error, I have such case in one of production buckets

 

URI("s3://acces_key:secret/key@bucket/file.csv")
# rfc3986_parser.rb:67:in `split': bad URI(is not URI?)

 

 

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
You can use percent-encoding for it. You can use CGI.escape in cgi/util for it: URI("s3://#\{CGI.escape("access_key")\}:#\{CGI.escape("secret/key")\}\@bucket/file.csv")

@asfimport
Copy link
Author

Kouhei Sutou / @kou:

Speed is very slow.

 

If you use "s3://..." on local, initialization process take a long time because of timeout. AWS_EC2_METADATA_DISABLED=true environment variable will help you.

See also: aws/aws-sdk-cpp#1410

@asfimport
Copy link
Author

Kanstantsin Ilchanka / @simpl1g:
Thanks, it worked, though it is not very intuitive, I'll add example to README.

@asfimport
Copy link
Author

Kanstantsin Ilchanka / @simpl1g:
Have you tried to run script with benchmark, can you reproduce problems with performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants