Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rust] Question and Request for Examples of Array Operations #22940

Closed
asfimport opened this issue Sep 17, 2019 · 4 comments
Closed

[Rust] Question and Request for Examples of Array Operations #22940

asfimport opened this issue Sep 17, 2019 · 4 comments

Comments

@asfimport
Copy link

Hi all, thank you for your excellent work on Arrow.

As I was going through the example for the Rust Arrow implementation, specifically the read_csv example https://github.com/apache/arrow/blob/master/rust/arrow/examples/read_csv.rs , as well as the generated Rustdocs, and unit tests, it was not quite clear what the intended usage is for operations such as filtering and masking over Arrays.

One particular use-case I'm interested in is finding all values in an Array such that x >= N for all x. I came across arrow::compute::array_ops::filter, which seems to be similar to what I want, although it's expecting a mask to already be constructed before performing the filter operation, and it was not obviously visible in the documentation, leading me to believe this might not be idiomatic usage.

More generally, is the expectation for Arrays on the Rust side that they are just simple data abstractions, without exposing higher-order methods such as filtering/masking? Is the intent to leave that to users? If I missed some piece of documentation, please let me know. For my use-case I ended up trying something like:

let column = batch.column(0).as_any().downcast_ref::<Float64Array>().unwrap();
let mut builder = BooleanBuilder::new(batch.num_rows());
let N = 5.0;
for i in 0..batch.num_rows() {
   if column.value(i).unwrap() > N {
      builder.append_value(true).unwrap();
   } else {
      builder.append_value(false).unwrap();
   }
}

let mask = builder.finish();
let filtered_column = filter(column, mask);

If possible, could you provide examples of intended usage of Arrays? Thank you!

 

 

Reporter: [DELETED]

Note: This issue was originally created as ARROW-6583. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Andy Grove / @andygrove:
Hi Arthur,

You approach looks functionality correct but rather than building a boolean array and then calling the filter method, it might be simpler and more efficient to just build the new Float64Array directly in your code, but it depends on your use case I guess.

You might also be interested in looking at the code in rust/arrow/src/compute/kernels/comparison.rs  where there are methods that take advantage of SIMD for comparing arrays (but not for comparing arrays to literals yet). For example we have >= implemented with this method:

 

/// Perform `left >= right` operation on two arrays. Non-null values are greater than null
/// values.
pub fn gt_eq<T>(
    left: &PrimitiveArray<T>,
    right: &PrimitiveArray<T>,
) -> Result<BooleanArray>
 

To answer your last question, I would say the goals of the Rust project are:

  1. Allow interop with other Arrow implementations

  2. Provide efficient compute kernels for various operations (some basic ones exist already but I think more will be added over time)

    In addition to the core Arrow implementation in Rust, there is also the DataFusion crate, which is implementing a SQL query engine using Arrow, supporting query execution against CSV and Parquet files.

    I hope that helps.

@asfimport
Copy link
Author

[DELETED]:
@andygrove thanks for the response, it certainly helps. I'll take a look inside the compute kernels, and if I come up with something reusable, I may send over a PR following the contributor guidelines.

I have also looked over DataFusion. I don't have really complex queries, mostly I just need to: traverse a column at a time from a RecordBatch, compute some summary statistics, and return offsets that subdivide the Array to do some further computation. The ability to represent missing values is one of the reasons I chose to build on this library.

I believe the statement of goals is very helpful, and I suggest perhaps placing it in the README or generated docs in case another user might benefit from it.

I'll return to the mailing list for further questions if need be. Also thanks to Wes as well for earlier direction.

@asfimport
Copy link
Author

Neville Dipale / @nevi-me:
Hi [~arthur_mac], I'll add some more examples or write up some educational blog posts when we approach 1.0. If you've had more questions or suggestions since, which are related to your original question; please feel free to add them here (or in user the mailing list).

@asfimport
Copy link
Author

Andrew Lamb / @alamb:
Migrated to github: apache/arrow-rs#52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant