Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement RecordBatch::slice() to slice RecordBatches #460

Closed
alamb opened this issue Jun 16, 2021 · 2 comments · Fixed by #490
Closed

Implement RecordBatch::slice() to slice RecordBatches #460

alamb opened this issue Jun 16, 2021 · 2 comments · Fixed by #490
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Jun 16, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When implementing an operator that needed to check for partitions across RecordBatch boundaries (in https://github.com/influxdata/influxdb_iox/pull/1733) I found myself wanting to slice the record batch so I could keep part of it while I sent the rest through

In addition I can imagine an operator like RepartitionExec which might want to break large batches into smaller batches when repartitioning.

Describe the solution you'd like
Implement the following function

impl RecordBatch { .
...
  /// Return a new RecordBatch where each column is sliced
  /// according to `offset` and `length`
  pub fn::slice(fn slice(&self, offset: usize, length: usize) -> Result<RecordBatch> {
  }

There is already a Array::slice method: https://github.com/apache/arrow-rs/blob/master/arrow/src/array/array.rs#L83-L98 so to slice a RecordBatch you simply apply that function to each column

Here is a partial implementation, from https://github.com/influxdata/influxdb_iox/pull/1733

    fn concat_record_batches(batches: &[RecordBatch]) -> ArrowResult<RecordBatch> {
        assert!(!batches.is_empty());

        // concatenate them column by column
        let schema = batches[0].schema();
        let num_columns = batches[0].columns().len();
        let new_columns = (0..num_columns)
            .map(|column_index| {
                let old_columns: Vec<_> = batches
                    .iter()
                    .map(|batch| batch.column(column_index).as_ref())
                    .collect();
                arrow::compute::concat(&old_columns)
            })
            .collect::<ArrowResult<Vec<_>>>()?;

        RecordBatch::try_new(schema, new_columns)

Describe alternatives you've considered

RecordBatch::split is proposed in #343 but slice may be more general (and work better with existing Rust API ecosystem)

Additional context
Add any other context or screenshots about the feature request here.

@alamb alamb added good first issue Good for newcomers arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog labels Jun 16, 2021
@b41sh
Copy link
Contributor

b41sh commented Jun 18, 2021

Can I take this one?

@alamb
Copy link
Contributor Author

alamb commented Jun 19, 2021

@b41sh yes please! I will assign it to you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants