Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Builder interface for adding Arrays to record batches #210

Closed
alamb opened this issue Apr 26, 2021 · 0 comments · Fixed by #7
Closed

Add Builder interface for adding Arrays to record batches #210

alamb opened this issue Apr 26, 2021 · 0 comments · Fixed by #7
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-12411

Use case:

While writing tests (both in IOx and in DataFusion) where I need a single RecordBatch, I often find myself doing something like this:

        let schema = Arc::new(Schema::new(vec![
            ArrowField::new("float_field", ArrowDataType::Float64, true),
            ArrowField::new("time", ArrowDataType::Int64, true),
        ]));

        let float_array: ArrayRef = Arc::new(Float64Array::from(vec![10.1, 20.1, 30.1, 40.1]));
        let timestamp_array: ArrayRef = Arc::new(Int64Array::from(vec![1000, 2000, 3000, 4000]));

        let batch = RecordBatch::try_new(schema, vec![float_array, timestamp_array])
            .expect("created new record batch");

This is annoying because the information that float_field is a float is encoded both in the Schema and the Float64Array

I would much rather rather be able to construct RecordBatches a a builder style to avoid the the redundancy and reduce the amount of typing / redundancy:


        let float_array: ArrayRef = Arc::new(Float64Array::from(vec![10.1, 20.1, 30.1, 40.1]));
        let timestamp_array: ArrayRef = Arc::new(Int64Array::from(vec![1000, 2000, 3000, 4000]));

        let batch = RecordBatch::empty()
          .append("float_field", timestamp_array).unwrap()
          .append("time", float_array).unwrap;

The proposal is to add a method to RecordBatch like

impl RecordBatch {
...
  fn append(self, field_name: &str, field_values: ArrayRef) -> Result<Self>
}

That would append the a field name to the current schema, returning an error if field_name was already present.

The nullability of the field would be set based on the actual null count of the field_values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants