Add Schema::project and RecordBatch::project functions #1033

hntd187 · 2021-12-12T00:10:27Z

…eturning a new schema with those columns only

Which issue does this PR close?

Closes #1014.

Rationale for this change

See #1014 but a lot of code can be simplified and also fix silent bugs with handling metadata.

What changes are included in this PR?

2 methods on Schema and RecordBatch to allow them to project on their schemas.

Are there any user-facing changes?

…eturning a new schema with those columns only

alamb

Thank you @hntd187 ❤️

This is a great start

alamb · 2021-12-13T21:22:58Z

arrow/src/datatypes/schema.rs

+        let mut new_fields = vec![];
+        for i in indices {
+            let f = self.fields[i].clone();
+            new_fields.push(f);
+        }


I think as written

This will panic! if there the index is not in bounds:

is not "idiomatic rust style" (which to me means avoid mut). This is far less important

How about something such as (untested):

Suggested change

let mut new_fields = vec![];

for i in indices {

let f = self.fields[i].clone();

new_fields.push(f);

}

let new_fields = indices

.into_iter()

.map(|i| {

self.fields.get(i).map(|f| f.clone()))

.ok_or_else(|| Err(ArrowError::SchemaError(

format!("project index {} out of bounds, max field {}"

i, self.fields().len()),

))

})

.collect::<Result<Vec<_>>>()?;

Note the use of https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get to avoid fields[i] and then the somewhat confusing use of turbofish .collect::<Result<Vec<_>>() -- it took me quite a while to get used to that pattern

Yea, that seems good to me, the for loop was the first thing that popped into my head, but I can't think of any reason it's better than yours.

I think the for loop thing is what one would write in other languages like C/C++, Java, go ,etc :) It is certainly what I was writing when I started learning rust.

Then I realized that a big part of how rust avoids bounds checks while still being safe is by the use of the functional style

alamb · 2021-12-13T21:31:15Z

arrow/src/datatypes/schema.rs

+        assert_eq!(projected.fields()[0].name(), "name");
+        assert_eq!(projected.fields()[1].name(), "priority");
+        assert_eq!(projected.metadata.get("meta").unwrap(), "data")
+    }


Related to above -- I recommend a test for handling if index is out of bounds -- like schema.project([2, 3])

Sure, will do

alamb · 2021-12-13T21:32:30Z

arrow/src/record_batch.rs

@@ -175,6 +175,12 @@ impl RecordBatch {
        self.schema.clone()
    }

+
+    /// Projects the schema onto the specified columns
+    pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> {


The intent of this field was to project the RecordBatch rather than just the schema:

A signature like this:

Suggested change

pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> {

pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<RecordBatch> {

(so we would also have to project the columns as well as the schema)

Ahh, I thought this part was a bit too easy, okay I'll update to reflect that.

hntd187 · 2021-12-14T17:37:30Z

@alamb impl IntoIterator<Item=usize> I wanted to reuse this for the schema projection in addition, so I had to add impl IntoIterator<Item=usize> + Clone to it for RecordBatch, this doesn't seem immediately correct to me since they have different arguments, but it works.

alamb · 2021-12-14T20:59:31Z

@alamb impl IntoIterator<Item=usize> I wanted to reuse this for the schema projection in addition, so I had to add impl IntoIterator<Item=usize> + Clone to it for RecordBatch, this doesn't seem immediately correct to me since they have different arguments, but it works.

It looks like the new code may not yet have been pushed to github

codecov-commenter · 2021-12-14T21:12:18Z

Codecov Report

Merging #1033 (8ade651) into master (239cba1) will decrease coverage by 0.06%.
The diff coverage is 89.28%.

@@            Coverage Diff             @@
##           master    #1033      +/-   ##
==========================================
- Coverage   82.31%   82.25%   -0.07%     
==========================================
  Files         168      168              
  Lines       49031    49197     +166     
==========================================
+ Hits        40360    40465     +105     
- Misses       8671     8732      +61

Impacted Files	Coverage Δ
arrow/src/record_batch.rs	`91.97% <80.00%> (-0.68%)`	⬇️
arrow/src/datatypes/schema.rs	`72.95% <94.44%> (+6.28%)`	⬆️
arrow/src/datatypes/native.rs	`66.66% <0.00%> (-6.25%)`	⬇️
parquet/src/arrow/record_reader.rs	`92.77% <0.00%> (-0.96%)`	⬇️
arrow/src/array/ord.rs	`67.15% <0.00%> (-0.50%)`	⬇️
parquet_derive/src/parquet_field.rs	`65.75% <0.00%> (-0.46%)`	⬇️
arrow/src/util/integration_util.rs	`68.66% <0.00%> (-0.42%)`	⬇️
arrow/src/array/data.rs	`80.85% <0.00%> (-0.32%)`	⬇️
arrow/src/array/array.rs	`85.25% <0.00%> (-0.21%)`	⬇️
arrow/src/compute/kernels/partition.rs	`97.45% <0.00%> (-0.21%)`	⬇️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 239cba1...8ade651. Read the comment docs.

alamb

THanks for sticking with this @hntd187

alamb · 2021-12-16T21:40:21Z

arrow/src/record_batch.rs

+
+        RecordBatch::try_new(SchemaRef::new(projected_schema), batch_fields)
+    }
+


How about some tests?

Perhaps something like

#[test] fn project() { let a: ArrayRef = Arc::new(Int32Array::from(vec![ Some(1), None, Some(3), ])); let b: ArrayRef = Arc::new(StringArray::from(vec!["a", "b", "c"])); let c: ArrayRef = Arc::new(StringArray::from(vec!["d", "e", "f"])); let record_batch = RecordBatch::try_from_iter(vec![("a", a.clone()), ("b", b.clone()), ("c", c.clone())]) .expect("valid conversion"); let expected = RecordBatch::try_from_iter(vec![("a", a), ("c", c)]) .expect("valid conversion"); assert_eq!(expected, record_batch.project(&vec![0, 2]).unwrap()); }

alamb · 2021-12-16T21:42:28Z

arrow/src/record_batch.rs

+        &self,
+        indices: impl IntoIterator<Item = usize> + Clone,
+    ) -> Result<RecordBatch> {
+        let projected_schema = self.schema.project(indices.clone())?;


I see now why you needed to make the iter Clone which is kind of annoying 🤔

alamb · 2021-12-16T21:48:45Z

arrow/src/datatypes/schema.rs

@@ -87,6 +87,24 @@ impl Schema {
        Self { fields, metadata }
    }

+    /// Returns a new schema with only the specified columns in the new schema
+    /// This carries metadata from the parent schema over as well
+    pub fn project(&self, indices: impl IntoIterator<Item = usize>) -> Result<Schema> {


I know I did something different in the ticket, but I think this interface is kind of annoying.

Namely, I couldn't pass in &vec![1, 2]

--> arrow/src/datatypes/schema.rs:405:40 | 405 | let projected: Schema = schema.project(&vec![0, 2]).unwrap(); | ^^^^^^^ expected `&{integer}`, found `usize`

What would you think about being less fancy and changing this (and RecordBatch) to something like:

pub fn project(&self, indices: &[size]) -> Result<Schema> {

Which would then avoid the need for the clone on RecordBatch::project as well

alamb

Looks good -- thank you @hntd187

alamb · 2021-12-20T16:18:38Z

@hntd187 there were some fmt and clippy errors on this PR; I have pushed a fix in 8ade651

hntd187 · 2021-12-20T20:40:40Z

@hntd187 there were some fmt and clippy errors on this PR; I have pushed a fix in 8ade651

oh thank you very much, I appreciate that !

* Allow Schema and RecordBatch to project schemas on specific columns returning a new schema with those columns only * Addressing PR updates and adding a test for out of range projection * switch to &[usize] * fix: clippy and fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Allow Schema and RecordBatch to project schemas on specific columns returning a new schema with those columns only * Addressing PR updates and adding a test for out of range projection * switch to &[usize] * fix: clippy and fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Stephen Carman <hntd187@users.noreply.github.com>

Allow Schema and RecordBatch to project schemas on specific columns r…

b896bdf

…eturning a new schema with those columns only

github-actions bot added the arrow Changes to the arrow crate label Dec 12, 2021

alamb reviewed Dec 13, 2021

View reviewed changes

Addressing PR updates and adding a test for out of range projection

753d40f

alamb reviewed Dec 16, 2021

View reviewed changes

switch to &[usize]

824f8ff

alamb approved these changes Dec 20, 2021

View reviewed changes

fix: clippy and fmt

8ade651

alamb changed the title ~~Projection on Schema and RecordBatch~~ Add Schema::project and RecordBatch::project functions Dec 20, 2021

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Dec 20, 2021

alamb merged commit f3e452c into apache:master Dec 20, 2021

alamb mentioned this pull request Dec 20, 2021

Consolidate Projection for Schema and RecordBatch apache/datafusion#1425

Closed

alamb added the cherry-picked label Dec 21, 2021

alamb mentioned this pull request Dec 21, 2021

Cherry pick Add Schema::project and RecordBatch::project functions to active_release #1077

Merged

alamb mentioned this pull request Dec 22, 2021

Fix SortExec discards field metadata on the output schema apache/datafusion#1477

Merged

This was referenced Jan 21, 2022

Consolidate Schema and RecordBatch projection apache/datafusion#1638

Merged

Consolidate Schema and RecordBatch projection #1638 apache/datafusion#1646

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Schema::project and RecordBatch::project functions #1033

Add Schema::project and RecordBatch::project functions #1033

hntd187 commented Dec 12, 2021

alamb left a comment

alamb Dec 13, 2021

hntd187 Dec 14, 2021

alamb Dec 14, 2021 •

edited

Loading

alamb Dec 13, 2021

hntd187 Dec 14, 2021

alamb Dec 13, 2021

hntd187 Dec 14, 2021

hntd187 commented Dec 14, 2021

alamb commented Dec 14, 2021

codecov-commenter commented Dec 14, 2021 •

edited

Loading

alamb left a comment

alamb Dec 16, 2021

alamb Dec 16, 2021

alamb Dec 16, 2021

alamb left a comment

alamb commented Dec 20, 2021

hntd187 commented Dec 20, 2021

-        let mut new_fields = vec![];
-        for i in indices {
-            let f = self.fields[i].clone();
-            new_fields.push(f);
-        }
+        let new_fields = indices
+          .into_iter()
+          .map(|i| {
+            self.fields.get(i).map(|f| f.clone()))
+              .ok_or_else(|| Err(ArrowError::SchemaError(
+                format!("project index {} out of bounds, max field {}"
+                                    i, self.fields().len()),
+                            ))
+          })
+          .collect::<Result<Vec<_>>>()?;

	pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> {
	pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<RecordBatch> {


		RecordBatch::try_new(SchemaRef::new(projected_schema), batch_fields)
		}

Add Schema::project and RecordBatch::project functions #1033

Add Schema::project and RecordBatch::project functions #1033

Conversation

hntd187 commented Dec 12, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Dec 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hntd187 commented Dec 14, 2021

alamb commented Dec 14, 2021

codecov-commenter commented Dec 14, 2021 • edited Loading

Codecov Report

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb commented Dec 20, 2021

hntd187 commented Dec 20, 2021

alamb Dec 14, 2021 •

edited

Loading

codecov-commenter commented Dec 14, 2021 •

edited

Loading