Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Schema::project and RecordBatch::project functions #1033

Merged
merged 4 commits into from
Dec 20, 2021

Conversation

hntd187
Copy link
Contributor

@hntd187 hntd187 commented Dec 12, 2021

…eturning a new schema with those columns only

Which issue does this PR close?

Closes #1014.

Rationale for this change

See #1014 but a lot of code can be simplified and also fix silent bugs with handling metadata.

What changes are included in this PR?

2 methods on Schema and RecordBatch to allow them to project on their schemas.

Are there any user-facing changes?

…eturning a new schema with those columns only
@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 12, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hntd187 ❤️

This is a great start

Comment on lines 94 to 98
let mut new_fields = vec![];
for i in indices {
let f = self.fields[i].clone();
new_fields.push(f);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as written

  1. This will panic! if there the index is not in bounds:
  2. is not "idiomatic rust style" (which to me means avoid mut). This is far less important

How about something such as (untested):

Suggested change
let mut new_fields = vec![];
for i in indices {
let f = self.fields[i].clone();
new_fields.push(f);
}
let new_fields = indices
.into_iter()
.map(|i| {
self.fields.get(i).map(|f| f.clone()))
.ok_or_else(|| Err(ArrowError::SchemaError(
format!("project index {} out of bounds, max field {}"
i, self.fields().len()),
))
})
.collect::<Result<Vec<_>>>()?;

Note the use of https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get to avoid fields[i] and then the somewhat confusing use of turbofish .collect::<Result<Vec<_>>() -- it took me quite a while to get used to that pattern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that seems good to me, the for loop was the first thing that popped into my head, but I can't think of any reason it's better than yours.

Copy link
Contributor

@alamb alamb Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the for loop thing is what one would write in other languages like C/C++, Java, go ,etc :) It is certainly what I was writing when I started learning rust.

Then I realized that a big part of how rust avoids bounds checks while still being safe is by the use of the functional style

assert_eq!(projected.fields()[0].name(), "name");
assert_eq!(projected.fields()[1].name(), "priority");
assert_eq!(projected.metadata.get("meta").unwrap(), "data")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to above -- I recommend a test for handling if index is out of bounds -- like schema.project([2, 3])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do

@@ -175,6 +175,12 @@ impl RecordBatch {
self.schema.clone()
}


/// Projects the schema onto the specified columns
pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent of this field was to project the RecordBatch rather than just the schema:

A signature like this:

Suggested change
pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<Schema> {
pub fn project(&self, indices: impl IntoIterator<Item=usize>) -> Result<RecordBatch> {

(so we would also have to project the columns as well as the schema)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I thought this part was a bit too easy, okay I'll update to reflect that.

@hntd187
Copy link
Contributor Author

hntd187 commented Dec 14, 2021

@alamb impl IntoIterator<Item=usize> I wanted to reuse this for the schema projection in addition, so I had to add impl IntoIterator<Item=usize> + Clone to it for RecordBatch, this doesn't seem immediately correct to me since they have different arguments, but it works.

@alamb
Copy link
Contributor

alamb commented Dec 14, 2021

@alamb impl IntoIterator<Item=usize> I wanted to reuse this for the schema projection in addition, so I had to add impl IntoIterator<Item=usize> + Clone to it for RecordBatch, this doesn't seem immediately correct to me since they have different arguments, but it works.

It looks like the new code may not yet have been pushed to github

@codecov-commenter
Copy link

codecov-commenter commented Dec 14, 2021

Codecov Report

Merging #1033 (8ade651) into master (239cba1) will decrease coverage by 0.06%.
The diff coverage is 89.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1033      +/-   ##
==========================================
- Coverage   82.31%   82.25%   -0.07%     
==========================================
  Files         168      168              
  Lines       49031    49197     +166     
==========================================
+ Hits        40360    40465     +105     
- Misses       8671     8732      +61     
Impacted Files Coverage Δ
arrow/src/record_batch.rs 91.97% <80.00%> (-0.68%) ⬇️
arrow/src/datatypes/schema.rs 72.95% <94.44%> (+6.28%) ⬆️
arrow/src/datatypes/native.rs 66.66% <0.00%> (-6.25%) ⬇️
parquet/src/arrow/record_reader.rs 92.77% <0.00%> (-0.96%) ⬇️
arrow/src/array/ord.rs 67.15% <0.00%> (-0.50%) ⬇️
parquet_derive/src/parquet_field.rs 65.75% <0.00%> (-0.46%) ⬇️
arrow/src/util/integration_util.rs 68.66% <0.00%> (-0.42%) ⬇️
arrow/src/array/data.rs 80.85% <0.00%> (-0.32%) ⬇️
arrow/src/array/array.rs 85.25% <0.00%> (-0.21%) ⬇️
arrow/src/compute/kernels/partition.rs 97.45% <0.00%> (-0.21%) ⬇️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 239cba1...8ade651. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THanks for sticking with this @hntd187


RecordBatch::try_new(SchemaRef::new(projected_schema), batch_fields)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about some tests?

Perhaps something like

    #[test]
    fn project() {
        let a: ArrayRef = Arc::new(Int32Array::from(vec![
            Some(1),
            None,
            Some(3),
        ]));
        let b: ArrayRef = Arc::new(StringArray::from(vec!["a", "b", "c"]));
        let c: ArrayRef = Arc::new(StringArray::from(vec!["d", "e", "f"]));

        let record_batch = RecordBatch::try_from_iter(vec![("a", a.clone()), ("b", b.clone()), ("c", c.clone())])
            .expect("valid conversion");

        let expected = RecordBatch::try_from_iter(vec![("a", a), ("c", c)])
            .expect("valid conversion");

        assert_eq!(expected, record_batch.project(&vec![0, 2]).unwrap());
    }

&self,
indices: impl IntoIterator<Item = usize> + Clone,
) -> Result<RecordBatch> {
let projected_schema = self.schema.project(indices.clone())?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now why you needed to make the iter Clone which is kind of annoying 🤔

@@ -87,6 +87,24 @@ impl Schema {
Self { fields, metadata }
}

/// Returns a new schema with only the specified columns in the new schema
/// This carries metadata from the parent schema over as well
pub fn project(&self, indices: impl IntoIterator<Item = usize>) -> Result<Schema> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I did something different in the ticket, but I think this interface is kind of annoying.

Namely, I couldn't pass in &vec![1, 2]

   --> arrow/src/datatypes/schema.rs:405:40
    |
405 |         let projected: Schema = schema.project(&vec![0, 2]).unwrap();
    |                                        ^^^^^^^ expected `&{integer}`, found `usize`

What would you think about being less fancy and changing this (and RecordBatch) to something like:

    pub fn project(&self, indices: &[size]) -> Result<Schema> {

Which would then avoid the need for the clone on RecordBatch::project as well

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- thank you @hntd187

@alamb
Copy link
Contributor

alamb commented Dec 20, 2021

@hntd187 there were some fmt and clippy errors on this PR; I have pushed a fix in 8ade651

@alamb alamb changed the title Projection on Schema and RecordBatch Add Schema::project and RecordBatch::project functions Dec 20, 2021
@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Dec 20, 2021
@alamb alamb merged commit f3e452c into apache:master Dec 20, 2021
@hntd187
Copy link
Contributor Author

hntd187 commented Dec 20, 2021

@hntd187 there were some fmt and clippy errors on this PR; I have pushed a fix in 8ade651

oh thank you very much, I appreciate that !

alamb added a commit that referenced this pull request Dec 21, 2021
* Allow Schema and RecordBatch to project schemas on specific columns returning a new schema with those columns only

* Addressing PR updates and adding a test for out of range projection

* switch to &[usize]

* fix: clippy and fmt

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
alamb added a commit that referenced this pull request Dec 22, 2021
* Allow Schema and RecordBatch to project schemas on specific columns returning a new schema with those columns only

* Addressing PR updates and adding a test for out of range projection

* switch to &[usize]

* fix: clippy and fmt

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Stephen Carman <hntd187@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Schema::project and RecordBatch project function to project / select a subset of columns
3 participants