-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing transpose and cross_tab for the InMemory table. #3919
Conversation
81edc32
to
08590b7
Compare
Session with Ned. First part of crosstab. Functional transpose prototype. Not memory efficient though. First go at transpose.
Supports the agreed naming rules from discussion with Ned. Doesn't do the cross tab yet...
Supports the agreed naming rules from discussion with Ned. Doesn't do the cross tab yet...
Uses aggregate methods.
Move transpose to Java and add some tests.
Add some tests for reasonably happy paths.
One still to sort.
08590b7
to
ef7ced6
Compare
@@ -819,6 +819,77 @@ type Table | |||
on_problems.attach_problems_before problems <| | |||
self.updated_context_and_columns new_ctx new_columns | |||
|
|||
## Returns a new table with a chosen subset of columns left unchanged and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are using pivot
so much to call this transpose
method, shall we maybe add ALIAS pivot
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add to the design and update as part of the next task.
## Avoid unused arguments warning. We cannot rename arguments to `_`, | ||
because we need to keep the API consistent with the in-memory table. | ||
_ = [id_fields, name_field, value_field, on_problems] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw. this makes me think if we should consider some proper solution for it like an @unused
annotation?
java_id = id_columns.map .java_column | ||
|
||
id_columns.each k-> unique.make_unique k.name | ||
result = Table.Table_Data (Java_Table.transpose java_id.to_array java_data.to_array (unique.make_unique name_field) (unique.make_unique value_field)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result = Table.Table_Data (Java_Table.transpose java_id.to_array java_data.to_array (unique.make_unique name_field) (unique.make_unique value_field)) | |
result = Table.Table_Data (Java_Table.transpose java_id java_data (unique.make_unique name_field) (unique.make_unique value_field)) |
I think with the Vector being a builtin and implementing the Truffle hasArrayElements API, we do not need the to_array
calls. At least I've been not using them in some new code and it was all working OK. Unless you are using them for some specific reason that I did not consider? If yes please let me know, as I probably should adapt my usages too, as we should be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stuck with what was in the Table code already - we can go through and remove the calls but should also probably think about using List on the other side as then it can proxy. This is calling on to a T[] so the Vector is copied anyway I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right! I was using List<>
in my case indeed. Well, then we can also keep it as-is, seems fine.
c = aggregate_column.column | ||
prefix + " " + (if include_column then c.name else "") | ||
|
||
## Utility function to check if all same column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't say more than the method name itself, it's also unclear why Count
can be ignored. Would be good to clarify in the doc comment what this method really does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update the doc in the next task along with some tests for this.
IntStream.range(0, groupingColumns.length) | ||
.forEach( | ||
i -> storage[i] = Builder.getForType(groupingColumns[i].getStorage().getType(), size)); | ||
IntStream.range(0, nameIndex.locs.size()) | ||
.forEach( | ||
i -> { | ||
int offset = groupingColumns.length + i * aggregates.length; | ||
IntStream.range(0, aggregates.length) | ||
.forEach( | ||
j -> storage[offset + j] = Builder.getForType(aggregates[j].getType(), size)); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for using streams here instead of regular for loops?
(Not saying not to, but in similar code I'm writing I'm using for-loops and so I want to clarify what is our desired style)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found it clearer for this case but am happy with either style.
if without_count.length < 2 then True else | ||
column = without_count.first.column.name | ||
without_count.all c->(c.column.name == column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's usually easier to follow if we have a check that simply call .is_empty
. Plus .length
call can seem a bit spurious from the performance POV (depending on the implementation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the performance of length
and is_empty
basically the same? Under the hood is_empty = length == 0
IIRC for Vector anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. That's a bummer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Vector.length
should be a very fast operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general every .length
on any collection should be a very fast operation.
But it is spurious when you only want to check if it is empty or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I guess it may not necessarily be fast on a List
(whereas is_empty
will still be fast there). But my point is that in case of Vector it just boils down to reading its property. And that there's just no other way to check if it's empty than to check that length
property in that case. But of course for other collections (like List
) there can be a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But my point is that in case of Vector it just boils down to reading its property
Sure, no disagreement on that. My comment was more generic since I didn't even check if without_count
is a Vector
or List
or something else. I just knew that it was some collection.
The discussion about is_empty
implementation is a slightly orthogonal subject not relevant for this PR.
Going back to my comment, my preference is just that when I see length
being compared to some number I have to make a mental note why this number is used. If I see is_empty
it is immediately obvious what the author had in mind. This is just a minor nit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discussion about
is_empty
implementation is a slightly orthogonal subject not relevant for this PR.
Yep, definitely :)
Going back to my comment, my preference is just that when I see length being compared to some number I have to make a mental note why this number is used. If I see is_empty it is immediately obvious what the author had in mind. This is just a minor nit.
I agree 100% here
Pull Request Description
Important Notes
Only basic tests have been implemented. Error and warning tests will be added as a follow up task.
Checklist
Please include the following checklist in your PR:
Scala,
Java,
and
Rust
style guides.
./run ide build
and./run ide watch
.