Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing transpose and cross_tab for the InMemory table. #3919

Merged
merged 14 commits into from
Nov 30, 2022

Conversation

jdunkerley
Copy link
Member

@jdunkerley jdunkerley commented Nov 25, 2022

Pull Request Description

  • Adds transpose and cross_tab to the In-Memory table.
  • Cross Tab is built on top of aggregate and hence allows for expressions and has same error trapping as in aggregate.

Important Notes

Only basic tests have been implemented. Error and warning tests will be added as a follow up task.

Checklist

Please include the following checklist in your PR:

  • The documentation has been updated if necessary.
  • All code conforms to the
    Scala,
    Java,
    and
    Rust
    style guides.
  • All code has been tested:
    • Unit tests have been written where possible.
    • If GUI codebase was changed: Enso GUI was tested when built using BOTH
      ./run ide build and ./run ide watch.

@jdunkerley jdunkerley marked this pull request as ready for review November 29, 2022 07:06
Session with Ned.

First part of crosstab.

Functional transpose prototype.
Not memory efficient though.

First go at transpose.
Supports the agreed naming rules from discussion with Ned.
Doesn't do the cross tab yet...
Supports the agreed naming rules from discussion with Ned.
Doesn't do the cross tab yet...
Uses aggregate methods.
Move transpose to Java and add some tests.
Add some tests for reasonably happy paths.
One still to sort.
@@ -819,6 +819,77 @@ type Table
on_problems.attach_problems_before problems <|
self.updated_context_and_columns new_ctx new_columns

## Returns a new table with a chosen subset of columns left unchanged and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are using pivot so much to call this transpose method, shall we maybe add ALIAS pivot?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add to the design and update as part of the next task.

Comment on lines +850 to +852
## Avoid unused arguments warning. We cannot rename arguments to `_`,
because we need to keep the API consistent with the in-memory table.
_ = [id_fields, name_field, value_field, on_problems]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw. this makes me think if we should consider some proper solution for it like an @unused annotation?

java_id = id_columns.map .java_column

id_columns.each k-> unique.make_unique k.name
result = Table.Table_Data (Java_Table.transpose java_id.to_array java_data.to_array (unique.make_unique name_field) (unique.make_unique value_field))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
result = Table.Table_Data (Java_Table.transpose java_id.to_array java_data.to_array (unique.make_unique name_field) (unique.make_unique value_field))
result = Table.Table_Data (Java_Table.transpose java_id java_data (unique.make_unique name_field) (unique.make_unique value_field))

I think with the Vector being a builtin and implementing the Truffle hasArrayElements API, we do not need the to_array calls. At least I've been not using them in some new code and it was all working OK. Unless you are using them for some specific reason that I did not consider? If yes please let me know, as I probably should adapt my usages too, as we should be consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stuck with what was in the Table code already - we can go through and remove the calls but should also probably think about using List on the other side as then it can proxy. This is calling on to a T[] so the Vector is copied anyway I believe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! I was using List<> in my case indeed. Well, then we can also keep it as-is, seems fine.

c = aggregate_column.column
prefix + " " + (if include_column then c.name else "")

## Utility function to check if all same column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't say more than the method name itself, it's also unclear why Count can be ignored. Would be good to clarify in the doc comment what this method really does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update the doc in the next task along with some tests for this.

Comment on lines +109 to +119
IntStream.range(0, groupingColumns.length)
.forEach(
i -> storage[i] = Builder.getForType(groupingColumns[i].getStorage().getType(), size));
IntStream.range(0, nameIndex.locs.size())
.forEach(
i -> {
int offset = groupingColumns.length + i * aggregates.length;
IntStream.range(0, aggregates.length)
.forEach(
j -> storage[offset + j] = Builder.getForType(aggregates[j].getType(), size));
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for using streams here instead of regular for loops?

(Not saying not to, but in similar code I'm writing I'm using for-loops and so I want to clarify what is our desired style)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found it clearer for this case but am happy with either style.

@jdunkerley jdunkerley added the CI: Ready to merge This PR is eligible for automatic merge label Nov 29, 2022
Comment on lines +113 to +115
if without_count.length < 2 then True else
column = without_count.first.column.name
without_count.all c->(c.column.name == column)
Copy link
Contributor

@hubertp hubertp Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's usually easier to follow if we have a check that simply call .is_empty. Plus .length call can seem a bit spurious from the performance POV (depending on the implementation).

Copy link
Member

@radeusgd radeusgd Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the performance of length and is_empty basically the same? Under the hood is_empty = length == 0 IIRC for Vector anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. That's a bummer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Vector.length should be a very fast operation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general every .length on any collection should be a very fast operation.
But it is spurious when you only want to check if it is empty or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I guess it may not necessarily be fast on a List (whereas is_empty will still be fast there). But my point is that in case of Vector it just boils down to reading its property. And that there's just no other way to check if it's empty than to check that length property in that case. But of course for other collections (like List) there can be a difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But my point is that in case of Vector it just boils down to reading its property

Sure, no disagreement on that. My comment was more generic since I didn't even check if without_count is a Vector or List or something else. I just knew that it was some collection.
The discussion about is_empty implementation is a slightly orthogonal subject not relevant for this PR.

Going back to my comment, my preference is just that when I see length being compared to some number I have to make a mental note why this number is used. If I see is_empty it is immediately obvious what the author had in mind. This is just a minor nit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion about is_empty implementation is a slightly orthogonal subject not relevant for this PR.

Yep, definitely :)

Going back to my comment, my preference is just that when I see length being compared to some number I have to make a mental note why this number is used. If I see is_empty it is immediately obvious what the author had in mind. This is just a minor nit.

I agree 100% here

@mergify mergify bot merged commit 4518f83 into develop Nov 30, 2022
@mergify mergify bot deleted the wip/jd/table-pivot-183867400 branch November 30, 2022 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants