Implementing transpose and cross_tab for the InMemory table. #3919

jdunkerley · 2022-11-25T17:56:32Z

Pull Request Description

Adds transpose and cross_tab to the In-Memory table.
Cross Tab is built on top of aggregate and hence allows for expressions and has same error trapping as in aggregate.

Important Notes

Only basic tests have been implemented. Error and warning tests will be added as a follow up task.

Checklist

Please include the following checklist in your PR:

The documentation has been updated if necessary.
All code conforms to the
Scala,
Java,
and
Rust
style guides.
All code has been tested:
- Unit tests have been written where possible.
- If GUI codebase was changed: Enso GUI was tested when built using BOTH
  ./run ide build and ./run ide watch.

Session with Ned. First part of crosstab. Functional transpose prototype. Not memory efficient though. First go at transpose.

Supports the agreed naming rules from discussion with Ned. Doesn't do the cross tab yet...

Uses aggregate methods.

Move transpose to Java and add some tests.

Add some tests for reasonably happy paths.

One still to sort.

radeusgd · 2022-11-29T12:01:56Z

distribution/lib/Standard/Database/0.0.0-dev/src/Data/Table.enso

@@ -819,6 +819,77 @@ type Table
            on_problems.attach_problems_before problems <|
                self.updated_context_and_columns new_ctx new_columns

+    ## Returns a new table with a chosen subset of columns left unchanged and


If we are using pivot so much to call this transpose method, shall we maybe add ALIAS pivot?

Will add to the design and update as part of the next task.

radeusgd · 2022-11-29T12:02:44Z

distribution/lib/Standard/Database/0.0.0-dev/src/Data/Table.enso

+        ## Avoid unused arguments warning. We cannot rename arguments to `_`,
+           because we need to keep the API consistent with the in-memory table.
+        _ = [id_fields, name_field, value_field, on_problems]


Btw. this makes me think if we should consider some proper solution for it like an @unused annotation?

radeusgd · 2022-11-29T12:05:43Z

distribution/lib/Standard/Table/0.0.0-dev/src/Data/Table.enso

+        java_id = id_columns.map .java_column
+
+        id_columns.each k-> unique.make_unique k.name
+        result = Table.Table_Data (Java_Table.transpose java_id.to_array java_data.to_array (unique.make_unique name_field) (unique.make_unique value_field))


Suggested change

result = Table.Table_Data (Java_Table.transpose java_id.to_array java_data.to_array (unique.make_unique name_field) (unique.make_unique value_field))

result = Table.Table_Data (Java_Table.transpose java_id java_data (unique.make_unique name_field) (unique.make_unique value_field))

I think with the Vector being a builtin and implementing the Truffle hasArrayElements API, we do not need the to_array calls. At least I've been not using them in some new code and it was all working OK. Unless you are using them for some specific reason that I did not consider? If yes please let me know, as I probably should adapt my usages too, as we should be consistent.

I stuck with what was in the Table code already - we can go through and remove the calls but should also probably think about using List on the other side as then it can proxy. This is calling on to a T[] so the Vector is copied anyway I believe.

Right! I was using List<> in my case indeed. Well, then we can also keep it as-is, seems fine.

radeusgd · 2022-11-29T12:38:33Z

distribution/lib/Standard/Table/0.0.0-dev/src/Internal/Aggregate_Column_Helper.enso

+            c = aggregate_column.column
+            prefix + " " + (if include_column then c.name else "")
+
+## Utility function to check if all same column


This comment doesn't say more than the method name itself, it's also unclear why Count can be ignored. Would be good to clarify in the doc comment what this method really does.

Will update the doc in the next task along with some tests for this.

radeusgd · 2022-11-29T12:41:31Z

std-bits/table/src/main/java/org/enso/table/data/index/MultiValueIndex.java

+    IntStream.range(0, groupingColumns.length)
+        .forEach(
+            i -> storage[i] = Builder.getForType(groupingColumns[i].getStorage().getType(), size));
+    IntStream.range(0, nameIndex.locs.size())
+        .forEach(
+            i -> {
+              int offset = groupingColumns.length + i * aggregates.length;
+              IntStream.range(0, aggregates.length)
+                  .forEach(
+                      j -> storage[offset + j] = Builder.getForType(aggregates[j].getType(), size));
+            });


Any reason for using streams here instead of regular for loops?

(Not saying not to, but in similar code I'm writing I'm using for-loops and so I want to clarify what is our desired style)

Found it clearer for this case but am happy with either style.

hubertp · 2022-11-29T15:28:12Z

distribution/lib/Standard/Table/0.0.0-dev/src/Internal/Aggregate_Column_Helper.enso

+    if without_count.length < 2 then True else
+        column = without_count.first.column.name
+        without_count.all c->(c.column.name == column)


It's usually easier to follow if we have a check that simply call .is_empty. Plus .length call can seem a bit spurious from the performance POV (depending on the implementation).

Isn't the performance of length and is_empty basically the same? Under the hood is_empty = length == 0 IIRC for Vector anyway.

Oh. That's a bummer.

Why? Vector.length should be a very fast operation

In general every .length on any collection should be a very fast operation.
But it is spurious when you only want to check if it is empty or not.

Hm I guess it may not necessarily be fast on a List (whereas is_empty will still be fast there). But my point is that in case of Vector it just boils down to reading its property. And that there's just no other way to check if it's empty than to check that length property in that case. But of course for other collections (like List) there can be a difference.

But my point is that in case of Vector it just boils down to reading its property

Sure, no disagreement on that. My comment was more generic since I didn't even check if without_count is a Vector or List or something else. I just knew that it was some collection.
The discussion about is_empty implementation is a slightly orthogonal subject not relevant for this PR.

Going back to my comment, my preference is just that when I see length being compared to some number I have to make a mental note why this number is used. If I see is_empty it is immediately obvious what the author had in mind. This is just a minor nit.

The discussion about is_empty implementation is a slightly orthogonal subject not relevant for this PR.

Yep, definitely :)

Going back to my comment, my preference is just that when I see length being compared to some number I have to make a mental note why this number is used. If I see is_empty it is immediately obvious what the author had in mind. This is just a minor nit.

I agree 100% here

jdunkerley marked this pull request as ready for review November 29, 2022 07:06

jdunkerley requested a review from radeusgd as a code owner November 29, 2022 07:06

jdunkerley force-pushed the wip/jd/table-pivot-183867400 branch from 81edc32 to 08590b7 Compare November 29, 2022 07:06

jdunkerley added 11 commits November 29, 2022 10:56

Transpose design work.

b5042ea

Session with Ned. First part of crosstab. Functional transpose prototype. Not memory efficient though. First go at transpose.

Working towards a cross-tab implementation.

760a8bf

Supports the agreed naming rules from discussion with Ned. Doesn't do the cross tab yet...

Working towards a cross-tab implementation.

0fb83b0

Supports the agreed naming rules from discussion with Ned. Doesn't do the cross tab yet...

Functional cross-tab.

eba7cad

Uses aggregate methods.

Remove some WIP ideas.

6e2e8be

Fix issue with expressions.

13a40d6

Changelog.

85260f8

Move transpose to Java and add some tests.

Java format.

6f0809f

Sort out transpose.

f68bdd3

Add some tests for reasonably happy paths.

Tests for cross-tab.

151b22a

One still to sort.

Fix bad name_field test.

ef7ced6

jdunkerley force-pushed the wip/jd/table-pivot-183867400 branch from 08590b7 to ef7ced6 Compare November 29, 2022 10:57

jdunkerley requested a review from hubertp November 29, 2022 11:07

radeusgd approved these changes Nov 29, 2022

View reviewed changes

jdunkerley added the CI: Ready to merge This PR is eligible for automatic merge label Nov 29, 2022

Merge branch 'develop' into wip/jd/table-pivot-183867400

37542dc

hubertp approved these changes Nov 29, 2022

View reviewed changes

mergify bot added 2 commits November 29, 2022 19:39

Merge branch 'develop' into wip/jd/table-pivot-183867400

4ddc891

Merge branch 'develop' into wip/jd/table-pivot-183867400

0b3af3f

mergify bot merged commit 4518f83 into develop Nov 30, 2022

mergify bot deleted the wip/jd/table-pivot-183867400 branch November 30, 2022 01:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing transpose and cross_tab for the InMemory table. #3919

Implementing transpose and cross_tab for the InMemory table. #3919

jdunkerley commented Nov 25, 2022 •

edited

Loading

radeusgd Nov 29, 2022

jdunkerley Nov 29, 2022

radeusgd Nov 29, 2022

radeusgd Nov 29, 2022

jdunkerley Nov 29, 2022

radeusgd Nov 29, 2022

radeusgd Nov 29, 2022

jdunkerley Nov 29, 2022

radeusgd Nov 29, 2022

jdunkerley Nov 29, 2022

hubertp Nov 29, 2022 •

edited

Loading

radeusgd Nov 29, 2022 •

edited

Loading

hubertp Nov 29, 2022

radeusgd Nov 29, 2022

hubertp Nov 29, 2022

radeusgd Nov 29, 2022

hubertp Nov 29, 2022

radeusgd Nov 29, 2022

	result = Table.Table_Data (Java_Table.transpose java_id.to_array java_data.to_array (unique.make_unique name_field) (unique.make_unique value_field))
	result = Table.Table_Data (Java_Table.transpose java_id java_data (unique.make_unique name_field) (unique.make_unique value_field))

Implementing transpose and cross_tab for the InMemory table. #3919

Implementing transpose and cross_tab for the InMemory table. #3919

Conversation

jdunkerley commented Nov 25, 2022 • edited Loading

Pull Request Description

Important Notes

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hubertp Nov 29, 2022 • edited Loading

Choose a reason for hiding this comment

radeusgd Nov 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdunkerley commented Nov 25, 2022 •

edited

Loading

hubertp Nov 29, 2022 •

edited

Loading

radeusgd Nov 29, 2022 •

edited

Loading