Add size-limited strings and varying bit-width integer Value_Types to in-memory backend and check for ArithmeticOverflow in LongStorage #7557

radeusgd · 2023-08-10T17:14:58Z

Pull Request Description

Closes Extend in-memory storage to support Integers other than 64-bit, and fixed-length/length-capped strings #5159
Now data downloaded from the database can keep the type much closer to the original type (like string length limits or smaller integer types).
Cast also exposes these types.
The integers are still all stored as 64-bit Java longs, we just check their bounds. Changing underlying storage for memory efficiency may come in the future: In-Memory Table: Store Integer 32-bit and Integer 16-bit more efficiently. Add support for Float 32-bit. #6109
Fixes union of two tables reports an "unexpected storage implementations" error when add_row_number precedes the union #7565
Fixes Numeric overflow handling in Table #7529 by checking for arithmetic overflow in in-memory integer arithmetic operations that could overflow. Adds a documentation note saying that the behaviour for Database backends is unspecified and depends on particular database.

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

The documentation has been updated, if necessary.
Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
All code follows the
Scala,
Java,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.
All code has been tested:
- Unit tests have been written where possible.
- If GUI codebase was changed, the GUI was tested when built using ./run ide build.

radeusgd · 2023-08-10T17:21:47Z

So I was not completely sure what to do with the case of arithmetic overflow.

We essentially have two cases: 64-bit integers and smaller types.

For 64-bit overflow, we cannot do much as we don't have a bigger type yet. Current behaviour is to do the standard 'modulus' overflow, but I think that this is wrong as it can lead to data corruption. So I replace the value with Nothing and report the warning.

Now what if I add two 16-bit values together and they overflow the 16-bit type. Then I have a bit more options - I could widen the storage type to 32-bit or 64-bit to make it fit. However, there is no clear 'heuristic' when to do so. Do I do this 'on demand'? That will mean the result type will be unpredictable - I start with 2 16-bit columns and can end up with whatever.

Instead, do I do this always? Maybe all operations should return 64-bit integers? But that will mean I will very quickly 'lose' the smaller bit-width. Even if I had small columns, they will be 'upcasted' after any operation is performed on them. If I want them back to be small, I would need to cast them again.

I think that always up-converting to 64-bit could have its merit. It is definitely a bit simpler to implement. But then the smaller types become 'second class citizens' - because we immediately 'escape' them. It may be practical though, as we will encounter overflows less often this way and can always can re-cast afterwards. And it's easier to implement.

What do you think @jdunkerley @GregoryTravis?

I'm sure for 64-bit overflows we want to have warnings and all. But for smaller types - do we also check for overflow or do we up-cast them on every operation to 64-bits?

I imagine I will implement the 64-bit logic first as it's simpler, but I'm wondering what to do with these smaller types.

GregoryTravis · 2023-08-10T17:42:02Z

What do you think @jdunkerley @GregoryTravis?

I agree that widening to 64 bit by default is the better move.

The most common use case is that the original data uses a narrow integer type, but after it's read, the user doesn't need it to stay that way.

The second most common use case is that the user needs to compute something with narrow types, and write them back to a column with a narrow type. In this case, they'll get a clear warning /error that they tried to write 64 bit integers to a narrows column, and they'll have to cast.

radeusgd · 2023-08-10T17:44:04Z

What do you think @jdunkerley @GregoryTravis?

I agree that widening to 64 bit by default is the better move.

The most common use case is that the original data uses a narrow integer type, but after it's read, the user doesn't need it to stay that way.

The second most common use case is that the user needs to compute something with narrow types, and write them back to a column with a narrow type. In this case, they'll get a clear warning /error that they tried to write 64 bit integers to a narrows column, and they'll have to cast.

I think you are right. I will amend the tests tomorrow.

Amended the tests in commit 66508b1 and then in abfbf0b I added that even the % promotes to 64 bits. That is not strictly necessary (no chance of overflow with %) - but it makes the code simpler and the semantics more consistent - I thought that maybe just one single operation not promoting may actually be more confusing. I don't think it serves much practical purpose, because you hardly ever only perform % - it will usually be surrounded by other operations that would promote to 64-bits anyway.

radeusgd · 2023-08-11T12:18:05Z

test/Table_Tests/src/Database/Upload_Spec.enso

+        if non_trivial_types_supported then
+            src = source_table_builder [["X", [1, 2, 3]], ["Y", ["a", "xyz", "abcdefghijkl"]], ["Z", ["a", "pqrst", "abcdefghijkl"]]]
+            ## TODO [RW] figure out what semantics we want here; I think the current one may be OK but it is going to
+               be slightly painful, so IMO an auto-conversion could be useful. We could make it so that we do
+               auto-conversion (cast), but in a more strict mode such that if anything does not fit
+               (even just string padding required) we fail hard and tell the user to fix this.
+            Test.specify "fails if the target type is more restrictive than source" <|
+                result = src.update_database_table dest update_action=Update_Action.Insert key_columns=[]
+                result.should_fail_with Column_Type_Mismatch
+            but_maybe="I'm not sure if we want this automatic restriction. If anything, we should probably report situations like abcdefghijkl being truncated."
+            Test.specify "should warn if the target type is more restrictive than source and truncation may occur" pending=but_maybe <|
+                result = src.update_database_table dest update_action=Update_Action.Insert key_columns=[]
+                IO.println (Problems.get_attached_warnings result)
+
+                result.column_names . should_equal ["X", "Y", "Z"]
+                result.at "X" . to_vector . should_contain_the_same_elements_as [1, 2, 3]
+                result.at "Y" . to_vector . should_contain_the_same_elements_as ["a", "xyz", "abc"]
+                result.at "Z" . to_vector . should_contain_the_same_elements_as ["a     ", "pqrst", "abcde"]


So this is not really for this PR, hence I'm not going to stop that PR by it.

But I just realised that if we have a table with e.g. 16-bit integers, and we create a table in Enso it will be 64-bit by default. So as shown in the test above, it will raise an error and require the user to cast.

I'm wondering if it would not be better to try casting automatically. If the cast has any warnings - error hard and get the user to resolve the situation, but if the types fit I imagine we could do this automatically for convenience.

Although that will only work in-memory where we can easily check if cast worked without warnings. In database we'd risk losing data so we surely cannot do this automatic conversion.

@jdunkerley do you think such convenience auto-cast in upload is worth it? If so, I will appreciate creating a ticket for it. Or just let me know and I'll create one once I'm back from vacation.

I think this is worth doing; if the user has a target table with a narrow type, they are likely trying to write data that they believe will fit, so this is a common case.

test/Table_Tests/src/Common_Table_Operations/Join/Union_Spec.enso

...table/src/main/java/org/enso/table/data/column/operation/map/MapOperationProblemBuilder.java

GregoryTravis · 2023-08-11T15:07:00Z

test/Table_Tests/src/Database/Upload_Spec.enso

+        if non_trivial_types_supported then
+            src = source_table_builder [["X", [1, 2, 3]], ["Y", ["a", "xyz", "abcdefghijkl"]], ["Z", ["a", "pqrst", "abcdefghijkl"]]]
+            ## TODO [RW] figure out what semantics we want here; I think the current one may be OK but it is going to
+               be slightly painful, so IMO an auto-conversion could be useful. We could make it so that we do
+               auto-conversion (cast), but in a more strict mode such that if anything does not fit
+               (even just string padding required) we fail hard and tell the user to fix this.
+            Test.specify "fails if the target type is more restrictive than source" <|
+                result = src.update_database_table dest update_action=Update_Action.Insert key_columns=[]
+                result.should_fail_with Column_Type_Mismatch
+            but_maybe="I'm not sure if we want this automatic restriction. If anything, we should probably report situations like abcdefghijkl being truncated."
+            Test.specify "should warn if the target type is more restrictive than source and truncation may occur" pending=but_maybe <|
+                result = src.update_database_table dest update_action=Update_Action.Insert key_columns=[]
+                IO.println (Problems.get_attached_warnings result)
+
+                result.column_names . should_equal ["X", "Y", "Z"]
+                result.at "X" . to_vector . should_contain_the_same_elements_as [1, 2, 3]
+                result.at "Y" . to_vector . should_contain_the_same_elements_as ["a", "xyz", "abc"]
+                result.at "Z" . to_vector . should_contain_the_same_elements_as ["a     ", "pqrst", "abcde"]


I think this is worth doing; if the user has a target table with a narrow type, they are likely trying to write data that they believe will fit, so this is a common case.

...bits/table/src/main/java/org/enso/table/data/column/storage/numeric/AbstractLongStorage.java

radeusgd · 2023-08-21T10:50:04Z

I amended the default iteration counts as I was seeing insufficient warmup leading to inconsistent benchmark results. Results afterwards:

Scenario	develop	this PR
Addition	3.331 ms	2.880 ms
Addition with Overflow	2.874 ms	7.619 ms
Multiplication	2.872 ms	2.844 ms
Multiplication with Overflow	2.796 ms	7.487 ms

We can see that as long as there is no overflow, both approaches have comparable performance (surprisingly for + the addExact was even a bit faster - but I suspect it was just insufficient warmup for the develop variant - I was extending warmup already as without warmup it was more around 8ms for no overflow vs 19ms for overflow cases; I imagine if I had extended the warmup even more both results would likely converge). This suggests that in the 'happy path' there seems to be no or close-to-none overhead for the operation (in fact both measurements here were faster).

Of course in case of overflow, the new approach is slower. In my setting I set it so that 20% of rows do overflow and with that the slowdown is about 2.6x. It will likely depend on the % of rows that overflow (<1% will likely cause little overhead, whereas 100% will increase it significantly) and other factors like base stack size that influences how costly it is to throw the exception. Still here we are comparing essentially 'incorrect' behaviour (the develop variant just allows the values to overflow) with an error-catching one which simply has much more 'work to do' so it is expected it may be slower.

Raw results

develop:

Found 4 cases to execute
Benchmarking 'Column_Arithmetic_1000000.Plus_Fitting' with configuration: [warmup={3 iterations, 5 seconds each}, measurement={3 iterations, 5 seconds each}]
Warmup duration:    15194.5505 ms
Warmup invocations: 1663
Warmup avg time:    9.021 ms
Measurement duration:    15187.6392 ms
Measurement invocations: 4503
Measurement avg time:    3.331 ms
Benchmark 'Column_Arithmetic_1000000.Plus_Fitting' finished in 30391.999 ms
Benchmarking 'Column_Arithmetic_1000000.Plus_Overflowing' with configuration: [warmup={3 iterations, 5 seconds each}, measurement={3 iterations, 5 seconds each}]
Warmup duration:    15161.867 ms
Warmup invocations: 3680
Warmup avg time:    4.077 ms
Measurement duration:    15204.4607 ms
Measurement invocations: 5220
Measurement avg time:    2.874 ms
Benchmark 'Column_Arithmetic_1000000.Plus_Overflowing' finished in 30368.108 ms
Benchmarking 'Column_Arithmetic_1000000.Multiply_Fitting' with configuration: [warmup={3 iterations, 5 seconds each}, measurement={3 iterations, 5 seconds each}]
Warmup duration:    15153.7378 ms
Warmup invocations: 3618
Warmup avg time:    4.147 ms
Measurement duration:    15173.2304 ms
Measurement invocations: 5224
Measurement avg time:    2.872 ms
Benchmark 'Column_Arithmetic_1000000.Multiply_Fitting' finished in 30328.21 ms
Benchmarking 'Column_Arithmetic_1000000.Multiply_Overflowing' with configuration: [warmup={3 iterations, 5 seconds each}, measurement={3 iterations, 5 seconds each}]
Warmup duration:    15093.7514 ms
Warmup invocations: 4072
Warmup avg time:    3.684 ms
Measurement duration:    15135.0264 ms
Measurement invocations: 5366
Measurement avg time:    2.796 ms
Benchmark 'Column_Arithmetic_1000000.Multiply_Overflowing' finished in 30229.871 ms

this PR:

Found 4 cases to execute
Benchmarking 'Column_Arithmetic_1000000.Plus_Fitting' with configuration: [warmup={3 iterations, 5 seconds each}, measurement={3 iterations, 5 seconds each}]
Warmup duration:    15197.3365 ms
Warmup invocations: 2184
Warmup avg time:    6.869 ms
Measurement duration:    15191.5517 ms
Measurement invocations: 5208
Measurement avg time:    2.88 ms
Benchmark 'Column_Arithmetic_1000000.Plus_Fitting' finished in 30398.177 ms
Benchmarking 'Column_Arithmetic_1000000.Plus_Overflowing' with configuration: [warmup={3 iterations, 5 seconds each}, measurement={3 iterations, 5 seconds each}]
Warmup duration:    15036.2704 ms
Warmup invocations: 863
Warmup avg time:    17.389 ms
Measurement duration:    15029.3346 ms
Measurement invocations: 1969
Measurement avg time:    7.619 ms
Benchmark 'Column_Arithmetic_1000000.Plus_Overflowing' finished in 30067.19 ms
Benchmarking 'Column_Arithmetic_1000000.Multiply_Fitting' with configuration: [warmup={3 iterations, 5 seconds each}, measurement={3 iterations, 5 seconds each}]
Warmup duration:    15137.2899 ms
Warmup invocations: 3695
Warmup avg time:    4.06 ms
Measurement duration:    15163.2409 ms
Measurement invocations: 5274
Measurement avg time:    2.844 ms
Benchmark 'Column_Arithmetic_1000000.Multiply_Fitting' finished in 30302.24 ms
Benchmarking 'Column_Arithmetic_1000000.Multiply_Overflowing' with configuration: [warmup={3 iterations, 5 seconds each}, measurement={3 iterations, 5 seconds each}]
Warmup duration:    15015.5373 ms
Warmup invocations: 1148
Warmup avg time:    13.069 ms
Measurement duration:    15027.4779 ms
Measurement invocations: 2004
Measurement avg time:    7.487 ms
Benchmark 'Column_Arithmetic_1000000.Multiply_Overflowing' finished in 30044.145 ms

std-bits/table/src/main/java/org/enso/table/data/column/builder/StringBuilder.java

test/Table_Tests/src/Common_Table_Operations/Join/Union_Spec.enso

radeusgd force-pushed the wip/radeusgd/5159-new-inmemory-value-types branch from 2a66410 to 66508b1 Compare August 11, 2023 08:47

radeusgd mentioned this pull request Aug 11, 2023

union of two tables reports an "unexpected storage implementations" error when add_row_number precedes the union #7565

Closed

2 tasks

radeusgd commented Aug 11, 2023

View reviewed changes

test/Table_Tests/src/Common_Table_Operations/Join/Union_Spec.enso Outdated Show resolved Hide resolved

radeusgd marked this pull request as ready for review August 11, 2023 12:20

radeusgd requested review from jdunkerley and GregoryTravis as code owners August 11, 2023 12:20

radeusgd self-assigned this Aug 11, 2023

GregoryTravis reviewed Aug 11, 2023

View reviewed changes

radeusgd force-pushed the wip/radeusgd/5159-new-inmemory-value-types branch from b710f25 to 90f44a9 Compare August 21, 2023 09:52

radeusgd added 16 commits August 22, 2023 11:47

add out of bounds error

a41953f

making LongStorage variable type - WIP

e5377f5

fixes, pass bists in long_fetcher

71653e3

Add docs about overflow (TODO: implement! test!)

6ae9a7f

Parametrize String{Builder,Storage} by TextType

4b74f28

I think this import was unused?

801e751

fix, text_fetcher use precise type info

ea879c9

patch the type mapping

4ff9951

new test and explore upload test

abfeb10

Fixing a piece of logic for Union, adding tests there

33009df

enable/add cast tests

71cb460

generalize LongStorage in cast

ba16725

Implement integer casts

52f752b

skeleton for overflow tests

31f968b

Add tests for Arithmetic_Overflow

72928fe

Adapt tests to new behaviour

22533f8

radeusgd added 16 commits August 22, 2023 11:47

fix tests

e6df231

more tests

4433cdf

more tests, fixing some

f42d26a

Factor out checks to separate class

3bab806

Implement checking and reporting 64-bit integer overflow

13efa62

remove TODO, clean

5cb6fac

remove TODO

164a442

remove unrelated test

3db976c

javafmt

389eab3

CHANGELOG: nice palindrome PR number :)

4ca4f29

CR1: doc comment

cd9d526

fixes after rebase

5c06319

fixes pt. 2

3e07308

add benchmarks

d4ebdf7

more warmup

9dd67ff

add Arithmetic to be run with other benchmarks

2659f97

radeusgd force-pushed the wip/radeusgd/5159-new-inmemory-value-types branch from a8cf52f to 2659f97 Compare August 22, 2023 09:47

jdunkerley linked an issue Aug 22, 2023 that may be closed by this pull request

Numeric overflow handling in Table #7529

Closed

radeusgd changed the title ~~Add size-limited strings and varying bit-width integer Value_Types to in-memory backend~~ Add size-limited strings and varying bit-width integer Value_Types to in-memory backend and check for ArithmeticOverflow in LongStorage Aug 22, 2023

radeusgd added 2 commits August 22, 2023 17:03

Merge branch 'develop' into wip/radeusgd/5159-new-inmemory-value-types

39356b6

javafmt

7203172

jdunkerley approved these changes Aug 22, 2023

View reviewed changes

std-bits/table/src/main/java/org/enso/table/data/column/builder/StringBuilder.java Show resolved Hide resolved

test/Table_Tests/src/Common_Table_Operations/Join/Union_Spec.enso Outdated Show resolved Hide resolved

Merge branch 'develop' into wip/radeusgd/5159-new-inmemory-value-types

16c5aa2

radeusgd requested a review from GregoryTravis August 22, 2023 16:51

GregoryTravis approved these changes Aug 22, 2023

View reviewed changes

radeusgd mentioned this pull request Aug 22, 2023

Add cause to No_Output_Columns error to make it clearer what was the _underlying_ cause of the lack of columns and improve the error message #7635

Closed

radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Aug 22, 2023

add reference to #7635 in TODO

c2b9d8c

mergify bot merged commit 2385f5b into develop Aug 22, 2023
24 checks passed

mergify bot deleted the wip/radeusgd/5159-new-inmemory-value-types branch August 22, 2023 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add size-limited strings and varying bit-width integer Value_Types to in-memory backend and check for ArithmeticOverflow in LongStorage #7557

Add size-limited strings and varying bit-width integer Value_Types to in-memory backend and check for ArithmeticOverflow in LongStorage #7557

radeusgd commented Aug 10, 2023 •

edited

Loading

radeusgd commented Aug 10, 2023

GregoryTravis commented Aug 10, 2023

radeusgd commented Aug 10, 2023 •

edited

Loading

radeusgd Aug 11, 2023

GregoryTravis Aug 11, 2023

GregoryTravis Aug 11, 2023

radeusgd commented Aug 21, 2023 •

edited

Loading

Add size-limited strings and varying bit-width integer Value_Types to in-memory backend and check for ArithmeticOverflow in LongStorage #7557

Add size-limited strings and varying bit-width integer Value_Types to in-memory backend and check for ArithmeticOverflow in LongStorage #7557

Conversation

radeusgd commented Aug 10, 2023 • edited Loading

Pull Request Description

Important Notes

Checklist

radeusgd commented Aug 10, 2023

GregoryTravis commented Aug 10, 2023

radeusgd commented Aug 10, 2023 • edited Loading

radeusgd Aug 11, 2023

Choose a reason for hiding this comment

GregoryTravis Aug 11, 2023

Choose a reason for hiding this comment

GregoryTravis Aug 11, 2023

Choose a reason for hiding this comment

radeusgd commented Aug 21, 2023 • edited Loading

radeusgd commented Aug 10, 2023 •

edited

Loading

radeusgd commented Aug 10, 2023 •

edited

Loading

radeusgd commented Aug 21, 2023 •

edited

Loading