Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing operations on Mixed types #7368

Merged
merged 34 commits into from
Jul 25, 2023
Merged

Conversation

radeusgd
Copy link
Member

@radeusgd radeusgd commented Jul 21, 2023

Pull Request Description

  • Fixes Table type checks now allow Mixed columns if all values satisfy a more precise type, but the Storage vectorized operations are not prepared for it #7231
  • Cleans up vectorized operations to distinguish unary and binary operations.
  • Introduces MixedStorage which may pretend to be a more specialized storage on demand.
  • Ensures that operations request a more specialized storage on right-hand side to ensure compatibility with reported inferred storage type.
  • Ensures that a dataflow error returned by an Enso callback in Java is propagated as a polyglot exception and can be caught back in Enso
  • Tests for comparison of Mixed storages with each other and other types
  • Started using Set for Filter_Condition.Is_In for better performance.
  • Migrated Column.map and Column.zip to use the Java-to-Enso callbacks.
    • This does not forward warnings. IMO we should not be losing them. We can switch and add a ticket to fix the warnings, but that would be a regression (current implementation handles them correctly). Instead, we should first gain some ability to work with warnings in polyglot. I created a ticket to get this figured out Migrating Column.map and Column.zip to Java-to-Enso callbacks #7371
  • Trying to avoid conversions when calling Enso functions from Java.

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • All code has been tested:
    • Unit tests have been written where possible.
    • If GUI codebase was changed, the GUI was tested when built using ./run ide build.

@radeusgd radeusgd added the CI: No changelog needed Do not require a changelog entry for this PR. label Jul 21, 2023
@radeusgd radeusgd self-assigned this Jul 21, 2023
@radeusgd radeusgd force-pushed the wip/radeusgd/7231-fix-mixed-ops branch from 3e1779a to 0850d2e Compare July 21, 2023 18:29
@@ -989,9 +1003,9 @@ type Column
result = case self.value_type of
Value_Type.Char _ _ -> self.is_empty
Value_Type.Float _ ->
if treat_nans_as_blank then self.is_nothing.iif True (self.internal_is_nan on_missing=False) else self.is_nothing
if treat_nans_as_blank then self.is_nothing.iif True self.internal_is_nan else self.is_nothing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this computing self.is_nothing twice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation!! Yes, apparently it is.
Well, if it's a LongStorage, then it is a O(1) operation, because it just re-uses the bitset used to mark missing values. But I guess better to cache it just to be sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sorry, I'm reading the code again and realise it's not the case.

See how it's bracketed:

if treat_nans_as_blank 
	then self.is_nothing.iif True self.internal_is_nan 
	else self.is_nothing

So we can see that in each branch, is_nothing is computed once.

If only we had multiline if to make stuff like this more readable...


Only thing we could optimize here is to compute self.is_nothing.iif True self.internal_is_nan in a single pass, by creating a variant of internal_is_nan that will return True for null values. But that seems out of scope of this PR.

@radeusgd
Copy link
Member Author

On occassion of limitting when the convertPolyglotValue helper is invoked (as the #7270 benchmarks showed that it accounts for a significant part of the runtime of various operations), I tried checking if it is even "still" needed.

After reminding myself of why it has been added in the first place, I conclude that it is needed and it will be needed whenever we are accepting date-time values, at least as long as we keep the architecture like the one we have currently.

  1. The Graal Polyglot architecture does not auto-convert our Enso date-time values (EnsoDate, EnsoDateTime, EnsoTimeOfDay) into their Java counterparts, like it does for primitive types (converting Enso Text to Java String) and this is rather unlikely to change as it seems to be done on purpose.

The Graal polyglot values essentially can report isDate, isTime and isZone, specifying if they hold a date, a time-of-day and a timezone. A single value can hold any combination of these three. Graal cannot autoconvert our EnsoDateTime value (which reports true to all three) to ZonedDateTime. Technically, it probably could - but it does not do it. I hypothesize it may be because there may exsit different date-time implementations between languages, so there is no guarantee that such conversion would be lossless (in Enso its essentially a wrapper, so it would be, but that is not guaranteed in any language) and so it is not certain if it should be performed - so Graal leaves that decision to the consumer of the value.

  1. Our current architecture stores date-time values in in-memory Table Columns as Java values (LocalDate, LocalTime and ZonedDateTime). So whenever we get an Object/Value from Enso, we need to inspect if it should be coerced to the date, because as noted in (1) it will not be auto-converted - and that is what convertPolyglotValue is doing.

If we wanted to get rid of convertPolyglotValue, we'd need to somehow change the architecture. Probably to be more Graal-friendly and instead store date/time columns as Object[] or Value[] and then convert to dates more 'lazily'. To get the date/time information from the value in Java we'd have to do the kind of conversion that convertPolyglotValue does anyway at some point and it is unclear whether doing it more lazily would be any better.

Alternatively, hypothetically, to make it 'faster' we could try getting rid of EnsoDate et al. and using raw LocalDate and other Java values in Enso - then the conversion would be necessary. That is of course not a great idea, since the whole EnsoDate wrappers were done for a reason, to get a more user friendly and Enso-centric API for Date. It would also not fully solve the problem, because we can still get date-time values into our columns that come from other languages (e.g. JS, Python).

All-in-all I don't think we can avoid convertPolyglotValue in general (if I'm missing something here and it is possible, I'd be happy to be proved wrong 🙂). What we can do is try running it only if it is really needed - when we are building columns of number where any value that is not a number will be rejected anyway, calling the conversion will be unnecessary. So we only need to call it if the value may be a date and won't be rejected if it is a date.

@radeusgd radeusgd marked this pull request as ready for review July 22, 2023 21:58
@radeusgd radeusgd force-pushed the wip/radeusgd/7231-fix-mixed-ops branch 4 times, most recently from 4f8c924 to c556b0d Compare July 25, 2023 09:38
@radeusgd radeusgd force-pushed the wip/radeusgd/7231-fix-mixed-ops branch from c556b0d to 930b3bb Compare July 25, 2023 14:07
Copy link
Member

@jdunkerley jdunkerley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and great to have this more sorted.
One slight name sort out please.

Comment on lines +197 to +198
set = Set.from_vector values
set.contains
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't be the implementation of contains within Vector?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one, but using set will be faster.

This way here we construct the Set (O(N)) operation once, and save that set in the closure of the generated function, so then we can use the quick (~O(1), hashing-based) set checks.

Config supports_case_sensitive_columns=True order_by=True natural_ordering=False case_insensitive_ordering=True order_by_unicode_normalization_by_default=False case_insensitive_ascii_only=False take_drop=True allows_mixed_type_comparisons=True supports_unicode_normalization=False is_nan_and_nothing_distinct=True distinct_returns_first_row_from_group_if_ordered=True date_time=True fixed_length_text_columns=False supports_decimal_type=False supports_time_duration=False supports_nanoseconds_in_time=False
- supports_mixed_columns: Specifies if the backend supports mixed-type
columns.
Config supports_case_sensitive_columns=True order_by=True natural_ordering=False case_insensitive_ordering=True order_by_unicode_normalization_by_default=False case_insensitive_ascii_only=False take_drop=True allows_mixed_type_comparisons=True supports_unicode_normalization=False is_nan_and_nothing_distinct=True distinct_returns_first_row_from_group_if_ordered=True date_time=True fixed_length_text_columns=False supports_decimal_type=False supports_time_duration=False supports_nanoseconds_in_time=False supports_mixed_columns=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is getting very long...
I suggest we drop supports_ from names. Not necessarily in this PR!

@radeusgd radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Jul 25, 2023
@mergify mergify bot merged commit 4b5a2e2 into develop Jul 25, 2023
24 of 25 checks passed
@mergify mergify bot deleted the wip/radeusgd/7231-fix-mixed-ops branch July 25, 2023 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: No changelog needed Do not require a changelog entry for this PR. CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
3 participants