[FLINK-17135][python][tests] Fix the test testPandasFunctionMixedWithGeneralPythonFunction to make it more stable #11771

dianfu · 2020-04-16T09:13:32Z

What is the purpose of the change

This pull request changes the test testPandasFunctionMixedWithGeneralPythonFunction a bit to work around the bug introduced in CALCITE-3149.

Brief change log

Update the test testPandasFunctionMixedWithGeneralPythonFunction

Verifying this change

This change added tests and can be verified as follows:

Updated tests testPandasFunctionMixedWithGeneralPythonFunction

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (not applicable)

flinkbot · 2020-04-16T09:16:39Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit d5143e5 (Thu Apr 16 09:16:39 UTC 2020)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2020-04-16T09:53:09Z

CI report:

a602c4a Travis: CANCELED Azure: FAILURE
a04d3b0 UNKNOWN

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

dianfu · 2020-04-16T13:41:50Z

The blink planner tests have passed in my personal azure pipelines: https://dev.azure.com/dianfu/58b46dfa-b96b-4b3f-a321-ce7644c4503b/_apis/build/builds/29/logs/108

hequn8128

@dianfu Thanks a lot for the fix. Looks good to me.
@dawidwys Do you want to take a look at this?

dawidwys · 2020-04-20T06:45:24Z

I don't have any comments to the changed code. I don't see though how it can fix the described issue. Can you describe how it fixes anything? I am sure this will be helpful for anyone trying to understand the change in the future.

I think either in the commit message or even here in the PR would be fine. Thanks.

dianfu · 2020-04-20T07:10:56Z

It's a good idea to add description in the PR commit message and will do that. For this issue itself, I have already explained in the JIRA. Let me try to explain it in the PR again:
It's a bug introduced by CALCITE-3149. It ensures that the RelDataType cache (stored in DATATYPE_CACHE) in RelDataTypeFactoryImpl could be garbage collected. However, the equality check still checks the object reference instead using object.equals. It may cause issues in cases that the cache DATATYPE_CACHE has been garbage collected and at the same time there are still references to the old RelDataType. During debugging this problem, I have saw there are RelDataTypes which are not in the DATATYPE_CACHE cache and this is the root cause of the test failure reported in this JIRA. Theoretically speaking, all the SQL jobs have chances to encounter this bug. We need to fix in the calcite eventually. @danny0405 is already helping on that. Regarding to the fix in this PR, it just adjust the test case a bit to make it doesn't encounter into the bug.

dawidwys · 2020-04-20T07:27:26Z

I understood the underlying issue. I don't get though how this PR works the issue around. How do the adjustments make it less probable?

dianfu · 2020-04-20T08:16:16Z

There are two caches in RelDataTypeFactoryImpl: KEY2TYPE_CACHE and DATATYPE_CACHE. KEY2TYPE_CACHE caches the mapping of Key(consists of field names and field types, etc) to RelDataType and can be used for the canonization of row types per my understanding. DATATYPE_CACHE caches the RelDataType instances.

PythonCalcSplitRule will split a Calc RelNode which contains both non-vectorized Python UDF and vectorized Python UDF into two Calc RelNodes.

For the failure test case, the output type of the bottom Calc consists of two fields (f0: INTEGER, f1: INTEGER), let's call it row_type_0. This row type is already available in the cache (generated by other test cases, it's held in variable KEY2TYPE_CACHE) and so it will hit the cache when constructing this row type. However, during debugging, I found that the INTEGER type referenced by row_type_0 is already cleaned up from the cache DATATYPE_CACHE. Then when constructing the RexProgram for the top Calc, it creates another INTEGER type and failure happens.

To work around this problem, we adjust the test case a bit to make the output row type of the bottom Calc consisting of three fields instead of two fields to make the cache hit fail. It seems a little hack, however, it did could solve this problem. I'm glad to try if there is more elegant way to address this problem wholly in Flink which could avoid this problem thoroughly. Do you have any suggestions? Glad to hear!

dianfu · 2020-04-20T08:29:53Z

May be we can revert the change of calcite-3149 in Flink? Does that makes sense for you?

dawidwys · 2020-04-20T08:29:54Z

Thanks for the explanation @dianfu. Really helpful. Now I understand the change. Personally I don't have a better idea from the top of my head. Let's see what @danny0405 comes up with.

+1 from my side.

…GeneralPythonFunction to make it more stable There are two caches in RelDataTypeFactoryImpl: KEY2TYPE_CACHE and DATATYPE_CACHE. KEY2TYPE_CACHE caches the mapping of Key(consists of field names and field types, etc) to RelDataType and can be used for the canonization of row types. DATATYPE_CACHE caches the RelDataType instances. PythonCalcSplitRule will split a Calc RelNode which contains both non-vectorized Python UDF and vectorized Python UDF into two Calc RelNodes. For the test case testPandasFunctionMixedWithGeneralPythonFunction, the output type of the bottom Calc consists of two fields (f0: INTEGER, f1: INTEGER), let's call it row_type_0. This row type is already available in the cache (generated by other test cases, held in variable KEY2TYPE_CACHE) and so it will hit the cache when constructing this row type for the bottle Calc. However, during debugging, I found that the INTEGER type referenced by row_type_0 is already cleaned up from the cache DATATYPE_CACHE. Then when constructing the RexProgram for the top Calc, it creates another INTEGER type and failure happens. To work around this problem, we adjust the test case a bit to make the output row type of the bottom Calc consisting of three fields instead of two fields to make the cache hit fail.

dianfu · 2020-04-20T12:57:03Z

Not sure why the tests still not triggered in PR. It has already passed in my private travis and azure pipelines.
travis: https://travis-ci.org/github/dianfu/flink/builds/677165020?utm_medium=notification&utm_source=email
azure pipeline: https://dev.azure.com/dianfu/Flink/_build/results?buildId=34&view=results

danny0405

Not sure if the issue is fixed if there is another CALC with same type of the changed CALC, maybe we should LOG an TODO and issue ID here and revert the change if we fix that from the CALCITE side.

dianfu · 2020-04-21T02:27:38Z

@danny0405 Thanks a lot for your comments. Personally I think there is no need to revert this change even after the issue is fixed in calcite. The changed test case is to test the functionality of PythonCalcSplitRule and it's only updated a bit to work around the calcite bug, however, the test case itself is still a valid test case for its test purpose.

danny0405 · 2020-04-21T02:32:57Z

Thanks, i think it is ready for merge.

hequn8128 · 2020-04-21T02:34:32Z

@dawidwys @danny0405 Thanks a lot for your double-check on this.

Merging...

rmetzger added the review=description? label Apr 16, 2020

rmetzger added component=API/Python component=Tests labels Apr 16, 2020

hequn8128 self-assigned this Apr 17, 2020

hequn8128 reviewed Apr 17, 2020

View reviewed changes

dianfu force-pushed the FLINK-17135 branch from d5143e5 to a602c4a Compare April 20, 2020 08:47

dianfu force-pushed the FLINK-17135 branch from a602c4a to a04d3b0 Compare April 20, 2020 09:36

danny0405 approved these changes Apr 21, 2020

View reviewed changes

hequn8128 merged commit 34105c7 into apache:master Apr 21, 2020

dianfu deleted the FLINK-17135 branch June 10, 2020 03:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-17135][python][tests] Fix the test testPandasFunctionMixedWithGeneralPythonFunction to make it more stable #11771

[FLINK-17135][python][tests] Fix the test testPandasFunctionMixedWithGeneralPythonFunction to make it more stable #11771

dianfu commented Apr 16, 2020

flinkbot commented Apr 16, 2020

flinkbot commented Apr 16, 2020 •

edited

Loading

dianfu commented Apr 16, 2020

hequn8128 left a comment

dawidwys commented Apr 20, 2020

dianfu commented Apr 20, 2020 •

edited

Loading

dawidwys commented Apr 20, 2020 •

edited

Loading

dianfu commented Apr 20, 2020

dianfu commented Apr 20, 2020

dawidwys commented Apr 20, 2020

dianfu commented Apr 20, 2020

danny0405 left a comment

dianfu commented Apr 21, 2020

danny0405 commented Apr 21, 2020

hequn8128 commented Apr 21, 2020

[FLINK-17135][python][tests] Fix the test testPandasFunctionMixedWithGeneralPythonFunction to make it more stable #11771

[FLINK-17135][python][tests] Fix the test testPandasFunctionMixedWithGeneralPythonFunction to make it more stable #11771

Conversation

dianfu commented Apr 16, 2020

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Apr 16, 2020

Automated Checks

Review Progress

flinkbot commented Apr 16, 2020 • edited Loading

CI report:

dianfu commented Apr 16, 2020

hequn8128 left a comment

Choose a reason for hiding this comment

dawidwys commented Apr 20, 2020

dianfu commented Apr 20, 2020 • edited Loading

dawidwys commented Apr 20, 2020 • edited Loading

dianfu commented Apr 20, 2020

dianfu commented Apr 20, 2020

dawidwys commented Apr 20, 2020

dianfu commented Apr 20, 2020

danny0405 left a comment

Choose a reason for hiding this comment

dianfu commented Apr 21, 2020

danny0405 commented Apr 21, 2020

hequn8128 commented Apr 21, 2020

flinkbot commented Apr 16, 2020 •

edited

Loading

dianfu commented Apr 20, 2020 •

edited

Loading

dawidwys commented Apr 20, 2020 •

edited

Loading