Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-17135][python][tests] Fix the test testPandasFunctionMixedWithGeneralPythonFunction to make it more stable #11771

Merged
merged 1 commit into from
Apr 21, 2020

Conversation

dianfu
Copy link
Contributor

@dianfu dianfu commented Apr 16, 2020

What is the purpose of the change

This pull request changes the test testPandasFunctionMixedWithGeneralPythonFunction a bit to work around the bug introduced in CALCITE-3149.

Brief change log

  • Update the test testPandasFunctionMixedWithGeneralPythonFunction

Verifying this change

This change added tests and can be verified as follows:

  • Updated tests testPandasFunctionMixedWithGeneralPythonFunction

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit d5143e5 (Thu Apr 16 09:16:39 UTC 2020)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Apr 16, 2020

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@dianfu
Copy link
Contributor Author

dianfu commented Apr 16, 2020

The blink planner tests have passed in my personal azure pipelines: https://dev.azure.com/dianfu/58b46dfa-b96b-4b3f-a321-ce7644c4503b/_apis/build/builds/29/logs/108

@hequn8128 hequn8128 self-assigned this Apr 17, 2020
Copy link
Contributor

@hequn8128 hequn8128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dianfu Thanks a lot for the fix. Looks good to me.
@dawidwys Do you want to take a look at this?

@dawidwys
Copy link
Contributor

I don't have any comments to the changed code. I don't see though how it can fix the described issue. Can you describe how it fixes anything? I am sure this will be helpful for anyone trying to understand the change in the future.

I think either in the commit message or even here in the PR would be fine. Thanks.

@dianfu
Copy link
Contributor Author

dianfu commented Apr 20, 2020

It's a good idea to add description in the PR commit message and will do that. For this issue itself, I have already explained in the JIRA. Let me try to explain it in the PR again:
It's a bug introduced by CALCITE-3149. It ensures that the RelDataType cache (stored in DATATYPE_CACHE) in RelDataTypeFactoryImpl could be garbage collected. However, the equality check still checks the object reference instead using object.equals. It may cause issues in cases that the cache DATATYPE_CACHE has been garbage collected and at the same time there are still references to the old RelDataType. During debugging this problem, I have saw there are RelDataTypes which are not in the DATATYPE_CACHE cache and this is the root cause of the test failure reported in this JIRA. Theoretically speaking, all the SQL jobs have chances to encounter this bug. We need to fix in the calcite eventually. @danny0405 is already helping on that. Regarding to the fix in this PR, it just adjust the test case a bit to make it doesn't encounter into the bug.

@dawidwys
Copy link
Contributor

dawidwys commented Apr 20, 2020

I understood the underlying issue. I don't get though how this PR works the issue around. How do the adjustments make it less probable?

@dianfu
Copy link
Contributor Author

dianfu commented Apr 20, 2020

There are two caches in RelDataTypeFactoryImpl: KEY2TYPE_CACHE and DATATYPE_CACHE. KEY2TYPE_CACHE caches the mapping of Key(consists of field names and field types, etc) to RelDataType and can be used for the canonization of row types per my understanding. DATATYPE_CACHE caches the RelDataType instances.

PythonCalcSplitRule will split a Calc RelNode which contains both non-vectorized Python UDF and vectorized Python UDF into two Calc RelNodes.

For the failure test case, the output type of the bottom Calc consists of two fields (f0: INTEGER, f1: INTEGER), let's call it row_type_0. This row type is already available in the cache (generated by other test cases, it's held in variable KEY2TYPE_CACHE) and so it will hit the cache when constructing this row type. However, during debugging, I found that the INTEGER type referenced by row_type_0 is already cleaned up from the cache DATATYPE_CACHE. Then when constructing the RexProgram for the top Calc, it creates another INTEGER type and failure happens.

To work around this problem, we adjust the test case a bit to make the output row type of the bottom Calc consisting of three fields instead of two fields to make the cache hit fail. It seems a little hack, however, it did could solve this problem. I'm glad to try if there is more elegant way to address this problem wholly in Flink which could avoid this problem thoroughly. Do you have any suggestions? Glad to hear!

@dianfu
Copy link
Contributor Author

dianfu commented Apr 20, 2020

May be we can revert the change of calcite-3149 in Flink? Does that makes sense for you?

@dawidwys
Copy link
Contributor

Thanks for the explanation @dianfu. Really helpful. Now I understand the change. Personally I don't have a better idea from the top of my head. Let's see what @danny0405 comes up with.

+1 from my side.

…GeneralPythonFunction to make it more stable

There are two caches in RelDataTypeFactoryImpl: KEY2TYPE_CACHE and
DATATYPE_CACHE. KEY2TYPE_CACHE caches the mapping of Key(consists of
field names and field types, etc) to RelDataType and can be used for the
canonization of row types. DATATYPE_CACHE caches the RelDataType instances.

PythonCalcSplitRule will split a Calc RelNode which contains both
non-vectorized Python UDF and vectorized Python UDF into two Calc RelNodes.

For the test case testPandasFunctionMixedWithGeneralPythonFunction,
the output type of the bottom Calc consists of two fields (f0: INTEGER, f1: INTEGER),
let's call it row_type_0. This row type is already available in the cache
(generated by other test cases, held in variable KEY2TYPE_CACHE) and so it will hit the
cache when constructing this row type for the bottle Calc. However, during debugging, I
found that the INTEGER type referenced by row_type_0 is already cleaned
up from the cache DATATYPE_CACHE. Then when constructing the RexProgram
for the top Calc, it creates another INTEGER type and failure happens.

To work around this problem, we adjust the test case a bit to make the
output row type of the bottom Calc consisting of three fields instead of
two fields to make the cache hit fail.
@dianfu
Copy link
Contributor Author

dianfu commented Apr 20, 2020

Not sure why the tests still not triggered in PR. It has already passed in my private travis and azure pipelines.
travis: https://travis-ci.org/github/dianfu/flink/builds/677165020?utm_medium=notification&utm_source=email
azure pipeline: https://dev.azure.com/dianfu/Flink/_build/results?buildId=34&view=results

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the issue is fixed if there is another CALC with same type of the changed CALC, maybe we should LOG an TODO and issue ID here and revert the change if we fix that from the CALCITE side.

@dianfu
Copy link
Contributor Author

dianfu commented Apr 21, 2020

@danny0405 Thanks a lot for your comments. Personally I think there is no need to revert this change even after the issue is fixed in calcite. The changed test case is to test the functionality of PythonCalcSplitRule and it's only updated a bit to work around the calcite bug, however, the test case itself is still a valid test case for its test purpose.

@danny0405
Copy link
Contributor

Thanks, i think it is ready for merge.

@hequn8128
Copy link
Contributor

@dawidwys @danny0405 Thanks a lot for your double-check on this.

Merging...

@hequn8128 hequn8128 merged commit 34105c7 into apache:master Apr 21, 2020
@dianfu dianfu deleted the FLINK-17135 branch June 10, 2020 03:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants