Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-23020][table-planner] Use FlinkDefaultRelMetadataProvider when accessing FlinkRelMetadataQuery from a different thread #20655

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

swtwsk
Copy link

@swtwsk swtwsk commented Aug 22, 2022

What is the purpose of the change

FlinkRelMetadataQuery depends on Apache Calcite RelMetadataQuery's THREAD_PROVIDERS field, which is a ThreadLocal. FlinkRelOptClusterFactory (implicitly) sets THREAD_PROVIDERS to the constant FlinkDefaultRelMetadataProvider instance. However, due to the fact that THREAD_PROVIDERS is an instance of a ThreadLocal, when a different thread tries to access RelMetadataQuery::instance, NullPointerException gets thrown. This change ensures that RelMetadataQuery has a reference to FlinkDefaultRelMetadataProvider regardless of the thread it gets accessed by.

Brief change log

  • Use FlinkDefaultRelMetadataProvider when accessing FlinkRelMetadataQuery from a different thread.

Verifying this change

This change added tests and can be verified as follows:

  • Added multithreaded test FlinkRelMetadataQueryThreadingTest
  • Manually verified the change by running a simple PyFlink job inside Jupyter notebook, interrupting kernel execution (sending SIGINT to py4j/Flink instance), and running the job again (basically running the example described in the JIRA ticket)

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): don't know
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

… accessing FlinkRelMetadataQuery from a different thread
@flinkbot
Copy link
Collaborator

flinkbot commented Aug 22, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@fapaul
Copy link

fapaul commented Feb 22, 2023

@snuyanzin do you have time to help with this PR? In general, I am missing a lot of nuances in Flink's table API, so would be good if someone with more knowledge can take a look at this change.

From my side, moving towards making the TableEnvironment thread-safe is a good idea, but I can easily miss something here.

cc @twalthr

@snuyanzin
Copy link
Contributor

snuyanzin commented Feb 24, 2023

I read through the jira issue and from what I see there I feel a lack of test reproducing jira issue. Can we have such here to be sure that this is the actual root cause?

I'm asking since i was even fail to see any logs/traces in issue. So no idea what's NPE is discussing there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants