Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-8618] Tear down unused DoFns periodically in Python SDK harness. #10655

Merged
merged 2 commits into from
Feb 27, 2020

Conversation

sunjincheng121
Copy link
Member

Tear down unused DoFns periodically in Python SDK harness.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@sunjincheng121
Copy link
Member Author

R: @robertwb @lukecwik

@sunjincheng121
Copy link
Member Author

R: @mxm @lukecwik

@@ -280,6 +283,7 @@ def get(self, instruction_id, bundle_descriptor_id):
try:
# pop() is threadsafe
processor = self.cached_bundle_processors[bundle_descriptor_id].pop()
self.last_access_time[bundle_descriptor_id] = time.time()
except IndexError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't update the access time when we first create the processor in the except block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is expected. The last_access_time represents the last time the cached_bundle_processors for some bundle_descriptor is accessed. If it exceeds the time limit, the remaining bundle processors cached in the cached_bundle_processors will be shutdown. What's your thought?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my eyes, the first time it is accessed is when it is created.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strategy here is that the bundle processors which are unused after an amount of time will be shutdown. When a bundle processor is created in the exception block, there are no cached(unused) bundle processors. The bundle processors become unused only when they are added to the cached bundle processors list. What about rename last_access_time to cached_bundle_processors_last_access_time to make it more clear?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still do not understand, the method is named get, so we access the bundle processor independently of whether we create it or not. It is cached, regardless of whether it is created and added to the cache, or retrieved from the cache.

Logically, you might want to update the time when putting the processor into the cache. That would be in release.

What is the advantage of updating the time here? It should be sufficient to update it in release, directly before putting it back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, If a bundle processor is retrieved from the cache, there is a high possibility that the remaining cached bundle processors will be needed in the future and so the last access time is updated.

If the bundle processor is newly created, it means that the cached bundle processor list is empty. This is the main reason that the last access time is only updated when bundle processor is retrieved from the cache. However, I think that it does no harm to update the last access time in both cases if it makes the code more readable.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the bundle processor is newly created, it means that the cached bundle processor list is empty. This is the main reason that the last access time is only updated when bundle processor is retrieved from the cache.

Consider the case where we just have a single bundle processor. When we call get for the first time, we won't update the last-used time. However, every time we retrieve it afterwards, we will update the time, but the list of cached bundle processors will remain empty.

I think we should either (1) always update the last-used timestamp in get, regardless of creation or (2) update it only on release.

I'm leaning towards (2) because while a bundle processor is in-use, it can't be removed anyways. We update the timestamp when we put it back in release.

Copy link
Member Author

@sunjincheng121 sunjincheng121 Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the single bundle processor case, it doesn't harm to update the time as the cached bundle processors is empty. However, in cases where there are multiple bundle processors, it will update the time for the remaining cached bundle processors and so improve the cache hit rate. I think this is main difference between solution (1) and (2). However, I'm fine with both solutions as I think both of them work. Will update the PR according to solution (2) if you are favor of it according to your experience.

sdks/python/apache_beam/runners/worker/sdk_worker.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/runners/worker/sdk_worker.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/runners/worker/sdk_worker.py Outdated Show resolved Hide resolved
@sunjincheng121
Copy link
Member Author

Thanks for the review and valuable comments. I have address the comments and I appreciate if you can have another look :) @mxm

Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for addressing the comments.

for descriptor_id, last_access_time in \
self.cached_bundle_processors_last_access_time.items():
if time.time() - last_access_time > \
DEFAULT_BUNDLE_PROCESSOR_CACHE_THRESHOLD_S:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could surround the condition with paranthesis which would make it easier to read, e.g.

       if (time.time() - last_access_time >
                DEFAULT_BUNDLE_PROCESSOR_CACHE_THRESHOLD_S):

@@ -69,6 +71,8 @@
# 5 minutes * 60 seconds * 1020 millis * 1000 micros * 1000 nanoseconds
DEFAULT_LOG_LULL_TIMEOUT_NS = 5 * 60 * 1000 * 1000 * 1000

DEFAULT_BUNDLE_PROCESSOR_CACHE_THRESHOLD_S = 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DEFAULT_BUNDLE_PROCESSOR_CACHE_THRESHOLD_S = 60
DEFAULT_BUNDLE_PROCESSOR_CACHE_SHUTDOWN_THRESHOLD_S = 60

def _schedule_periodic_shutdown(self):
def shutdown_inactive_bundle_processors():
for descriptor_id, last_access_time in \
self.cached_bundle_processors_last_access_time.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, use parenthesis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that it doesn't support parenthesis in the for loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should, there are many examples of for loops using parentheses in the code base already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I have added a local variable and it should have no this problem now. Is that make sense to you?

@kamilwu
Copy link
Contributor

kamilwu commented Feb 6, 2020

Just to let you know that we've just introduced Python autoformatter. Your merge conflict might be a result of this.
Here you can find an instruction on how to run autoformatter: https://cwiki.apache.org/confluence/display/BEAM/Python+Tips, section Formatting.
Sorry for inconvenience.

@sunjincheng121
Copy link
Member Author

squash the commits and rebase the code.

@sunjincheng121
Copy link
Member Author

R: @lukecwik @mxm

@mxm
Copy link
Contributor

mxm commented Feb 24, 2020

retest this please

Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linting needs to be fixed. See comment from @kamilwu.

@sunjincheng121
Copy link
Member Author

Run PythonLint PreCommit

@mxm
Copy link
Contributor

mxm commented Feb 25, 2020

08:17:21 > Task :sdks:python:test-suites:tox:py2:lintPy27_3
08:17:21 ************* Module apache_beam.runners.worker.sdk_worker
08:17:21 W:372,14: dict.items referenced when not iterating (dict-items-not-iterating)

https://builds.apache.org/job/beam_PreCommit_PythonLint_Phrase/70/

@sunjincheng121
Copy link
Member Author

Thanks for the info @mxm , I have fix the style issue, and update the PR :)

@robertwb robertwb merged commit 4a25aa0 into apache:master Feb 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants