[BEAM-2732] Starting refactor of state tracking in Python#4375
[BEAM-2732] Starting refactor of state tracking in Python#4375robertwb merged 5 commits intoapache:masterfrom
Conversation
…e Python-only state sampler full functionality.
|
r: @robertwb |
|
Python tests passing. |
|
Run Python PostCommit |
robertwb
left a comment
There was a problem hiding this comment.
Sorry for the delay, here's some initial comments.
| from apache_beam.utils.counters import CounterName | ||
|
|
||
|
|
||
| class StateSamplerSlowTest(unittest.TestCase): |
There was a problem hiding this comment.
It would be better to run the same test on both implementations rather than have a copy. (Possibly conditionally test those parts that aren't implemented of course.)
| DEFAULT_SAMPLING_PERIOD_MS = 200 | ||
|
|
||
|
|
||
| class StateSampler(object): |
There was a problem hiding this comment.
If we're going to provide a full implementation here, let's make a common base class and put anything that can be shared there. (Possibly the default sampling period as well.)
| # the worker. | ||
| # We stop using prefixes with included dash. | ||
| self.prefix = prefix[:-1] if prefix[-1] == '-' else prefix | ||
| EXECUTION_STATE_SAMPLERS.set_sampler(self) |
There was a problem hiding this comment.
That the act of creating a StateSampler should not modify global state.
| self._current_sampler = sampler | ||
|
|
||
|
|
||
| EXECUTION_STATE_SAMPLERS = ExecutionStateSamplers() |
There was a problem hiding this comment.
I'd really like to avoid more proliferation of global variables. This is somewhat needed for counters (as the user invokes counter operations without access to the underlying state) but we should be able to pass this state around explicitly.
|
Run Python PostCommit |
pabloem
left a comment
There was a problem hiding this comment.
Refactored the change. Let me know what you think.
| from apache_beam.utils.counters import CounterName | ||
|
|
||
|
|
||
| class StateSamplerSlowTest(unittest.TestCase): |
There was a problem hiding this comment.
robertwb wrote:
It would be better to run the same test on both implementations rather than have a copy. (Possibly conditionally test those parts that aren't implemented of course.)
Acknowledged.
| DEFAULT_SAMPLING_PERIOD_MS = 200 | ||
|
|
||
|
|
||
| class StateSampler(object): |
There was a problem hiding this comment.
robertwb wrote:
If we're going to provide a full implementation here, let's make a common base class and put anything that can be shared there. (Possibly the default sampling period as well.)
I've restructured it into a class hierarchy. Let me know what you think.
| # the worker. | ||
| # We stop using prefixes with included dash. | ||
| self.prefix = prefix[:-1] if prefix[-1] == '-' else prefix | ||
| EXECUTION_STATE_SAMPLERS.set_sampler(self) |
There was a problem hiding this comment.
robertwb wrote:
That the act of creating a StateSampler should not modify global state.
Added a register function.
| self._current_sampler = sampler | ||
|
|
||
|
|
||
| EXECUTION_STATE_SAMPLERS = ExecutionStateSamplers() |
There was a problem hiding this comment.
robertwb wrote:
I'd really like to avoid more proliferation of global variables. This is somewhat needed for counters (as the user invokes counter operations without access to the underlying state) but we should be able to pass this state around explicitly.
I'll document that we want to pass a tracker around as much as possible - but the aim of this global variable is to be able remove global state within MetricsEnvironment, and PerThreadLoggingContext, and have this be the canonical global state.
How does that sound?
|
Run Python PostCommit |
|
@robertwb - I've created a class hierarchy. Passes Python PostCommit, and PreCommit. |
| from apache_beam.runners.worker.statesampler import DEFAULT_SAMPLING_PERIOD_MS | ||
| except ImportError: | ||
| DEFAULT_SAMPLING_PERIOD_MS = 0 | ||
| DEFAULT_SAMPLING_PERIOD_MS = 0 |
There was a problem hiding this comment.
Nit: single assignment is easier to reason about. Put this into an else clause.
| self.counter_factory = counter_factory | ||
| self.sampling_period_ms = sampling_period_ms | ||
| def __init__(self, *args): | ||
| #TODO(pabloem): Figure out how to pass arguments without errors. |
There was a problem hiding this comment.
What errors were you getting?
|
|
||
| cdef public int64_t state_transition_count | ||
| cdef int64_t time_since_transition | ||
| cdef int64_t _time_since_transition |
There was a problem hiding this comment.
Why this change? Or should we be changing all the others as well for consistency?
| @@ -190,60 +158,28 @@ cdef class StateSampler(object): | |||
| if self.started and not self.finished: | |||
There was a problem hiding this comment.
Is this method also common to both?
|
|
||
| def __str__(self): | ||
| return '%s' % self._str_internal() | ||
| return '<CounterName<%s> at %s>' % (self._str_internal(), hex(id(self))) |
There was a problem hiding this comment.
__str__ defaults to __repr__, if you want them to be the same remove this one.
| # pylint: disable=global-variable-not-assigned | ||
| global statesampler | ||
| from apache_beam.runners.worker import statesampler | ||
| # pylint: disable=unused-variable |
There was a problem hiding this comment.
You could also put this method in a
@classmethod
def setUpClass(cls):
...
| self._current_sampler = sampler | ||
|
|
||
|
|
||
| EXECUTION_STATE_SAMPLERS = ExecutionStateSamplers() |
There was a problem hiding this comment.
Please get rid of this unused global state.
|
Let's not add the global variable until it's actually used, in which case we can weigh the pros and cons. |
| from apache_beam.runners.worker.statesampler import DEFAULT_SAMPLING_PERIOD_MS | ||
| except ImportError: | ||
| DEFAULT_SAMPLING_PERIOD_MS = 0 | ||
| DEFAULT_SAMPLING_PERIOD_MS = 0 |
There was a problem hiding this comment.
robertwb wrote:
Nit: single assignment is easier to reason about. Put this into an else clause.
Done.
| @@ -190,60 +158,28 @@ cdef class StateSampler(object): | |||
| if self.started and not self.finished: | |||
There was a problem hiding this comment.
robertwb wrote:
Is this method also common to both?
You're right. Done.
|
|
||
| def __str__(self): | ||
| return '%s' % self._str_internal() | ||
| return '<CounterName<%s> at %s>' % (self._str_internal(), hex(id(self))) |
There was a problem hiding this comment.
robertwb wrote:
__str__defaults to__repr__, if you want them to be the same remove this one.
Done.
|
|
||
| cdef public int64_t state_transition_count | ||
| cdef int64_t time_since_transition | ||
| cdef int64_t _time_since_transition |
There was a problem hiding this comment.
robertwb wrote:
Why this change? Or should we be changing all the others as well for consistency?
I had thought this was only c-accessible. I've kept it consistent.
| self._current_sampler = sampler | ||
|
|
||
|
|
||
| EXECUTION_STATE_SAMPLERS = ExecutionStateSamplers() |
There was a problem hiding this comment.
robertwb wrote:
Please get rid of this unused global state.
I plan to use the global state to track metrics and logging (and eventually remove the old style metrics context). I can add this now, or on the next PR. What do you think?
Next PR: https://github.com/apache/beam/pull/4387/files#diff-89072ff532dd15a0af899957de4c26f3R152
| self.counter_factory = counter_factory | ||
| self.sampling_period_ms = sampling_period_ms | ||
| def __init__(self, *args): | ||
| #TODO(pabloem): Figure out how to pass arguments without errors. |
There was a problem hiding this comment.
robertwb wrote:
What errors were you getting?
hmmm this is fixed now.
7323851 to
5c30712
Compare
|
Run Python PostCommit |
| from apache_beam.runners.worker import statesampler | ||
| # pylint: disable=unused-variable | ||
| from apache_beam.runners.worker import statesampler_fast | ||
| cls.slow_sampler = False |
There was a problem hiding this comment.
Actually, instead of duplicating the logic, perhaps just query state_sampler.FAST_SAMPLER directly below.
17b329e to
d3d0cf6
Compare
|
Fixed, and done. |
|
Thanks. Non-Java runs are green. Merging. |
Also giving the Python-only state sampler full functionality.
The goal for BEAM-2732 is to refactor the context trackers in the Python SDK so that they will all use the same mechanism.
Currently, Metrics, Logging and StateSampler keep their own contexts. BEAM-2732 aims to have all of them rely on the logic in StateSampler to keep their context (this is already the case in Java).
This PR does the following: