Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-2732][BEAM-4028] Logging relies on StateSampler for context #5356

Merged
merged 1 commit into from Jul 9, 2018

Conversation

pabloem
Copy link
Member

@pabloem pabloem commented May 14, 2018

The Logging module will no longer implement its own context tracking for step and stage names.

Because it no longer plays part in high-performance operations, I'm removing it and its cython annotations.

Passing PostCommit: https://builds.apache.org/job/beam_PostCommit_Python_Verify/4985/

@pabloem
Copy link
Member Author

pabloem commented May 14, 2018

Run Python Dataflow ValidatesRunner

@pabloem
Copy link
Member Author

pabloem commented May 14, 2018

Run Python PostCommit

1 similar comment
@pabloem
Copy link
Member Author

pabloem commented May 15, 2018

Run Python PostCommit

@pabloem
Copy link
Member Author

pabloem commented May 15, 2018

Run Python PostCommit

@pabloem
Copy link
Member Author

pabloem commented May 17, 2018

Run Python PostCommit

@pabloem
Copy link
Member Author

pabloem commented May 18, 2018

Run Python Dataflow ValidatesRunner

@pabloem pabloem changed the title [DO NOT REVIEW] Logging relies on StateSampler for context [BEAM-2732] Logging relies on StateSampler for context Jun 4, 2018
@pabloem pabloem changed the title [BEAM-2732] Logging relies on StateSampler for context [BEAM-2732][BEAM-4028] Logging relies on StateSampler for context Jun 4, 2018
@pabloem
Copy link
Member Author

pabloem commented Jun 4, 2018

This unifies context management in Python, which simplifies further feature work, and also expands the use of NameContext, which should improve the separation of runner and sdk harness.

@pabloem pabloem force-pushed the ss-logging branch 2 times, most recently from cead665 to 84f8fb6 Compare June 26, 2018 18:56
@pabloem
Copy link
Member Author

pabloem commented Jun 26, 2018

Run Python Dataflow ValidatesRunner

@pabloem
Copy link
Member Author

pabloem commented Jun 26, 2018

Run Python PreCommit

1 similar comment
@pabloem
Copy link
Member Author

pabloem commented Jun 27, 2018

Run Python PreCommit

@pabloem
Copy link
Member Author

pabloem commented Jun 27, 2018

Run Python PreCommit

@pabloem
Copy link
Member Author

pabloem commented Jun 27, 2018

Run Python Dataflow ValidatesRunner

1 similar comment
@pabloem
Copy link
Member Author

pabloem commented Jun 28, 2018

Run Python Dataflow ValidatesRunner

@pabloem
Copy link
Member Author

pabloem commented Jun 28, 2018

r: @charlesccychen
This unifies context management in Python, so that logging will use the state sampler to retrieve its current state. Also, NameContext is used more widely to improve step name management.

Copy link
Contributor

@charlesccychen charlesccychen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Pablo! This is a great cleanup. Can you also run Robert's benchmarks here with Cython, with and without this change: #4741?

self.operations = operations
self.stage_name = stage_name
# TODO(BEAM-4028): Remove arguments other than name_contexts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this obsolete? The Jira is still open.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Not obsolete. Good catch!

@@ -539,19 +529,14 @@ def __init__(self,
windowing: windowing properties of the output PCollection(s)
tagged_receivers: a dict of tag name to Receiver objects
step_name: the name of this step
logging_context: a LoggingContext object
logging_context: DEPRECATED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a JIRA to remove this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added BEAM-4728.

self.windowed_coder = windowed_coder
self.windowed_coder_impl = windowed_coder.get_impl()
self.step_name = step_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this deletion intentional? If so, can you add a comment / JIRA reference to clean up step_name in the arguments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of my goal with BEAM-4028. The step name is meant to only be retrievable through the name context.

@@ -49,7 +48,7 @@ def get_data(self):
per_thread_worker_data = _PerThreadWorkerData()


class PerThreadLoggingContext(LoggingContext):
class PerThreadLoggingContext(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to get rid of this class entirely? It looks like you removed the only usage in operations.py.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is used internally at google, so we need to remove it from there first.

@@ -34,7 +34,6 @@ class _PerThreadWorkerData(threading.local):

def __init__(self):
super(_PerThreadWorkerData, self).__init__()
# TODO(robertwb): Consider starting with an initial (ignored) ~20 elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental deletion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging context will be removed once it's no longer useful (after it's removed from Google code) so optimizations should not be considered anymore. I'll remove it ASAP as part of BEAM-4728

@@ -62,12 +64,22 @@ def __init__(self, prefix, counter_factory,
sampling_period_ms=DEFAULT_SAMPLING_PERIOD_MS):
self.states_by_name = {}
self._prefix = prefix
self._counter_factory = counter_factory
self._counter_factory = counter_factory or CounterFactory()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the second branch is only used by tests. Can we have the tests pass empty CounterFactory()s instead of adding this optional behavior in the actual code?

Copy link
Member Author

@pabloem pabloem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Charles!

@@ -49,7 +48,7 @@ def get_data(self):
per_thread_worker_data = _PerThreadWorkerData()


class PerThreadLoggingContext(LoggingContext):
class PerThreadLoggingContext(object):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is used internally at google, so we need to remove it from there first.

@@ -539,19 +529,14 @@ def __init__(self,
windowing: windowing properties of the output PCollection(s)
tagged_receivers: a dict of tag name to Receiver objects
step_name: the name of this step
logging_context: a LoggingContext object
logging_context: DEPRECATED
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added BEAM-4728.

self.windowed_coder = windowed_coder
self.windowed_coder_impl = windowed_coder.get_impl()
self.step_name = step_name
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of my goal with BEAM-4028. The step name is meant to only be retrievable through the name context.

@@ -34,7 +34,6 @@ class _PerThreadWorkerData(threading.local):

def __init__(self):
super(_PerThreadWorkerData, self).__init__()
# TODO(robertwb): Consider starting with an initial (ignored) ~20 elements
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging context will be removed once it's no longer useful (after it's removed from Google code) so optimizations should not be considered anymore. I'll remove it ASAP as part of BEAM-4728

self.operations = operations
self.stage_name = stage_name
# TODO(BEAM-4028): Remove arguments other than name_contexts.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Not obsolete. Good catch!

@pabloem
Copy link
Member Author

pabloem commented Jul 3, 2018

Also removed the extra path in statesampler for CounterFactory

@pabloem
Copy link
Member Author

pabloem commented Jul 3, 2018

Run Python Dataflow ValidatesRunner

@pabloem
Copy link
Member Author

pabloem commented Jul 3, 2018

Results of the map_fn_microbenchmark. For some (i'd think floating-point-related) reason, it gives per-element cost as zero, but if you observe row-per-row, the cost with changes is slightly lower than the cost on master.

On master:

     1 element  0.798268 sec
  1001 elements 1.01165 sec
  2001 elements 1.05419 sec
  3001 elements 1.1398 sec
  4001 elements 1.37623 sec
  5001 elements 1.47872 sec
  6001 elements 1.68769 sec
  7001 elements 1.68809 sec
  8001 elements 1.8503 sec
  9001 elements 2.0606 sec
Fixed cost   0.8104043092640963
Per-element  0.0
R^2          0.9845043457059202

With these changes:

     1 element  0.796835 sec
  1001 elements 0.952501 sec
  2001 elements 1.01314 sec
  3001 elements 1.17117 sec
  4001 elements 1.31416 sec
  5001 elements 1.3791 sec
  6001 elements 1.54986 sec
  7001 elements 1.68663 sec
  8001 elements 1.71841 sec
  9001 elements 1.93632 sec
Fixed cost   0.8011857992764676
Per-element  0.0
R^2          0.9922532858766798

@charlesccychen
Copy link
Contributor

Thanks Pablo! This LGTM. Can you rebase? It looks like this is a great performance improvement too, cutting down processing overhead.

@pabloem
Copy link
Member Author

pabloem commented Jul 9, 2018

Run Python Dataflow ValidatesRunner

@pabloem
Copy link
Member Author

pabloem commented Jul 9, 2018

Tests passing. I'll squash and merge this after lunch.

@pabloem
Copy link
Member Author

pabloem commented Jul 9, 2018

Squashed commits and resolved conflicts. Reruning tests.

@pabloem
Copy link
Member Author

pabloem commented Jul 9, 2018

Run Python Dataflow ValidatesRunner

@pabloem pabloem merged commit f063b15 into apache:master Jul 9, 2018
@pabloem pabloem deleted the ss-logging branch July 9, 2018 19:46
@pabloem
Copy link
Member Author

pabloem commented Jul 9, 2018

Merged. Thanks @charlesccychen for reviewing the large-ish change : )

@pabloem
Copy link
Member Author

pabloem commented Aug 15, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants