Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable logging for all components, remove print statements #80

Merged
merged 1 commit into from
Jan 20, 2020

Conversation

kevin-bates
Copy link
Member

@kevin-bates kevin-bates commented Jan 14, 2020

This change enables a logger for each module of yarn-api-client.
By default, nothing will be logged unless the calling application has
configured logging. Once configured, entries will be logged relative
to the configured log level. The calling application is also responsible
or specifying the appropriate log formats, etc.

An accessor method, get_logger(logger_name), has been created in case we
need to add configuration for the loggers. In those cases, we'd probably
configure the logger "yarn_api_client", so that each of the submodule
loggers would inherit from that logger. Likewise, applications can
configure "yarn_api_client" themselves and those changes will be picked
up by the individual loggers.

The previous INFO log statement has been changed to DEBUG and I've
added timing results for the request. In addition, the previous print
statements have been replaced with calls to apiLogger().warning().

Fixes: #73

@dimon222
Copy link
Collaborator

I have next notes in mind

  1. global is somewhat antipattern, and usually direct road to undesired behaviour. I see the point in trying to have only one single logger instantiated, but I'm not sure if its the best way to implement it. I would suggest one way to instantiate it and simply import it to each working module.
  2. Environment variables might be good, but its typically discouraged for usage of loggers. I might assume the only real practical usage would be for main.py for essentially CLI (but even there it can be implemented via verbosity level argument?). In the rest python community typically overrides loglevels, formats, and date formats using native python logging handlers.
  3. Having just one-named logger might help in our current situation, but might be not flexible enough. Instead, I suggest to use __name__ based loggers which are essentially nested like yarn_api_client.application_master. Why? It allows flexible controls for end user to not just change nested levels logger handling, but also has full-pledged inheritence of controls. So, if you want to control all loggers, you can change yarn_api_client, it will automatically apply it to nested levels too.
    Might be interesting to read -https://fangpenlin.com/posts/2012/08/26/good-logging-practice-in-python/

I have slightly adjusted version with mentioned stuff. Let me know what you think.
dimon222@c1cdcf3

@kevin-bates
Copy link
Member Author

Thank you for the excellent feedback!

What I delivered was an evolution of odd behaviors I encountered - probably due to do the unnecessary. I agree with your suggestions and your branch looks good. Using a logger per module makes sense and I had recently discovered the hierarchical nature - similar to log4j.

Since the logger(s) can be accessed by "hosted applications", they can adjust the formatting also as you suggest, so I would agree we can drop the env-based approach.

Thanks. I'll revisit this first thing tomorrow and will update the title to WIP now.

@kevin-bates kevin-bates changed the title Enable logging for all components, remove print statements [WIP] Enable logging for all components, remove print statements Jan 14, 2020
@kevin-bates
Copy link
Member Author

Hmm - looks like because there are handlers set for every logger access, we now get 5 entries for each log statement (this is produced using the submit code attached to #75):

[I 2020-01-14 06:56:27.551 yarn_api_client.base] Getting resource manager endpoint from config: /etc/hadoop/conf/yarn-site.xml
[I 2020-01-14 06:56:27.551 yarn_api_client.base] Getting resource manager endpoint from config: /etc/hadoop/conf/yarn-site.xml
[I 2020-01-14 06:56:27.551 yarn_api_client.base] Getting resource manager endpoint from config: /etc/hadoop/conf/yarn-site.xml
[I 2020-01-14 06:56:27.551 yarn_api_client.base] Getting resource manager endpoint from config: /etc/hadoop/conf/yarn-site.xml
[I 2020-01-14 06:56:27.551 yarn_api_client.base] Getting resource manager endpoint from config: /etc/hadoop/conf/yarn-site.xml
Getting application id...
[D 2020-01-14 06:56:27.581 yarn_api_client.base] 'POST' request against endpoint 'http://yarn-eg-node-1.fyre.ibm.com:8088/ws/v1/cluster/apps/new-application' took 9.849 ms
[D 2020-01-14 06:56:27.581 yarn_api_client.base] 'POST' request against endpoint 'http://yarn-eg-node-1.fyre.ibm.com:8088/ws/v1/cluster/apps/new-application' took 9.849 ms
[D 2020-01-14 06:56:27.581 yarn_api_client.base] 'POST' request against endpoint 'http://yarn-eg-node-1.fyre.ibm.com:8088/ws/v1/cluster/apps/new-application' took 9.849 ms
[D 2020-01-14 06:56:27.581 yarn_api_client.base] 'POST' request against endpoint 'http://yarn-eg-node-1.fyre.ibm.com:8088/ws/v1/cluster/apps/new-application' took 9.849 ms
[D 2020-01-14 06:56:27.581 yarn_api_client.base] 'POST' request against endpoint 'http://yarn-eg-node-1.fyre.ibm.com:8088/ws/v1/cluster/apps/new-application' took 9.849 ms

I think we need to check if there are handlers defined and, if not, only create one then. I'll give this a shot.

@kevin-bates
Copy link
Member Author

@dimon222 - I've experimented with enabling formatting from the calling application and have greatly simplified the logger acquisition logic. The previous attempt had a couple issues.

  1. Multiple handlers were getting created (one for each module) and associated with the same logger (because __name__ was always 'yarn_api_client.base').
  2. Because __name__ was always the same value, it defeated the purpose and flexibility of the hierarchical nature of loggers.

If I add a method similar to the following to the hosting applications that use the api, I get the desired formatting (relative to the application)...

def configure_yarn_api_client_logger():
    logger = logging.getLogger("yarn_api_client")
    logger.setLevel(logging.INFO)
    if not logger.handlers:
        my_app_log_format = "[%(levelname)1.1s %(asctime)s.%(msecs).03d %(name)s] %(message)s"
        my_app_date_format = "%Y-%m-%d %H:%M:%S"        
        _log_handler = logging.StreamHandler()
        _log_handler.setFormatter(logging.Formatter(fmt=my_app_log_format, datefmt=my_app_date_format))
        logger.addHandler(_log_handler)
        logger.propagate = False

which then produces the formatting the developer wants for their application.

P.S., I've added you as a co-author on this PR. If you're uncomfortable with that, please let me know and I'll amend the PR.

@kevin-bates kevin-bates changed the title [WIP] Enable logging for all components, remove print statements Enable logging for all components, remove print statements Jan 14, 2020
@dimon222
Copy link
Collaborator

dimon222 commented Jan 15, 2020

I've briefly checked. Looks good.

Only some things, but I'm not sure of their importance

  1. If user has full control of it, does it even make sense to manually change loglevel of urllib3?
    https://github.com/toidi/hadoop-yarn-api-python-client/pull/80/files#diff-ee2241d0b6c7e4665240a34f5947d144R18
    User might potentially want a different loglevel for urllib, and it might override his setup.
  2. Logger can be using arguments for formatting this way utilizing "copy by reference" instead of duplicating value in memory, but since we don't expect a lot of big data values formatted, its probably not a big deal. Sadly, the simple usage of keyword would require writing our own formatters or some workarounds (https://stackoverflow.com/questions/13131400/logging-variable-data-with-new-format-string)

@kevin-bates
Copy link
Member Author

kevin-bates commented Jan 15, 2020

If we user has full control of it, does it even make sense to manually change loglevel of urllib3?

Understood. Without this we’re going to get essentially debug statements for every request. That said, I’m happy to handle this from my calling application.

I’ll remove this first thing tomorrow. I’m away from my keyboard until tomorrow.

Logger can be using arguments for formatting this way utilizing "copy by reference" instead of duplicating value in memory

Good to know. Let’s punt on that for now.

@dimon222
Copy link
Collaborator

dimon222 commented Jan 15, 2020

We could also consider this kind of workaround to make sure we initialize logger with proper hierarchy:

def getLogger(logger_name):
    logger = logging.getLogger(logger_name)
    logger.setLevel(logging.INFO)
    if not logger.handlers:
        my_app_log_format = "[%(levelname)1.1s %(asctime)s.%(msecs).03d %(name)s] %(message)s"
        my_app_date_format = "%Y-%m-%d %H:%M:%S"        
        _log_handler = logging.StreamHandler()
        _log_handler.setFormatter(logging.Formatter(fmt=my_app_log_format, datefmt=my_app_date_format))
        logger.addHandler(_log_handler)
        logger.propagate = False
   return logger

In each of modules we would call log = getLogger(__name__) where essentially current classname in hierrachy would be passed as argument.

@kevin-bates
Copy link
Member Author

I think deferring the formatting to the calling application is the right thing to do. In my two applications, I configure "yarn_api_client" and the hierarchy of the sub loggers use that format.

I do see merit in having a general getLogger(logger_name) method in case we want to make changes, so I think I'll put that back in, but I'd rather not have it setup a handler with specific formatting at this time. I'll have to do some more experimenting, particularly relative to the level, to see what such a method would need to include.

This change enables a logger for each module of yarn-api-client.
By default, nothing will be logged unless the calling application has
configured logging.  Once configured, entries will be logged relative
to the configured log level. The calling application is also responsible
for specifying the appropriate log formats, etc.

An accessor method, get_logger(logger_name), has been created in case we
need to add configuration for the loggers.  In those cases, we'd probably
configure the logger "yarn_api_client", so that each of the submodule
loggers would inherit from that logger.  Likewise, applications can
configure "yarn_api_client" themselves and those changes will be picked
up by the individual loggers.

The previous INFO log statement has been changed to DEBUG and I've
added timing results for the request. In addition, the previous print
statements have been replaced with calls to apiLogger().warning().

Fixes: gateway-experiments#73

Co-authored-by: Dmitry Romanenko <Dmitry@Romanenko.in>
@kevin-bates
Copy link
Member Author

Ok - I've gone ahead with what I hope is sufficient for the time being. I've removed the URLLIB setting (leaving that to the calling applications) and created an accessor in case we ever want to add configuration ourselves.

This has been much more challenging to get right than I had originally anticipated. @dimon222 - I really appreciate all your help and hope you're happy with the solution as well.

@dimon222
Copy link
Collaborator

Looks good for me

@lresende
Copy link
Collaborator

Overall, it looks good, and I only have one suggestion. Looks like we have removed the logger property from the base class, which caused the side effect of every module, even the ones that inherit from base class, to have to instantiate a global log. I would like to suggest that we leave the logger property on the base class, and that would call the new get_logger function to initialize the log and that property would be used in the child classes, and the get_logger would be used only in the modules that have helper functions that don't inherit from base class.

Thoughts ?

@kevin-bates
Copy link
Member Author

I think if we were to adopt an approach of using a fixed logger name (e.g., "yarn_api_client"), then using a class instance property to access the logger for instances and a "global" accessor for "helper" functions would be okay, although I prefer to have every log call "look" the same. When (re)introducing the class instance property approach, you're going to have some methods use self.logger.debug() while others use log.debug() (for example).

If we want to retain the notion of the hierarchical nature of loggers, where "yarn_api_client" represents our "root" (and configuration occurs against that level) and the "leaf" portion (e.g., "yarn_api_client.hadoop_conf") represents the location from which the log statement was issued, then defining the property on the base class (as before), will lead to all class instances using a logger name of "yarn_api_client.base", irrespective of where the class instance performed the log statement. Of course, you could address this by passing __name__ as a parameter to the logger property and "touch" the logger property within the constructor of class instance. But again, even in that case, you'd get log statements taking place in base.py with logger names of yarn_api_client.resource_manager while other statements, logged on behalf of the instance but logged in helper methods, listed as yarn_api_client.hadoop_conf - which seems inconsistent to me. In addition, the calling conventions would vary depending on where the log statement occurs. For class-based methods, you'd have self.logger(__name__).debug() (even though __name__ is only used during its initialization) while helper methods would use log.debug().

Based on this, I'm not sure what using a class instance property buys you other than not using a global accessor in files which consist solely of class methods. And it seems to introduce inconsistencies at both the calling and logging levels - unless we were to forgo the hierarchical nature of loggers and use a fixed logger name, at the expense of some finer grained control capabilities.

One thing is for sure about this logging exercise, it surfaces philosophies that are important to developers and it's great that we can have these discussions. I've just presented what I believe are counter-arguments to using an instance property along with a global accessor, but am willing to revisit this if others agree that's how we should proceed. If so, please propose how we'd address some of the inconsistencies or reasons why they shouldn't be a concern. Thank you.

@kevin-bates
Copy link
Member Author

FWIW, I just happened to look at urllib3 since "urllib3.connectionpool" is something we (in Enterprise Gateway) adjust the logging level for, and found they use a global log variable in each file.

@dimon222
Copy link
Collaborator

I think global accessor is no big harm and somewhat adopted approach across python community (what obviously doesn't always mean correct). However, it will help in using loggers for writing integration & unit tests since we don't have to reimplement the wheel for test classes. Also, same logger won't be created 20 times if u want to communicate to 20 different clusters.

@lresende
Copy link
Collaborator

Ok, based on the comments above it LGTM.

@lresende lresende merged commit 159f692 into gateway-experiments:master Jan 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix up logging and (remove) print statements
3 participants