Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-6047] Simplify the logging configuration template #6644

Merged
merged 3 commits into from
Nov 27, 2019

Conversation

mik-laj
Copy link
Member

@mik-laj mik-laj commented Nov 22, 2019

There are several problems in this file:

  • The configuration - REMOTE_HANDLERS is stored in a dictionary. In the next step the item are got based on the key. There is no reason why this cannot be done using variables. This will increase the complexity of the file
  • Variables are always created as global if a smaller scope can be used

I also created a section only regarding the remote logging configuration.

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-6047
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

@mik-laj
Copy link
Member Author

mik-laj commented Nov 22, 2019

CC: @XD-DENG, @ashb, @serkef

@mik-laj mik-laj requested review from ashb and XD-DENG November 22, 2019 22:52
@mik-laj
Copy link
Member Author

mik-laj commented Nov 22, 2019

This is important to me because I want to add new integration that will require a lot of more configuration options

@codecov-io
Copy link

codecov-io commented Nov 23, 2019

Codecov Report

Merging #6644 into master will decrease coverage by 0.32%.
The diff coverage is 41.37%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6644      +/-   ##
==========================================
- Coverage   83.82%   83.49%   -0.33%     
==========================================
  Files         672      672              
  Lines       37594    37600       +6     
==========================================
- Hits        31512    31395     -117     
- Misses       6082     6205     +123
Impacted Files Coverage Δ
airflow/config_templates/airflow_local_settings.py 63.82% <41.37%> (-16.66%) ⬇️
airflow/kubernetes/volume_mount.py 44.44% <0%> (-55.56%) ⬇️
airflow/kubernetes/volume.py 52.94% <0%> (-47.06%) ⬇️
airflow/kubernetes/pod_launcher.py 45.25% <0%> (-46.72%) ⬇️
airflow/kubernetes/refresh_config.py 50.98% <0%> (-23.53%) ⬇️
...rflow/contrib/operators/kubernetes_pod_operator.py 78.2% <0%> (-20.52%) ⬇️
airflow/configuration.py 89.13% <0%> (-3.63%) ⬇️
airflow/utils/dag_processing.py 58.48% <0%> (+0.32%) ⬆️
airflow/jobs/local_task_job.py 90% <0%> (+5%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e51e1c7...241a408. Read the comment docs.

@XD-DENG
Copy link
Member

XD-DENG commented Nov 23, 2019

Hi @mik-laj , sure I will take a look.

Meanwhile curious what new integration you're planning? If you already have clear idea, may be useful for this review as well. Cheers.

@mik-laj
Copy link
Member Author

mik-laj commented Nov 23, 2019

I am working on direct integration with Google Stackdriver Logging. WIP version is available on my company fork.
https://github.com/PolideaInternal/airflow/pull/478/files

In the end, I did not provide all possible options in the Airflow configuration options, but more advanced users have access through the logging_config_class options.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach - getting the configurations where they are actually used makes perfect sense.

However maybe we have an opportunity to simplify the whole logging configuration here @ashb @XD-DENG @mik-laj ? (It might be different PR but I wanted to discuss it here since it touches it).

I think airflow_local_setttings used to have different purpose. You were supposed to be able to define your own configuration here and treat is as "template".

For example in some places in the code I found:

    To define a pod mutation hook, add a ``airflow_local_settings`` module
    to your PYTHONPATH that defines this ``pod_mutation_hook`` function.
    It receives a ``Pod`` object and can alter it where needed.

The SETTINGS_FILE_POLICY_WITH_DUNDER_ALL and SETTINGS_FILE_POLICY and SETTINGS_FILE_POD_MUTATION_HOOK which are not documented other than some not-very-good code comments.

But the module documentation says "Airflow logging settings" (and it seems to be true, looking at the file).

The settings are loaded via "import_local_settings" method and theorethically you can modify it and configure yourself. However pretty much all the code in the airflow_local_settinngs is already configurable via conf. So it's a weird mixture of "conf-driven" settings and flexible python code that might be changed in your own copy of local_settings. Most of the VARIABLES in airflow_local_setttings do not need to be overrideable as they are already configurable..

I think much better solution might be this:

  • turn "airflow local settings" in "logging_settins" and have it simply imported as everything else - statically rather than dynamically.
  • have another "local_settings" importable dynamically from a specified folder.
  • have a 'closed' list of settings that are overridable in local configuration file. But then this file should not contain the whole configuration, just the overridden VARIABLES.

WDYT?

@mik-laj
Copy link
Member Author

mik-laj commented Nov 24, 2019

I would dream that the whole configuration was expressed as Python code.
Then in the HOME directory we would have settings.py with the following content:

from airflow.settings import *

And the airflow.settings file would contain default value definitions that could be overwritten in settings.py.

ENABLE_AWESOME_FEATURE = True

This would allow much greater freedom and create more complex objects, e.g. dictionaries.

This is how Django is configured.
https://docs.djangoproject.com/en/2.2/topics/settings/
https://github.com/joke2k/django-environ

But all in all, if we delete the import from the first code and transfer it to the core, we will have your idea. so i agree. We can move this configuration file to the core and introduce another way to overwrite the settings.

Copy link
Member

@XD-DENG XD-DENG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the more future-oriented discussion aside, please find my 2 cents and let me if it makes sense to you?

REMOTE_LOGGING = conf.getboolean('core', 'remote_logging')

if REMOTE_LOGGING and REMOTE_BASE_LOG_FOLDER.startswith('s3://'):
S3_REMOTE_HANDLERS = {
'task': {
'class': 'airflow.utils.log.s3_task_handler.S3TaskHandler',
'formatter': 'airflow',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatter, base_log_folder, and filename_template are shared and are always the same in different handlers.
Wonder if it's a good idea to have a task template (inside formatter, base_log_folder, and filename_template are defined), and create your *_REMOTE_HANDLERS using this template.

This can avoid duplicated lines, and make potential future updating on formatter/base_log_folder/filename_template easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not common to all handlers, so it will be problematic. My Stackdriver handler contains the following configurations:
https://github.com/PolideaInternal/airflow/blob/e2511a74bfdd3824845ae037e4a50de127c223d6/airflow/config_templates/airflow_local_settings.py

    gcp_conn_id = conf.get('core', 'REMOTE_LOG_CONN_ID', fallback=None)
    # stackdriver:///airflow-tasks => airflow-tasks
    REMOTE_BASE_LOG_FOLDER = urlparse(REMOTE_BASE_LOG_FOLDER).path[1:]
    STACKDRIVER_REMOTE_HANDLERS = {
        'task': {
            'class': 'airflow.utils.log.stackdriver_task_handler.StackdriverTaskHandler',
            'formatter': 'airflow',
            'name': REMOTE_BASE_LOG_FOLDER,
            'gcp_conn_id': gcp_conn_id
        }
    }

    DEFAULT_LOGGING_CONFIG['handlers'].update(STACKDRIVER_REMOTE_HANDLERS)

I'm also afraid that pulling out only part of the configuration to a separate variable will make it difficult to understand. This is not a classic code that must follow DRY rules to avoid problems. This is a configuration file where each code has a different purpose. They look similar, but each has its own separate role. First of all, this file should be easy to understand and adapt to the specific case of our users .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I think it's ok.

DEFAULT_LOGGING_CONFIG['handlers'].update(REMOTE_HANDLERS['wasb'])
elif REMOTE_LOGGING and ELASTICSEARCH_HOST:
DEFAULT_LOGGING_CONFIG['handlers'].update(REMOTE_HANDLERS['elasticsearch'])
DEFAULT_LOGGING_CONFIG['handlers'].update(ELASTIC_REMOTE_HANDLERS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worthwhile to add a else: at the end? Inside it we can handle cases like user enters something wrong when they configure (e.g. user makes a typo in REMOTE_BASE_LOG_FOLDER and it starts with something like upper case "S3:" or "hs://")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added else statement at the end. Is it looks good for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quite minor thing and no difference in terms of logic: but why not we write in the way below

if REMOTE_LOGGING:
    if REMOTE_BASE_LOG_FOLDER.startswith('s3://'):
        ...
    elif REMOTE_BASE_LOG_FOLDER.startswith('gs://'):
        ...
    ...
    else:
        raise AirflowException("...")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to increase the level of indentation, but It will be clearer to understand, so I make a change.

@serkef
Copy link
Contributor

serkef commented Nov 25, 2019

I like the approach of the changes. Would it make sense to add logging messages (to root logger) about the configuration of the logger? This way one could "verify" that their settings are being applied.

@mik-laj
Copy link
Member Author

mik-laj commented Nov 25, 2019

@serkef Airflow runs a lot of processes. Each process initiates the logger from the beginning, so displaying the full configuration would be problematic.
The configuration loading process, contains large messages that allow you to trace behavior.
https://github.com/apache/airflow/blob/master/airflow/logging_config.py#L33-L72

@mik-laj
Copy link
Member Author

mik-laj commented Nov 26, 2019

@mik-laj
Copy link
Member Author

mik-laj commented Nov 27, 2019

@potiuk Do you have any other comments related to scope of this PR?

@potiuk
Copy link
Member

potiuk commented Nov 27, 2019

No more comments. It looks good. Just wanted to see if we can think about some future changes.

@mik-laj mik-laj merged commit e82008d into apache:master Nov 27, 2019
potiuk pushed a commit that referenced this pull request Nov 29, 2019
eladkal pushed a commit to eladkal/airflow that referenced this pull request Dec 2, 2019
kaxil pushed a commit that referenced this pull request Dec 12, 2019
kaxil pushed a commit to astronomer/airflow that referenced this pull request Jul 16, 2020
kaxil added a commit to astronomer/airflow that referenced this pull request Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants