Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-XXX] GSoD: How to make DAGs production ready #6515

Merged
merged 30 commits into from Nov 25, 2019

Conversation

KKcorps
Copy link
Contributor

@KKcorps KKcorps commented Nov 7, 2019

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-XXX
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

@KKcorps
Copy link
Contributor Author

KKcorps commented Nov 7, 2019

docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
@KKcorps KKcorps requested a review from kaxil November 7, 2019 18:20
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
@potiuk
Copy link
Member

potiuk commented Nov 8, 2019

@KKcorps I think one of the important things to mention in this document is database access during parsing time and especially avoiding to use Airflow Variables in the DAGs (they are still ok to use in "execute" method).

This is a known problem that people are complaining a lot that scheduler opens and closes a lot of connections to the database - because every time the file is parsed, and variable is reached, a database connection is opened and query executed.

I think there are lots of examples around with using variables in the DAGs but this is not really a good practice and I think this is a perfect place to mention it.

I believe environment variables are better way to share common configuration.

@kaxil @ashb @mik-laj - what are your thoughts about it ?

@KKcorps
Copy link
Contributor Author

KKcorps commented Nov 8, 2019

@potiuk Yes, I am all for it. Personally, I also faced this issue. I have mentioned in the documentation that you shouldn't write a lot of code outside the DAG because of frequent parsing but I'll elaborate that point to include this as well.

@potiuk
Copy link
Member

potiuk commented Nov 8, 2019

Great @KKcorps It could also be great to search some of the existing documentation and see if there are no contradictions practices <> examples (I believe there are a few places where you have those bad practices shown as examples :). And we can fix them together.

@kaxil
Copy link
Member

kaxil commented Nov 8, 2019

@KKcorps I think one of the important things to mention in this document is database access during parsing time and especially avoiding to use Airflow Variables in the DAGs (they are still ok to use in "execute" method).

This is a known problem that people are complaining a lot that scheduler opens and closes a lot of connections to the database - because every time the file is parsed, and variable is reached, a database connection is opened and query executed.

I think there are lots of examples around with using variables in the DAGs but this is not really a good practice and I think this is a perfect place to mention it.

I believe environment variables are better way to share common configuration.

@kaxil @ashb @mik-laj - what are your thoughts about it ?

Agree. So using lots of Airflow variables in the file outside of task code will cause many DB connections.

It is fine to use it in a deferred way in Jinja templated field. For using it outside task file use Environment Variables.

@KKcorps
Copy link
Contributor Author

KKcorps commented Nov 8, 2019

I was also thinking to actually place this doc in the main section and name is Best Practices or something similar.

@kaxil
Copy link
Member

kaxil commented Nov 8, 2019

I was also thinking to actually place this doc in the main section and name is Best Practices or something similar.

I like that idea

@potiuk
Copy link
Member

potiuk commented Nov 8, 2019

I was also thinking to actually place this doc in the main section and name is Best Practices or something similar.

I like it too :)

@KKcorps
Copy link
Contributor Author

KKcorps commented Nov 13, 2019

@kaxil @potiuk
what is the difference between processor_poll_interval & min_file_process_interval
Which of these control how frequently a DAG will be parsed?
In my opinion, it should be min_file_process_interval but it has a default value of 0 in default-airflow.cfg

docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
Let's see what precautions you need to take.


Backend
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Backend
Database backend

Logging
--------

If you are using disposable nodes in your cluster, configure the log storage to be a distributed file system such as ``S3`` or ``GCS``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Airflow can also be used with other tools that do not provide a file system e.g. Stackdriver Logging (not publicly available yet), Elasticsearch, Amazon CloudWatch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Communication
--------------

Airflow executes tasks of a DAG in different directories, which can even be present
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The directory bit isn't true, but the issue here is as you mention tasks can be executed on different machines.

And even if using the a LocalExecutor, storing files on local disk can make retries harder (especially if another task might have deleted the file in the mean time)

Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files.

The tasks should also not store any authentication parameters such as passwords or token inside them.
Always use :ref:`Connections <concepts-connections>` to store data securely in Airflow backend and retrieve them using a unique connection id.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unforutnately we can't be so bold as to say "Always" -- not every system is supported, so "Where at all possible" might be the best we can say.

on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`.
Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes.

Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expand this to include something like "and then push a path to the remote file in Xcom to use in downstream tasks"

docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Show resolved Hide resolved
Multi-Node Cluster
-------------------

Airflow uses :class:`airflow.executors.sequential_executor.SequentialExecutor` by default. It works fine in most cases. However, by its nature, the user is limited to executing at most
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say in most cases. Even on a local laptop LocalExecutor works much better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with SequentialExecutor is it pauses the scheduler when it runs a task, so even for a single node production deploy SequentialExecutor is not a good choicce.

docs/best-practices.rst Outdated Show resolved Hide resolved
Logging
--------

If you are using disposable nodes in your cluster, configure the log storage to be a distributed file system such as ``S3`` or ``GCS``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Show resolved Hide resolved
KKcorps and others added 2 commits November 21, 2019 19:54
Co-Authored-By: Ash Berlin-Taylor <ash_github@firemirror.com>
Co-Authored-By: Kamil Breguła <mik-laj@users.noreply.github.com>
Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there now. Looking good though!

You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce
incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task.

Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can, if configured. Default is to not retry though.

docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
KKcorps and others added 3 commits November 23, 2019 09:06
Co-Authored-By: Ash Berlin-Taylor <ash_github@firemirror.com>
@ashb
Copy link
Member

ashb commented Nov 25, 2019

@kaxil PTAL.

docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor changes. Looks solid overall. Good work @KKcorps

docs/best-practices.rst Outdated Show resolved Hide resolved
Co-Authored-By: Kaxil Naik <kaxilnaik@gmail.com>
@kaxil kaxil merged commit c7c0a53 into apache:master Nov 25, 2019
@kaxil
Copy link
Member

kaxil commented Nov 25, 2019

Thanks @KKcorps

kaxil pushed a commit that referenced this pull request Dec 17, 2019
kaxil added a commit that referenced this pull request Dec 17, 2019
kaxil pushed a commit that referenced this pull request Dec 17, 2019
kaxil added a commit that referenced this pull request Dec 17, 2019
kaxil pushed a commit that referenced this pull request Dec 18, 2019
ashb pushed a commit that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants