Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-XXX] GSoD: How to make DAGs production ready #6515

Merged
merged 30 commits into from Nov 25, 2019

Conversation

@KKcorps
Copy link
Contributor

KKcorps commented Nov 7, 2019

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-XXX
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release
@KKcorps

This comment has been minimized.

Copy link
Contributor Author

KKcorps commented Nov 7, 2019

docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
@KKcorps KKcorps requested a review from kaxil Nov 7, 2019
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
docs/howto/dags-in-production.rst Outdated Show resolved Hide resolved
@potiuk

This comment has been minimized.

Copy link
Member

potiuk commented Nov 8, 2019

@KKcorps I think one of the important things to mention in this document is database access during parsing time and especially avoiding to use Airflow Variables in the DAGs (they are still ok to use in "execute" method).

This is a known problem that people are complaining a lot that scheduler opens and closes a lot of connections to the database - because every time the file is parsed, and variable is reached, a database connection is opened and query executed.

I think there are lots of examples around with using variables in the DAGs but this is not really a good practice and I think this is a perfect place to mention it.

I believe environment variables are better way to share common configuration.

@kaxil @ashb @mik-laj - what are your thoughts about it ?

@KKcorps

This comment has been minimized.

Copy link
Contributor Author

KKcorps commented Nov 8, 2019

@potiuk Yes, I am all for it. Personally, I also faced this issue. I have mentioned in the documentation that you shouldn't write a lot of code outside the DAG because of frequent parsing but I'll elaborate that point to include this as well.

@potiuk

This comment has been minimized.

Copy link
Member

potiuk commented Nov 8, 2019

Great @KKcorps It could also be great to search some of the existing documentation and see if there are no contradictions practices <> examples (I believe there are a few places where you have those bad practices shown as examples :). And we can fix them together.

@kaxil

This comment has been minimized.

Copy link
Contributor

kaxil commented Nov 8, 2019

@KKcorps I think one of the important things to mention in this document is database access during parsing time and especially avoiding to use Airflow Variables in the DAGs (they are still ok to use in "execute" method).

This is a known problem that people are complaining a lot that scheduler opens and closes a lot of connections to the database - because every time the file is parsed, and variable is reached, a database connection is opened and query executed.

I think there are lots of examples around with using variables in the DAGs but this is not really a good practice and I think this is a perfect place to mention it.

I believe environment variables are better way to share common configuration.

@kaxil @ashb @mik-laj - what are your thoughts about it ?

Agree. So using lots of Airflow variables in the file outside of task code will cause many DB connections.

It is fine to use it in a deferred way in Jinja templated field. For using it outside task file use Environment Variables.

@KKcorps

This comment has been minimized.

Copy link
Contributor Author

KKcorps commented Nov 8, 2019

I was also thinking to actually place this doc in the main section and name is Best Practices or something similar.

@kaxil

This comment has been minimized.

Copy link
Contributor

kaxil commented Nov 8, 2019

I was also thinking to actually place this doc in the main section and name is Best Practices or something similar.

I like that idea

@potiuk

This comment has been minimized.

Copy link
Member

potiuk commented Nov 8, 2019

I was also thinking to actually place this doc in the main section and name is Best Practices or something similar.

I like it too :)

@KKcorps

This comment has been minimized.

Copy link
Contributor Author

KKcorps commented Nov 13, 2019

@kaxil @potiuk
what is the difference between processor_poll_interval & min_file_process_interval
Which of these control how frequently a DAG will be parsed?
In my opinion, it should be min_file_process_interval but it has a default value of 0 in default-airflow.cfg

docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
Let's see what precautions you need to take.


Backend

This comment has been minimized.

Copy link
@mik-laj

mik-laj Nov 18, 2019

Member
Suggested change
Backend
Database backend
Logging
--------

If you are using disposable nodes in your cluster, configure the log storage to be a distributed file system such as ``S3`` or ``GCS``.

This comment has been minimized.

Copy link
@mik-laj

mik-laj Nov 18, 2019

Member

Airflow can also be used with other tools that do not provide a file system e.g. Stackdriver Logging (not publicly available yet), Elasticsearch, Amazon CloudWatch.

This comment has been minimized.

Copy link
@ashb

ashb Nov 21, 2019

Member

Agreed.

Communication
--------------

Airflow executes tasks of a DAG in different directories, which can even be present

This comment has been minimized.

Copy link
@ashb

ashb Nov 21, 2019

Member

The directory bit isn't true, but the issue here is as you mention tasks can be executed on different machines.

And even if using the a LocalExecutor, storing files on local disk can make retries harder (especially if another task might have deleted the file in the mean time)

Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files.

The tasks should also not store any authentication parameters such as passwords or token inside them.
Always use :ref:`Connections <concepts-connections>` to store data securely in Airflow backend and retrieve them using a unique connection id.

This comment has been minimized.

Copy link
@ashb

ashb Nov 21, 2019

Member

Unforutnately we can't be so bold as to say "Always" -- not every system is supported, so "Where at all possible" might be the best we can say.

on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`.
Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes.

Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files.

This comment has been minimized.

Copy link
@ashb

ashb Nov 21, 2019

Member

Please expand this to include something like "and then push a path to the remote file in Xcom to use in downstream tasks"

docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Show resolved Hide resolved
Multi-Node Cluster
-------------------

Airflow uses :class:`airflow.executors.sequential_executor.SequentialExecutor` by default. It works fine in most cases. However, by its nature, the user is limited to executing at most

This comment has been minimized.

Copy link
@ashb

ashb Nov 21, 2019

Member

I wouldn't say in most cases. Even on a local laptop LocalExecutor works much better.

This comment has been minimized.

Copy link
@ashb

ashb Nov 21, 2019

Member

The problem with SequentialExecutor is it pauses the scheduler when it runs a task, so even for a single node production deploy SequentialExecutor is not a good choicce.

docs/best-practices.rst Outdated Show resolved Hide resolved
Logging
--------

If you are using disposable nodes in your cluster, configure the log storage to be a distributed file system such as ``S3`` or ``GCS``.

This comment has been minimized.

Copy link
@ashb

ashb Nov 21, 2019

Member

Agreed.

docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Show resolved Hide resolved
KKcorps and others added 2 commits Nov 21, 2019
Co-Authored-By: Ash Berlin-Taylor <ash_github@firemirror.com>
Co-Authored-By: Kamil Breguła <mik-laj@users.noreply.github.com>
@KKcorps KKcorps force-pushed the KKcorps:dag-in-prod branch from d8b60aa to 2007915 Nov 21, 2019
Kartik
@potiuk
potiuk approved these changes Nov 21, 2019
Copy link
Member

ashb left a comment

Almost there now. Looking good though!

You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce
incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task.

Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run.

This comment has been minimized.

Copy link
@ashb

ashb Nov 22, 2019

Member

Can, if configured. Default is to not retry though.

docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
KKcorps and others added 3 commits Nov 23, 2019
Co-Authored-By: Ash Berlin-Taylor <ash_github@firemirror.com>
Kartik
Kartik
@ashb
ashb approved these changes Nov 25, 2019
@ashb

This comment has been minimized.

Copy link
Member

ashb commented Nov 25, 2019

@kaxil PTAL.

docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
docs/best-practices.rst Outdated Show resolved Hide resolved
@kaxil
kaxil approved these changes Nov 25, 2019
Copy link
Contributor

kaxil left a comment

Just some minor changes. Looks solid overall. Good work @KKcorps

docs/best-practices.rst Outdated Show resolved Hide resolved
Co-Authored-By: Kaxil Naik <kaxilnaik@gmail.com>
@kaxil kaxil merged commit c7c0a53 into apache:master Nov 25, 2019
1 check was pending
1 check was pending
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@kaxil

This comment has been minimized.

Copy link
Contributor

kaxil commented Nov 25, 2019

Thanks @KKcorps

kaxil added a commit that referenced this pull request Dec 17, 2019
(cherry picked from commit c7c0a53)
kaxil added a commit that referenced this pull request Dec 17, 2019
kaxil added a commit that referenced this pull request Dec 17, 2019
kaxil added a commit that referenced this pull request Dec 17, 2019
kaxil added a commit that referenced this pull request Dec 17, 2019
kaxil added a commit that referenced this pull request Dec 17, 2019
kaxil added a commit that referenced this pull request Dec 18, 2019
(cherry picked from commit c7c0a53)
ashb added a commit that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.