New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-XXX] GSoD: How to make DAGs production ready #6515
Conversation
@KKcorps I think one of the important things to mention in this document is database access during parsing time and especially avoiding to use Airflow Variables in the DAGs (they are still ok to use in "execute" method). This is a known problem that people are complaining a lot that scheduler opens and closes a lot of connections to the database - because every time the file is parsed, and variable is reached, a database connection is opened and query executed. I think there are lots of examples around with using variables in the DAGs but this is not really a good practice and I think this is a perfect place to mention it. I believe environment variables are better way to share common configuration. |
@potiuk Yes, I am all for it. Personally, I also faced this issue. I have mentioned in the documentation that you shouldn't write a lot of code outside the DAG because of frequent parsing but I'll elaborate that point to include this as well. |
Great @KKcorps It could also be great to search some of the existing documentation and see if there are no contradictions practices <> examples (I believe there are a few places where you have those bad practices shown as examples :). And we can fix them together. |
Agree. So using lots of Airflow variables in the file outside of task code will cause many DB connections. It is fine to use it in a deferred way in Jinja templated field. For using it outside task file use Environment Variables. |
I was also thinking to actually place this doc in the main section and name is Best Practices or something similar. |
I like that idea |
I like it too :) |
docs/best-practices.rst
Outdated
Let's see what precautions you need to take. | ||
|
||
|
||
Backend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backend | |
Database backend |
docs/best-practices.rst
Outdated
Logging | ||
-------- | ||
|
||
If you are using disposable nodes in your cluster, configure the log storage to be a distributed file system such as ``S3`` or ``GCS``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Airflow can also be used with other tools that do not provide a file system e.g. Stackdriver Logging (not publicly available yet), Elasticsearch, Amazon CloudWatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
docs/best-practices.rst
Outdated
Communication | ||
-------------- | ||
|
||
Airflow executes tasks of a DAG in different directories, which can even be present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The directory bit isn't true, but the issue here is as you mention tasks can be executed on different machines.
And even if using the a LocalExecutor, storing files on local disk can make retries harder (especially if another task might have deleted the file in the mean time)
docs/best-practices.rst
Outdated
Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. | ||
|
||
The tasks should also not store any authentication parameters such as passwords or token inside them. | ||
Always use :ref:`Connections <concepts-connections>` to store data securely in Airflow backend and retrieve them using a unique connection id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unforutnately we can't be so bold as to say "Always" -- not every system is supported, so "Where at all possible" might be the best we can say.
docs/best-practices.rst
Outdated
on different servers in case you are using :doc:`Kubernetes executor <../executor/kubernetes>` or :doc:`Celery executor <../executor/celery>`. | ||
Therefore, you should not store any file or config in the local filesystem — for example, a task that downloads the JAR file that the next task executes. | ||
|
||
Always use XCom to communicate small messages between tasks or S3/HDFS to communicate large messages/files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please expand this to include something like "and then push a path to the remote file in Xcom to use in downstream tasks"
docs/best-practices.rst
Outdated
Multi-Node Cluster | ||
------------------- | ||
|
||
Airflow uses :class:`airflow.executors.sequential_executor.SequentialExecutor` by default. It works fine in most cases. However, by its nature, the user is limited to executing at most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say in most cases. Even on a local laptop LocalExecutor works much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with SequentialExecutor is it pauses the scheduler when it runs a task, so even for a single node production deploy SequentialExecutor is not a good choicce.
docs/best-practices.rst
Outdated
Logging | ||
-------- | ||
|
||
If you are using disposable nodes in your cluster, configure the log storage to be a distributed file system such as ``S3`` or ``GCS``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
Co-Authored-By: Kamil Breguła <mik-laj@users.noreply.github.com>
Co-Authored-By: Ash Berlin-Taylor <ash_github@firemirror.com>
Co-Authored-By: Ash Berlin-Taylor <ash_github@firemirror.com> Co-Authored-By: Kamil Breguła <mik-laj@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there now. Looking good though!
docs/best-practices.rst
Outdated
You should treat tasks in Airflow equivalent to transactions in a database. It implies that you should never produce | ||
incomplete results from your tasks. An example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a task. | ||
|
||
Airflow retries a task if it fails. Thus, the tasks should produce the same outcome on every re-run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can, if configured. Default is to not retry though.
Co-Authored-By: Ash Berlin-Taylor <ash_github@firemirror.com>
@kaxil PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor changes. Looks solid overall. Good work @KKcorps
Co-Authored-By: Kaxil Naik <kaxilnaik@gmail.com>
Thanks @KKcorps |
(cherry picked from commit c7c0a53)
(cherry picked from commit c7c0a53)
(cherry picked from commit c7c0a53)
(cherry picked from commit c7c0a53)
Make sure you have checked all steps below.
Jira
Description
Tests
Commits
Documentation