-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-5946] Store source code in db #7217
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! Here are some useful points:
Apache Airflow is a community-driven project and together we are making it better 🚀. In case of doubts contact the developers at: |
I'm not sure if storing this source code in DagModel is a good idea. One file can contain many DAG definitions. This file can be very large. I think it is worth introducing support for saving DAGs, but using a different data model. I will try to prepare a document that will describe the proposed solutions. |
This will still be required in several cases:
I think that this option should be optional. Many instances of Airflow do not need to store the source code in a database. This can have a negative impact on their performance. |
Hi @anitakar, thank you for your PR. There is a open Jira issue and I am working on it: https://issues.apache.org/jira/browse/AIRFLOW-5946 as a part of Dag Serialization. We still need to render templates, which we are trying to solve now, the PR is open for it: #6788 I am going to raise PRs for Code View this week. |
Schema databaseThe considerations focus on the collection table and the further diagrams will only be the following tables:
#Current schema Schema changes proposed by anitakarAnita suggests adding a new My propositionI think, we should add new Migration script for PostgresSQL create table dag_file
(
fileloc varchar(2000) not null,
fileloc_hash integer not null,
last_updated timestamp with time zone not null,
source_code BYTEA NOT NULL,
PRIMARY KEY (fileloc, fileloc_hash)
);
alter table dag add fileloc_hash integer not null DEFAULT 0;
alter table dag alter column fileloc_hash drop default;
|
Yup, that is exactly what we discussed initially as mentioned in https://issues.apache.org/jira/browse/AIRFLOW-5946 :) To make Webserver not need DAG Files we need to find a way to get Code to display in Code View.
I am also more towards Kamil's side and the PR will be available soon too |
Hi Anita, I would appreciate if you can hold on to this PR before I get the TaskInstance and template rendering PR in |
Sure. Kamil's database design makes much more sense. We avoid storing the same file source code multiple times, if file contains multiple DAGs. I kind of jumped in with this PR without creating a bug or seeing what is happening in community when it comes to DAG serialization. @kaxil I shall wait for your commit then. |
Thanks Anita, appreciate it. |
How calling the table dag_source - it's possible in the future that we'd have an API endpoint to submit a dag, at which point there is no "file" |
615aefc
to
099559a
Compare
a2e6b2a
to
901700f
Compare
Codecov Report
@@ Coverage Diff @@
## master #7217 +/- ##
==========================================
- Coverage 86.93% 86.93% -0.01%
==========================================
Files 909 910 +1
Lines 43975 44066 +91
==========================================
+ Hits 38229 38308 +79
- Misses 5746 5758 +12
Continue to review full report at Codecov.
|
901700f
to
662fab1
Compare
dc216a2
to
f71aea9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM although I have 1 comment:
I would love this to have the following:
store_dag_code
should default to "True" ifstore_serialized_dags=True
as the general assumption with
Serialized DAGs is that the webserver doesn't have access to DAG Files- If someone wants to read code from DAG Files even with Serialized DAGs, they should be needed to set
store_dag_code=False
.
Currently, even though if store_serialized_dags
is set toTrue
, it would still not store DagCode to DB.
airflow/config_templates/config.yml
Outdated
@@ -335,6 +335,15 @@ | |||
type: string | |||
example: ~ | |||
default: "True" | |||
- name: store_dag_code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to live next to store_serialized_dags as these two settings are related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved it just below min_serialized_dag_update_interval because it seemed to me that it should stay just below store_serialized_dags. But I have no strong opinion here.
airflow/config_templates/config.yml
Outdated
Whether to persist DAG files code in DB. | ||
If set to True, Webserver reads file contents from DB instead of | ||
trying to access files in a DAG folder. | ||
version_added: 2.0.0 | ||
type: string | ||
example: ~ | ||
default: "False" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address Kaxil's comment I think we can do this
Whether to persist DAG files code in DB. | |
If set to True, Webserver reads file contents from DB instead of | |
trying to access files in a DAG folder. | |
version_added: 2.0.0 | |
type: string | |
example: ~ | |
default: "False" | |
Whether to persist DAG files code in DB. | |
If set to True, Webserver reads file contents from DB instead of | |
trying to access files in a DAG folder. Defaults to same as the | |
store_serialized_dags setting | |
version_added: 2.0.0 | |
type: string | |
example: ~ | |
default: "%(store_serialized_dags)s" |
This will use the "Basic interpolation" built in to config parser https://docs.python.org/3/library/configparser.html#configparser.BasicInterpolation
Example:
[Paths]
home_dir: /Users
my_dir: %(home_dir)s/lumberjack
my_pictures: %(my_dir)s/Pictures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I was too quick with committing your first suggestion so that I could not accept your first suggestion anymore. But I have manually added it.
f71aea9
to
50f4874
Compare
… the dag_code table and is queried from here when the Code view is opened for the DAG. The webserver no longer needs access to the dags folder in the shared filesystem. Co-Authored-By: Kaxil Naik <kaxilnaik@gmail.com> Co-Authored-By: Kamil Breguła <mik-laj@users.noreply.github.com>
1682db2
to
4acf5bb
Compare
Co-Authored-By: Kaxil Naik <kaxilnaik@gmail.com>
4acf5bb
to
a7fbf91
Compare
Waiting for the CI to complete and pass :) |
Good work @anitakar 🎉 |
🎉 🐈 |
🎉 |
import hashlib | ||
# Only 7 bytes because MySQL BigInteger can hold only 8 bytes (signed). | ||
return struct.unpack('>Q', hashlib.sha1( | ||
full_filepath.encode('utf-8')).digest()[-8:])[0] >> 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anitakar Did you mean for this construct to ignore the least-significant byte?
Location: /home/ash/airflow/dags/example.py
SHA1 (hex) 0x78288e229ef15e32a7a32e1fd123d9fc60b5eae0
fileloc_hash: 58867689281598954
fileloc_hash hex: d123d9fc60b5ea
i.e. it ignores/removes the e0
from the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all honesty I wanted to use all 8 bytes of mysql biginteger.
I used another construct before but it was not working with python 2, so I have changed it to use python struct. But I should have used signed long long instead of unsigned long long.
When I have noted it the code was too close to releasing and 7 bytes is enough anyway (https://cloud.google.com/composer/docs/release-notes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured it was something like that, just wanted to check if there was a deeper reason. Thanks.
Store DAG's source code in the dag_code table
The web server no longer needs an access to the dags folder in the shared file system.
DAG's code can be shown from dag_code table.
Enabled only if store_dag_code flag is set to true.
https://issues.apache.org/jira/browse/AIRFLOW-5946