Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-5946] Store source code in db #7217

Merged
merged 2 commits into from
Mar 13, 2020
Merged

Conversation

anitakar
Copy link
Contributor

@anitakar anitakar commented Jan 20, 2020

Store DAG's source code in the dag_code table

The web server no longer needs an access to the dags folder in the shared file system.
DAG's code can be shown from dag_code table.
Enabled only if store_dag_code flag is set to true.

https://issues.apache.org/jira/browse/AIRFLOW-5946

@boring-cyborg boring-cyborg bot added area:serialization area:webserver Webserver related Issues labels Jan 20, 2020
@boring-cyborg
Copy link

boring-cyborg bot commented Jan 20, 2020

Congratulations on your first Pull Request and welcome to the Apache Airflow community!
If you have any issues or are unsure about any anything please check our
Contribution Guide (https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst)

Here are some useful points:

  • Pay attention to the quality of your code (flake8, pylint and type annotations). Our pre-commits
    will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory).
    Adding a new operator? Check this short [guide](https://github
    .com/apache/airflow/blob/master/docs/howto/custom-operator.rst) Consider adding an example DAG that shows
    how users should use it.
  • Consider using Breeze environment for testing
    locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from
    Committers.

Apache Airflow is a community-driven project and together we are making it better 🚀.

In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://apache-airflow-slack.herokuapp.com/

@mik-laj
Copy link
Member

mik-laj commented Jan 20, 2020

I'm not sure if storing this source code in DagModel is a good idea. One file can contain many DAG definitions. This file can be very large. I think it is worth introducing support for saving DAGs, but using a different data model. I will try to prepare a document that will describe the proposed solutions.

@mik-laj
Copy link
Member

mik-laj commented Jan 20, 2020

The webserver no longer needs access to the dags folder in the shared filesystem.

This will still be required in several cases:

  • To get the DAG structure, if DAG serialization is turned off
  • Rendered Template tab will still have to parse Python file as it needs all the details like
    the execution date and even the data passed by the upstream task using Xcom.

I think that this option should be optional. Many instances of Airflow do not need to store the source code in a database. This can have a negative impact on their performance.

@kaxil
Copy link
Member

kaxil commented Jan 20, 2020

Hi @anitakar, thank you for your PR. There is a open Jira issue and I am working on it: https://issues.apache.org/jira/browse/AIRFLOW-5946 as a part of Dag Serialization.

We still need to render templates, which we are trying to solve now, the PR is open for it: #6788

I am going to raise PRs for Code View this week.

@mik-laj
Copy link
Member

mik-laj commented Jan 20, 2020

Schema database

The considerations focus on the collection table and the further diagrams will only be the following tables:

  • dag
  • serialized_dag

#Current schema

Screenshot 2020-01-20 at 16 20 41

Schema changes proposed by anitakar

Screenshot 2020-01-20 at 16 20 41

Anita suggests adding a new source_code field in the serialzed_dag table

My proposition

Screenshot 2020-01-20 at 16 56 23

I think, we should add new dag_file table to avoid duplication of source code. New table have fileloc and fileloc_hash as primary key. The dag table contains only the fileloc field, but I think it would also be helpful to add fileloc_hash. I also used the blob type, because we don't need to process this code by text functions in the database.

Migration script for PostgresSQL

create table dag_file
(
        fileloc varchar(2000) not null,
	fileloc_hash integer not null,	
	last_updated timestamp with time zone not null,
	source_code BYTEA NOT NULL,
	PRIMARY KEY (fileloc, fileloc_hash)
);

alter table dag add fileloc_hash integer not null DEFAULT 0;
alter table dag alter column fileloc_hash drop default;

@kaxil
Copy link
Member

kaxil commented Jan 20, 2020

Yup, that is exactly what we discussed initially as mentioned in https://issues.apache.org/jira/browse/AIRFLOW-5946 :)

To make Webserver not need DAG Files we need to find a way to get Code to display in Code View.

  1. Store in lazy-loaded column in SerializedDag table
  2. Save in a new table with DAG_id and store versions as well. Add a limit of last 10 versions (configurable). This is just needed by Code View so not a problem if we store in New table

I am also more towards Kamil's side and the PR will be available soon too

@anitakar anitakar changed the title [AIRFLOW-NNNN] Store DAG's source code in the serialized_dag table [WIP] [AIRFLOW-NNNN] Store DAG's source code in the serialized_dag table Jan 20, 2020
@boring-cyborg boring-cyborg bot added area:dev-tools area:logging area:Scheduler including HA (high availability) scheduler provider:google Google (including GCP) related issues labels Jan 21, 2020
@kaxil
Copy link
Member

kaxil commented Jan 21, 2020

Hi Anita, I would appreciate if you can hold on to this PR before I get the TaskInstance and template rendering PR in

@anitakar
Copy link
Contributor Author

Sure. Kamil's database design makes much more sense. We avoid storing the same file source code multiple times, if file contains multiple DAGs.

I kind of jumped in with this PR without creating a bug or seeing what is happening in community when it comes to DAG serialization. @kaxil I shall wait for your commit then.

@kaxil
Copy link
Member

kaxil commented Jan 21, 2020

Sure. Kamil's database design makes much more sense. We avoid storing the same file source code multiple times, if file contains multiple DAGs.

I kind of jumped in with this PR without creating a bug or seeing what is happening in community when it comes to DAG serialization. @kaxil I shall wait for your commit then.

Thanks Anita, appreciate it.

@ashb
Copy link
Member

ashb commented Jan 21, 2020

How calling the table dag_source - it's possible in the future that we'd have an API endpoint to submit a dag, at which point there is no "file"

@anitakar anitakar force-pushed the dag_code_column branch 4 times, most recently from 615aefc to 099559a Compare January 27, 2020 13:46
@anitakar anitakar force-pushed the dag_code_column branch 4 times, most recently from a2e6b2a to 901700f Compare January 29, 2020 10:47
@codecov-io
Copy link

codecov-io commented Jan 29, 2020

Codecov Report

Merging #7217 into master will decrease coverage by 0.00%.
The diff coverage is 90.40%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7217      +/-   ##
==========================================
- Coverage   86.93%   86.93%   -0.01%     
==========================================
  Files         909      910       +1     
  Lines       43975    44066      +91     
==========================================
+ Hits        38229    38308      +79     
- Misses       5746     5758      +12     
Impacted Files Coverage Δ
airflow/www/utils.py 79.39% <ø> (-0.99%) ⬇️
airflow/utils/dag_processing.py 88.02% <33.33%> (-0.70%) ⬇️
airflow/api/common/experimental/get_code.py 72.72% <60.00%> (-4.20%) ⬇️
airflow/models/dag.py 91.21% <66.66%> (-0.10%) ⬇️
airflow/models/dagcode.py 92.47% <92.47%> (ø)
airflow/api/common/experimental/__init__.py 92.00% <100.00%> (-0.60%) ⬇️
airflow/exceptions.py 100.00% <100.00%> (ø)
airflow/models/serialized_dag.py 92.40% <100.00%> (-0.28%) ⬇️
airflow/utils/file.py 88.88% <100.00%> (+1.05%) ⬆️
airflow/www/views.py 76.27% <100.00%> (+0.01%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c997cab...a7fbf91. Read the comment docs.

Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM although I have 1 comment:

I would love this to have the following:

  • store_dag_code should default to "True" if store_serialized_dags=True as the general assumption with
    Serialized DAGs is that the webserver doesn't have access to DAG Files
  • If someone wants to read code from DAG Files even with Serialized DAGs, they should be needed to set store_dag_code=False.

Currently, even though if store_serialized_dags is set toTrue, it would still not store DagCode to DB.

@@ -335,6 +335,15 @@
type: string
example: ~
default: "True"
- name: store_dag_code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to live next to store_serialized_dags as these two settings are related.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved it just below min_serialized_dag_update_interval because it seemed to me that it should stay just below store_serialized_dags. But I have no strong opinion here.

Comment on lines 340 to 346
Whether to persist DAG files code in DB.
If set to True, Webserver reads file contents from DB instead of
trying to access files in a DAG folder.
version_added: 2.0.0
type: string
example: ~
default: "False"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address Kaxil's comment I think we can do this

Suggested change
Whether to persist DAG files code in DB.
If set to True, Webserver reads file contents from DB instead of
trying to access files in a DAG folder.
version_added: 2.0.0
type: string
example: ~
default: "False"
Whether to persist DAG files code in DB.
If set to True, Webserver reads file contents from DB instead of
trying to access files in a DAG folder. Defaults to same as the
store_serialized_dags setting
version_added: 2.0.0
type: string
example: ~
default: "%(store_serialized_dags)s"

This will use the "Basic interpolation" built in to config parser https://docs.python.org/3/library/configparser.html#configparser.BasicInterpolation

Example:

[Paths]
home_dir: /Users
my_dir: %(home_dir)s/lumberjack
my_pictures: %(my_dir)s/Pictures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I was too quick with committing your first suggestion so that I could not accept your first suggestion anymore. But I have manually added it.

… the dag_code table and is queried from here when the Code view is opened for the DAG. The webserver no longer needs access to the dags folder in the shared filesystem.

Co-Authored-By: Kaxil Naik <kaxilnaik@gmail.com>
Co-Authored-By: Kamil Breguła <mik-laj@users.noreply.github.com>
@anitakar anitakar force-pushed the dag_code_column branch 2 times, most recently from 1682db2 to 4acf5bb Compare March 13, 2020 14:46
Co-Authored-By: Kaxil Naik <kaxilnaik@gmail.com>
@kaxil
Copy link
Member

kaxil commented Mar 13, 2020

Waiting for the CI to complete and pass :)

@kaxil kaxil merged commit e146518 into apache:master Mar 13, 2020
@kaxil
Copy link
Member

kaxil commented Mar 13, 2020

Good work @anitakar 🎉

@mik-laj
Copy link
Member

mik-laj commented Mar 13, 2020

🎉 🐈

@potiuk
Copy link
Member

potiuk commented Mar 13, 2020

🎉

import hashlib
# Only 7 bytes because MySQL BigInteger can hold only 8 bytes (signed).
return struct.unpack('>Q', hashlib.sha1(
full_filepath.encode('utf-8')).digest()[-8:])[0] >> 8
Copy link
Member

@ashb ashb Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anitakar Did you mean for this construct to ignore the least-significant byte?

Location: /home/ash/airflow/dags/example.py
SHA1 (hex) 0x78288e229ef15e32a7a32e1fd123d9fc60b5eae0
fileloc_hash: 58867689281598954
fileloc_hash hex: d123d9fc60b5ea

i.e. it ignores/removes the e0 from the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all honesty I wanted to use all 8 bytes of mysql biginteger.
I used another construct before but it was not working with python 2, so I have changed it to use python struct. But I should have used signed long long instead of unsigned long long.
When I have noted it the code was too close to releasing and 7 bytes is enough anyway (https://cloud.google.com/composer/docs/release-notes).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured it was something like that, just wanted to check if there was a deeper reason. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:dev-tools area:logging area:Scheduler including HA (high availability) scheduler area:serialization area:webserver Webserver related Issues provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants