Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add airflow upgrade_check command (MVP) #9276

Closed
wants to merge 1 commit into from

Conversation

mik-laj
Copy link
Member

@mik-laj mik-laj commented Jun 13, 2020

This is a DRAFT and I still have to complete the missing documentation and unit tests.

I wanted to prepare a minimal valuable product for the upgrade_check command. Only one rule is implemented, but I think that's enough to show the feature. In the next steps, we can think about what rules we need to create, create tickets, and then work on the next rules by the whole community.

Recently, we have a lot of new beginner contributors and I think we can try to engage them for this task. Most rules will be small and independent, so they will be great candidates for good-first-issue.

An example of the result of this command is presented in the termianlshot below
Screenshot 2020-06-13 at 12 58 38

Example JSON:

[
  {
    "rule": "ConnTypeIsNotNullableRule",
    "message": "Connection<id=''\", conn_id=AA> have empty conn_type field."
  }
]

Part of: #8765


Make sure to mark the boxes below before creating PR: [x]

  • Description above provides context of the change
  • Unit tests coverage for changes (not needed for documentation changes)
  • Target Github ISSUE in description if exists
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions.
  • I will engage committers as explained in Contribution Workflow Example.

In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

@mik-laj mik-laj marked this pull request as draft June 13, 2020 11:13
@mik-laj mik-laj changed the title Add airflow upgrade-check command (MVP) Add airflow upgrade_check command (MVP) Jun 13, 2020
@mik-laj mik-laj force-pushed the upgrade-check-mvp branch 2 times, most recently from e3f9d5f to 97654ae Compare June 13, 2020 11:49
@potiuk
Copy link
Member

potiuk commented Jun 13, 2020

I like the idea, but we have to make sure that this tool is usable without installing airflow 2.0 first.

The usage pattern will be

A) run the upgrade check on airflow 1.10

B) fix all the problems

C) re-run it

d) only afterwards install airflow 2.0 over the 1.10

So I do not think it's a good idea to have it as airflow 2.0 CLI. I t should likelly be a separate package or even a standalone script that you can run without upgrading airflow to 2.0

@mik-laj
Copy link
Member Author

mik-laj commented Jun 13, 2020

@potiuk You're right. This complicates the situation a bit, but it's worth doing. I thought that we could in the future backport the airflow.upgrade package to the latest Airflow 1.10 release, but that would not allow us to test these changes on a regular basis with the latest Airflow 1.10. The new package is a better idea. I was too much inspired by Ash's original proposal - #8765
We need a little more flexibility.

@ashb
Copy link
Member

ashb commented Jun 14, 2020

A nice framework

@ashb
Copy link
Member

ashb commented Jun 14, 2020

We should possibly just so this directly on 1.10 branches though.

The way I said we would do upgrades was "install latest 1.10 version, fix warnings, then install 2.0”. Doing a separate package seems overkill, we can just add more rules as we work them out.

Many we can check by capturing the existing DeprecationWarnings we already emit.

from airflow.upgrade.rules.base_rule import BaseRule


class BaseFormatter(metaclass=ABCMeta):
Copy link
Member

@ashb ashb Jun 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can we not have one class per file - this isn't Java.

All of these formatter classes would live in the same module - airflow/upgrade/formatters.py

# invalid_connections = session.query(Connection).filter(Connection.conn_type.is_(None))
invalid_connections = [Connection(conn_id="AA")]
return (
f'Connection<id={quote(conn.id)}", conn_id={conn.conn_id}> have empty conn_type field.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{conn.id!r} seems more natural than shell quoting function.

Though conn.id is an integer, so quoting it is not needed. Did you mean to do this on conn_id instead?

@potiuk
Copy link
Member

potiuk commented Jun 15, 2020

The way I said we would do upgrades was "install latest 1.10 version, fix warnings, then install 2.0”. Doing a separate package seems overkill, we can just add more rules as we work them out.

This make a silent assumption that you always want to migrate to latest 1.10 and then to 2.0 which we might, or might not require - depending on how (and if) we are going to migrate the database. I do not think this is a hard prerequisite (is it @ashb ?) to migrate to latest 1.10 first?

I think it will needlessly complicate the migration path - especially if (which I presume is the case) w are not going to migrate the database.

"""A long description explaining the problem in detail. This can be an entry from UPDATING.md file."""
pass

def check(self) -> List[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return a list of Problems instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem stores references to the rule, so the problem must be initiated as follows: Problem(rule=rule, message=message). It seemed too talkative to me right now. This code will be repeated in every rule. Now the rules are a little easier to write. We can add support for returning it in the future when we need it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it longer and will come back with another way to store the reference to the rule. However, this is a detail. I will wait until we set other high-level requirements.

@ashb
Copy link
Member

ashb commented Jun 15, 2020

I think it will needlessly complicate the migration path - especially if (which I presume is the case) w are not going to migrate the database.

What do you mean about "not migrate the database"? As in you expect people to start 2.0 with a clean/empty metadata DB?

@ashb
Copy link
Member

ashb commented Jun 15, 2020

This make a silent assumption that you always want to migrate to latest 1.10 and then to 2.0

I wasn't planning on making it a silent assumption but the official upgrade path:

  1. Update to latest 1.10.x (likely 1.10.12?)
  2. Run upgrade check and apply changes (install required backport providers etc.)
  3. Test this in staging->production etc.
  4. Upgrade staging+test
  5. Upgrade prod.

This approach (latest 1.10.x without warnings, then major upgrade) is based loosely off how Django handles upgrades and deprecation, https://docs.djangoproject.com/en/3.0/howto/upgrade-version/

@potiuk
Copy link
Member

potiuk commented Jun 15, 2020

What do you mean about "not migrate the database"? As in you expect people to start 2.0 with a clean/empty metadata DB?

I do not know what are the plans for now I do not know if it's been discussed - the migrations are long time separated between the databases in 1.10 and 2.0 so I am not sure if there is a clear path (yet) on how to migrate from a version of 1.10 to 2.0.

There are few paths that might be taken for that. I see at least three:

  1. migrate the database on a live instance

  2. create a new schema and migrate some of the data if you cannot do live DB migration for whatever reason/

  3. Clean-cut - start with Airflow 2.0 from the scratch having both - old and new version of Airflow running for some time keeping the old installation as "historical".

I think all of those scenarios are possible and valid, and only 1) requires migrating to the latest Airflow before switching to 2.0.

There are pros and cons for all those scenarios and I am not sure we are ready to make a decision on how to do it without checking with our users and possibly listening to how they would run the migration in their environments.

I think we should discuss which of the migration paths we are going to support from the community point of view.

@ashb
Copy link
Member

ashb commented Jun 15, 2020

If we don't give users the ability to test changes without jumping straight to 2.0 then a lot of bigger companies are going to be very reluctant to upgrade. (It's hard enough to get them to upgrade minor Airflow versions.)

I do not know what are the plans for now I do not know if it's been discussed - the migrations are long time separated between the databases in 1.10 and 2.0 so I am not sure if there is a clear path (yet) on how to migrate from a version of 1.10 to 2.0.

It is a hard requirement that it is possible to update DB from 1.10 to 2.0. Resetting the DB means a very likely chance that Airflow would then go and re-run months or years worth of tasks.

@ashb
Copy link
Member

ashb commented Jun 15, 2020

I am willing to do the work necessary to make it possible to upgrade a DB in place. I feel very strongly that not allowing it would be a disaster for uptake of 2.0.

@potiuk
Copy link
Member

potiuk commented Jun 15, 2020

If we don't give users the ability to test changes without jumping straight to 2.0 then a lot of bigger companies are going to be very reluctant to upgrade. (It's hard enough to get them to upgrade minor Airflow versions.)

Absolutely all those scenarios should include testing. But I think there will be many companies that will not like to make two migrations - 1.10.x whatever they have now - 1.10.12 -> 2.0.0., they would rather migrate to 2.0 straight. Of course, we might choose to not support this path - as community. But I do not think it should be taken as granted without discussion and hearing the user's opinion on that.

I do not know what are the plans, for now, I do not know if it's been discussed - the migrations are long time separated between the databases in 1.10 and 2.0 so I am not sure if there is a clear path (yet) on how to migrate from a version of 1.10 to 2.0.

It is a hard requirement that it is possible to update DB from 1.10 to 2.0. Resetting the DB means a very likely chance that Airflow would then go and re-run months or years worth of tasks.

I agree "clean-cut' might not be the best way of doing it, yet I imagine scenarios where this might be the path taken by our users. I think it's worth starting the discussion on the devlist a least, as this might be a viable option for some of the customers (no capacity for that from my side though currently).

I do not think we had an opportunity to migrate the "major" largely backwards in-compatible version yet, so hearing what the users think about this might be a good way to make good decisions on it.

@potiuk
Copy link
Member

potiuk commented Jun 15, 2020

I am willing to do the work necessary to make it possible to upgrade a DB in place. I feel very strongly that not allowing it would be a disaster for uptake of 2.0.

Absolutely. This is a must - we have to definitely support that. I have no doubt whatsoever that we should do it. I't just the question if a) we also support scenario with clean cut b) how we migrate the DB for users which are not on the latest (1.10.12?) version. I do not have ready answers yet. I am just asking questions for now as I see different options are possible.

@ashb
Copy link
Member

ashb commented Jun 17, 2020

https://docs.google.com/presentation/d/1lye_qklHkrucVrMyCzApdBSBSsnjkg3oGGlBsjlQT2I/edit?ts=5eb32894#slide=id.g84d35adb5c_0_14 (slide 60 if that link doesn't take you there) was where I talked about my proposed upgrade path.

(None of these discussions affect the code in the PR, just where we put it in the end.)

@potiuk
Copy link
Member

potiuk commented Jun 18, 2020

Yeah. I had some thoughts and discussed with some of the stakeholders as well (from the Composer team where lots of the Airflow users are), and I think we all agree that going through latest 1.10.* release makes most sense.

@mik-laj
Copy link
Member Author

mik-laj commented Jun 18, 2020

If no one is against it, I will set the target of this PR to v1-10-test when I return to work on this change. I will try to come back quickly so that other people can add their rule based on it.

@stale
Copy link

stale bot commented Aug 2, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Aug 2, 2020
@stale stale bot closed this Aug 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:CLI area:plugins stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants