Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove MSSQL support from Airflow core #35868

Conversation

jscheffl
Copy link
Contributor

@jscheffl jscheffl commented Nov 26, 2023

PR to remove MSSQL support from Airflow - as discussed and voted in https://lists.apache.org/thread/pgcgmhf6560k8jbsmz8nlyoxosvltph2

This PR removes MSSQL from:

  • Core Metadatabase
  • (Historic) Migration scripts
  • Documentation

It does NOT remove the provider package, so no mis-understanding: Apache Airflow further will "work" with MSSQL and the provider package and features are un-touched. All Operators keep in place. Just the metadata support is reduced to sqlite (dev only), Postgres, MySQL.
Note that for this also the build of MSSQL client code is kept inside the Docker as Provider packages rely on this availability.

Preconditions before merge:

@boring-cyborg boring-cyborg bot added area:CLI area:dev-tools area:production-image Production image improvements and fixes area:Scheduler Scheduler or dag parsing Issues area:serialization labels Nov 26, 2023
@potiuk
Copy link
Member

potiuk commented Nov 27, 2023

Should work after rebase

@jscheffl jscheffl added this to the Airflow 2.8.0 milestone Nov 27, 2023
Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
@jscheffl
Copy link
Contributor Author

Thanks @potiuk for the analysis and the hist to history. My main "mis-understanding" was that I did not know that there are multiple installs for the driver. Was wondering if this is a leftover (=garbage). But understood it is a "feature".

I understand (now) that mssql is needed in the docker for the provider package, which is fail. reverted the last two (functional) commits, leaving the driver and the install options for the docker build scripts like before.

@jscheffl jscheffl force-pushed the feature/remove-mssql-support-from-airflow-core branch from 278baa1 to f321c84 Compare November 28, 2023 06:15
@jscheffl jscheffl requested a review from potiuk November 28, 2023 08:31
@potiuk
Copy link
Member

potiuk commented Nov 28, 2023

Thanks @potiuk for the analysis and the hist to history. My main "mis-understanding" was that I did not know that there are multiple installs for the driver. Was wondering if this is a leftover (=garbage). But understood it is a "feature".

I understand (now) that mssql is needed in the docker for the provider package, which is fail. reverted the last two (functional) commits, leaving the driver and the install options for the docker build scripts like before.

Yeah. It is not clear indeed :). As @Taragolis mentioned in slack, likely we should split the mssql_sh installation in two.

Just to add a bit of context here.

The Dockerfiles of ours (especially the production one) are built with "users can build their own custom, optimized builds files using them". There is the whole set of documentation and set of arguments users can use if they want to build their own, even more optimized image using our Dockerfile if they want. Currently we pre-built two images:

  • the regular one (AKA "reference" image - where we add libraries needed, Postgres, MySQL, MSSQL clients and most popular (we think) providers

  • the "slim" image - which contains all database clients and airflow - without providers (except the preinstalled ones)

But the users can simply get our Dockerfile and build their own image using build-args https://airflow.apache.org/docs/docker-stack/build.html#customizing-the-image

For now both MSSQL clients we install (pymssql and msodbc) are controlled by a single arg when you want to build custom image with or without it:

https://airflow.apache.org/docs/docker-stack/build-arg-ref.html#image-optimization-options

INSTALL_MSSQL_CLIENT | true | Whether MsSQL client should be installed

But likely we should split that into two variables.

Another small comment/context. For development purpose you can very easily build your own, custom PROD image using these. Without even looking at the variable / build image docs.

The breeze prod-image build command have all of those parameters nicely explained and documented and they are even auto-completeable - see "advanced customisation optins". You can even run --dry-run with selected parameters and it will spit-out the right docker build command with the right build args used converted from the nicely documented breeze prod-image build auto-completeable and documented flags.

image

@@ -139,11 +139,11 @@ def upgrade():
type_=mysql.TIMESTAMP(fsp=6),
)
else:
# sqlite and mssql datetime are fine as is. Therefore, not converting
Copy link
Member

@potiuk potiuk Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really sure if we want to remove mssql in all the migrations from the past.

Maybe yes, Maybe no. I have arguments for both.

The arguments for "removing MSSQL From migrations":

I am leaning towards removing them (so basically what this PR does), but I am a bit torn on that one. On one hand I would love to constraint and limit the "mssql" users to only use Airflow 2.7.3 and never, ever use airflow >= 2.8.0 for anything MSSQL. On the other hand it feels a little weird. I'd love other's opinion - especially those who worked on the migrations in the past: @ephraimbuddy @dstandish @ashb @kaxil @eladkal

The argument for "leaving MSSQL in migrations":

It does not really looks like it is necessary to remove them and does not cost us much. It's a bit sunken cost already and we are not going to maintain those migrations anyway. <aybe at some point in time we might want to merge them somehow and get a new "starting point" migration in the future, but I am not sure what is the value in removing mssql from the old migrations.

It feels a bit weird to change the past migrations that are already marked as "I run this migration already" by alembic. I know it SHOULD NOT matter when you have Postgres/MySQL, but it feels a bit wrong to modify the migration scripts, especially that it does not hurt us to leave them there (we generally do not touch/maintain migrations too often).

I see two potential prolems with removing the migrations for mssql from those migrations:

  • there is a risk we remove something here accidentally by missing it in review - there are a lot of files to be reviewed and lots of conditions, there is a slight risk we will miss it. But with careful review we can mitigate it.

  • there is a side-effect on --from-ref/--to-ref, --from-version/--to-version. Since we are still supporting MSSQL for airlfow < 2.8.0, it would be nice if Airflow 2.8+ still supports migrations for older versions of airflow even for MSSQL.

I think (but I might be wrong) - just leaving the mssql in migration code will just allow it to be run and it SHOULD work in general. So users will still be able - for example - to run airflow db migrate --to-version 2.7.3 using Airflow 2.8, 2.9 etc. There will be no need to run those migration with Airflow 2.7.3 strictly.

My current thinking

But I am bit torn on that one. It also relates to the discussion we had on #35861 and what we want to tell our users who run MSSQL now.

So far my points there at least were:

  • let's constraint the set of tool and environment as much as we can to reduce variability of environment
  • let's provide a tool/script that they can use - but also modify if needed, so that it's not part of "airflow" itself

So if we follow and agree that line of thought - then yes, removing MSSQL from Airflow 2.8.0 and not allowing the MSSQL users to run --to-version 2.7.3 feels about right. On the other hand - if we allow for variability, and letting the users to do the migrations in various ways, it's ok to leave it in.

I wonder what you think?

Copy link
Contributor Author

@jscheffl jscheffl Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% on your argument line - and I really had the same thoughts, just did not write it like this.

Mainly I started removing migrations when I cleaned (future) dead code in airflow/migrations/utils.py and then saw migration scripts failing because common functions missing. Migration code is not 100% separated.

As we are shortly before the release and we need more days in discussing/debating about migrations I'd also be OK to split the PR in two. (with a minimum risk that I need to leave some migration dependency dead code behind).
And still if somebody want to play with MSSQL there is the 2.7.3 release available, from there still you could up-and downgrade. Leaving traces behind if 2.8.0 does not support MSSQL anyway is just some leftover cleanup. And I would not like to keep all MSSQL dependencies in the future (2.9++?) just because in ancient history we supported MSSQL - rather clean it soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting to what others say. I think we do not have to rush it. Spoke to @ephraimbuddy today and RC is planned 11th of December. I think we have quite some time to discuss, decide, iterate and merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main risk I see is this PR is quire large, if we merge last-minute, there is no buffer to see if we had a unforeseen side-effect. But I am okay if you see it relaxed :-D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. The problem with size here is that it's "all" or "nothing" in case of migrations (and those are the big ones) . All the rest is "2 minutes" review :).

So yeah, agree we should - I think - quickly - get to conclusions on how we should do it and start reviewing/merging what we agreed about.

CC: everyone else who migh have an opinion :) . Maybe also worth to send it to devlist to raise "importance" of it - might be in the response of the vote/consensus we had about MSSQL removal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand it feels a little weird

Does it? even before we started experimental support we accepted changes to migration scripts to better support MsSQL. It doesn't cost us anything other than few min of review.
i think we should keep it and also accept PRs improving/fixing issues related to MsSQL.

The removal of MsSQL support from our side just means that we do not test Airflow deployment against MsSQL backend. It doesn't mean that it won't work it just mean that we don't guarantee it.

Copy link
Member

@potiuk potiuk Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. It was a "little" weird. But I could go either way.

The removal of MsSQL support from our side just means that we do not test Airflow deployment against MsSQL backend. It doesn't mean that it won't work it just mean that we don't guarantee it.

I think we need a a bit more than that.

There are two potential scenarios for our MSSQL current users:

  1. Scenario 1 - we do not hard/stop mssql users for migration and it will maybe work and maybe not

If we leave it in "maybe it will work" then our users will continue using airflow until it breaks and willl come to us when it will stop (inevitably) - the whole point is that the "mssql-specific" code in Airflow slows us down, so getting rid of it is a good idea and also it's likely in a few months time, it will catastrophically break MSSQL installation - say somoene already moved to 2.9.0 - still using MsSQL and then it will break in 2.10.0. Now. they have no way to go back (becuase migration scripts will also break for example) and we have a user who has broken installation that they can do nothing about.

And they will complain to us, will take our time, will cry for help and workarounds regardless what is the status of MSSQL - and most of all they will loose the trust in Airflow as a good solution for them because it won't work for them for possibly days when they will try to recover somehow. And this all will be happening 2 years from now when nobody here will even remember what kind of changes they were there implmented and likely many of people who are now in Airflow won't even be around.

And this is actually one of the reasons we want to remove MSSQL - we do not WANT to get issues about MSSQL for 2.8+

Is it their fault? Technically yes - they have not read the release notes (I have no stats but there is quite a number of people who don't).

  1. Scenario 2 - we forbid MSSQL and remove MSSQL code straight away

We can explicitly forbid migration to 2.8.0 for MSSQL users. that will be much cleaner solution. We will provide them the migration script on 2.7.3 and how to migrate to other database. The users will see it immediatley when they will attempt to migrate - and they will have two choices: a) stay with Airflow 2.7.3 longer b) migrate to other DB.

I think this is far more plausible scenario - both for our users and maintainers (including future ones) of Airflow. The cut is clean, and we know already that (some) people do not read release notes so basically - if we allow MSSQL users to migrate to 2.8.0, then it's all but guaranteed that some of our users will actually get into troubles sooner or later. We are basically accepting the fact that some percentage of our users, rather than stopped from migration - will migrate and eventually get in the troubles.

So I'd personally be for even saying:

if mssql => raise Error ("Migrate to Postgres or MySQL").

@potiuk
Copy link
Member

potiuk commented Nov 28, 2023

I understand (now) that mssql is needed in the docker for the provider package, which is fail. reverted the last two (functional) commits, leaving the driver and the install options for the docker build scripts like before.

So there are more changes that are potentially backwards-incompatible coming, removing mssql odbc library, seems like not a big issue.

I don't know if this "breaking" is really a problem, users still could follow building the dependencies on their own. It is breaking apart a non-promised dependency (I think we never promised which dependencies we ship per default, was just a dependency because we had it in code). So in my eyes making a note is OK. Else also for other python depts we would never be able to remove any. And as one other MSSQL client driver is there, it would mainly (in 90% of cases) just need to switch the connection URL (if not custom python code)?

@jscheffl jscheffl added the area:dev-env CI, pre-commit, pylint and other changes that do not change the behavior of the final code label Nov 28, 2023
@ephraimbuddy
Copy link
Contributor

I'm not confortable having this in 2.8.0. Let's leave it for 2.9.0 because the change is big and close to the release time.
Also, I prefer we don't remove the mssql codes but instead let users know by warning that this is no longer supported if they are using mssql. I understand @potiuk concerns about removing it this way but I feel this approach is harsh. We can start by warning users and then total removal. My opinion though.

@jscheffl
Copy link
Contributor Author

jscheffl commented Dec 7, 2023

I'm not confortable having this in 2.8.0. Let's leave it for 2.9.0 because the change is big and close to the release time. Also, I prefer we don't remove the mssql codes but instead let users know by warning that this is no longer supported if they are using mssql. I understand @potiuk concerns about removing it this way but I feel this approach is harsh. We can start by warning users and then total removal. My opinion though.

I understand the change is large, this is a reason. Raised this a bit late, unfortunately review was stale for a number of days :-(

Anyway, admins should have been warned already, there was a post in the docs already, see: https://airflow.apache.org/docs/apache-airflow/stable/howto/set-up-database.html#setting-up-a-mssql-database

Anyway... so let put this PR on pause and merge it for 2.9.0...

@potiuk
Copy link
Member

potiuk commented Dec 7, 2023

Anyway... so let put this PR on pause and merge it for 2.9.0...

Well. I think we have all the building-blocks in place. Yes it might meen few people doing some out-of-hours stuff - but for a good rason. I think if we miss the window of opportunity with announcements and "potentially breaking" changes, then we won't be abe to do much (due to cherry-pickability of some fixes). We have the script to migrate, we can put it separately (few minutes tomorrow mid-day) - and we are done. IMHO - the risk is small.

@potiuk
Copy link
Member

potiuk commented Dec 29, 2023

Hey @jscheffl -> do you need my help / actions for that one? I am happy to create the separate repo, if needed and we could already arrange the PR in the way to be "ready to marge" when we start thinking about 2.9.0 - we want to get it all ready and prepared quite some time before 2.9.0 so that we do not miss another window of opportunity for MSSQL removal :).

@jscheffl
Copy link
Contributor Author

@potiuk Yes, before re-basing another 10 times, I'd like to get this off the plate and merged. So your advise and prep help might help:

  • I assume all feedback consistently points that the migration tool shall not be in the Airflow mono repo. If there would be any kind of other GIT repo that you can create or point to to bring Add a migration script tooling support to migrate off mssql #35861 to then this would be a good pre-requisite. This PR needs to point to migration tool in docs. I do not have a strong opinion where to store the migration tool but woul dbe better also not pointing to my personal account.
  • This PR is quite large, the debate around migration script clean is not settled. I would be OK to revert. Maybe separating into at least two PRs might make discussion easier. (PR1: Removal of support form core + breeze / PR2: Clean up of leftovers in migration tooling)
  • Development advise in regards: Would it be OK to merge this to main "soon" or for better patch maintenance of 2.8-line do we need to keep this removal as a PR open for longer? Otherwise nobody really made a review, and I don't want all the efforts being dropped/wasted as actually I had positive intention as a follow-up of the vote to clean-up MSSQL but missing support for review here.

@potiuk
Copy link
Member

potiuk commented Dec 29, 2023

@potiuk Yes, before re-basing another 10 times, I'd like to get this off the plate and merged. So your advise and prep help might help:

  • I assume all feedback consistently points that the migration tool shall not be in the Airflow mono repo. If there would be any kind of other GIT repo that you can create or point to to bring Add a migration script tooling support to migrate off mssql #35861 to then this would be a good pre-requisite. This PR needs to point to migration tool in docs. I do not have a strong opinion where to store the migration tool but woul dbe better also not pointing to my personal account.

I created https://github.com/apache/airflow-mssql-migration . In about 5 minutes permissions should be set there and you should be able to initiate it an push the migration tooling there (even directly without PR/Approval). We can setup branch protection later when it settles down, for now it's ok to just push the changes there (but if you prefer to open PRs it's also fine - happy to review them).

  • This PR is quite large, the debate around migration script clean is not settled. I would be OK to revert. Maybe separating into at least two PRs might make discussion easier. (PR1: Removal of support form core + breeze / PR2: Clean up of leftovers in migration tooling)

Yeah. I would rather do it in reverse:

  • Create migration tooling outside (it does not have to be final version - just enough to be "working" - we can iterate and improve it later

  • Add link to it for people who want to migrate in PR 1 (we will keep updating it when we know exactly which version we should migrate to and maybe get some clarification about migration process

  • Only after we have this set up as base, we should have PR 2 with removing the support. There is a bit of problem potentially for cherry-picking, but I think the probability of difficult change that we would like to cherry-pick that will break MSSQL is very low and we will still have mssql tests running in v2-8-test so as long as @ephraimbuddy is ok with it and we commit to quickly fixing those kind of issues (or not-cherry-picking them in the first place) we can merge it even now.

  • Development advise in regards: Would it be OK to merge this to main "soon" or for better patch maintenance of 2.8-line do we need to keep this removal as a PR open for longer? Otherwise nobody really made a review, and I don't want all the efforts being dropped/wasted as actually I had positive intention as a follow-up of the vote to clean-up MSSQL but missing support for review here.

Yeah. See above, I think with just dropping MSSQL the chance of breaking cherry-picking is rather low (and in that case we would likely not cherry-pick it anyway - as I see it only happening for new features). But I'd love to hear what others say as well (especially @ephraimbuddy ).

@jscheffl
Copy link
Contributor Author

Closing this PR as the attempt is now super-seeded with PRs:

Leaving the cleanup of migration scripts "open for now"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:CLI area:dev-env CI, pre-commit, pylint and other changes that do not change the behavior of the final code area:dev-tools area:production-image Production image improvements and fixes area:Scheduler Scheduler or dag parsing Issues area:serialization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants