Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider splitting Google Provider #15933

Open
potiuk opened this issue May 19, 2021 · 28 comments
Open

Consider splitting Google Provider #15933

potiuk opened this issue May 19, 2021 · 28 comments
Labels
kind:feature Feature Requests provider:google Google (including GCP) related issues

Comments

@potiuk
Copy link
Member

potiuk commented May 19, 2021

The Google provider is huge. It's by far the biggest provider out there and it has a number of sometimes more closely coupled and sometimes not coupled at all (leveldb) operators.

We should consider splitting the Google operator into smaller pieces, but that would involve some more decisions:

  • what to do with common part
  • what should be the split ? Per service? Per cloud ?
  • how to manage dependencies between different provider
  • how to manage transition and backwards compatibility

This is more of opening the discussion than making decisions yet.

@potiuk potiuk added the kind:feature Feature Requests label May 19, 2021
@potiuk
Copy link
Member Author

potiuk commented May 19, 2021

@anitakar
Copy link
Contributor

anitakar commented May 19, 2021 via email

@anitakar
Copy link
Contributor

anitakar commented May 19, 2021 via email

@dstandish
Copy link
Contributor

dstandish commented May 20, 2021

This came up in part because leveldb hooks depend on plyvel library which requires that the machine has leveldb installed.

So, prior to #15770, on mac if you want to install google provider (e.g. because you want to use some GCP hook) then you also have to install leveldb (which has nothing to do with GCP), and if you do not have leveldb installed, then the google provider install will fail.

To resolve this, what we ended up doing was making plyvel an optional dependency of google provider (#15770) so you have to install ...google[leveldb] to get plyvel (and this will only work if your system has leveldb installed). This also required that we make plyvel a "core" dependency of airflow, since CI requires it in order to fully import the google provider.

(Come to think of it perhaps CI could be amended to do ...provider-x[ALL] and then we would not have to add it as a core dep...)

Rather than making leveldb a google provider extra, I think a cleaner solution would be if leveldb was made a distinct provider, either google-leveldb or just leveldb. In the case of leveldb, there's no interaction with GCP hooks -- AFAIK it's completely independent. And this way plyvel would not need to be an extra, and would not need to be a airflow core dep.

The issue with leveldb was my main concern because being forced to brew install this system dependency is a bad user experience. And we have fixed the issue in a way, though it can be debated what the best solution is (i.e. sep provider vs extras). I'm not aware if there are other similar problems within the google provider, or other reasons that there may be to separate certain components of the provider

@turbaszek
Copy link
Member

I think a cleaner solution would be if leveldb was made a distinct provider

+1 to that, I think we had similar discussion about salesforce, tableau and slack (Salesforce have bought Tableau and Slack):
https://lists.apache.org/thread.html/rb1828485df78df7649000ff586a749ab685635f606b394386e1be808%40%3Cdev.airflow.apache.org%3E

Also with companies like google it would be nice if they would be able to take care of the providers (even if that would mean extracting them from core). In that way they would be able to split it in whatever seems best.

@turbaszek turbaszek added the provider:google Google (including GCP) related issues label May 21, 2021
@kaxil
Copy link
Member

kaxil commented May 21, 2021

Agree with @turbaszek -- The best judge of the split is the company itself -- so according to me that would best. However, if Google does not want to do that, we should separate leveldb to a distinct provider like @dstandish suggested

@anitakar
Copy link
Contributor

anitakar commented May 21, 2021 via email

@eladkal
Copy link
Contributor

eladkal commented May 22, 2021

+1 to that, I think we had similar discussion about salesforce, tableau and slack (Salesforce have bought Tableau and Slack)

Thats true but there may be a different case here as leveldb is under Google github account. So the users of leveldb are very aware that this is a google service unlike Tableau/Slack & Salesforce relationship.

@potiuk
Copy link
Member Author

potiuk commented May 22, 2021

Yeah. I see the case @eladkal. But I kind of see how Plyvel is very different from all other Google "core" components. Which cloud/ads/marketing_platform might be really "core" google business, where Plyvel is more of a "side" thing and could be fully separated out. I think it is much easier to separate out plyvel as a provider than splitting google "core" one.

I am quite torn on this one. I see the benefits of keeping all those as single "provider" - on the other hand the benefits of splitting it are "potentially good" but it makes them also difficult to work together when installed with all the dependencies - it will be easy for example to run in a situation where "cloud" dependencies are different (and conflicting) with "ads" dependencies. I think until we figure out (possibly) some way of separating the dependencies out between tasks, this might be a difficult one to tackle and rather than "origin" (google), the "independence" of a package might be more important.

I think the "extras" approach where plyvel is "optional" google extra dependency is maybe not perfect, but it kind of join both worlds. On one hand we have them together by "ownership", on the other hand we have just a problem of "environment" set properly for plyvel to work (so plyvel db libs installed). I think the current approach is kinda sustainable and does not introduce too much of complexity.

As usually with "compromise" solutions, it's not perfect but maybe it's "good enough"? As long as it works, it could be "OK". I am also quite OK to separate just plyvel as @anitakar suggested.

@anitakar
Copy link
Contributor

anitakar commented May 22, 2021 via email

@potiuk
Copy link
Member Author

potiuk commented May 24, 2021

I am a little confused though. I see plyvel on the list of requirements

Yeah. It will be released in the next wave of providers. And I also think the [plyvel] extra is a good option especially that this is really just a dependency. The plyvel provider is still there it will just fail with missing import if it's not installed with extra (same as a number of other cases, where we have cross-provider dependency.

Or maybe, they are happy that somebody has fixed
those dependencies so that they match each other?

Yeah. I am a bit torn myself. I see a value in having the "google" provider to get "everything-google". It it 'biggish" but on the other hand, it's also a kind of "good" to promote the whole platform (no need to install anything else to use "another" Google operator - it's the same with AWS (though there are far less number of of operators) and a bit different in Microsoft as there are really different "domains" there (winrm, mssql, azure).

Actually I like the idea of the poll from @anitakar ! Just a thought... Maybe we can use Airflow Summit to run a number of mini-pols where we could prepare a list of such questions and make people more engaged? That could be one of a number of questions that we could ask. I will raise it to Airflow Summit organizers :)

@anitakar
Copy link
Contributor

+1 to the polls on Airflow Summit.

@kurtqq
Copy link
Contributor

kurtqq commented Jul 30, 2021

I think the google provider became more trouble than worth.
It installs so many unneeded packages. I think it must be divided

@uranusjr
Copy link
Member

Did the poll actually happen? (I wasn’t following closely and may have missed it.)

@potiuk
Copy link
Member Author

potiuk commented Jul 30, 2021 via email

@kurtqq
Copy link
Contributor

kurtqq commented Sep 20, 2021

Kurtqq - would you be willing to help with that if we start the effort?

yes I might be able to help but what exactly are the tasks?

@kurtqq
Copy link
Contributor

kurtqq commented Nov 15, 2021

hi @potiuk @dstandish is this planned?
i can help with some of the semi automated work mentioned.

@potiuk
Copy link
Member Author

potiuk commented Nov 15, 2021

No - not planned maybe you would like to take a lead on it and itntroduce it simply (though it will likely mean quite some thinking on how to do backwards compatibility).

@uranusjr
Copy link
Member

To be more specific, backward compatibility should be conceptually simple—we just need to keep publishing an aggregated Google provider that depends on all the Google providers—but building an infrastructure to build this aggregated provider can be more challenging.

@potiuk
Copy link
Member Author

potiuk commented Nov 16, 2021

In general yes, but there are a few watch outs there - first of all most Google Providers depends on the common code - Base Hooks, common decorators, utils. And this means that there should also be a a 'google-common' package most likely and then there is a complexity of version dependencies between all those. When you take into account all the scenarios With various versions o those and potential cross dependencies between the split providers, this is its own small dependency-hell. The more I think of it, the more i see this one will introduce far more problems than it solves.

@eladkal
Copy link
Contributor

eladkal commented Nov 16, 2021

@potiuk I would say that common should not change that often and if it does it probably won't add additional dependencies- it's mostly code that you already need for the service it's just being used by another service so it's in a higher hierarchy. Why can't the provider always work with the latest common version ?

@potiuk
Copy link
Member Author

potiuk commented Nov 16, 2021

I think should is the key problem here. The initial split is quite easy, but what would happen after few releases is scary like hell. "Keep backwards compatibiliity of commons" in this case is quite a bit of wishful thinking and tt's far easier said than done and it requires constant vigiliance and fixes of accidental incompatibilities that will eventually creep-in leading to uncontrollable growth in complexity.

The problem is when people start refactoring and add code that will break the compatibility accidentally or when you want to do a refactor that will improve the common code but introduces compatibilty isses..

We already saw examples of that with db_api: see the comment here: https://github.com/apache/airflow/blob/main/airflow/hooks/dbapi.py#L45
This is just one class with few methods and I personally recall at least 3 cases where there were almost merged (or even merged) changes that would break the compatibility of existing, released providers (in all compatible versions) accidentally:
Splitting Google provider is the same but order or magnitude worse potentially as there is much more common code than that one class.

People making changes to such common code (and often even reviewwers/maintainers) might not realise the consequences of their changes on already released packages. Some changes will accidentally break compatibilities even if you are careful. It basically requires that all the relased google-providers are FUTURE compatible with alll the released version of the "common" package.

Unless you have full test suite that can handle various cases, and make sure that the "common package" will work with ALL already released and compatible providers that people have, thiere is no way to "make sure" it's the case. And even if we can add such a test suite (which is possible to some extent just very costly on maintenance and running), this prevents you from doing more "bold" refactorings - which is generally very bad side-effect of such approach. I think ability to refector code is crucial to maintainability.

For example now, we have such a test suite for all providers - we make sure that they import without warnings on Airflow 2.1 in our CI. But this is not a full guarantee they will work with Airflow 2.1 - this test is just a "smoke" test - but it already caught at least 2 cases of seemingly "innocent" change that would make all such providers stop working on 2.1.

Of course you can also introduce "breaking" changes in the common code (and release 2.0 package), but then this inevitably leads to one of two things:

  1. you also release all dependent packages with "breaking" realease that is effectively equivalent to releasing a new "google provider" release today.

  2. you have to maintain compatibility in all the dependent packages (thy shoudl work with both 1.0 and 2.0 of the commons) which leads to messy code and will break eventually as you add more changes. It can only be maintained for short time and eventually it leads to 1) - you have to say at some point of time "provider google-x.N" only works with "commons-2.0" and above. Just maintaining those dependencies is a pain and you require dedicated people who would keep an eye on those dependencies.

The question here is what is more costly (and for whom):

a) complexity of maintenance of compatibilities between different versions of common packages and released "google" packages, with potential ability for the user to upgrade only some parts

or

b) complexity for the users who have to adapt to potentially more frequently handling breaking changes with new "full gogle provider" release

I tnink the a) one is something that will grow more and more complex for maintainers over time, where b) is kinda "stable" - it requires some regular effort from the users but in a long run it is esiear to handle by them.

Also a) has one very uncomfortable - for open-source project at least - property. I immediately imagine many issues opened by the users "i want to install google-ads-6 and google-gcs-3 because this and that and they do not work together because they require different "commons". Just conversations about that and explaining what can and cannot be done will take a good chunk of time for maintainers who will know how it works. Explaining that in docs will be next to impossible I am afraid. Right now we avoid all those conversations by releasing a single google provider. The conversation is very simple: "if you want to use google ads which were added in provider 5.0 you need to also adapt all other google properties to the version that is there". End of story. By having multiple providers/versions, the amount of user stories here grows exponentially large.

Also I think in a long term you will not avoid the "breaking all" releases anyway. Users will have to do it anyway - only it will cost them more because they will have to do it much less frequently (counter-intuitively). There is an old saying that if migration is painful - just do it more often. The "split providers" approach leads to potentially less frequent, but more painful upgrades to our users + adds effort on maintenance of it on committers. The "single provider" means that you potentially must do more frequent but less paindul upgrades (say every month when you upgrade google provider). But there is no solution with "no pain".

I am not saying all this is impossible - technically splitting the google provider is possible. I just think that the person (or rather grouop) who commits doing it realises the consequences and takes the burden of organising it in the way that we do not land in "dependency" hell. I personally currently have not enough courage to commit to it to be honest.

@potiuk
Copy link
Member Author

potiuk commented Nov 16, 2021

And just one more comment here:

It also depends on the "Extent" of the split of course. If we are talking for example about extracting the "plyvel" out of Google provider - by all means let's do it. Adding Plyvel to google provider was simply a mistake - it does not use any common code and conceptually it is very different than other integrations.
Possibly same could be done to extract "ads" - they also use very little (or no) common code.

But for all the "cloud" and "marketing_platform" I'd say there are far too many things in common that separating them out is worth the cost.

@eladkal
Copy link
Contributor

eladkal commented Nov 16, 2021

To my perspective it's not the same case as compatibility issue with Airflow core (like the dbApi hook)
Even now we have providers that has dependencies between each other - Postgres and AWS (for redshift), any transfer operator between two providers etc.. The problem that you are worried about indeed needs attention but I'm happy with providing the answer "keep it aligned and up to date".

I checked the google common code:
https://github.com/apache/airflow/commits/main/airflow/providers/google/common
In the past year it had only 3 changes but yeah - I get the point that it's risky.

I'm very OK with the alternative suggested. For me the goal is just to reduce the amount of services that installed directly with the google provider. The more services we can set as optional the better. After we do that we can revisit further splitting if needed.

So the action items to resolve this issue are:

  • converting leveldb to be a google provider extra dependency.
  • converting ads to be a google provider extra dependency.

@potiuk
Copy link
Member Author

potiuk commented Nov 16, 2021

I hope I am not exaggerating, but my experience tells me different story (but maybe, of course when handled differently it can be made simple). It's just I've seen what happened with provider split and how much time after the split it took to sort out all the various cases/incompatibilities and cross-version problems (and build processes around maintaining those).

But if there is someone who would like to take the lead on the split and work out the protections and make sure the cost of maintenance is low - I will certainly not be opposing it, I am super happy to help!. And happy to answer any questions and help to build and test all the infrastructure.

It will not be "split and done". My point here is that this is a long-haul and requires to work out the process of maintenance, managing to respond to user's questions, document it properly, make frequent decisions on whether to accept and release changes that are "possibly" incompatible and how to version those separate providers and core.

Raealistically speaking - once you split you should expect half a year of monthly releases and get various user stories in that will be addressed, then you will be past the initiaal problems, and within the year or two you might expect breaking changes in Google API that wil require some backwards-incompatible changes of all such providers.

And it needs committment to lead not only the split but also what happens after, including communication with the users, sorting out potential incompatibilities, and handling/reviewing multiple PRs with care and focus on compatibility and potentiallybreaking changes.

That's it. For me it's scary to make such committment, but maybe someone would like to take that on?

@potiuk
Copy link
Member Author

potiuk commented Nov 16, 2021

Even now we have providers that has dependencies between each other - Postgres and AWS (for redshift), any transfer operator between two providers etc.. The problem that you are worried about indeed needs attention but I'm happy with providing the answer "keep it aligned and up to date".

Just to comment on that part - those cross-provider dependencies are (I believe) way simpler. They use very stable (by definition) APIs that are shared between providers now (Hooks). That's about it. And we already had 1 case where there was a breaking change which required to document it in the changelog, add cross-provider-versioned-dependency with quite some communication overhead.

What we are talking about in Google is a deep integration with a number of common functionalities revolving mostly about authentication but also some "common" functionality that is shared and expectation is (and current state) that if we implement it for one service, it is implemented for all of them at once. This is far closer to the DBAPI issues than cross-provider dependencies IMHO.

Just to give a realistic example and where it would create problems. An example of "breaking" change I am talking here was adding "impersonation chain" to GoogleBaseHook and a number of "services" #9915.

This was a functionality added acrross pretty much all google services (it was done before we released Airflow 2 and split providers) and if we did not have all those changes in one release, the risk of introducing problems when implementing it in fully backwards compatible way in the common part was high.

You'd either:

  1. make all providers depend on new version of 'commons-2.0' (which is equivalent to releasing the "google provider" now.

or

  1. in most "split" providers you'd have to handle the case where you have "old commons 1.0" without impersonation and "new commons 2.0" with it (and test it with both versions for all providers that use them!)

Case 2) is an absolute nightmare from the code/maintenance point of view. You cannot rely on having new "common" code (it has to work with already released commons-1.0). So basically what you have to do is to properly handle the "if impersonation available" code in all the providers in similar way. The terrible part of it is tha you have to copy the if-ed code to all split providers that use impersonation - because you can only rely on the common code that is out there in commons 1.0.

There are yet a number of authentication features we do not have and at some point we might want to (or need to) add them.

Surely - we could make it a "breaking" change and release all providers only depending on the new commons then. Those changes are not "often" but IMHO those are super painful. This is because in this case our users completely loose the benefits of using "separately released" providers.

Users who will be used to just upgrading one "split" provider at a time will complain with exactly this thing: "I am using gcs 2.0" but in order to use this new feature from ads 4.0 I have to upgrade to gcs 5.0 because I have to upgrade "commons" to 2.0 and it is not compatible with my version. I still want to continue using gcs 2.0 and get the feature from ads 4.0 - how can I do it?" .

IMHO it's much better to have our users get "used" to upgrading all the google stuff together rather than risk this kind of problems.

But once again - if someone woudl like to take a lead on that, I am happy to help to set it up :).

@rodrigo-fss
Copy link

rodrigo-fss commented May 25, 2022

Hello guys
any updates? I'm still struggling with the leveldb requirement to install the google provider and I can see that it wasn't split apart yet. Do you have any recommendations?

plyvel/_plyvel.cpp:703:10: fatal error: leveldb/db.h: No such file or directory
#14 3.011         703 | #include "leveldb/db.h"

Airflow 2.0.2
Python 3.9.6

Thanks!

@potiuk
Copy link
Member Author

potiuk commented May 26, 2022

@rodrigo-fss -> Plyvel is not a requirement for Google Provider any more - it's optional dependency - you do not have to install it. But in any case you need to upgrade to 2.1+ as google providers are already not compatible with 2.0.* line so you are using a really old version of those and even if we split providers, you will have to upgrade anyway to use that.

So upgreding to Airflow 2.1+ (Ideallly 2.3+) is the RIGHT solution to your problem, not waiting for splitting the provider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature Feature Requests provider:google Google (including GCP) related issues
Projects
None yet
Development

No branches or pull requests

9 participants