Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have destination schema update with source schema update #1967

Closed
vchallier opened this issue Feb 5, 2021 · 13 comments
Closed

Have destination schema update with source schema update #1967

vchallier opened this issue Feb 5, 2021 · 13 comments
Labels
team/compose type/enhancement New feature or request

Comments

@vchallier
Copy link

vchallier commented Feb 5, 2021

Tell us about the problem you're trying to solve

We're planning a migration from a MySQL Azure instance to BigQuery. There are about 50 tables in the source datasets. Our client is a growing company, and will regularly add fields in the source tables. I understand Airbyte does not update the target table when the source table changes schema.

Describe the solution you’d like

I would like Airbyte to handle changes in the source table and update the target tables accordinly. Ideally this could be done automatically. Another solution closer to what we have today could be : when running a load step (with airflow for example), Airbyte would return a specific response when the source schema has changed. It leaves the user with the possibility to trigger a full reload through an API call when he receives such response.

Describe the alternative you’ve considered or used

Fivetran handles these changes, otherwise we designed a custom setup that runs a full reload whenever there is a change in the schema. This is a "brute force" solution, we know there are smarter ways to do this, like being able to understand what are the changes.

Additional context

No other context, really looking forward to seeing where Airbyte will go in the upcoming months !

┆Issue is synchronized with this Asana task by Unito

@vchallier vchallier added the type/enhancement New feature or request label Feb 5, 2021
@cgardens
Copy link
Contributor

cgardens commented Feb 8, 2021

Thanks for creating this issue. This is definitely on our roadmap, we will update you as we have a better sense of when we will tackle this.

Ideally this could be done automatically.

Would be curious to get your opinion on which behavior you would prefer. Airbyte detects that a column that used to have the type int now has the type string. Would you prefer:

  1. Airbyte automatically changes the column in the destination to be of type string. (I believe this is Fivetran's behavior, rename old column, create new column with "wider" type, copy contents of old column into new)
  2. Airbyte blocks further replication for that table or source until you confirm that the schema changes are appropriate. Once you confirm, the schema migration is handled entirely by Airbyte (same behavior as option 1)

One of my concerns with 100% automatic schema changes, is that if they are ever done in error they can be hard to fix. For example if someone is trying to push data of the wrong type into a column, that sounds like that could be a mistake to me. Would you prefer the pure automatic approach (option 1) or (option 2) having the sanity check that the schema migration is what you expect.

@vchallier
Copy link
Author

Thanks for the answer !

Having these sanity checks would be great. I believe every column type change / new column should be validated by the ingestion owner. As you said, option 1 sounds like option 2 with every change automatically validated.

The ability to summarise the changes in a concise way would be great, it could look like a list :

  • [New Column] column1, type INT
  • [New Column] column2, type STRING
  • [Deleted Column] column3
  • [Modify Column] column4, type INT to type FLOAT

Let me know if you want to further develop the reasoning !

@cgardens
Copy link
Contributor

Thanks, this answer is super helpful. We will be in touch as wee start work on this and if we want to run anything else by you.

@aizenshtat
Copy link

I would also be super interested in this sollution. Very critical for us, as our internally developed software changes very frequently.

@msardana94
Copy link

Very excited for this feature! This will definitely make our lives easier. I have a few comments/questions in addition to what is already mentioned:

  1. I think it is helpful to have a configurable option to simply ignore the schema changes (like it does now).
  2. I think the other thing to note for option 2 is that it can be problematic for teams where internal app backend updates are done without prior notice to the data engineering teams. What happens if DE team don't act on the schema change notification right away?

In general I agree that 100% automated solution may have unintended consequences.

@blake-enyart
Copy link

@cgardens, is there a public roadmap somewhere that outlines when this functionality might be prioritized?

@naphattheerawat
Copy link

As April 2022 is fast approaching, may I have an update on this thread about the roadmap of this feature?
Do we have a clear timeline or ETA on when this feature will get prioritized or released yet?
Will it support PostgreSQL and BigQuery as well? 🥺

@grishick grishick added the team/destinations Destinations team's backlog label Sep 27, 2022
@vaishali-cm
Copy link

Any news on this feature?

@florian-ernst-alan
Copy link

It's surprising to see Airbyte being so late on this feature, while openly comparing itself to all alternatives as the best tool in the market.

In a distributed team dealing with large data (with engineers and data working separately), you can't expect us to manually refresh the changed tables:

  • Some changes might not be detected
  • Some tables are too big to be reimported efficiently

At the very least, Airbyte should add new columns when new fields are added - there's no debate between automatic and semi-automatic here.

For column data type changes or columns being dropped, I agree we could debate about the best implementation, but I would at least expect an alert.

Seeing this issue opened for 2+ years while being "definitely on the roadmap" is worrying and doesn't convince us to move to Airbyte, despite being seemingly a great product otherwise.

@mvgijssel
Copy link

I’m contemplating moving our Postgres to Snowflake pipeline from Meltano to Airbyte, but this is a deal breaker as the source schema changes almost daily (new columns and tables are added). Are there alternatives / ways to prioritize this?

@bleonard bleonard added the frozen Not being actively worked on label Mar 22, 2024
@nataliekwong
Copy link
Contributor

This has been a supported feature for a few quarters now, but we did not come back and update this issue.

You can read more about it in our blog post, read more details in our docs or enable it via our API.

I'll close this issue now, but if you have any further questions, feel free to tag me and ask! For more about what we are currently working on, you can visit our public roadmap here: https://go.airbyte.com/roadmap.

@mvgijssel
Copy link

@nataliekwong thanks for you comment! I guess I'm a bit confused about the documentation of the Postgres source (https://docs.airbyte.com/integrations/sources/postgres/postgres-troubleshooting#cdc-requirements) which still states that schema changes are not propagated.

Are schema changes not supported for this particular extractor? Or is the documentation simply out-of-date? 🤔

@nataliekwong
Copy link
Contributor

Hey @mvgijssel , good catch! It looks like the docs are not up to date, thanks for helping to identify it! I'll look internally for help to update the docs.

Schema propagation with Postgres CDC is supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/compose type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests