Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database Sources (Postgres) implements Progress Bar protocol - AirbyteEstimateTraceMessage #19199

Closed
bleonard opened this issue Nov 9, 2022 · 4 comments · Fixed by #20783
Closed
Assignees
Labels

Comments

@bleonard
Copy link
Contributor

bleonard commented Nov 9, 2022

Implement the AirbyteEstimateTraceMessage messages described in #18875

Here is an example PR for another source: #19197

The goal is to emit the predicated total number of rows/bytes in the sync (per stream).

@evantahler evantahler changed the title Database Sources (Postgres) implements Progress Bar protocol Database Sources (Postgres) implements Progress Bar protocol - AirbyteEstimateTraceMessage Nov 9, 2022
@bleonard
Copy link
Contributor Author

We can't do bytes, but we can do rows for Postgres Standard.

So in the AbstractJdbcSource, we'll do SELECT COUNT(*) FROM table WHERE cursor > value (or without the WHERE if not incremental or there is no cursor) and output that as the row_count

It's not clear to me at the moment @evantahler if we do this for all streams at the begniing or for each sttream as we start to work on it. Let's assume at the beginning so it can be a full lenght of the whole sync.

To support CDC, we can talk about updating the protocol to take a percentage value.

@evantahler
Copy link
Contributor

Why can't you get an estimate to the size of a postgres table?

  • There are fancy ways to compute the size of the rows in question directly
  • Or, you could get a count of all rows in the table (10,000) and ask postgres for the fast table size (select pg_relation_size('table_name')) and devide to get an average size of row... and then multiply that by the # rows you are sending (considering the offset)

Either way, the feature is implemented such that:

  • you don't have to send the estimates at the beginning - you can send them whenever
  • It's better to send the estimates at the beginning because we won't display the progress bar or time-remaining estimate until we get an estimate for every stream that's being synced.
    • The estimates are based on rows remaining, rather than bytes remaining

👍 on future protocol updates

@bleonard
Copy link
Contributor Author

Cool. Maybe we can send bytes then. We'll look into it, but it does look like we want to do it per stream at the beginning. I'm curious what happens if it turns out we don't have access to a random stream. Let's say we want to then do that one last. Do we still have to send some number of rows (0) to make it show the progress bar? @evantahler

@evantahler
Copy link
Contributor

evantahler commented Nov 16, 2022

Do we still have to send some number of rows (0) to make it show the progress bar?

I think so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants