Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AirbyteLib: Progress Printer #34588

Merged
merged 60 commits into from
Jan 30, 2024
Merged

AirbyteLib: Progress Printer #34588

merged 60 commits into from
Jan 30, 2024

Conversation

aaronsteers
Copy link
Collaborator

@aaronsteers aaronsteers commented Jan 28, 2024

Note:

A sync operation can take 5 minutes, 30 minutes, or multiple hours. This PR adds a self-updating progress-print when run from IPython.

As much as possible, I try to not impact performance.

Sample output below:

This is the runtime of extracting just the 'issues' stream for the repo 'airbytehq/airbyte', using the GitHub source:

image

What's nice about this is that I can presume this workload is probably throttled by GitHub's API rate limits, and I can see at a glance that the write to the SQL table was <2 minutes, while the extraction from source was 24 minutes. Running with 2 or 3 auth tokens to rotate (which I think this source supports) would make an interesting experiment and possible demo content.

It's also nice that this is updated while the sync is running, so I don't need to wonder if the code got stuck. If the user has an estimate of how many rows they expect, then they can also deduce a % progress - even though we can't programmatically detect that.

@octavia-squidington-iii octavia-squidington-iii removed the area/connectors Connector related issues label Jan 28, 2024
Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this! Left two suggestions, but they are incremental improvements, not blockers.

from rich.markdown import Markdown as RichMarkdown


try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like these small QOL checks, but could we also add an option for this like pinecone does?

https://github.com/pinecone-io/pinecone-python-client/blob/61c19ad5a6a70321db44189dd90e603703670869/pinecone/data/index.py#L154

Some environments don't handle this kind of using the console very well and it's nice to be able to mute it for a clean output in a more "production" setting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely on board for this. The refactoring would be significant at this stage though, specifically because the progress messages are currently sent from a few different places in the codebase. While I'm not able to do so in this iteration, I definitely think this is worth doing and will probably get to it down the road.

# This is some math to make updates adaptive to the scale of records read.
# We want to update the display more often when the count is low, and less
# often when the count is high.
updated_period = min(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it would make sense to throttle these updates based on the elapsed time.

The current approach works well in a lot of scenarios, but gets weird for edge cases like:

  • Records are read really fast in the beginning, then get slow (e.g. because it's reading a second, slower stream) - it would mean it would update really rarely, so the user can't see the slow progress on the second stream well
  • Records are read glacially slow (like one per second). It takes a long time for stuff to show up in the first place

By just updating once a second or so, it's lively without any risk of overloading the terminal

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good feedback. It's not obvious here, but there's also a time-based throttle inside update_display(). In this layer I'm trying to keep as fast as possible performance, so just the math-based check. Probably we could add a time-elapsed check, but this codepath may be called literally thousands of times per second, so I'm wary to change this further right now. Happy to keep iterating after merging though.

Base automatically changed from aj/airbyte-lib/install-failure-handling to master January 30, 2024 06:07
@aaronsteers aaronsteers enabled auto-merge (squash) January 30, 2024 06:10
@aaronsteers aaronsteers merged commit f35c2a6 into master Jan 30, 2024
18 checks passed
@aaronsteers aaronsteers deleted the aj/airbyte-lib/progress-print branch January 30, 2024 06:39
clnoll pushed a commit that referenced this pull request Jan 30, 2024
jbfbell pushed a commit that referenced this pull request Feb 1, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 21, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants