-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
airbyte-lib: Track streams in cache #34517
airbyte-lib: Track streams in cache #34517
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
…ib-streams-internal-table
This reverts commit 9145b25.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's looking great. Some feedback inline 👍
stream_name = Column(String, primary_key=True) | ||
source_name = Column(String) | ||
table_name = Column(String) | ||
catalog_metadata = Column(String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wdyt of using a composite three-column primary key here?
stream_name = Column(String, primary_key=True) | |
source_name = Column(String) | |
table_name = Column(String) | |
catalog_metadata = Column(String) | |
stream_name = Column(String, primary_key=True) | |
source_name = Column(String, primary_key=True) | |
table_name = Column(String, primary_key=True) | |
catalog_metadata = Column(String) |
The implications would be:
- Dupes are allowed for different sources using the same stream name.
- Dupes are allowed for the same stream name and same source name - iif table suffix and/or suffix are different.
- If a user configures the cache with different prefix/suffix settings, they'll basically be writing to a different namespace within the cache.
- Meaning, I could initialize my cache with "_test1" suffix, "_test2" suffix, etc., AirbyteLib would treat these as different namespaces with different tablesets.
- When we get to supporting multiple sources in the same cache, we might want a dynamic default table prefix like
'{source_name}_'
. But we can tackle that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem I'm seeing with that is that the cache api is stream name only based.
For example, you have the following entries in the _airbytelib_streams table:
stream | source | table |
---|---|---|
my_stream | source1 | source1_my_stream |
my_stream | source1 | source1_prefixed_my_stream |
If you do cache["my_stream"]
, then what table are you getting? For this prefix case, we could decide based on the table name, by comparing the table name the cache would choose for a stream name with the table name stored in the _airbytelib_streams table. I'm going to implement this.
But what about this case:
stream | source | table |
---|---|---|
my_stream | source1 | source1_my_stream |
my_stream | source2 | source2_my_stream |
cache["my_stream"]
could be either of those, and there is no way to decide. Changing this to cache["source1_my_stream"]
raises lots of new questions - it just doesn't feel like the right approach to me, this edge case shouldn't make it harder for the "regular" case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in summary, we should actually use the table name as primary key, because that's what needs to be unique for the current feature set.
For multi-source, I think it's OK to merge the streams and overwrite / raise in the collision case - after all, this is also what would happen in case of regular destinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Yeah, I think this sounds like a great path forward. 👍
…ib-streams-internal-table
…ib-streams-internal-table
@aaronsteers I adjusted the PR as discussed above. This is the behavior now:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks for the comments and related updates. Feel free to merge when ready.
stream_name = Column(String) | ||
source_name = Column(String) | ||
table_name = Column(String, primary_key=True) | ||
catalog_metadata = Column(String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach to the primary key. 👍👍
stream_name = Column(String, primary_key=True) | ||
source_name = Column(String) | ||
table_name = Column(String) | ||
catalog_metadata = Column(String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Yeah, I think this sounds like a great path forward. 👍
This PR introduces an internal table to track which streams got synced to a cache.
This allows to read from a cache even if no source is available:
This can be used for situations where data is written to a cache in one script, reading from it in another.
Conceptually, the cache remembers what it stores in a persistent way not tied to python heap memory.
Behavior
The cache already kept a
source_catalog
around with all streams that was read from. This information is now saved to an internal table when changed and loaded from the table when the cache object is created. New streams are merged into the existing catalog by the stream name, replacing existing stream definitions in case of collisions.The internal table
_airbytelib_streams
is added to the same schema as all other tables.When accessing streams from the cache object, all stream names that are in the catalog and also have an existing final table are available.
When accessing streams for a
ReadResult
, only the streams that got synced in that read are available, not all of them.Implementation details
As it's available already, this PR is using sqlalchemy ORM capabilities to create the table in the cache and manage its data.
Unclear points
Table name vs. stream name
A lot of the logic in the cache class is hinging on the stream name. However, the mapping from stream name to table name depends on the configuration (e.g. via the configured prefix). This leaves an edge case that could lead to confusing behavior:
This could be "fixed" by leveraging the table name column in the internal streams table, but it makes the world more complex than justified IMHO.
Different sources with identical stream names
As the interface of the cache is built around the stream name, there's no straightforward way to support multiple sources with identical stream names. I think it's OK to not handle this situation, just wanted to bring it up.