-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Primary Key to CDK. #3105
Add Primary Key to CDK. #3105
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bunch of suggestions and one question
@@ -64,6 +64,14 @@ def read_records( | |||
This method should be overridden by subclasses to read records based on the inputs | |||
""" | |||
|
|||
@abstractmethod | |||
def _get_source_defined_primary_keys( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_
indicates that a method is private, so it's a little idiosyncratic to call this_get...
.- Also, can we remove the
source_defined
? this is a source, sosource_defined
is redundant - This should be a property because in many cases this is hardcoded, so having to create a whole method for it is verbose
- Should this return
None
by default instead of being an abstractmethod?
Also, right now a source has to define a list of list of strings. This supports four mutex cases:
- single flat primary key e.g
[["id"]]
- single nested primary key e.g: `[["id_parent", "id_field"]]
- multiple flat primary keys
[["id1"], ["id2"]]
- multiple nested primary keys
[["id_parent1", "id_field1"], ["id_parent2", "id_field2"]]
My guess is that no 1 is by far the most common. So if I'm someone creating a source and I have to implement this method, I need to understand why this is a list of lists. Most people will think of the primary key as just a string, or maybe a list of strings in the composite case.
So why don't we have the following:
@property
@abstractmethod
def primary_key() -> Union[str, List[str], List[List[str]]]:
"""
Return string if single primary key, list of strings if composite primary key, list of list of strings if composite primary key consisting of nested fields
"""
def _wrapped_primary_key() -> List[List[str]]:
if isinstance(self.primary_key(), str):
return [[self.primary_key()]]
elif isinstance(self.primary_key(), list):
wrapped_key = []
for component in self.primary_key():
if isinstance(component, str):
wrapped_key.append([component])
elif isinstance(component, list):
wrapped_key.append(component)
else:
# error
else:
# error must be a list or str
this way the most common case can just stick to using string
like they're used to and if someone wants to go deeper there's nothing stopping them.
This is definitely a little more involved than usual but it will make for easier DX and readability in the majority case. WDYT?
If we do it this way we also won't need the helper method in http.py for wrapping the PK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke offline.
Agreed this approach provides sane defaults while exposing all underlying functionality.
Also agreed to remove all the source-defined
bits of the variable names.
airbyte-integrations/bases/base-python/base_python/cdk/streams/core.py
Outdated
Show resolved
Hide resolved
Made the changes as we discussed. How does this look @sherifnada ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments and one breaking change that will need to be removed but LGTM
airbyte-integrations/bases/base-python/base_python/catalog_helpers.py
Outdated
Show resolved
Hide resolved
@@ -102,12 +107,38 @@ def cursor_field(self) -> Union[str, List[str]]: | |||
return [] | |||
|
|||
@property | |||
def source_defined_cursor(self) -> bool: | |||
def cursor(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can either keep this as source_defined_cursor
? this one makes sense because the user actually can input a cursor, and this just toggles that
@@ -37,7 +37,8 @@ class HttpStream(Stream, ABC): | |||
Base abstract class for an Airbyte Stream using the HTTP protocol. Basic building block for users building an Airbyte source for a HTTP API. | |||
""" | |||
|
|||
source_defined_cursor = True # Most HTTP streams use a source defined cursor (i.e: the user can't configure it like on a SQL table) | |||
cursor = True # Most HTTP streams use a source defined cursor (i.e: the user can't configure it like on a SQL table) | |||
primary_key = "" # Change this to the field this stream should use as a primary key. Use a list if the key should be formed from multiple fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why set this to an empty string? I think either we make the base property not abstract and return None, or require users to choose a primary key. I'm kind of on the side of making it abstract and not overriding it here because almost all data has a primary key, so simply asking the user about the primary key will result in a much higher rate of connectors declaring primary keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
…/core.py Co-authored-by: Sherif A. Nada <snadalive@gmail.com>
@sherifnada holding off on publishing this base python version and updating all the down stream consumers since we are refactoring this. will do so in a separate PR if we need to do so before the refactor goes out. |
What
Allow CDK users to set primary keys in the Streams.
Closes: #2766.
How
_get_source_defined_primary_keys
, in the basestream
class. Modify theas_airbyte_stream
function to use the abstract method when settingsource_defined_primary_keys
.source_defined_primary_keys
property onHttpStream
that is a list of strings. Implement the abstract primary keys method inHttpStream
converting the list of keys to a list of list of keys. I felt reducing the interface to a list of strings allows users to avoid the confusion of a list of list of strings and only thinking about passing in all the response' fields required for the primary key.Recommended reading order
core.py
http.py