Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Primary Key to CDK. #3105

Merged
merged 15 commits into from
Apr 30, 2021
Merged

Add Primary Key to CDK. #3105

merged 15 commits into from
Apr 30, 2021

Conversation

davinchia
Copy link
Contributor

@davinchia davinchia commented Apr 28, 2021

What

Allow CDK users to set primary keys in the Streams.

Closes: #2766.

How

  • Make primary key an abstract internal method, _get_source_defined_primary_keys, in the base stream class. Modify the as_airbyte_stream function to use the abstract method when setting source_defined_primary_keys.
  • Create a source_defined_primary_keys property on HttpStream that is a list of strings. Implement the abstract primary keys method in HttpStream converting the list of keys to a list of list of keys. I felt reducing the interface to a list of strings allows users to avoid the confusion of a list of list of strings and only thinking about passing in all the response' fields required for the primary key.

Recommended reading order

  1. core.py
  2. http.py

@davinchia davinchia changed the title Add hooks for primary key in base Source interface. Add more speciali… Add Primary Key to CDK. Apr 28, 2021
@davinchia davinchia marked this pull request as ready for review April 28, 2021 06:53
@davinchia davinchia removed the request for review from michel-tricot April 28, 2021 06:53
Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bunch of suggestions and one question

@@ -64,6 +64,14 @@ def read_records(
This method should be overridden by subclasses to read records based on the inputs
"""

@abstractmethod
def _get_source_defined_primary_keys(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. _ indicates that a method is private, so it's a little idiosyncratic to call this _get....
  2. Also, can we remove the source_defined? this is a source, so source_defined is redundant
  3. This should be a property because in many cases this is hardcoded, so having to create a whole method for it is verbose
  4. Should this return None by default instead of being an abstractmethod?

Also, right now a source has to define a list of list of strings. This supports four mutex cases:

  1. single flat primary key e.g [["id"]]
  2. single nested primary key e.g: `[["id_parent", "id_field"]]
  3. multiple flat primary keys [["id1"], ["id2"]]
  4. multiple nested primary keys [["id_parent1", "id_field1"], ["id_parent2", "id_field2"]]

My guess is that no 1 is by far the most common. So if I'm someone creating a source and I have to implement this method, I need to understand why this is a list of lists. Most people will think of the primary key as just a string, or maybe a list of strings in the composite case.

So why don't we have the following:


@property
@abstractmethod
def primary_key() -> Union[str, List[str], List[List[str]]]:
   """
   Return string if single primary key, list of strings if composite primary key, list of list of strings if composite primary key consisting of nested fields
   """ 

def _wrapped_primary_key() -> List[List[str]]:
   if isinstance(self.primary_key(), str): 
       return [[self.primary_key()]]
   elif isinstance(self.primary_key(), list):
       wrapped_key = []
       for component in self.primary_key():
           if isinstance(component, str):
             wrapped_key.append([component])
           elif isinstance(component, list):
              wrapped_key.append(component)
           else: 
              # error
    else: 
         # error must be a list or str

this way the most common case can just stick to using string like they're used to and if someone wants to go deeper there's nothing stopping them.

This is definitely a little more involved than usual but it will make for easier DX and readability in the majority case. WDYT?

If we do it this way we also won't need the helper method in http.py for wrapping the PK

Copy link
Contributor Author

@davinchia davinchia Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke offline.

Agreed this approach provides sane defaults while exposing all underlying functionality.

Also agreed to remove all the source-defined bits of the variable names.

@davinchia
Copy link
Contributor Author

davinchia commented Apr 29, 2021

Made the changes as we discussed. How does this look @sherifnada ?

Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments and one breaking change that will need to be removed but LGTM

@@ -102,12 +107,38 @@ def cursor_field(self) -> Union[str, List[str]]:
return []

@property
def source_defined_cursor(self) -> bool:
def cursor(self) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can either keep this as source_defined_cursor? this one makes sense because the user actually can input a cursor, and this just toggles that

@@ -37,7 +37,8 @@ class HttpStream(Stream, ABC):
Base abstract class for an Airbyte Stream using the HTTP protocol. Basic building block for users building an Airbyte source for a HTTP API.
"""

source_defined_cursor = True # Most HTTP streams use a source defined cursor (i.e: the user can't configure it like on a SQL table)
cursor = True # Most HTTP streams use a source defined cursor (i.e: the user can't configure it like on a SQL table)
primary_key = "" # Change this to the field this stream should use as a primary key. Use a list if the key should be formed from multiple fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why set this to an empty string? I think either we make the base property not abstract and return None, or require users to choose a primary key. I'm kind of on the side of making it abstract and not overriding it here because almost all data has a primary key, so simply asking the user about the primary key will result in a much higher rate of connectors declaring primary keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

@davinchia
Copy link
Contributor Author

@sherifnada holding off on publishing this base python version and updating all the down stream consumers since we are refactoring this. will do so in a separate PR if we need to do so before the refactor goes out.

@davinchia davinchia merged commit fc00d36 into master Apr 30, 2021
@davinchia davinchia deleted the davinchia/santa-primary-key branch April 30, 2021 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

santa: primary keys
2 participants