Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix comparing columns in tables to the current stream keys. #94

Conversation

mirelagrigoras
Copy link

A "key" value, as it comes in the stream_buffer object, may or may not contain uppercase letters, but in Redshift, which is based on Postgres, columns are always created with lowercase letters.
As a consequence, in order to check that in the existing table there is a column that corresponds to a certain key and has the same type as the one provided for the key in the current schema, we must check that the type of the key is the same as the column that is mapped to this key.
E.g.: For a key called 'orderID', the new column that will be created in a Redshift table will be "orderid".Thus, in the metadata object the column 'orderid' will be mapped to the 'orderID' key.
To check that both the column in the table and the schema type for that key have the same sql type, we call the json_schema_to_sql_type method as it follows:
-current_table_schema['schema']['properties']['orderid']
-stream_buffer.schema['properties']['orderID'].

Copy link
Collaborator

@AlexanderMann AlexanderMann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mirelagrigoras I took a stab at this as well before seeing your work in here. This is great! Also, good catch with the key_properties being passed into persist_csv.

I don't see any failing tests herein, and we'll need those for future proofing etc. If you wanna check out my pr and steal the tests feel free.

#95

I'll leave it up to you as to whether you want to merge the prs, or if you want to review mine and we can try to get that one through!

@@ -459,7 +463,7 @@ def persist_csv_rows(self,

update_sql = self._get_update_sql(remote_schema['name'],
temp_table_name,
remote_schema['key_properties'],
[key.lower() for key in remote_schema['key_properties']],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you also have an issue with other values being canonicalized to _ by the canonicalize_identifier code. I think for your persist_csv code you'll need to do the same thing as what you did on line 142.

@@ -125,6 +124,8 @@ def write_batch(self, stream_buffer):
(stream_buffer.stream,),
stream_buffer.stream)

table_metadata = self._get_table_metadata(cur,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be moved to line ~140 as it's only used there. By doing that it better scopes the variable so that it's not being used accidentally in more places.

@mirelagrigoras
Copy link
Author

mirelagrigoras commented Feb 8, 2019 via email

@AlexanderMann
Copy link
Collaborator

@mirelagrigoras sounds good. It looks like @awm33 already approved the other pr. I'll go ahead and merge and get releases cut so that we can get Redshift updated with the same changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants