Convert columns in get_pandas_df to lowercase SnowflakeHook #13152

JavierLopezT · 2020-12-18T13:26:53Z

In snowflake, everything is uppercase. In get_pandas_df this could be inconvenient. Sometimes you have to access certain columns, and you don't want to type the name of the columns in uppercase. This is especially inconvenient if you are migrating from Redshift or another DW to Snowflake, because you have to change all your pandas code.

I aim to solve this with this PR. If you include a boolean in the extra field of your snowflake connection as True, all the columns will get lowercase automatically.

I guess this needs test, but I have no idea how to begin. Could anyone help me, please? Thanks

github-actions · 2020-12-18T14:02:47Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

dstandish · 2020-12-18T22:26:07Z

it's not true that everything in snowflake is uppercase. snowflake renders unquoted identifiers as uppercase. but quoted identifiers it leaves as is.

you can do create table "MyCrazyTable" ... and then when you do select * from mycrazytable it will fail.

so this would be no problem if you are never quoting identifiers in your code. however, if that's not true it could cause problems.

perhaps a better example is this:
create temp table _my_table ( "col1" int, col1 int);

with your change you might run into trouble

dstandish · 2020-12-18T22:28:05Z

ok but i see now you made it optional :)

so maybe that's ok

dstandish

ok so i have a different concern with this change.

you are adding a global config option lowercase_columns (global in the context of the hook)

but this only applies to the narrow case of the get_pandas_df method. e.g. it does not apply to get_records. or fetchall on a cursor.

so i would vote not to approve this change

since this isn't a snowflake parameter, and can't be applied uniformly, i think it's the kind of change best left to be implemented in your own subclass --- not something in the airflow version.

people will not easily discover the functionality, and will find that it has only limited use, so not worth it IMO.

JavierLopezT · 2020-12-31T10:58:10Z

ok so i have a different concern with this change.

you are adding a global config option lowercase_columns (global in the context of the hook)

but this only applies to the narrow case of the get_pandas_df method. e.g. it does not apply to get_records. or fetchall on a cursor.

so i would vote not to approve this change

since this isn't a snowflake parameter, and can't be applied uniformly, i think it's the kind of change best left to be implemented in your own subclass --- not something in the airflow version.

people will not easily discover the functionality, and will find that it has only limited use, so not worth it IMO.

I agree that it was not ideal to have it as a global config option for just one method. However, I do think that this can be useful for more people, since it's pretty common to migrate from other databases to Snowflake and this will make things easier.

Thus, I have changed the argument from connection to get_pandas_df method and I have added a little bit of documentation. I also agree that this could be potentially hidden from users, so I would be pleased to write more documentation somewhere else (I don't know where though). WDYT?

dstandish · 2020-12-31T21:13:40Z

Let me caveat that I am not a committer and can't merge this anyway, so my opinion is not decisive. I just took a look cus I have experience with snowflake and can help review.

And let me also say that I totally get the impulse to add this -- as someone who has used snowflake for a couple years I also was annoyed by the weird handling of case.

And also, let me share that I know what it's like to try to contribute or propose something and get a "meh".

That all having been said, I just don't think this should be added. It's just too niche, and it's not worth the complexity and confusion it adds. True it's a tiny amount of complexity. But it's another configurable thing, and it is nonstandard -- i.e. it is not part of snowflake api, not part of pandas api, and not supported in other hook methods. Will people even discover the feature?

I think it is best to simply accept snowflake as it is. get_pandas_df returns the naming that snowflake gives it. If they are lower and quoted in the system, they'll be lower in pandas. And the same is true with fetchall or querying from an IDE or the web UI for that matter.

Snowflake currently has a parameter you can set to make quoted identifiers case insensitive. And when enabled, even quoted identifiers are rendered as upper case. I think the best solution would be if snowflake extended this to make it possible that quoted identifiers are case insensitive and always rendered as lower. But that's an issue for another team :)

Anyway, just my 2 cents. Others very well may feel differently.

In apache language my vote would be -0: "I won't get in the way, but I'd rather we didn't do this."

potiuk · 2021-01-05T22:03:12Z

There is a new version of snowflake python connector coming, that we will incorporate very quickly as it solves some really bad behaviour of python connector. I think we should take a look again at the issue when the new version is released.

I also have high hopes we can get much better cooperation between snowflake and open source community using their connectors. They seem to release that this is important for them.

eladkal · 2021-02-03T10:17:21Z

I tend to agree with @dstandish
I also don't think this issue is localized only to Snowflake.
I have ETLs where i do something similar with other databases. Nothing prevents you from doing:

def snflk_func(**kwargs):
    snflk = SnowflakeHook('snowflake_default')
    df = snflk.get_pandas_df(sql=sql)
    if kwargs['lowercase_columns']:
        df.columns = df.columns.str.lower()
    # load your df to anywhere you need.

op = PyhtonOperator(task_id='snflk_task',
                    python_callable=snflk_func,
                    op_kwargs={'lowercase_columns': True},
                    dag=dag
                    )

When you use hooks either you wrap it with python function or you inherit from and can overwrite the get_pandas_df. In both cases it's very easy to handle the problem you presented.

If this PR proceed forward my only comment is that at least don't call it lowercase_columns. It's preferred to call it something like column_case that accept lowercase as option. I for example in my code convert to snakecase so if it's accepted better to keep the option open for extentions.

Like @dstandish I'm - 0 for this PR but that is just me.

JavierLopezT · 2021-02-24T14:32:40Z

There is a new version of snowflake python connector coming, that we will incorporate very quickly as it solves some really bad behaviour of python connector. I think we should take a look again at the issue when the new version is released.

I also have high hopes we can get much better cooperation between snowflake and open source community using their connectors. They seem to release that this is important for them.

Hi Jarek. Any news about this?

potiuk · 2021-02-24T19:40:04Z

Yeah. The new version of python connector is out and we switched to it. But after looking closely - I also think we should not merge this one - similarly to Daniel, this is -0 for me. Since everything we do is python, overriding the get_pandas_df in your own operator is almost as easy as passing a parameter and it gives you much better flexibility.

Also, I think if Snowflake has a captitalised names, the best you can do is to keep it this way throughout the whole "data" journey - to avoid confusion, even for the cases where you have to do any kind of matching/lineage kind of analysis, it simply makes sense to keep it consistent. Adding an "optional parameter" seems like it is not dangerous, but I think we should avoid doing so if the option is "niche" and when you can do it almost as easily by other means.

I think we are all for closing the PR. Are you convinced @JavierLopezT ?1

JavierLopezT · 2021-03-05T22:57:14Z

Yeah. The new version of python connector is out and we switched to it. But after looking closely - I also think we should not merge this one - similarly to Daniel, this is -0 for me. Since everything we do is python, overriding the get_pandas_df in your own operator is almost as easy as passing a parameter and it gives you much better flexibility.

Also, I think if Snowflake has a captitalised names, the best you can do is to keep it this way throughout the whole "data" journey - to avoid confusion, even for the cases where you have to do any kind of matching/lineage kind of analysis, it simply makes sense to keep it consistent. Adding an "optional parameter" seems like it is not dangerous, but I think we should avoid doing so if the option is "niche" and when you can do it almost as easily by other means.

I think we are all for closing the PR. Are you convinced @JavierLopezT ?1

Sure. I'll close it myself

JavierLTPromofarma added 2 commits December 18, 2020 14:23

Convert columns in get_pandas_df to lowercase SnowflakeHook

3870937

statik checks

9dccf4c

dstandish requested changes Dec 30, 2020

View reviewed changes

changes

37c648a

turbaszek requested a review from potiuk February 24, 2021 18:52

turbaszek added the provider:snowflake Issues related to Snowflake provider label Feb 24, 2021

JavierLopezT closed this Mar 5, 2021

JavierLopezT deleted the snowflake_pandas branch April 1, 2021 08:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert columns in get_pandas_df to lowercase SnowflakeHook #13152

Convert columns in get_pandas_df to lowercase SnowflakeHook #13152

Uh oh!

JavierLopezT commented Dec 18, 2020

Uh oh!

github-actions bot commented Dec 18, 2020

Uh oh!

dstandish commented Dec 18, 2020

Uh oh!

dstandish commented Dec 18, 2020 •

edited

Loading

Uh oh!

dstandish left a comment •

edited

Loading

Uh oh!

JavierLopezT commented Dec 31, 2020 •

edited

Loading

Uh oh!

dstandish commented Dec 31, 2020 •

edited

Loading

Uh oh!

potiuk commented Jan 5, 2021

Uh oh!

eladkal commented Feb 3, 2021

Uh oh!

JavierLopezT commented Feb 24, 2021

Uh oh!

potiuk commented Feb 24, 2021

Uh oh!

JavierLopezT commented Mar 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Convert columns in get_pandas_df to lowercase SnowflakeHook #13152

Convert columns in get_pandas_df to lowercase SnowflakeHook #13152

Uh oh!

Conversation

JavierLopezT commented Dec 18, 2020

Uh oh!

github-actions bot commented Dec 18, 2020

Uh oh!

dstandish commented Dec 18, 2020

Uh oh!

dstandish commented Dec 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dstandish left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JavierLopezT commented Dec 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dstandish commented Dec 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

potiuk commented Jan 5, 2021

Uh oh!

eladkal commented Feb 3, 2021

Uh oh!

JavierLopezT commented Feb 24, 2021

Uh oh!

potiuk commented Feb 24, 2021

Uh oh!

JavierLopezT commented Mar 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dstandish commented Dec 18, 2020 •

edited

Loading

dstandish left a comment •

edited

Loading

JavierLopezT commented Dec 31, 2020 •

edited

Loading

dstandish commented Dec 31, 2020 •

edited

Loading