Skip to content

Conversation

@JavierLopezT
Copy link
Contributor

In snowflake, everything is uppercase. In get_pandas_df this could be inconvenient. Sometimes you have to access certain columns, and you don't want to type the name of the columns in uppercase. This is especially inconvenient if you are migrating from Redshift or another DW to Snowflake, because you have to change all your pandas code.

I aim to solve this with this PR. If you include a boolean in the extra field of your snowflake connection as True, all the columns will get lowercase automatically.

I guess this needs test, but I have no idea how to begin. Could anyone help me, please? Thanks

@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@dstandish
Copy link
Contributor

it's not true that everything in snowflake is uppercase. snowflake renders unquoted identifiers as uppercase. but quoted identifiers it leaves as is.

you can do create table "MyCrazyTable" ... and then when you do select * from mycrazytable it will fail.

so this would be no problem if you are never quoting identifiers in your code. however, if that's not true it could cause problems.

perhaps a better example is this:
create temp table _my_table ( "col1" int, col1 int);

with your change you might run into trouble

@dstandish
Copy link
Contributor

dstandish commented Dec 18, 2020

ok but i see now you made it optional :)

so maybe that's ok

Copy link
Contributor

@dstandish dstandish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so i have a different concern with this change.

you are adding a global config option lowercase_columns (global in the context of the hook)

but this only applies to the narrow case of the get_pandas_df method. e.g. it does not apply to get_records. or fetchall on a cursor.

so i would vote not to approve this change

since this isn't a snowflake parameter, and can't be applied uniformly, i think it's the kind of change best left to be implemented in your own subclass --- not something in the airflow version.

people will not easily discover the functionality, and will find that it has only limited use, so not worth it IMO.

@JavierLopezT
Copy link
Contributor Author

JavierLopezT commented Dec 31, 2020

ok so i have a different concern with this change.

you are adding a global config option lowercase_columns (global in the context of the hook)

but this only applies to the narrow case of the get_pandas_df method. e.g. it does not apply to get_records. or fetchall on a cursor.

so i would vote not to approve this change

since this isn't a snowflake parameter, and can't be applied uniformly, i think it's the kind of change best left to be implemented in your own subclass --- not something in the airflow version.

people will not easily discover the functionality, and will find that it has only limited use, so not worth it IMO.

I agree that it was not ideal to have it as a global config option for just one method. However, I do think that this can be useful for more people, since it's pretty common to migrate from other databases to Snowflake and this will make things easier.

Thus, I have changed the argument from connection to get_pandas_df method and I have added a little bit of documentation. I also agree that this could be potentially hidden from users, so I would be pleased to write more documentation somewhere else (I don't know where though). WDYT?

@dstandish
Copy link
Contributor

dstandish commented Dec 31, 2020

Let me caveat that I am not a committer and can't merge this anyway, so my opinion is not decisive. I just took a look cus I have experience with snowflake and can help review.

And let me also say that I totally get the impulse to add this -- as someone who has used snowflake for a couple years I also was annoyed by the weird handling of case.

And also, let me share that I know what it's like to try to contribute or propose something and get a "meh".

That all having been said, I just don't think this should be added. It's just too niche, and it's not worth the complexity and confusion it adds. True it's a tiny amount of complexity. But it's another configurable thing, and it is nonstandard -- i.e. it is not part of snowflake api, not part of pandas api, and not supported in other hook methods. Will people even discover the feature?

I think it is best to simply accept snowflake as it is. get_pandas_df returns the naming that snowflake gives it. If they are lower and quoted in the system, they'll be lower in pandas. And the same is true with fetchall or querying from an IDE or the web UI for that matter.

Snowflake currently has a parameter you can set to make quoted identifiers case insensitive. And when enabled, even quoted identifiers are rendered as upper case. I think the best solution would be if snowflake extended this to make it possible that quoted identifiers are case insensitive and always rendered as lower. But that's an issue for another team :)

Anyway, just my 2 cents. Others very well may feel differently.

In apache language my vote would be -0: "I won't get in the way, but I'd rather we didn't do this."

@potiuk
Copy link
Member

potiuk commented Jan 5, 2021

There is a new version of snowflake python connector coming, that we will incorporate very quickly as it solves some really bad behaviour of python connector. I think we should take a look again at the issue when the new version is released.

I also have high hopes we can get much better cooperation between snowflake and open source community using their connectors. They seem to release that this is important for them.

@eladkal
Copy link
Contributor

eladkal commented Feb 3, 2021

I tend to agree with @dstandish
I also don't think this issue is localized only to Snowflake.
I have ETLs where i do something similar with other databases. Nothing prevents you from doing:

def snflk_func(**kwargs):
    snflk = SnowflakeHook('snowflake_default')
    df = snflk.get_pandas_df(sql=sql)
    if kwargs['lowercase_columns']:
        df.columns = df.columns.str.lower()
    # load your df to anywhere you need.

op = PyhtonOperator(task_id='snflk_task',
                    python_callable=snflk_func,
                    op_kwargs={'lowercase_columns': True},
                    dag=dag
                    )

When you use hooks either you wrap it with python function or you inherit from and can overwrite the get_pandas_df. In both cases it's very easy to handle the problem you presented.

If this PR proceed forward my only comment is that at least don't call it lowercase_columns. It's preferred to call it something like column_case that accept lowercase as option. I for example in my code convert to snakecase so if it's accepted better to keep the option open for extentions.

Like @dstandish I'm - 0 for this PR but that is just me.

@JavierLopezT
Copy link
Contributor Author

There is a new version of snowflake python connector coming, that we will incorporate very quickly as it solves some really bad behaviour of python connector. I think we should take a look again at the issue when the new version is released.

I also have high hopes we can get much better cooperation between snowflake and open source community using their connectors. They seem to release that this is important for them.

Hi Jarek. Any news about this?

@turbaszek turbaszek requested a review from potiuk February 24, 2021 18:52
@potiuk
Copy link
Member

potiuk commented Feb 24, 2021

Yeah. The new version of python connector is out and we switched to it. But after looking closely - I also think we should not merge this one - similarly to Daniel, this is -0 for me. Since everything we do is python, overriding the get_pandas_df in your own operator is almost as easy as passing a parameter and it gives you much better flexibility.

Also, I think if Snowflake has a captitalised names, the best you can do is to keep it this way throughout the whole "data" journey - to avoid confusion, even for the cases where you have to do any kind of matching/lineage kind of analysis, it simply makes sense to keep it consistent. Adding an "optional parameter" seems like it is not dangerous, but I think we should avoid doing so if the option is "niche" and when you can do it almost as easily by other means.

I think we are all for closing the PR. Are you convinced @JavierLopezT ?1

@turbaszek turbaszek added the provider:snowflake Issues related to Snowflake provider label Feb 24, 2021
@JavierLopezT
Copy link
Contributor Author

Yeah. The new version of python connector is out and we switched to it. But after looking closely - I also think we should not merge this one - similarly to Daniel, this is -0 for me. Since everything we do is python, overriding the get_pandas_df in your own operator is almost as easy as passing a parameter and it gives you much better flexibility.

Also, I think if Snowflake has a captitalised names, the best you can do is to keep it this way throughout the whole "data" journey - to avoid confusion, even for the cases where you have to do any kind of matching/lineage kind of analysis, it simply makes sense to keep it consistent. Adding an "optional parameter" seems like it is not dangerous, but I think we should avoid doing so if the option is "niche" and when you can do it almost as easily by other means.

I think we are all for closing the PR. Are you convinced @JavierLopezT ?1

Sure. I'll close it myself

@JavierLopezT JavierLopezT deleted the snowflake_pandas branch April 1, 2021 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

provider:snowflake Issues related to Snowflake provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants