You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's very handy, I use it when my dataframe is ready to be exported, after some processing steps.
However, I generally have several dataframes to process and export: I loop over several csv files, read them into a dataframe, process them, and then export them into parquet. The first time I call the to_parquet function, the Glue table will be created. The next calls won't do anything on Glue, but will still export the df to parquet (and correct me if I'm wrong, but will also throw an exception if the schema of the dataframe isn't compatible with the schema already in the table). Passing new/updated parameters in the following calls won't update the parameters of the Glue table.
Describe the solution you'd like
I would like to be able to update the parameters of the Glue table. For example, I'd like to update the parameters to update the records count in the dataset, or maybe count the number of Nan values, or maybe something else.
Issue #198 is somehow related to my problem, but here I don't want to update the schema of the table, just the parameters.
I'm not sure if there is a safe/elegant way to do this.
I could simply export the files to parquet, store all the information I want to put in the parameters, and then create the glue table manually. But I would loose the schema checking I think, and it would also complicate the workflow a bit.
What do you guys think?
The text was updated successfully, but these errors were encountered:
Oh, this was fast! thank you very much. I'll test 1.1.2 in the next few days and let you know if I find any problem! (I think there was a hiccup with the documentation page: https://aws-data-wrangler.readthedocs.io/en/latest/api.html The upsert_table_parameters function is displayed twice)
You can use it independently to fetch the current values, calculate the new ones and then update it on Glue Catalog.
Regarding the wr.s3.to_parquet(mode="append") we also included a wr.catalog.upsert_table_parameters call under the hood, so new parameters passed to the functions will be usert in the Catalog automatically.
Is your feature request related to a problem? Please describe.
When exporting a dataframe to parquet, it's possible to also create a table in the Glue catalog, like this:
It's very handy, I use it when my dataframe is ready to be exported, after some processing steps.
However, I generally have several dataframes to process and export: I loop over several csv files, read them into a dataframe, process them, and then export them into parquet. The first time I call the
to_parquet
function, the Glue table will be created. The next calls won't do anything on Glue, but will still export the df to parquet (and correct me if I'm wrong, but will also throw an exception if the schema of the dataframe isn't compatible with the schema already in the table). Passing new/updatedparameters
in the following calls won't update the parameters of the Glue table.Describe the solution you'd like
I would like to be able to update the parameters of the Glue table. For example, I'd like to update the parameters to update the records count in the dataset, or maybe count the number of Nan values, or maybe something else.
Issue #198 is somehow related to my problem, but here I don't want to update the schema of the table, just the parameters.
I'm not sure if there is a safe/elegant way to do this.
I could simply export the files to parquet, store all the information I want to put in the parameters, and then create the glue table manually. But I would loose the schema checking I think, and it would also complicate the workflow a bit.
What do you guys think?
The text was updated successfully, but these errors were encountered: