Update table parameters (metadata) in Glue catalog #224

JPFrancoia · 2020-05-08T09:38:22Z

Is your feature request related to a problem? Please describe.

When exporting a dataframe to parquet, it's possible to also create a table in the Glue catalog, like this:

        wr.s3.to_parquet(
            df=df,
            path=my_path,
            index=False,
            dataset=True,
            mode="append",
            database=clean_database,
            table=my_table,
            description=DESCRIPTION,
            parameters=DATASET_TAGS,
            columns_comments=columns_comments,
        )

It's very handy, I use it when my dataframe is ready to be exported, after some processing steps.

However, I generally have several dataframes to process and export: I loop over several csv files, read them into a dataframe, process them, and then export them into parquet. The first time I call the to_parquet function, the Glue table will be created. The next calls won't do anything on Glue, but will still export the df to parquet (and correct me if I'm wrong, but will also throw an exception if the schema of the dataframe isn't compatible with the schema already in the table). Passing new/updated parameters in the following calls won't update the parameters of the Glue table.

Describe the solution you'd like

I would like to be able to update the parameters of the Glue table. For example, I'd like to update the parameters to update the records count in the dataset, or maybe count the number of Nan values, or maybe something else.

Issue #198 is somehow related to my problem, but here I don't want to update the schema of the table, just the parameters.

I'm not sure if there is a safe/elegant way to do this.

I could simply export the files to parquet, store all the information I want to put in the parameters, and then create the glue table manually. But I would loose the schema checking I think, and it would also complicate the workflow a bit.

What do you guys think?

The text was updated successfully, but these errors were encountered:

…eters. #224

JPFrancoia · 2020-05-08T21:18:01Z

Oh, this was fast! thank you very much. I'll test 1.1.2 in the next few days and let you know if I find any problem! (I think there was a hiccup with the documentation page: https://aws-data-wrangler.readthedocs.io/en/latest/api.html The upsert_table_parameters function is displayed twice)

igorborgest · 2020-05-08T21:23:05Z

@JPFrancoia I've just fixed the docs! Thanks!

A quick overview about the changes:

We added three new functions under wr.catalog to address it.

You can use it independently to fetch the current values, calculate the new ones and then update it on Glue Catalog.

Regarding the wr.s3.to_parquet(mode="append") we also included a wr.catalog.upsert_table_parameters call under the hood, so new parameters passed to the functions will be usert in the Catalog automatically.

JPFrancoia added the feature label May 8, 2020

igorborgest self-assigned this May 8, 2020

igorborgest added the WIP Work in progress label May 8, 2020

igorborgest added a commit that referenced this issue May 8, 2020

Add get_table_parameters, upsert_table_parameters, upsert_table_param…

a87867a

…eters. #224

igorborgest added the micro release Will be addressed in the next micro release label May 8, 2020

igorborgest closed this as completed May 8, 2020

igorborgest removed the WIP Work in progress label May 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update table parameters (metadata) in Glue catalog #224

Update table parameters (metadata) in Glue catalog #224

JPFrancoia commented May 8, 2020

JPFrancoia commented May 8, 2020 •

edited

igorborgest commented May 8, 2020

Update table parameters (metadata) in Glue catalog #224

Update table parameters (metadata) in Glue catalog #224

Comments

JPFrancoia commented May 8, 2020

JPFrancoia commented May 8, 2020 • edited

igorborgest commented May 8, 2020

A quick overview about the changes:

JPFrancoia commented May 8, 2020 •

edited