Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update table parameters (metadata) in Glue catalog #224

Closed
JPFrancoia opened this issue May 8, 2020 · 2 comments
Closed

Update table parameters (metadata) in Glue catalog #224

JPFrancoia opened this issue May 8, 2020 · 2 comments
Assignees
Labels
feature micro release Will be addressed in the next micro release

Comments

@JPFrancoia
Copy link
Contributor

Is your feature request related to a problem? Please describe.

When exporting a dataframe to parquet, it's possible to also create a table in the Glue catalog, like this:

        wr.s3.to_parquet(
            df=df,
            path=my_path,
            index=False,
            dataset=True,
            mode="append",
            database=clean_database,
            table=my_table,
            description=DESCRIPTION,
            parameters=DATASET_TAGS,
            columns_comments=columns_comments,
        )

It's very handy, I use it when my dataframe is ready to be exported, after some processing steps.

However, I generally have several dataframes to process and export: I loop over several csv files, read them into a dataframe, process them, and then export them into parquet. The first time I call the to_parquet function, the Glue table will be created. The next calls won't do anything on Glue, but will still export the df to parquet (and correct me if I'm wrong, but will also throw an exception if the schema of the dataframe isn't compatible with the schema already in the table). Passing new/updated parameters in the following calls won't update the parameters of the Glue table.

Describe the solution you'd like

I would like to be able to update the parameters of the Glue table. For example, I'd like to update the parameters to update the records count in the dataset, or maybe count the number of Nan values, or maybe something else.

Issue #198 is somehow related to my problem, but here I don't want to update the schema of the table, just the parameters.

I'm not sure if there is a safe/elegant way to do this.

I could simply export the files to parquet, store all the information I want to put in the parameters, and then create the glue table manually. But I would loose the schema checking I think, and it would also complicate the workflow a bit.

What do you guys think?

@igorborgest igorborgest self-assigned this May 8, 2020
@igorborgest igorborgest added the WIP Work in progress label May 8, 2020
@igorborgest igorborgest added the micro release Will be addressed in the next micro release label May 8, 2020
@JPFrancoia
Copy link
Contributor Author

JPFrancoia commented May 8, 2020

Oh, this was fast! thank you very much. I'll test 1.1.2 in the next few days and let you know if I find any problem! (I think there was a hiccup with the documentation page: https://aws-data-wrangler.readthedocs.io/en/latest/api.html The upsert_table_parameters function is displayed twice)

@igorborgest
Copy link
Contributor

@JPFrancoia I've just fixed the docs! Thanks!

A quick overview about the changes:

We added three new functions under wr.catalog to address it.

You can use it independently to fetch the current values, calculate the new ones and then update it on Glue Catalog.

Regarding the wr.s3.to_parquet(mode="append") we also included a wr.catalog.upsert_table_parameters call under the hood, so new parameters passed to the functions will be usert in the Catalog automatically.

@igorborgest igorborgest removed the WIP Work in progress label May 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature micro release Will be addressed in the next micro release
Projects
None yet
Development

No branches or pull requests

2 participants