-
Notifications
You must be signed in to change notification settings - Fork 722
Description
Is your idea related to a problem? Please describe.
In order to speed up the performance of a data flow that I am building, I am creating persisted tables with chosen partitions and buckets to optimize the join on a defined set of columns. AWS Data Wrangler does not support this use case to my understanding, so in the current implementation I am defining the CTAS statement manually and run it through pyathena.
After running the flow, I iterate through the list of created tables to drop them both from Athena/Glue and S3, since they are only created to speed up the flow in the spirit of temp tables.
Describe the solution you'd like
It would be great to have a function focused on creating a table as CTAS, that is focused on creating the table, without sending back the data as awswrangler.athena.read_sql_query does it. Parameters could be very similar to the parameters of awswrangler.athena.read_sql_query (e.g. sql, database, ctas_*) without those that are specific on how the data is send back to Python, since this would not be necessary. As a return argument, it should return the database and name of the table.
In addition, it would be extremely helpful if Data Wrangler could support the cleanup once the dataflow is finished. In that case the implementation is less obvious, since my impression was that usually Data Wrangler does not instantiate. One possible solution could be having the function in the prior paragraph return a kind of object that could be handed over as a parameter in a cleanup function, ideally as a list to manage the cleanup of all temp tables at once.