PySpark schema registration in Glue #29

radcheb · 2019-09-20T10:00:56Z

With Pandas, it's possible to write a parquet dataframe to s3 and have it registered in Glue so it can be queried by Athena.

It would be nice to support the same feature for PySpark.
PySpark provide richer types (Array, Map, ...) which are mostly supported in Athena and Glue but not in Pandas. So converting PySpark DF to pandas and writing it with awswrangler will corrupt the nested types (arrays,...) and will not work.

It would be nice to be able to just register a Pyspark Dataframe Schema in Glue or moreover a partitionned Pyspark Dataframe partitions as Glue partitions.

igorborgest · 2019-09-20T12:08:08Z

Hi @radcheb,

Make sense, would be nice have the ability to register Athena tables based on the Pyspark dataframe metadata. We will implement that!

igorborgest · 2019-09-21T17:40:00Z

Hi @radcheb,

Done! It will be on the version 0.0.4 that will be released later today.

Issue #29

Register Glue table from Dataframe stored on S3

# Writing with PySpark
dataframe.write \
        .mode("overwrite") \
        .format("parquet") \
        .partitionBy(["year", "month"]) \
        .save(compression="gzip", path="s3://...")
session = awswrangler.Session(spark_session=spark)
# Registering
session.spark.create_glue_table(dataframe=dataframe,
                                file_format="parquet",
                                partition_by=["year", "month"],
                                path="s3://...",
                                compression="gzip",
                                database="my_database")

Load partitions on Athena/Glue table (repair table)

session = awswrangler.Session()
session.athena.repair_table(database="db_name", table="tbl_name")

radcheb · 2019-09-21T17:45:01Z

Hi @igorborgest
Great, eager to test the version.

igorborgest · 2019-09-21T17:50:32Z

@radcheb, closing this issue. Please open another if you find some bug or improvement opportunity.

Thank you!

igorborgest self-assigned this Sep 20, 2019

igorborgest added the enhancement New feature or request label Sep 20, 2019

igorborgest mentioned this issue Sep 21, 2019

Add Spark.create_glue_table() and Athena.repair_table() #30

Merged

igorborgest closed this as completed Sep 21, 2019

nicolasdaviaud mentioned this issue Sep 25, 2019

Pyarrow schema registration in Glue #32

Closed

aabid0193 mentioned this issue Nov 4, 2022

Possible to write spark dataframes to glue tables in similar fashion as awswrangler.s3.to_parquet #1743

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PySpark schema registration in Glue #29

PySpark schema registration in Glue #29

radcheb commented Sep 20, 2019

igorborgest commented Sep 20, 2019

igorborgest commented Sep 21, 2019

radcheb commented Sep 21, 2019

igorborgest commented Sep 21, 2019

PySpark schema registration in Glue #29

PySpark schema registration in Glue #29

Comments

radcheb commented Sep 20, 2019

igorborgest commented Sep 20, 2019

igorborgest commented Sep 21, 2019

Register Glue table from Dataframe stored on S3

Load partitions on Athena/Glue table (repair table)

radcheb commented Sep 21, 2019

igorborgest commented Sep 21, 2019