Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PySpark schema registration in Glue #29

Closed
radcheb opened this issue Sep 20, 2019 · 4 comments
Closed

PySpark schema registration in Glue #29

radcheb opened this issue Sep 20, 2019 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@radcheb
Copy link

radcheb commented Sep 20, 2019

With Pandas, it's possible to write a parquet dataframe to s3 and have it registered in Glue so it can be queried by Athena.

It would be nice to support the same feature for PySpark.
PySpark provide richer types (Array, Map, ...) which are mostly supported in Athena and Glue but not in Pandas. So converting PySpark DF to pandas and writing it with awswrangler will corrupt the nested types (arrays,...) and will not work.

It would be nice to be able to just register a Pyspark Dataframe Schema in Glue or moreover a partitionned Pyspark Dataframe partitions as Glue partitions.

@igorborgest igorborgest self-assigned this Sep 20, 2019
@igorborgest igorborgest added the enhancement New feature or request label Sep 20, 2019
@igorborgest
Copy link
Contributor

Hi @radcheb,

Make sense, would be nice have the ability to register Athena tables based on the Pyspark dataframe metadata. We will implement that!

@igorborgest
Copy link
Contributor

Hi @radcheb,

Done! It will be on the version 0.0.4 that will be released later today.

Issue #29

Register Glue table from Dataframe stored on S3

# Writing with PySpark
dataframe.write \
        .mode("overwrite") \
        .format("parquet") \
        .partitionBy(["year", "month"]) \
        .save(compression="gzip", path="s3://...")
session = awswrangler.Session(spark_session=spark)
# Registering
session.spark.create_glue_table(dataframe=dataframe,
                                file_format="parquet",
                                partition_by=["year", "month"],
                                path="s3://...",
                                compression="gzip",
                                database="my_database")

Load partitions on Athena/Glue table (repair table)

session = awswrangler.Session()
session.athena.repair_table(database="db_name", table="tbl_name")

@radcheb
Copy link
Author

radcheb commented Sep 21, 2019

Hi @igorborgest
Great, eager to test the version.

@igorborgest
Copy link
Contributor

@radcheb, closing this issue. Please open another if you find some bug or improvement opportunity.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants