New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PySpark schema registration in Glue #29
Labels
enhancement
New feature or request
Comments
Hi @radcheb, Make sense, would be nice have the ability to register Athena tables based on the Pyspark dataframe metadata. We will implement that! |
Hi @radcheb, Done! It will be on the version 0.0.4 that will be released later today. Issue #29 Register Glue table from Dataframe stored on S3# Writing with PySpark
dataframe.write \
.mode("overwrite") \
.format("parquet") \
.partitionBy(["year", "month"]) \
.save(compression="gzip", path="s3://...")
session = awswrangler.Session(spark_session=spark)
# Registering
session.spark.create_glue_table(dataframe=dataframe,
file_format="parquet",
partition_by=["year", "month"],
path="s3://...",
compression="gzip",
database="my_database") Load partitions on Athena/Glue table (repair table)session = awswrangler.Session()
session.athena.repair_table(database="db_name", table="tbl_name") |
Hi @igorborgest |
@radcheb, closing this issue. Please open another if you find some bug or improvement opportunity. Thank you! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
With Pandas, it's possible to write a parquet dataframe to s3 and have it registered in Glue so it can be queried by Athena.
It would be nice to support the same feature for PySpark.
PySpark provide richer types (Array, Map, ...) which are mostly supported in Athena and Glue but not in Pandas. So converting PySpark DF to pandas and writing it with awswrangler will corrupt the nested types (arrays,...) and will not work.
It would be nice to be able to just register a Pyspark Dataframe Schema in Glue or moreover a partitionned Pyspark Dataframe partitions as Glue partitions.
The text was updated successfully, but these errors were encountered: