-
Notifications
You must be signed in to change notification settings - Fork 722
Description
I'm trying to create/update a table in the Glue catalog with the following snippet:
# ...some code fetching csv files from a bucket...
for file in valid_files:
df = wr.pandas.read_csv(path=file)
# This is needed, if a column has a trailing "\r", call to metadata_to_glue crashes
df.columns = [c.strip() for c in df.columns]
# Call without serde in extra_args
wr.glue.metadata_to_glue(
df,
BUCKET_SCAN + SUB_PATH,
valid_files,
"csv",
database=DATABASE,
table="my_table_20200129",
# extra_args={"serde": "LazySimpleSerDe"},
preserve_index=False,
)I get the following error:
Traceback (most recent call last):
File "qof_scripts/crawler.py", line 72, in <module>
preserve_index=False,
File "/Users/jpfrancoia/.local/share/virtualenvs/test_aws_lake-KkaPCkQ0/lib/python3.7/site-packages/awswrangler/glue.py", line 114, in metadata_to_glue
columns_comments=columns_comments)
File "/Users/jpfrancoia/.local/share/virtualenvs/test_aws_lake-KkaPCkQ0/lib/python3.7/site-packages/awswrangler/glue.py", line 182, in create_table
extra_args=extra_args)
File "/Users/jpfrancoia/.local/share/virtualenvs/test_aws_lake-KkaPCkQ0/lib/python3.7/site-packages/awswrangler/glue.py", line 313, in csv_table_definition
raise InvalidSerDe(f"{serde} in not in the valid SerDe list.")
awswrangler.exceptions.InvalidSerDe: None in not in the valid SerDe list
I managed to track down the issue to this line: https://github.com/awslabs/aws-data-wrangler/blob/d50b214274583eb6dd2cbc1c6c54c60f9f87035c/awswrangler/glue.py#L295
Basically the serde is taken from the extra_args parameter:
serde = extra_args.get("serde")But serde is set to None if it's not provided in the extra_args dict. And the rest of the function crashes if serde isn't set to OpenCSVSerDe or LazySimpleSerDe: https://github.com/awslabs/aws-data-wrangler/blob/d50b214274583eb6dd2cbc1c6c54c60f9f87035c/awswrangler/glue.py#L313
I think this is a bug. In the current setting, the extra_args parameter is an Optional[dict], but the method csv_table_definition can't run without the serde being set.
This can be solved by defaulting to a serde if serde isn't provided. I'll make a PR.