New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pyarrow schema registration in Glue #32
Comments
to support By the way, it looks like it does not support def _pyarrow2athena(glue, ftype):
if str(ftype) == 'null':
return 'string'
if isinstance(ftype, ListType):
return f'array<{_pyarrow2athena(glue, ftype.value_type)}>'
return glue.type_pyarrow2athena(str(ftype))
schema = []
partition_cols_schema = []
for field in pyarrow_schema:
name = field.name
# field.type list is not supported by glue.type_pyarrow2athena
athena_type = _pyarrow2athena(glue, field.type)
if partition_cols is None or name not in partition_cols:
schema.append((name, athena_type))
else:
partition_cols_schema.append((name, athena_type)) |
Hey @nicolasdaviaud, Thank you, great points arrived. 1. List support: 2. Pyarrow integration: |
Thanks for looking into it. Did you have a chance to look into empty columns (cf my point above)? They can end up as |
We don't have CI yet. But you could run with my branch with: https://github.com/awslabs/aws-data-wrangler.git
cd aws-data-wrangler
python3.6 -m venv venv
pip install -e .
... About the null columns:
I'm pretty tending for the second option, what do you think? |
If I understand it correctly, second option allows you to pass a schema override. If could have been an easy way to squeeze in the |
PR #33 updated to allow casting of data types as arguments... Unfortunately Pyarrow hasn't type alias for nested types. Ref: https://github.com/apache/arrow/blob/apache-arrow-0.14.1/python/pyarrow/types.pxi#L1684 Maybe in the future we could prioritize the change to accept the data type objects itself instead of the alias. But by now, I think that it is enough to close this issue. Thank! |
thanks! |
Issue is similar to #29 but for Pyarrow.
Pyarrow supports richer types than pandas, in our case
ListArray
, which translates toarray<int>
in Glue.The current implementation requires to go through pandas, which stores it in an
object
column which then gets added asstring
to the schema.Looking at the code, it looks like we reconstruct the Pyarrow schema anyway, and it might be as simple as expose this entry point as well as pandas.
The text was updated successfully, but these errors were encountered: