Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyarrow schema registration in Glue #32

Closed
nicolasdaviaud opened this issue Sep 25, 2019 · 7 comments
Closed

Pyarrow schema registration in Glue #32

nicolasdaviaud opened this issue Sep 25, 2019 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@nicolasdaviaud
Copy link

Issue is similar to #29 but for Pyarrow.
Pyarrow supports richer types than pandas, in our case ListArray, which translates to array<int> in Glue.
The current implementation requires to go through pandas, which stores it in an object column which then gets added as string to the schema.
Looking at the code, it looks like we reconstruct the Pyarrow schema anyway, and it might be as simple as expose this entry point as well as pandas.

@nicolasdaviaud
Copy link
Author

to support ListType in Pyarrow schemas, it might be as simple as replacing lines 286 to 293 in awswrangler.glue by the following snippet.

By the way, it looks like it does not support DataType(null) either (which can happen if the whole column is empty). We decided to cast it to string but it is arbitrary

def _pyarrow2athena(glue, ftype):
    if str(ftype) == 'null':
        return 'string'
    if isinstance(ftype, ListType):
        return f'array<{_pyarrow2athena(glue, ftype.value_type)}>'
    return glue.type_pyarrow2athena(str(ftype))

    schema = []
    partition_cols_schema = []
    for field in pyarrow_schema:
        name = field.name
        # field.type list is not supported by glue.type_pyarrow2athena
        athena_type = _pyarrow2athena(glue, field.type)
        if partition_cols is None or name not in partition_cols:
            schema.append((name, athena_type))
        else:
            partition_cols_schema.append((name, athena_type))

@igorborgest igorborgest self-assigned this Sep 25, 2019
@igorborgest igorborgest added the enhancement New feature or request label Sep 25, 2019
@igorborgest
Copy link
Contributor

Hey @nicolasdaviaud,

Thank you, great points arrived.
I think that we can split this issue in two different improvements:

1. List support:
Here we can solve the issue surgically, with some mechanism similar to the code snippet sent by you. I'm already working on that and I think this is enough to close this thread by now.

2. Pyarrow integration:
You've brought two good arguments: Superior data types and an existing code base that would be reused.
But is this enough to start a new exposed interface for AWS Data Wrangler?
Are the community really eager to use Pyarrow Table in their ETLs directly?
I just feel that we should resolve your issue with the point 1 and than focus on Pandas and Spark interfaces while we still have a lot of work to do on both.

@nicolasdaviaud
Copy link
Author

Thanks for looking into it.
I guess we could live with the first point (tbc). Is there an easy way for me to run against your branch? I couldn't see any CI links attached to the PR.

Did you have a chance to look into empty columns (cf my point above)? They can end up as DataType(null) which I don't think your PR supports.

@igorborgest
Copy link
Contributor

@nicolasdaviaud

We don't have CI yet. But you could run with my branch with:

https://github.com/awslabs/aws-data-wrangler.git
cd aws-data-wrangler
python3.6 -m venv venv
pip install -e .
...

About the null columns:
This is really tough, independent of the technology, we can't infer type from a entire null column. You will have the same problem with a Pyarrow interface where you will need to explicit say the type of the column before instantiate your Pyarrow's Table.
Trying to overcome that we have two different approaches by now:

  1. Consider nulls as strings
    The simplest one, and can works great for "one shot" table creation.
    But could be a silent bomb for incremental loads of data. Imagine a ETL that runs daily and starts with a null column and at some time the non-null values start to appear... Probably we will have two types of data mixed in the same column causing troubles.

  2. Receive the column type as argument for an specific column:
    Certainly this is less convenient for the user, but is a more explicit approach and guarantee that no data type will be guess.
    We already are doing something similar for the Int64 data types not yet supported on Pyarrow: https://github.com/awslabs/aws-data-wrangler/blob/master/testing/test_awswrangler/test_pandas.py#L246

I'm pretty tending for the second option, what do you think?

@nicolasdaviaud
Copy link
Author

If I understand it correctly, second option allows you to pass a schema override. If could have been an easy way to squeeze in the list<int> and could prove useful again a in the future, I'm for it.
You might only want to check that the types are Athena-valid (if it is possible, given deep recursive types such as struct and arrays). At least checking it's all lower case

@igorborgest
Copy link
Contributor

PR #33 updated to allow casting of data types as arguments... Unfortunately Pyarrow hasn't type alias for nested types.

Ref: https://github.com/apache/arrow/blob/apache-arrow-0.14.1/python/pyarrow/types.pxi#L1684

Maybe in the future we could prioritize the change to accept the data type objects itself instead of the alias. But by now, I think that it is enough to close this issue.

Thank!

@nicolasdaviaud
Copy link
Author

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants