Pyarrow schema registration in Glue #32

nicolasdaviaud · 2019-09-25T09:12:16Z

Issue is similar to #29 but for Pyarrow.
Pyarrow supports richer types than pandas, in our case ListArray, which translates to array<int> in Glue.
The current implementation requires to go through pandas, which stores it in an object column which then gets added as string to the schema.
Looking at the code, it looks like we reconstruct the Pyarrow schema anyway, and it might be as simple as expose this entry point as well as pandas.

The text was updated successfully, but these errors were encountered:

nicolasdaviaud · 2019-09-25T10:36:34Z

to support ListType in Pyarrow schemas, it might be as simple as replacing lines 286 to 293 in awswrangler.glue by the following snippet.

By the way, it looks like it does not support DataType(null) either (which can happen if the whole column is empty). We decided to cast it to string but it is arbitrary

def _pyarrow2athena(glue, ftype):
    if str(ftype) == 'null':
        return 'string'
    if isinstance(ftype, ListType):
        return f'array<{_pyarrow2athena(glue, ftype.value_type)}>'
    return glue.type_pyarrow2athena(str(ftype))

    schema = []
    partition_cols_schema = []
    for field in pyarrow_schema:
        name = field.name
        # field.type list is not supported by glue.type_pyarrow2athena
        athena_type = _pyarrow2athena(glue, field.type)
        if partition_cols is None or name not in partition_cols:
            schema.append((name, athena_type))
        else:
            partition_cols_schema.append((name, athena_type))

igorborgest · 2019-09-25T20:57:41Z

Hey @nicolasdaviaud,

Thank you, great points arrived.
I think that we can split this issue in two different improvements:

1. List support:
Here we can solve the issue surgically, with some mechanism similar to the code snippet sent by you. I'm already working on that and I think this is enough to close this thread by now.

2. Pyarrow integration:
You've brought two good arguments: Superior data types and an existing code base that would be reused.
But is this enough to start a new exposed interface for AWS Data Wrangler?
Are the community really eager to use Pyarrow Table in their ETLs directly?
I just feel that we should resolve your issue with the point 1 and than focus on Pandas and Spark interfaces while we still have a lot of work to do on both.

nicolasdaviaud · 2019-09-26T08:33:04Z

Thanks for looking into it.
I guess we could live with the first point (tbc). Is there an easy way for me to run against your branch? I couldn't see any CI links attached to the PR.

Did you have a chance to look into empty columns (cf my point above)? They can end up as DataType(null) which I don't think your PR supports.

igorborgest · 2019-09-26T10:35:55Z

@nicolasdaviaud

We don't have CI yet. But you could run with my branch with:

https://github.com/awslabs/aws-data-wrangler.git
cd aws-data-wrangler
python3.6 -m venv venv
pip install -e .
...

About the null columns:
This is really tough, independent of the technology, we can't infer type from a entire null column. You will have the same problem with a Pyarrow interface where you will need to explicit say the type of the column before instantiate your Pyarrow's Table.
Trying to overcome that we have two different approaches by now:

Consider nulls as strings
The simplest one, and can works great for "one shot" table creation.
But could be a silent bomb for incremental loads of data. Imagine a ETL that runs daily and starts with a null column and at some time the non-null values start to appear... Probably we will have two types of data mixed in the same column causing troubles.
Receive the column type as argument for an specific column:
Certainly this is less convenient for the user, but is a more explicit approach and guarantee that no data type will be guess.
We already are doing something similar for the Int64 data types not yet supported on Pyarrow: https://github.com/awslabs/aws-data-wrangler/blob/master/testing/test_awswrangler/test_pandas.py#L246

I'm pretty tending for the second option, what do you think?

nicolasdaviaud · 2019-09-26T12:03:53Z

If I understand it correctly, second option allows you to pass a schema override. If could have been an easy way to squeeze in the list<int> and could prove useful again a in the future, I'm for it.
You might only want to check that the types are Athena-valid (if it is possible, given deep recursive types such as struct and arrays). At least checking it's all lower case

igorborgest · 2019-09-27T00:28:22Z

PR #33 updated to allow casting of data types as arguments... Unfortunately Pyarrow hasn't type alias for nested types.

Ref: https://github.com/apache/arrow/blob/apache-arrow-0.14.1/python/pyarrow/types.pxi#L1684

Maybe in the future we could prioritize the change to accept the data type objects itself instead of the alias. But by now, I think that it is enough to close this issue.

Thank!

nicolasdaviaud · 2019-09-27T04:14:40Z

thanks!

igorborgest self-assigned this Sep 25, 2019

igorborgest added the enhancement New feature or request label Sep 25, 2019

igorborgest mentioned this issue Sep 25, 2019

Adding support for lists to Pandas #33

Merged

igorborgest closed this as completed Sep 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pyarrow schema registration in Glue #32

Pyarrow schema registration in Glue #32

nicolasdaviaud commented Sep 25, 2019

nicolasdaviaud commented Sep 25, 2019

igorborgest commented Sep 25, 2019

nicolasdaviaud commented Sep 26, 2019

igorborgest commented Sep 26, 2019

nicolasdaviaud commented Sep 26, 2019

igorborgest commented Sep 27, 2019

nicolasdaviaud commented Sep 27, 2019

Pyarrow schema registration in Glue #32

Pyarrow schema registration in Glue #32

Comments

nicolasdaviaud commented Sep 25, 2019

nicolasdaviaud commented Sep 25, 2019

igorborgest commented Sep 25, 2019

nicolasdaviaud commented Sep 26, 2019

igorborgest commented Sep 26, 2019

nicolasdaviaud commented Sep 26, 2019

igorborgest commented Sep 27, 2019

nicolasdaviaud commented Sep 27, 2019