Add basic arrow schema conversion #194

danielcweeks · 2019-05-22T22:19:54Z

There are a number of things that still need to be addressed and we can create follow up issues for:

Dictionaries are part of the Arrow Schema definition, but should be built based off the parquet dictionaries. This probably means that conversion needs to happen while processing the row group metadata to capture the parquet dictionaries.
This converts Maps to lists<struct<key,value>>, but that is probably not what is ideal for columnar Spark representation (or other engines).

mccheah · 2019-06-08T00:57:51Z

core/src/test/java/org/apache/iceberg/arrow/ArrowSchemaUtilTest.java

+import org.apache.iceberg.types.Types.TimestampType;
+import org.junit.Test;
+
+import static org.apache.arrow.vector.types.pojo.ArrowType.ArrowTypeID.Bool;


For any new module we make, we should probably start with Baseline. For here, Baseline will complain about static imports - but these might make some sense to have exceptions for in the checkstyle.xml.

I'd be more comfortable with importing org.apache.arrow.vector.types.pojo.ArrowType.ArrowTypeID, and then addressing by ArrowTypeID.Date, etc.

This doesn't seem to be currently enforced. Was there a change to the checkstyle to relax this convention or do you still prefer the proposed import changes?

mccheah · 2019-06-08T00:58:59Z

Build is failing - is this ready for review as a standalone item or did you want to do more here? What's next for this feature moving forward?

xhochy · 2019-06-10T11:05:17Z

Dictionaries are part of the Arrow Schema definition, but should be built based off the parquet dictionaries.

@danielcweeks This actually changed in Arrow master as it lead to a lot of problems. With the changes in Arrow master, reading rowgroups with Dictionaries should be much simpler.

danielcweeks · 2019-06-10T16:33:21Z

@mccheah We're going to start investing in building out the arrow read path starting in a couple weeks, so I'll try to address the checkstyle issues and we'll build on this from there.

mccheah · 2019-07-03T20:08:55Z

Ping on this - are we continuing to make progress here?

danielcweeks · 2019-07-05T04:22:18Z

@mccheah thanks for the ping. I've rebased and addressed the checkstyle issues.

We are actively working toward an arrow read path, so despite this not being used, it would probably help to have this reviewed and included to keep the commits smaller.

rdblue · 2019-07-05T22:20:13Z

I'm not sure that it's a good idea to add Arrow to iceberg-core until there is more functionality there. Could we keep this as a PR until we know more about what the Arrow integration will look like? It would make sense as an add-on module like iceberg-parquet or iceberg-orc.

danielcweeks · 2019-07-06T17:30:52Z

@rdblue Keeping as a PR is fair and it does seem that arrow would make more sense as a separate module.

mccheah · 2019-07-08T21:41:35Z

Is there a world where we can keep this feature behind a feature flag so that development can proceed forward with reasonably sized diffs and merges?

rdblue · 2019-07-08T21:43:48Z

@mccheah, we do intend to move forward with reasonable size merges. For this, I'm not sure that it is worth updating it because we don't really know what the module structure will be for Arrow yet, and that will depend on the implementation. We can definitely update this to be in its own iceberg-arrow module and add the Arrow dependency to core, but since this PR is mostly to demonstrate the conversion I wasn't sure that was worth spending time on.

mccheah · 2019-07-08T21:48:46Z

I think it would be appropriate to clarify what the plan is for integrating Arrow with Iceberg and the data format integrations. I'll follow up on #9.

rdblue · 2019-07-31T22:19:36Z

I'm going to close this because it is in the vectorized read branch and will be merged with that work. For status, there is a vectorized read milestone.

danielcweeks mentioned this pull request May 22, 2019

Vectorize reads and deserialize to Arrow #9

Closed

mccheah reviewed Jun 8, 2019

View reviewed changes

danielcweeks added 2 commits July 4, 2019 15:01

Add basic arrow schema conversion

7b45a81

Checkstyle and rebase

72e3485

danielcweeks force-pushed the iceberg-arrow-schema branch from 465db5a to 72e3485 Compare July 5, 2019 04:13

rdblue closed this Jul 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic arrow schema conversion #194

Add basic arrow schema conversion #194

danielcweeks commented May 22, 2019

mccheah Jun 8, 2019

mccheah Jun 8, 2019

danielcweeks Jul 5, 2019

mccheah commented Jun 8, 2019

xhochy commented Jun 10, 2019

danielcweeks commented Jun 10, 2019

mccheah commented Jul 3, 2019

danielcweeks commented Jul 5, 2019

rdblue commented Jul 5, 2019

danielcweeks commented Jul 6, 2019

mccheah commented Jul 8, 2019

rdblue commented Jul 8, 2019

mccheah commented Jul 8, 2019

rdblue commented Jul 31, 2019

Add basic arrow schema conversion #194

Add basic arrow schema conversion #194

Conversation

danielcweeks commented May 22, 2019

mccheah Jun 8, 2019

Choose a reason for hiding this comment

mccheah Jun 8, 2019

Choose a reason for hiding this comment

danielcweeks Jul 5, 2019

Choose a reason for hiding this comment

mccheah commented Jun 8, 2019

xhochy commented Jun 10, 2019

danielcweeks commented Jun 10, 2019

mccheah commented Jul 3, 2019

danielcweeks commented Jul 5, 2019

rdblue commented Jul 5, 2019

danielcweeks commented Jul 6, 2019

mccheah commented Jul 8, 2019

rdblue commented Jul 8, 2019

mccheah commented Jul 8, 2019

rdblue commented Jul 31, 2019