Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic arrow schema conversion #194

Closed
wants to merge 2 commits into from

Conversation

danielcweeks
Copy link
Contributor

There are a number of things that still need to be addressed and we can create follow up issues for:

  • Dictionaries are part of the Arrow Schema definition, but should be built based off the parquet dictionaries. This probably means that conversion needs to happen while processing the row group metadata to capture the parquet dictionaries.

  • This converts Maps to lists<struct<key,value>>, but that is probably not what is ideal for columnar Spark representation (or other engines).

import org.apache.iceberg.types.Types.TimestampType;
import org.junit.Test;

import static org.apache.arrow.vector.types.pojo.ArrowType.ArrowTypeID.Bool;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any new module we make, we should probably start with Baseline. For here, Baseline will complain about static imports - but these might make some sense to have exceptions for in the checkstyle.xml.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be more comfortable with importing org.apache.arrow.vector.types.pojo.ArrowType.ArrowTypeID, and then addressing by ArrowTypeID.Date, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be currently enforced. Was there a change to the checkstyle to relax this convention or do you still prefer the proposed import changes?

@mccheah
Copy link
Contributor

mccheah commented Jun 8, 2019

Build is failing - is this ready for review as a standalone item or did you want to do more here? What's next for this feature moving forward?

@xhochy
Copy link
Member

xhochy commented Jun 10, 2019

Dictionaries are part of the Arrow Schema definition, but should be built based off the parquet dictionaries.

@danielcweeks This actually changed in Arrow master as it lead to a lot of problems. With the changes in Arrow master, reading rowgroups with Dictionaries should be much simpler.

@danielcweeks
Copy link
Contributor Author

@mccheah We're going to start investing in building out the arrow read path starting in a couple weeks, so I'll try to address the checkstyle issues and we'll build on this from there.

@mccheah
Copy link
Contributor

mccheah commented Jul 3, 2019

Ping on this - are we continuing to make progress here?

@danielcweeks
Copy link
Contributor Author

@mccheah thanks for the ping. I've rebased and addressed the checkstyle issues.

We are actively working toward an arrow read path, so despite this not being used, it would probably help to have this reviewed and included to keep the commits smaller.

@rdblue
Copy link
Contributor

rdblue commented Jul 5, 2019

I'm not sure that it's a good idea to add Arrow to iceberg-core until there is more functionality there. Could we keep this as a PR until we know more about what the Arrow integration will look like? It would make sense as an add-on module like iceberg-parquet or iceberg-orc.

@danielcweeks
Copy link
Contributor Author

@rdblue Keeping as a PR is fair and it does seem that arrow would make more sense as a separate module.

@mccheah
Copy link
Contributor

mccheah commented Jul 8, 2019

Is there a world where we can keep this feature behind a feature flag so that development can proceed forward with reasonably sized diffs and merges?

@rdblue
Copy link
Contributor

rdblue commented Jul 8, 2019

@mccheah, we do intend to move forward with reasonable size merges. For this, I'm not sure that it is worth updating it because we don't really know what the module structure will be for Arrow yet, and that will depend on the implementation. We can definitely update this to be in its own iceberg-arrow module and add the Arrow dependency to core, but since this PR is mostly to demonstrate the conversion I wasn't sure that was worth spending time on.

@mccheah
Copy link
Contributor

mccheah commented Jul 8, 2019

I think it would be appropriate to clarify what the plan is for integrating Arrow with Iceberg and the data format integrations. I'll follow up on #9.

@rdblue
Copy link
Contributor

rdblue commented Jul 31, 2019

I'm going to close this because it is in the vectorized read branch and will be merged with that work. For status, there is a vectorized read milestone.

@rdblue rdblue closed this Jul 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants