-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic arrow schema conversion #194
Conversation
import org.apache.iceberg.types.Types.TimestampType; | ||
import org.junit.Test; | ||
|
||
import static org.apache.arrow.vector.types.pojo.ArrowType.ArrowTypeID.Bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For any new module we make, we should probably start with Baseline. For here, Baseline will complain about static imports - but these might make some sense to have exceptions for in the checkstyle.xml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be more comfortable with importing org.apache.arrow.vector.types.pojo.ArrowType.ArrowTypeID
, and then addressing by ArrowTypeID.Date
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be currently enforced. Was there a change to the checkstyle to relax this convention or do you still prefer the proposed import changes?
Build is failing - is this ready for review as a standalone item or did you want to do more here? What's next for this feature moving forward? |
@danielcweeks This actually changed in Arrow master as it lead to a lot of problems. With the changes in Arrow master, reading rowgroups with Dictionaries should be much simpler. |
@mccheah We're going to start investing in building out the arrow read path starting in a couple weeks, so I'll try to address the checkstyle issues and we'll build on this from there. |
Ping on this - are we continuing to make progress here? |
465db5a
to
72e3485
Compare
@mccheah thanks for the ping. I've rebased and addressed the checkstyle issues. We are actively working toward an arrow read path, so despite this not being used, it would probably help to have this reviewed and included to keep the commits smaller. |
I'm not sure that it's a good idea to add Arrow to |
@rdblue Keeping as a PR is fair and it does seem that arrow would make more sense as a separate module. |
Is there a world where we can keep this feature behind a feature flag so that development can proceed forward with reasonably sized diffs and merges? |
@mccheah, we do intend to move forward with reasonable size merges. For this, I'm not sure that it is worth updating it because we don't really know what the module structure will be for Arrow yet, and that will depend on the implementation. We can definitely update this to be in its own |
I think it would be appropriate to clarify what the plan is for integrating Arrow with Iceberg and the data format integrations. I'll follow up on #9. |
I'm going to close this because it is in the vectorized read branch and will be merged with that work. For status, there is a vectorized read milestone. |
There are a number of things that still need to be addressed and we can create follow up issues for:
Dictionaries are part of the Arrow Schema definition, but should be built based off the parquet dictionaries. This probably means that conversion needs to happen while processing the row group metadata to capture the parquet dictionaries.
This converts Maps to lists<struct<key,value>>, but that is probably not what is ideal for columnar Spark representation (or other engines).