-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1474: [JAVA] Java ValueVector hierarchy Refactor (Implementation Phase 2) #1203
ARROW-1474: [JAVA] Java ValueVector hierarchy Refactor (Implementation Phase 2) #1203
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @siddharthteotia! Does the logger creation need to be in each concrete vector class? What is the reason for having the ArrowReader in each vector class? does this need to be there at all?
} | ||
|
||
private StringBuilder getAsStringBuilderHelper(int index) { | ||
final int startIndex = index * 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to index * TYPE_WIDTH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
final int startIndex = index * 8; | ||
|
||
final int days = valueBuffer.getInt(startIndex); | ||
int millis = valueBuffer.getInt(startIndex + 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a MILLISECOND_OFFSET = 4
and use that here so it gives some context on what the calculation is for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
final int offsetIndex = index * TYPE_WIDTH; | ||
BitVectorHelper.setValidityBitToOne(validityBuffer, index); | ||
valueBuffer.setInt(offsetIndex, days); | ||
valueBuffer.setInt((offsetIndex + 4), milliseconds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here for millisecond offset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
|
||
private void setValue(int index, int value) { | ||
valueBuffer.setShort(index * TYPE_WIDTH, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is some kind of cast needed for value which is an int? Why do we need to have double the set methods to allow setting an int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is probably no cast needed. I just thought that since this vector type really doesn't have 4 byte ints, we should ideally have a method set(index, short value). In the existing code on master, we don't have this. We just have set (int index, int value)
So set(index, int value) internally just takes 2 bytes from the int and calls setShort on the ArrowBuf. From an API perspective, it looked less intuitive to me since the class description tells user that the vector stores 2 byte values. So I introduced another method and still kept the original one.
What do you think? Should we not have set(int index, short value)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have just setValue(int index, short value)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess having a setValue(int index, int value)
is fine as a convenient method . Although, do we want any overflow checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just have setValue(int index, short value)
if the other is just convenience. Otherwise we have to think about things like overflow like @icexelloss pointed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my .02: given java's penchant for upcasting, I'm inclined to include both short and int signatures. I know some of these things were also introduced because of various runtime code generation patterns that we've used. @siddharthteotia, any sense whether we rely on this pattern? (int set)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacques-n , Iet me dig into the code and find out what is our usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we stand on this? TBD or decided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep both the methods. Keeping them is probably no harm but removing them is likely going to affect the downstream use in our run time generated code.
BitVectorHelper.setValidityBit(validityBuffer, index, 0); | ||
} | ||
|
||
public void set(int index, int isSet, long valueField ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you call valueField
just value
to be consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done.
valueBuffer.setChar(index * TYPE_WIDTH, value); | ||
} | ||
|
||
private void setValue(int index, char value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we support setting a value as a char
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning over here was similar to why I introduced another method for setting "short" value. I think this needs some discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be short
instead? This is a 2 byte value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as before, we should try to keep the support value types to minimum. I'm not sure that having char
really helps much?
*/ | ||
public class NullableUInt4Vector extends BaseNullableFixedWidthVector { | ||
private static final org.slf4j.Logger logger = | ||
org.slf4j.LoggerFactory.getLogger(NullableIntVector.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NullableUInt4Vector.class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
*/ | ||
public class NullableUInt8Vector extends BaseNullableFixedWidthVector { | ||
private static final org.slf4j.Logger logger = | ||
org.slf4j.LoggerFactory.getLogger(NullableBigIntVector.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NullableUInt8Vector.class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -0,0 +1,299 @@ | |||
/******************************************************************************* | |||
|
|||
* Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe minor but I usually sometimes see a different format for the license:
/**
* Licensed to the Apache Software Foundation (ASF) under one
* ...
*\
It would be good to be consistent, but maybe that would be better to address separately after all this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay will look into it at the end.
*/ | ||
public class NullableDateDayVector extends BaseNullableFixedWidthVector { | ||
private static final org.slf4j.Logger logger = | ||
org.slf4j.LoggerFactory.getLogger(NullableIntVector.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be NullableDateDayVector.class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I have addressed most of your general comments. Few points on remaining comments (1) Regarding Logger - I had added this as a TODO in my previous patch. The thing is that we really don't log much. We only log in the base class during realloc and alloc functions where there is a chance of catching memory related exceptions. So I have been contemplating if we really need logging. It's more of an unnecessary heap overhead. The reason I initialize the specific logger in each subclass is because when the super class methods dump out log messages we can see which exact vector type the messages correspond to. But again, since we barely log messages, I think we are better off not having any logging at all. You can take a look at BaseNullableFixedWidthVector and see what you think. Can we afford no logging? 60a2ebd#diff-dddca025d8d6792d8776d3c59ce508f7R270 (2) Regarding FieldReader -- I think you are right. When we are working with a vector type, we have enough information available to create the reader on demand as opposed to carrying the FieldReader object inside each vector. Is this what you were suggesting? However, we may have to see the impact of changes. This is definitely doable but we will have to refactor code in map, list and union vectors where when they read nested scalar vectors, they can no longer make a call to getReader(). |
All scalar types refactored and new implementation is ready -- builds fine. Corresponding Legacy vectors are also ready. |
return new TransferImpl((NullableDecimalVector)to); | ||
} | ||
|
||
private class TransferImpl implements TransferPair { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These TransferImpl class looks very similar for different vectors. Can they be refactored out as a single class?
* integer values which could be null. A validity buffer (bit vector) is | ||
* maintained to track which elements in the vector are null. | ||
*/ | ||
public class NullableUInt2Vector extends BaseNullableFixedWidthVector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between this and NullableSmallIntVector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think we should remove all the UInt vectors. This was a task we talked about doing a year ago but just didn't get it done. They aren't complete and aren't tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sweet. Just curious what's the initial intention for those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is probably a good idea. I have been wondering what is the difference between Int4Vector and UInt4Vector since the latter really doesn't implement unsigned semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for removing them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the problem with removing them is that other implementations may send unsigned integer data. We've already been integration testing this: https://github.com/apache/arrow/blob/master/integration/integration_test.py#L745
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to mark the unsigned int vectors in Java "buggy" but they will have to get implemented properly at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess we are not removing them ?
Keeping them as is and will open JIRA(s) for correctly implementing and testing UINT1, UINT2, UINT4, UINT8. I can open JIRAs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, great, thank you =)
@@ -260,10 +260,10 @@ private void readVector(Field field, FieldVector vector) throws JsonParseExcepti | |||
((ListVector) vector).getMutator().setLastSet(count); | |||
break; | |||
case VARBINARY: | |||
((NullableVarBinaryVector) vector).getMutator().setLastSet(count - 1); | |||
((NullableVarBinaryVector)vector).setLastSet(count - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
Should we add whitespace after closing )
in casting? This is specified in
https://google.github.io/styleguide/javaguide.html#s4.6.2-horizontal-whitespace
item 5
* timestamp values which could be null. A validity buffer (bit vector) is | ||
* maintained to track which elements in the vector are null. | ||
*/ | ||
public class NullableTimeStampMilliVector extends BaseNullableFixedWidthVector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we want to consolidate all Timestamp vectors into one class later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what that buys. Definitely seems like something we should address afterwards (but before a release--better not to break things twice)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can address this later as long as it's before release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In C++ we have a single arrow::TimestampArray
but applications will generally branch based on the unit metadata and look at the int64 values. I can't really say what's the better option for Java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, in Java we want to avoid having extra branches at the cell level. For long I agree all could be the same. So maybe having a TimestampVector that takes a constructor provided unit type and then have subclassed vectors that hold the old per unit hierarchy could give a good combination. Both would have the get and set with int64/long. If you want to avoid the getValue() conditional, you go with the sub-hiearchy. If you want a generic interface for getValue(), use the base TimestampVector which does an internal conditional on how to interpret the value. Keeps @icexelloss happy but also allows Dremio folk to use the specific interfaces in runtime generated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I think this is a Java-specific design issue then related to the JIT -- in C/C++ we have to branch earlier than the cell level into vectorized branch-free code paths so having a single container makes things simpler (since you can dispatch to code that handles "some kind of timestamp").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we should do as part of the ongoing refactor patch? Earlier @jacques-n mentioned doing it afterwards probably once the new infrastructure is stable? I am fine either way. Just confirming..
Thanks @siddharthteotia. I went through the change and left a few comments. |
@BryanCutler and @siddharthteotia , the reason the vector holders a reference to ArrowReader is that there are several cases where you don't want to constantly be recreating the reader and at the time, this was the easiest place to maintain it. Not sure if this still the case but it is something that should be reviewed before removing. |
@siddharthteotia , regarding the logger can you specific vector logs by moving to the base class and initializing the logger with the class name as a string, e.g. |
Thanks for the thorough review @jacques-n , @BryanCutler , @icexelloss , @wesm . I am in the process of addressing comments as we are reaching consensus. Meanwhile, I am trying to prioritize stability of tests. Right now we have 2 related failures in ComplexWriter, Promotable Writer and lots of failures in TestJsonFile since getFieldInnerVectors is no longer applicable. I am addressing the former ones as of now. The recent commit has refactored complex types -- LIST, FIXED SIZE LIST, MAP, UNION along with corresponding Legacy types and code changes in the callers to make things work. |
@BryanCutler, for the logger we discussed that internally (@siddharthteotia and I) and should have posted the thoughts here. Basically, the downside is additional heap bloat since each instance has to hold a reference to the logger (as opposed to be a constant). We thought it wasn't worth the cost. |
} | ||
} | ||
|
||
public void setSafe(int index, int isSet, long valueField ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valueField -> value?
BitVectorHelper.setValidityBit(validityBuffer, index, 0); | ||
} | ||
|
||
public void set(int index, int isSet, long valueField ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valueField -> value?
@siddharthteotia I also found the new files are 3-space indented. The current files are 2-space indented. Please fix the indentation? |
Is this style issue being checked in Travis CI? |
Probably not. I fixed some checkstyles warning but not all of them: So we cannot failure the build for checkstyles yet. |
I opened https://issues.apache.org/jira/browse/ARROW-1688. We must keep our code clean |
@jacques-n , @BryanCutler , @icexelloss
Remaining:
|
} | ||
else if (bufferType.equals(OFFSET)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: combine with line above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to run checkstyle on the new vector classes before merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we ignore style until the change set are finalized and just fix it all at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely. Once the patch is ready for merge to java-vector-refactor branch, we can check that. Or once we plan to push both patches from java-vector-refactor to master, then we can do and address the failures at one go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's address that and any other cosmetic changes once we plan to push the two patches from java-vector-refactor branch to master
for (int i = 0; i < bufferValueCount; i++) { | ||
if (vectorType.equals(DATA) && | ||
(vector.getMinorType() == Types.MinorType.VARCHAR || | ||
vector.getMinorType() == Types.MinorType.VARBINARY)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: indentation seems off in this if statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if (bufferType.equals(TYPE)) { | ||
generator.writeNumber(buffer.getByte(index * NullableTinyIntVector.TYPE_WIDTH)); | ||
} | ||
else if (bufferType.equals(OFFSET)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: combine with line above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I looked at JSONfile reader/writer and looks good from what I can tell so far, just some minor style comments |
The latest commit addresses some recent review comments and removes the inner vectors from LIST. I guess we are left with Implementing Timestamp vectors as suggested above -- #1203 (comment) |
All tests run clean with latest commit. |
Thanks @siddharthteotia. I will find some time to review this tomorrow. |
* limitations under the License. | ||
*/ | ||
<@pp.dropOutputFile /> | ||
<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/UnionVector.java" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in the new vector class space. @siddharthteotia, what are your thoughts on:
-
Do we want to continue to codegen union vector in the new vector package?
-
Are there any API changes comparing new/legacy union vectors?
-
Do we want to release support for union vector in new vector classes in 0.8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see this is acutally generate LegacyUnionVector.java
- the doc is incorrect.
To answer myself, it seems:
- Yes
- (sid can you answer this?)
- Yes it's compatible with the C++ union. No otherwise.
@jacques-n , @BryanCutler , @icexelloss The latest commit addresses the changes suggested w.r.t timestamp vector hierarchy. The concrete timestamp classes now have only holder specific methods. Adjusted license headers for consistency. All tests run clean. |
@@ -363,12 +361,12 @@ public void copyValueSafe(int from, int to) { | |||
|
|||
@Override | |||
public Accessor getAccessor() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we stand on these? Remove before 0.8 release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment here that indicates why we have dummy getAccessor() and getMutator() interfaces https://github.com/apache/arrow/pull/1203/files/fb5768fd086a979af8cc8d8795935d191f231678#diff-b5610c173675c3c2707593d7968ee29eR259
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry the link is not valid any more, can you paste a new link?
public void setValueCount(int valueCount) { } | ||
|
||
public Object getObject(int index) { return null; } | ||
public int getNullCount() { return 0; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. Why is this returning 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As explained here -- #1203 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be UnsupportedOperationException
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is resolved.
public final static String DATA_VECTOR_NAME = "$data$"; | ||
|
||
protected final UInt4Vector offsets; | ||
public final static byte OFFSET_WIDTH = 4; | ||
protected ArrowBuf offsetBuffer; | ||
protected FieldVector vector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename this to dataVector
?
@@ -62,9 +62,6 @@ public static FixedSizeListVector empty(String name, int size, BufferAllocator a | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class still uses BitVector
as validity. If we are not changing this in this PR, maybe leaving a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am currently doing it.
return true; | ||
} | ||
|
||
private void allocateValidityBuffer(final long size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be in base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this is fine.
/* | ||
* transfer the validity. | ||
*/ | ||
private void splitAndTransferValidityBuffer(int startIndex, int length, ListVector target) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to myself:
Read this carefully.
@@ -60,9 +60,6 @@ public static NullableMapVector empty(String name, BufferAllocator allocator) { | |||
|
|||
private final List<BufferBacked> innerVectors; | |||
|
|||
private final Accessor accessor; | |||
private final Mutator mutator; | |||
|
|||
// deprecated, use FieldType or static constructor instead | |||
@Deprecated | |||
public NullableMapVector(String name, BufferAllocator allocator, CallBack callBack) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep both MapVector
and NullableMapVector
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss when we are removing all non-nullable vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
track in ARROW-1710
@@ -175,7 +142,7 @@ public boolean read(VectorSchemaRoot root) throws IOException { | |||
{ | |||
for (Field field : root.getSchema().getFields()) { | |||
FieldVector vector = root.getVector(field.getName()); | |||
readVector(field, vector); | |||
readFromJsonIntoVector(field, vector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to myself: Think about this.
@siddharthteotia I went through the change and left some comments. I haven't read all your reply yet but I will do it later. (Have to go now) |
fb5768f
to
b83d874
Compare
@jacques-n , @BryanCutler , @icexelloss
I will review the code tonight primarily for adding JavaDocs, comments and any review comments that we haven't got a closure on. |
I have added some new tests too but will file follow-up JIRAs to improve our test suite going forward. |
A few high level comments:
|
@siddharthteotia For future reference, I usually prefer not to squash commit during PR - this makes it hard to track incremental changes. We can squash commit when merging. |
W.r.t introduction of some static interfaces for JsonFileWriter/Reader The introduction of couple of static interfaces is not absolutely necessary. They are written for better readability in JsonFileReader's gigantic switch block when it parses Json and writes to the vector (and its inner vectors). Since now we no longer have inner vectors, we obviously couldn't leverage the same code. The JsonFileReader had to be changed to specifically write to different buffers (TYPE, VALIDITY, OFFSET, DATA) for a particular vector. Also it has to allocate the buffer and appropriately set writer index before calling loadFieldBuffers. This is something that was needed for every case in switch block here. Once I did this, the code looked pretty messy and ugly. So I moved all the logic private to vectors and made them as part of static interfaces. On the other hand, in JsonFileWriter we were reading from vector (and its inner vectors) and writing out Json data. Again, since there are no inner vectors, all operation had to be transformed to work at the buffer level -- for writing the contents of each inner buffer. Also, the old code of JsonFileWriter stated a TODO that it was not handling each type. The new code handles all types. If the general preference is to not introduce static interfaces in vector APIs, I am fine with removing them and moving all logic into JSon code itself. The javadocs already indicate that external use of these APIs is not recommended. W.r.t introduction of some new APIs in ValueVector Note that top level interface is still ValueVector even though hierarchy underneath has changed. So there are still non-nullable vectors extending ValueVector, implementing mutator/accessor interfaces etc. So I introduced APIs like getNullCount(), getValueCount(), setValueCount(), getObject() for the new nullable vectors. Once we remove non-nullable vectors and expose mutator/accessor functions as direct get/set in ValueVector, we can get rid of these APIs too. User is free to call such methods on vectors since internally they delegate the call to corresponding mutator/accessor operation for non-nullable vectors and for nullable vectors we already have the new implementation. For legacy vectors, it doesn't really matter since each operation is just a pass-through to new code. There aren't any placeholder interfaces anywhere. Each vector (nullable or non-nullable) has a concrete implementation of all such interfaces as prescribed by ValueVector. Correctness is not affected anywhere. We should be able to do the simple cleanup once we remove non-nullable vectors. If we are not planning to remove non-nullable vectors then we should just remove mutator/accessor from them and expose all the get/set APIs directly just like we have done for other nullable and complex vectors. That will also allow us to do simple cleanup. Whatever we decide to do with non-nullable scalar vectors, we should do soon to make the entire java Vector code under ValueVector hierarchy consistent. Right now the nullable scalars and complex types are consistent -- none of them have inner vectors and none of them support mutator/accessor based access. Either we should do the same thing with non nullable vectors or remove them all together. The latter is preferable. |
Looking into Travis CI failures -- build (with tests) runs clean locally |
We're really running into a weakness of GitHub code reviews. My understanding is that Dremio uses Gerrit for code reviews (like Kudu, Impala, and lots of Google projects) and so the squashing is a key part of the Gerrit workflow. But it works pretty poorly for GitHub, where having a string of new commits is better (although the GitHub UI is terrible for reviewing incremental diffs) I would really like to have the option of doing large Arrow code reviews on Gerrit. It can be a bit challenging to do (because Gerrit can fall out of sync) unless you have 100% of your reviews hosted on there, and Gerrit is quite a bit of process for some users. I hope that we find a way to do this in the future. |
@siddharthteotia , thanks for the explanation. W.r.t introduction of some static interfaces for JsonFileWriter/Reader In that case, I am sort of OK with leaving these methods as public static but document clearly those methods are not part of public API and should not be used (IMHO "not recommended" is not strong enough, I would probably say "shouldn't") and refactor that later, before or after 0.8 release. What do other people think? W.r.t introduction of some new APIs in ValueVector
@siddharthteotia, let me see if I understand this correctly: getNullCount(), getValueCount(), setValueCount(), getObject() are a part of the new vector API and we will keep them going forward. I think I saw a version of the BaseValueVector that these methods are returning bogus values and got confused. It seems to be correct row. I probably just saw a wip version. |
@jacques-n , @BryanCutler , @icexelloss The latest commit has good javadocs for each and every new function in all vector types. How do you feel about merging this patch to java-vector-refactor branch? I believe merging into master will require proper formal sign off. We are going to kickstart testing these in Dremio after cherry-picking these two patches from java-vector-refactor branch and making the necessary code changes. cc @wesm |
I'm on board with merging to the refactor branch whenever you all agree |
I'm onboard with merge. We can still continue to address comments post merge as necessary. Big patches and github don't mix... |
Thanks @siddharthteotia , I agree to merge to refactor branch and continue there. This is getting a little to big to focus on any one change. W.r.t introduction of some static interfaces for JsonFileWriter/Reader I don't really like having these static |
I agree we can merge this. I will open Jiras to track major unresolved discussion . |
Implementation of all scalar types and complex types with corresponding legacy versions. Closes #1203
Merged in 612b970, thanks @siddharthteotia and everyone for reviewing! |
Implementation of all scalar types and complex types with corresponding legacy versions. Closes apache#1203
Implementation of all scalar types and complex types with corresponding legacy versions. Closes #1203
Implementation of all scalar types and complex types with corresponding legacy versions.
cc @jacques-n , @BryanCutler , @icexelloss