Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1474: [JAVA] Java ValueVector hierarchy Refactor (Implementation Phase 2) #1203

Closed

Conversation

siddharthteotia
Copy link
Contributor

@siddharthteotia siddharthteotia commented Oct 16, 2017

Implementation of all scalar types and complex types with corresponding legacy versions.

cc @jacques-n , @BryanCutler , @icexelloss

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @siddharthteotia! Does the logger creation need to be in each concrete vector class? What is the reason for having the ArrowReader in each vector class? does this need to be there at all?

}

private StringBuilder getAsStringBuilderHelper(int index) {
final int startIndex = index * 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to index * TYPE_WIDTH

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

final int startIndex = index * 8;

final int days = valueBuffer.getInt(startIndex);
int millis = valueBuffer.getInt(startIndex + 4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a MILLISECOND_OFFSET = 4 and use that here so it gives some context on what the calculation is for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

final int offsetIndex = index * TYPE_WIDTH;
BitVectorHelper.setValidityBitToOne(validityBuffer, index);
valueBuffer.setInt(offsetIndex, days);
valueBuffer.setInt((offsetIndex + 4), milliseconds);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here for millisecond offset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



private void setValue(int index, int value) {
valueBuffer.setShort(index * TYPE_WIDTH, value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is some kind of cast needed for value which is an int? Why do we need to have double the set methods to allow setting an int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler

There is probably no cast needed. I just thought that since this vector type really doesn't have 4 byte ints, we should ideally have a method set(index, short value). In the existing code on master, we don't have this. We just have set (int index, int value)

So set(index, int value) internally just takes 2 bytes from the int and calls setShort on the ArrowBuf. From an API perspective, it looked less intuitive to me since the class description tells user that the vector stores 2 byte values. So I introduced another method and still kept the original one.

What do you think? Should we not have set(int index, short value)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have just setValue(int index, short value)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess having a setValue(int index, int value)is fine as a convenient method . Although, do we want any overflow checks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just have setValue(int index, short value) if the other is just convenience. Otherwise we have to think about things like overflow like @icexelloss pointed out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my .02: given java's penchant for upcasting, I'm inclined to include both short and int signatures. I know some of these things were also introduced because of various runtime code generation patterns that we've used. @siddharthteotia, any sense whether we rely on this pattern? (int set)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacques-n , Iet me dig into the code and find out what is our usage

Copy link
Contributor

@icexelloss icexelloss Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we stand on this? TBD or decided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep both the methods. Keeping them is probably no harm but removing them is likely going to affect the downstream use in our run time generated code.

BitVectorHelper.setValidityBit(validityBuffer, index, 0);
}

public void set(int index, int isSet, long valueField ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you call valueField just value to be consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done.

valueBuffer.setChar(index * TYPE_WIDTH, value);
}

private void setValue(int index, char value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we support setting a value as a char?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning over here was similar to why I introduced another method for setting "short" value. I think this needs some discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be short instead? This is a 2 byte value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before, we should try to keep the support value types to minimum. I'm not sure that having char really helps much?

*/
public class NullableUInt4Vector extends BaseNullableFixedWidthVector {
private static final org.slf4j.Logger logger =
org.slf4j.LoggerFactory.getLogger(NullableIntVector.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullableUInt4Vector.class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*/
public class NullableUInt8Vector extends BaseNullableFixedWidthVector {
private static final org.slf4j.Logger logger =
org.slf4j.LoggerFactory.getLogger(NullableBigIntVector.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullableUInt8Vector.class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,299 @@
/*******************************************************************************

* Licensed to the Apache Software Foundation (ASF) under one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe minor but I usually sometimes see a different format for the license:

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * ...
 *\

It would be good to be consistent, but maybe that would be better to address separately after all this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay will look into it at the end.

*/
public class NullableDateDayVector extends BaseNullableFixedWidthVector {
private static final org.slf4j.Logger logger =
org.slf4j.LoggerFactory.getLogger(NullableIntVector.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be NullableDateDayVector.class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@siddharthteotia
Copy link
Contributor Author

@BryanCutler,

I have addressed most of your general comments. Few points on remaining comments

(1) Regarding Logger - I had added this as a TODO in my previous patch. The thing is that we really don't log much. We only log in the base class during realloc and alloc functions where there is a chance of catching memory related exceptions. So I have been contemplating if we really need logging. It's more of an unnecessary heap overhead.

The reason I initialize the specific logger in each subclass is because when the super class methods dump out log messages we can see which exact vector type the messages correspond to.

But again, since we barely log messages, I think we are better off not having any logging at all. You can take a look at BaseNullableFixedWidthVector and see what you think. Can we afford no logging?

60a2ebd#diff-dddca025d8d6792d8776d3c59ce508f7R270

(2) Regarding FieldReader -- I think you are right. When we are working with a vector type, we have enough information available to create the reader on demand as opposed to carrying the FieldReader object inside each vector. Is this what you were suggesting?

However, we may have to see the impact of changes. This is definitely doable but we will have to refactor code in map, list and union vectors where when they read nested scalar vectors, they can no longer make a call to getReader().

@siddharthteotia
Copy link
Contributor Author

All scalar types refactored and new implementation is ready -- builds fine. Corresponding Legacy vectors are also ready.
Testing has issues w.r.t mutator/accessor in LIST, MAP, UNION. Next step is to refactor them.

return new TransferImpl((NullableDecimalVector)to);
}

private class TransferImpl implements TransferPair {
Copy link
Contributor

@icexelloss icexelloss Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These TransferImpl class looks very similar for different vectors. Can they be refactored out as a single class?

* integer values which could be null. A validity buffer (bit vector) is
* maintained to track which elements in the vector are null.
*/
public class NullableUInt2Vector extends BaseNullableFixedWidthVector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this and NullableSmallIntVector?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we should remove all the UInt vectors. This was a task we talked about doing a year ago but just didn't get it done. They aren't complete and aren't tested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sweet. Just curious what's the initial intention for those?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is probably a good idea. I have been wondering what is the difference between Int4Vector and UInt4Vector since the latter really doesn't implement unsigned semantics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for removing them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the problem with removing them is that other implementations may send unsigned integer data. We've already been integration testing this: https://github.com/apache/arrow/blob/master/integration/integration_test.py#L745

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to mark the unsigned int vectors in Java "buggy" but they will have to get implemented properly at some point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess we are not removing them ?

Keeping them as is and will open JIRA(s) for correctly implementing and testing UINT1, UINT2, UINT4, UINT8. I can open JIRAs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, great, thank you =)

@@ -260,10 +260,10 @@ private void readVector(Field field, FieldVector vector) throws JsonParseExcepti
((ListVector) vector).getMutator().setLastSet(count);
break;
case VARBINARY:
((NullableVarBinaryVector) vector).getMutator().setLastSet(count - 1);
((NullableVarBinaryVector)vector).setLastSet(count - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:
Should we add whitespace after closing ) in casting? This is specified in
https://google.github.io/styleguide/javaguide.html#s4.6.2-horizontal-whitespace
item 5

* timestamp values which could be null. A validity buffer (bit vector) is
* maintained to track which elements in the vector are null.
*/
public class NullableTimeStampMilliVector extends BaseNullableFixedWidthVector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we want to consolidate all Timestamp vectors into one class later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what that buys. Definitely seems like something we should address afterwards (but before a release--better not to break things twice)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can address this later as long as it's before release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C++ we have a single arrow::TimestampArray but applications will generally branch based on the unit metadata and look at the int64 values. I can't really say what's the better option for Java

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, in Java we want to avoid having extra branches at the cell level. For long I agree all could be the same. So maybe having a TimestampVector that takes a constructor provided unit type and then have subclassed vectors that hold the old per unit hierarchy could give a good combination. Both would have the get and set with int64/long. If you want to avoid the getValue() conditional, you go with the sub-hiearchy. If you want a generic interface for getValue(), use the base TimestampVector which does an internal conditional on how to interpret the value. Keeps @icexelloss happy but also allows Dremio folk to use the specific interfaces in runtime generated code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I think this is a Java-specific design issue then related to the JIT -- in C/C++ we have to branch earlier than the cell level into vectorized branch-free code paths so having a single container makes things simpler (since you can dispatch to code that handles "some kind of timestamp").

Copy link
Contributor Author

@siddharthteotia siddharthteotia Oct 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we should do as part of the ongoing refactor patch? Earlier @jacques-n mentioned doing it afterwards probably once the new infrastructure is stable? I am fine either way. Just confirming..

@icexelloss
Copy link
Contributor

Thanks @siddharthteotia. I went through the change and left a few comments.

@jacques-n
Copy link
Contributor

@BryanCutler and @siddharthteotia , the reason the vector holders a reference to ArrowReader is that there are several cases where you don't want to constantly be recreating the reader and at the time, this was the easiest place to maintain it. Not sure if this still the case but it is something that should be reviewed before removing.

@BryanCutler
Copy link
Member

BryanCutler commented Oct 17, 2017

@siddharthteotia , regarding the logger can you specific vector logs by moving to the base class and initializing the logger with the class name as a string, e.g. getLogger(this.getClass().getCanonicalName());?

@siddharthteotia
Copy link
Contributor Author

Thanks for the thorough review @jacques-n , @BryanCutler , @icexelloss , @wesm . I am in the process of addressing comments as we are reaching consensus. Meanwhile, I am trying to prioritize stability of tests.

Right now we have 2 related failures in ComplexWriter, Promotable Writer and lots of failures in TestJsonFile since getFieldInnerVectors is no longer applicable. I am addressing the former ones as of now.

The recent commit has refactored complex types -- LIST, FIXED SIZE LIST, MAP, UNION along with corresponding Legacy types and code changes in the callers to make things work.

@jacques-n
Copy link
Contributor

@BryanCutler, for the logger we discussed that internally (@siddharthteotia and I) and should have posted the thoughts here. Basically, the downside is additional heap bloat since each instance has to hold a reference to the logger (as opposed to be a constant). We thought it wasn't worth the cost.

}
}

public void setSafe(int index, int isSet, long valueField ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valueField -> value?

BitVectorHelper.setValidityBit(validityBuffer, index, 0);
}

public void set(int index, int isSet, long valueField ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valueField -> value?

@icexelloss
Copy link
Contributor

@siddharthteotia I also found the new files are 3-space indented. The current files are 2-space indented. Please fix the indentation?

@wesm
Copy link
Member

wesm commented Oct 18, 2017

Is this style issue being checked in Travis CI?

@icexelloss
Copy link
Contributor

Probably not. I fixed some checkstyles warning but not all of them:
#930 (comment)

So we cannot failure the build for checkstyles yet.

@wesm
Copy link
Member

wesm commented Oct 19, 2017

I opened https://issues.apache.org/jira/browse/ARROW-1688. We must keep our code clean

@siddharthteotia
Copy link
Contributor Author

siddharthteotia commented Oct 19, 2017

@jacques-n , @BryanCutler , @icexelloss

  • Addressed issues in JsonFileReader and JsonFileWriter. some work was needed here because getFieldInnerVectors is not applicable anymore. I have introduced static methods in vectors as helper routines for reading from and writing into JSON.

  • Removed logger from all vectors.

  • Addressed failures in tests. Down to one failure now.

Remaining:

  • I haven't yet removed the inner vectors from List. Only mutator/accessor were removed. Will do the remaining tomorrow.

  • Checkstyle issues

  • keeping both (int and actual) signatures for small types (SMALLINT, TINYINT etc)

  • TimeStamp hierarchy refactoring per the suggestions given earlier.

}
else if (bufferType.equals(OFFSET)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: combine with line above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to run checkstyle on the new vector classes before merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we ignore style until the change set are finalized and just fix it all at once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely. Once the patch is ready for merge to java-vector-refactor branch, we can check that. Or once we plan to push both patches from java-vector-refactor to master, then we can do and address the failures at one go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's address that and any other cosmetic changes once we plan to push the two patches from java-vector-refactor branch to master

for (int i = 0; i < bufferValueCount; i++) {
if (vectorType.equals(DATA) &&
(vector.getMinorType() == Types.MinorType.VARCHAR ||
vector.getMinorType() == Types.MinorType.VARBINARY)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: indentation seems off in this if statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (bufferType.equals(TYPE)) {
generator.writeNumber(buffer.getByte(index * NullableTinyIntVector.TYPE_WIDTH));
}
else if (bufferType.equals(OFFSET)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: combine with line above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@BryanCutler
Copy link
Member

I looked at JSONfile reader/writer and looks good from what I can tell so far, just some minor style comments

@siddharthteotia
Copy link
Contributor Author

The latest commit addresses some recent review comments and removes the inner vectors from LIST. I guess we are left with Implementing Timestamp vectors as suggested above -- #1203 (comment)
I will get going with it unless there are other specific concerns (modulo indentation etc)

@siddharthteotia
Copy link
Contributor Author

All tests run clean with latest commit.

@icexelloss
Copy link
Contributor

Thanks @siddharthteotia. I will find some time to review this tomorrow.

* limitations under the License.
*/
<@pp.dropOutputFile />
<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/UnionVector.java" />
Copy link
Contributor

@icexelloss icexelloss Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the new vector class space. @siddharthteotia, what are your thoughts on:

  • Do we want to continue to codegen union vector in the new vector package?

  • Are there any API changes comparing new/legacy union vectors?

  • Do we want to release support for union vector in new vector classes in 0.8?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see this is acutally generate LegacyUnionVector.java - the doc is incorrect.

To answer myself, it seems:

  • Yes
  • (sid can you answer this?)
  • Yes it's compatible with the C++ union. No otherwise.

@siddharthteotia
Copy link
Contributor Author

@jacques-n , @BryanCutler , @icexelloss

The latest commit addresses the changes suggested w.r.t timestamp vector hierarchy. The concrete timestamp classes now have only holder specific methods.

Adjusted license headers for consistency.

All tests run clean.

@@ -363,12 +361,12 @@ public void copyValueSafe(int from, int to) {

@Override
public Accessor getAccessor() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we stand on these? Remove before 0.8 release?

Copy link
Contributor Author

@siddharthteotia siddharthteotia Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment here that indicates why we have dummy getAccessor() and getMutator() interfaces https://github.com/apache/arrow/pull/1203/files/fb5768fd086a979af8cc8d8795935d191f231678#diff-b5610c173675c3c2707593d7968ee29eR259

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry the link is not valid any more, can you paste a new link?

public void setValueCount(int valueCount) { }

public Object getObject(int index) { return null; }
public int getNullCount() { return 0; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. Why is this returning 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained here -- #1203 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be UnsupportedOperationException?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is resolved.

public final static String DATA_VECTOR_NAME = "$data$";

protected final UInt4Vector offsets;
public final static byte OFFSET_WIDTH = 4;
protected ArrowBuf offsetBuffer;
protected FieldVector vector;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename this to dataVector?

@@ -62,9 +62,6 @@ public static FixedSizeListVector empty(String name, int size, BufferAllocator a

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class still uses BitVector as validity. If we are not changing this in this PR, maybe leaving a TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently doing it.

return true;
}

private void allocateValidityBuffer(final long size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be in base class?

Copy link
Contributor Author

@siddharthteotia siddharthteotia Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this is fine.

/*
* transfer the validity.
*/
private void splitAndTransferValidityBuffer(int startIndex, int length, ListVector target) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself:

Read this carefully.

@@ -60,9 +60,6 @@ public static NullableMapVector empty(String name, BufferAllocator allocator) {

private final List<BufferBacked> innerVectors;

private final Accessor accessor;
private final Mutator mutator;

// deprecated, use FieldType or static constructor instead
@Deprecated
public NullableMapVector(String name, BufferAllocator allocator, CallBack callBack) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep both MapVector and NullableMapVector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss when we are removing all non-nullable vectors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

track in ARROW-1710

@@ -175,7 +142,7 @@ public boolean read(VectorSchemaRoot root) throws IOException {
{
for (Field field : root.getSchema().getFields()) {
FieldVector vector = root.getVector(field.getName());
readVector(field, vector);
readFromJsonIntoVector(field, vector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: Think about this.

@icexelloss
Copy link
Contributor

@siddharthteotia I went through the change and left some comments. I haven't read all your reply yet but I will do it later. (Have to go now)

@siddharthteotia
Copy link
Contributor Author

@jacques-n , @BryanCutler , @icexelloss

  • All code changes done.

  • Tests run clean with latest commit (note all commits are now squashed)

I will review the code tonight primarily for adding JavaDocs, comments and any review comments that we haven't got a closure on.

@siddharthteotia
Copy link
Contributor Author

I have added some new tests too but will file follow-up JIRAs to improve our test suite going forward.

@icexelloss
Copy link
Contributor

A few high level comments:

  • There are a few functions that seems to be placeholder for the interface - for instance getNullCount that only returns 0. I suppose these functions are not going to be removed in 0.8? If that's the case, I think we should make it clear to the user that these functions will not work and shouldn't be called. Having bogus implementation could be confusing to user.

  • API added in vectors classes that are for json reader/writer. I think those shouldn't be a part of the public API. Also I am not sure why we need to make such changes, maybe @siddharthteotia can help clarify?

@icexelloss
Copy link
Contributor

@siddharthteotia For future reference, I usually prefer not to squash commit during PR - this makes it hard to track incremental changes. We can squash commit when merging.

@siddharthteotia
Copy link
Contributor Author

siddharthteotia commented Oct 23, 2017

W.r.t introduction of some static interfaces for JsonFileWriter/Reader

The introduction of couple of static interfaces is not absolutely necessary. They are written for better readability in JsonFileReader's gigantic switch block when it parses Json and writes to the vector (and its inner vectors). Since now we no longer have inner vectors, we obviously couldn't leverage the same code. The JsonFileReader had to be changed to specifically write to different buffers (TYPE, VALIDITY, OFFSET, DATA) for a particular vector. Also it has to allocate the buffer and appropriately set writer index before calling loadFieldBuffers. This is something that was needed for every case in switch block here. Once I did this, the code looked pretty messy and ugly. So I moved all the logic private to vectors and made them as part of static interfaces.

On the other hand, in JsonFileWriter we were reading from vector (and its inner vectors) and writing out Json data. Again, since there are no inner vectors, all operation had to be transformed to work at the buffer level -- for writing the contents of each inner buffer. Also, the old code of JsonFileWriter stated a TODO that it was not handling each type. The new code handles all types.

If the general preference is to not introduce static interfaces in vector APIs, I am fine with removing them and moving all logic into JSon code itself. The javadocs already indicate that external use of these APIs is not recommended.

W.r.t introduction of some new APIs in ValueVector

Note that top level interface is still ValueVector even though hierarchy underneath has changed. So there are still non-nullable vectors extending ValueVector, implementing mutator/accessor interfaces etc.

So I introduced APIs like getNullCount(), getValueCount(), setValueCount(), getObject() for the new nullable vectors. Once we remove non-nullable vectors and expose mutator/accessor functions as direct get/set in ValueVector, we can get rid of these APIs too.

User is free to call such methods on vectors since internally they delegate the call to corresponding mutator/accessor operation for non-nullable vectors and for nullable vectors we already have the new implementation. For legacy vectors, it doesn't really matter since each operation is just a pass-through to new code.

There aren't any placeholder interfaces anywhere. Each vector (nullable or non-nullable) has a concrete implementation of all such interfaces as prescribed by ValueVector. Correctness is not affected anywhere. We should be able to do the simple cleanup once we remove non-nullable vectors. If we are not planning to remove non-nullable vectors then we should just remove mutator/accessor from them and expose all the get/set APIs directly just like we have done for other nullable and complex vectors. That will also allow us to do simple cleanup.

Whatever we decide to do with non-nullable scalar vectors, we should do soon to make the entire java Vector code under ValueVector hierarchy consistent.

Right now the nullable scalars and complex types are consistent -- none of them have inner vectors and none of them support mutator/accessor based access. Either we should do the same thing with non nullable vectors or remove them all together. The latter is preferable.

@siddharthteotia
Copy link
Contributor Author

Looking into Travis CI failures -- build (with tests) runs clean locally

@wesm
Copy link
Member

wesm commented Oct 23, 2017

For future reference, I usually prefer not to squash commit during PR - this makes it hard to track incremental changes. We can squash commit when merging.

We're really running into a weakness of GitHub code reviews. My understanding is that Dremio uses Gerrit for code reviews (like Kudu, Impala, and lots of Google projects) and so the squashing is a key part of the Gerrit workflow. But it works pretty poorly for GitHub, where having a string of new commits is better (although the GitHub UI is terrible for reviewing incremental diffs)

I would really like to have the option of doing large Arrow code reviews on Gerrit. It can be a bit challenging to do (because Gerrit can fall out of sync) unless you have 100% of your reviews hosted on there, and Gerrit is quite a bit of process for some users. I hope that we find a way to do this in the future.

@icexelloss
Copy link
Contributor

icexelloss commented Oct 23, 2017

@siddharthteotia , thanks for the explanation.

W.r.t introduction of some static interfaces for JsonFileWriter/Reader
I see, so the reason is inner vectors are removed and therefore the json reader/writer doesn't work anymore.

In that case, I am sort of OK with leaving these methods as public static but document clearly those methods are not part of public API and should not be used (IMHO "not recommended" is not strong enough, I would probably say "shouldn't") and refactor that later, before or after 0.8 release.

What do other people think?

W.r.t introduction of some new APIs in ValueVector

So I introduced APIs like getNullCount(), getValueCount(), setValueCount(), getObject() for the new nullable vectors. Once we remove non-nullable vectors and expose mutator/accessor functions as direct get/set in ValueVector, we can get rid of these APIs too.

User is free to call such methods on vectors

@siddharthteotia, let me see if I understand this correctly:

getNullCount(), getValueCount(), setValueCount(), getObject() are a part of the new vector API and we will keep them going forward.

I think I saw a version of the BaseValueVector that these methods are returning bogus values and got confused. It seems to be correct row. I probably just saw a wip version.

@siddharthteotia
Copy link
Contributor Author

siddharthteotia commented Oct 23, 2017

@jacques-n , @BryanCutler , @icexelloss

The latest commit has good javadocs for each and every new function in all vector types.

How do you feel about merging this patch to java-vector-refactor branch? I believe merging into master will require proper formal sign off.

We are going to kickstart testing these in Dremio after cherry-picking these two patches from java-vector-refactor branch and making the necessary code changes.

cc @wesm

@wesm
Copy link
Member

wesm commented Oct 23, 2017

I'm on board with merging to the refactor branch whenever you all agree

@siddharthteotia siddharthteotia changed the title ARROW-1474:[WIP] Java Vector Refactor (Implementation Phase 2) ARROW-1474: [JAVA] Java ValueVector hierarchy Refactor (Implementation Phase 2) Oct 23, 2017
@jacques-n
Copy link
Contributor

I'm onboard with merge. We can still continue to address comments post merge as necessary. Big patches and github don't mix...

@BryanCutler
Copy link
Member

Thanks @siddharthteotia , I agree to merge to refactor branch and continue there. This is getting a little to big to focus on any one change.

W.r.t introduction of some static interfaces for JsonFileWriter/Reader

I don't really like having these static set methods in the vector public apis if they shouldn't be used outside of JSON reader/writer, but lets continue this discussion after merging this

@icexelloss
Copy link
Contributor

I agree we can merge this. I will open Jiras to track major unresolved discussion .

wesm pushed a commit that referenced this pull request Oct 24, 2017
Implementation of all scalar types and complex types with corresponding legacy
versions.

Closes #1203
@wesm
Copy link
Member

wesm commented Oct 24, 2017

Merged in 612b970, thanks @siddharthteotia and everyone for reviewing!

@wesm wesm closed this Oct 24, 2017
siddharthteotia added a commit to siddharthteotia/arrow that referenced this pull request Nov 14, 2017
Implementation of all scalar types and complex types with corresponding legacy
versions.

Closes apache#1203
wesm pushed a commit that referenced this pull request Nov 15, 2017
Implementation of all scalar types and complex types with corresponding legacy
versions.

Closes #1203
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants