ARROW-16005: [Java] Fix ArrayConsumer when using ArrowVectorIterator #12692

tom-s-powell · 2022-03-22T16:17:05Z

Fixes https://issues.apache.org/jira/browse/ARROW-16005.

tom-s-powell · 2022-03-22T16:18:33Z

java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/ArrowVectorIterator.java

+      for (int i = 1; i <= consumers.length; i++) {
+        ArrowType arrowType = config.getJdbcToArrowTypeConverter()
+                .apply(new JdbcFieldInfo(resultSet.getMetaData(), i));
+        consumers[i - 1] = JdbcToArrowUtils.getConsumer(


Because ArrayConsumer requires a FieldVector to be passed, I've opted for lazily initialising the consumers after the first VectorSchemaRoot is created.

https://github.com/apache/arrow/pull/12692/files#diff-f812c76a565e7c56500943f512b8498487209b15ed036d404d703854841df3d0R152 will update the vector in the consumer on subsequent iterations.

tom-s-powell · 2022-03-22T16:19:37Z

java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/consumer/ArrayConsumer.java

@@ -90,13 +97,12 @@ public void consume(ResultSet resultSet) throws SQLException, IOException {
        int count = 0;
        try (ResultSet rs = array.getResultSet()) {
          while (rs.next()) {
-            ensureInnerVectorCapacity(innerVectorIndex + count + 1);


I couldn't work out the value of this innerVectorIndex? It didn't seem to ever be reset but nor did it seem to be used when consuming results.

It's because of how a ListVector is laid out in memory. The list [[1, 2], [], [3, 4, 5]] is represented as the child vector [1, 2, 3, 4, 5] and the offsets [0, 2, 2, 5]. ensureInnerVectorCapacity is resizing the child vector, so when we call consume for the last element, we want to ensure the child vector has enough capacity for the current elements, along with all the previous elements, and it looks like that's what innerVectorIndex is tracking.

In other words when we call consume for what will be [3, 4, 5] we need to ensure the child vector has space for at least 3, 4, 5, … elements not 1, 2, 3… elements.

I'm still not entirely convinced we can remove innerVectorIndex here?

I've reverted the change to innerVectorIndex. Only thing I've done is reset it to 0 when we reset inner vector.

I've added unit tests for Arrays which covers the two issues when reusing VectorSchemaRoot:

NPE error from ArrowVectorIterator.

ArrayConsumer not reseting the delegate consumer.

tom-s-powell · 2022-03-22T16:25:38Z

java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/consumer/ArrayConsumer.java

+  public void resetValueVector(ListVector vector) {
+    super.resetValueVector(vector);
+
+    FieldVector childVector = vector.getDataVector();


When VectorSchemaRoot is reused in ArrowVectorIterator, we currently hit the issue that the currentIndex here is reset to 0 but is never updated in the delegate consumer. As such, subsequent iterations will result in null array values because the ListVector (and data vector) is reset https://github.com/apache/arrow/pull/12692/files#diff-f812c76a565e7c56500943f512b8498487209b15ed036d404d703854841df3d0R150.

For example, if you have a batch size of 2 and a ResultSet with 4 rows, the second iteration will be writing values into index 0 and 1 in the ListVector but the offsets for those in the data vector will be pointing at null values (because it was reset) and the values written to the data vector will be at larger indexes.

github-actions · 2022-03-22T16:25:54Z

https://issues.apache.org/jira/browse/ARROW-16005

github-actions · 2022-03-22T16:25:56Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

lidavidm

Thanks for fixing this!

Is there a unit test we can add to cover this?

lidavidm · 2022-03-25T19:24:20Z

java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/ArrowVectorIterator.java

    return root;
  }

+  private void ensureInitialized(VectorSchemaRoot root) throws SQLException {
+    if (!initialized) {


It seems this isn't right if !config.isReuseVectorSchemaRoot() because we'll have recreated the root. Since ensureInitialized is only ever called when creating a new root, I don't think we need to guard this with initialized?

~~Ah yes good point!~~

Actually, it's fine as its currently written. After creating a new VectorSchemaRoot we'll call load and this will call JdbcConsumer#resetValueVector passing the vectors of the new VectorSchemaRoot and obtaining a reference to them (https://github.com/apache/arrow/pull/12692/files#diff-f812c76a565e7c56500943f512b8498487209b15ed036d404d703854841df3d0R161). This will happen before we consume data.

Actually sorry I see we have to recreate the delegate consumer as it references the child vector so it does make sense to initialize each time.

lidavidm · 2022-03-25T19:28:09Z

java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/consumer/ArrayConsumer.java

@@ -90,13 +97,12 @@ public void consume(ResultSet resultSet) throws SQLException, IOException {
        int count = 0;
        try (ResultSet rs = array.getResultSet()) {
          while (rs.next()) {
-            ensureInnerVectorCapacity(innerVectorIndex + count + 1);


It's because of how a ListVector is laid out in memory. The list [[1, 2], [], [3, 4, 5]] is represented as the child vector [1, 2, 3, 4, 5] and the offsets [0, 2, 2, 5]. ensureInnerVectorCapacity is resizing the child vector, so when we call consume for the last element, we want to ensure the child vector has enough capacity for the current elements, along with all the previous elements, and it looks like that's what innerVectorIndex is tracking.

In other words when we call consume for what will be [3, 4, 5] we need to ensure the child vector has space for at least 3, 4, 5, … elements not 1, 2, 3… elements.

lidavidm

Thanks for plugging away at this. As mentioned it would be good to see a unit test to cover this too (I guess with both reusing/not reusing the VectorSchemaRoot)

@toddfarmer would you like to take a glance here as well?

lidavidm · 2022-05-24T18:45:39Z

java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/consumer/ArrayConsumer.java

@@ -90,13 +97,12 @@ public void consume(ResultSet resultSet) throws SQLException, IOException {
        int count = 0;
        try (ResultSet rs = array.getResultSet()) {
          while (rs.next()) {
-            ensureInnerVectorCapacity(innerVectorIndex + count + 1);


I'm still not entirely convinced we can remove innerVectorIndex here?

tom-s-powell · 2022-05-25T10:59:20Z

I've updated the unit tests for arrow-jdbc to include array types.

tom-s-powell · 2022-05-25T11:52:24Z

java/adapter/jdbc/src/test/java/org/apache/arrow/adapter/jdbc/Table.java

@@ -204,6 +219,11 @@ public void setRowCount(int rowCount) {
    this.rowCount = rowCount;
  }

+  @Override
+  public String toString() {
+    return "Table{name='" + name + "', type='" + type + "'}";


Have added this for debugging purposes, helpful in the ParameterizedTest to easily see which test file is failing. I updated the table names to match the YML file name.

tom-s-powell · 2022-05-25T11:53:19Z

...apter/jdbc/src/test/java/org/apache/arrow/adapter/jdbc/h2/JdbcToArrowVectorIteratorTest.java

        "[101, 102, 103]",
        "[104, null, null]",
        "[107, 108, 109]",
        "[110]"
    };
+    String[] expectedArrayColValues = {


Added this test case for arrays specifically to test the fix to ArrayConsumer when resetting the delegate. Without the reset you end up reading null values.

lidavidm

Sorry for the delay.

Thanks for adding the tests and refactoring everything! I left some minor questions but I think this is ready.

lidavidm · 2022-05-31T15:19:09Z

java/adapter/jdbc/src/test/java/org/apache/arrow/adapter/jdbc/JdbcToArrowTestHelper.java

+      }
+    }
+    return valueArr;
+  }


It feels like this could be reused from Table.getListValues? Especially as this seems to also handle nulls

Refactored so they can be reused

lidavidm · 2022-05-31T15:22:23Z

...apter/jdbc/src/test/java/org/apache/arrow/adapter/jdbc/h2/JdbcToArrowVectorIteratorTest.java

@@ -119,7 +134,10 @@ public void testVectorSchemaRootReuse() throws SQLException, IOException {
      assertNotNull(cur);

      // verify the first column, with may contain nulls.
-      assertEquals(expectedColValues[batchCount], cur.getVector(0).toString());
+      assertEquals(expectedIntColValues[batchCount], cur.getVector(0).toString());


Is there a better way to assert equality than the string representation (perhaps with getObject as done elsewhere)? This is a little brittle

Switched to use the assertIntVectorValues and assertListVectorValues

tom-s-powell · 2022-06-01T08:38:31Z

@lidavidm this should be good to merge

Fix ArrayConsumer when using ArrowVectorIterator.

85c4a9f

tom-s-powell commented Mar 22, 2022

View reviewed changes

github-actions bot added the Component: Java label Mar 22, 2022

lidavidm reviewed Mar 25, 2022

View reviewed changes

iamthomaspowell added 3 commits May 24, 2022 12:51

Merge branch 'master' into tp/fix-jdbc-array-iterator

0881e91

Remove ensure initialized check.

849bcfe

Revert change.

1cd3e04

lidavidm reviewed May 24, 2022

View reviewed changes

iamthomaspowell added 2 commits May 25, 2022 11:26

Add tests for arrays in JDBC adapter.

9fec1ea

Revert change to ArrayConsumer.

d2f78d6

iamthomaspowell added 2 commits May 25, 2022 12:46

Add test for arrays when reusing VectorSchemaRoot.

c0d90bf

Reset innerVectorIndex to 0.

5f919a2

tom-s-powell commented May 25, 2022

View reviewed changes

Minor tidy up.

cee2fef

lidavidm approved these changes May 31, 2022

View reviewed changes

iamthomaspowell added 2 commits May 31, 2022 19:57

Code review comments.

390a9f4

Fix checkstyle.

eff943c

lidavidm approved these changes Jun 1, 2022

View reviewed changes

lidavidm merged commit e866789 into apache:master Jun 1, 2022

lwhite1 mentioned this pull request Aug 2, 2022

Version 9.0.0 release blog post apache/arrow-site#227

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-16005: [Java] Fix ArrayConsumer when using ArrowVectorIterator #12692

ARROW-16005: [Java] Fix ArrayConsumer when using ArrowVectorIterator #12692

tom-s-powell commented Mar 22, 2022

tom-s-powell Mar 22, 2022

tom-s-powell Mar 22, 2022

lidavidm Mar 25, 2022

lidavidm May 24, 2022

tom-s-powell May 25, 2022

tom-s-powell Mar 22, 2022

github-actions bot commented Mar 22, 2022

github-actions bot commented Mar 22, 2022

lidavidm left a comment

lidavidm Mar 25, 2022

tom-s-powell May 24, 2022 •

edited

Loading

tom-s-powell May 24, 2022

tom-s-powell May 25, 2022

lidavidm Mar 25, 2022

lidavidm left a comment

lidavidm May 24, 2022

tom-s-powell commented May 25, 2022

tom-s-powell May 25, 2022

tom-s-powell May 25, 2022

lidavidm left a comment

lidavidm May 31, 2022

tom-s-powell May 31, 2022

lidavidm May 31, 2022

tom-s-powell May 31, 2022

tom-s-powell commented Jun 1, 2022

ARROW-16005: [Java] Fix ArrayConsumer when using ArrowVectorIterator #12692

ARROW-16005: [Java] Fix ArrayConsumer when using ArrowVectorIterator #12692

Conversation

tom-s-powell commented Mar 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Mar 22, 2022

github-actions bot commented Mar 22, 2022

lidavidm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tom-s-powell May 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidavidm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tom-s-powell commented May 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidavidm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tom-s-powell commented Jun 1, 2022

tom-s-powell May 24, 2022 •

edited

Loading