Skip to content

Conversation

@liyafan82
Copy link
Contributor

According to the current design of the JDBC adapter, it is not possible to reuse the vector schema roots. That is, a new vector schema root is created and released for each batch.

This can cause performance problems, because in many scenarios, the client code only reads data in vector schema root. So the vector schema roots can be reused in the following cycle: populate data -> client use data -> populate data -> ...

The current design has another problem. For most times, it has two alternating vector schema roots in memory, causing a large waste of memory, especially for large batches.

We solve both problems by providing a flag in the config, which allows the user to reuse the vector shema roots.

@github-actions
Copy link

public boolean hasNext() {
return nextBatch != null;
try {
return !resultSet.isAfterLast();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this guaranteed to be implemented by most JDBC providers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.
isAfterLast is a public API of interface java.sql.ResultSet (https://docs.oracle.com/javase/7/docs/api/java/sql/ResultSet.html#isAfterLast()), so it is supposed to be supported by each legitimate implementation.

VectorSchemaRoot returned = nextBatch;
try {
load(createVectorSchemaRoot());
VectorSchemaRoot ret = config.isReuseVectorSchemaRoot() ? nextBatch : createVectorSchemaRoot();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to factor this out to a method that takes config? instead of repeating ternary logic in a few places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the code, and find the ternary logic is used twice. However, the logic in the two places are the opposite:

  1. In initialize(), a new vector schema root is created, if the resue flag is enabled.
  2. In next(), a new vector schema root is created, if the reuse flag is diabled.

So there is no common logic here?

final int targetRows = 600000;
ResultSet rs = new FakeResultSet(targetRows);
try (ArrowVectorIterator iter = JdbcToArrow.sqlToArrowVectorIterator(rs, allocator)) {
JdbcToArrowConfig config = new JdbcToArrowConfigBuilder(allocator, JdbcToArrowUtils.getUtcCalendar(), false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add parameter doc for the new false literaal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Parameter doc added for this line of code, and also added for some other places.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a test that asserts the VectorSchemaRoot is actually reused when the value set is true?

@liyafan82
Copy link
Contributor Author

Should there be a test that asserts the VectorSchemaRoot is actually reused when the value set is true?

Good suggestion. I've added JdbcToArrowVectorIteratorTest#testVectorSchemaRootReuse for this.

@emkornfield
Copy link
Contributor

+1 thank you.

ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
According to the current design of the JDBC adapter, it is not possible to reuse the vector schema roots. That is, a new vector schema root is created and released for each batch.

This can cause performance problems, because in many scenarios, the client code only reads data in vector schema root. So the vector schema roots can be reused in the following cycle: populate data -> client use data -> populate data -> ...

The current design has another problem. For most times, it has two alternating vector schema roots in memory, causing a large waste of memory, especially for large batches.

We solve both problems by providing a flag in the config, which allows the user to reuse the vector shema roots.

Closes apache#10983 from liyafan82/fly_0824_jd

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
According to the current design of the JDBC adapter, it is not possible to reuse the vector schema roots. That is, a new vector schema root is created and released for each batch.

This can cause performance problems, because in many scenarios, the client code only reads data in vector schema root. So the vector schema roots can be reused in the following cycle: populate data -> client use data -> populate data -> ...

The current design has another problem. For most times, it has two alternating vector schema roots in memory, causing a large waste of memory, especially for large batches.

We solve both problems by providing a flag in the config, which allows the user to reuse the vector shema roots.

Closes apache#10983 from liyafan82/fly_0824_jd

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants