Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37841: [Java] Dictionary decoding not using the compression factory from the ArrowReader #38371

Merged
merged 12 commits into from
Feb 1, 2024

Conversation

vibhatha
Copy link
Collaborator

@vibhatha vibhatha commented Oct 20, 2023

Rationale for this change

This PR addresses #37841.

What changes are included in this PR?

Adding compression-based write and read for Dictionary data.

Are these changes tested?

Yes.

Are there any user-facing changes?

No

@github-actions
Copy link

⚠️ GitHub issue #37841 has been automatically assigned in GitHub to PR creator.

@vibhatha vibhatha marked this pull request as ready for review November 1, 2023 08:16
@vibhatha vibhatha marked this pull request as draft November 1, 2023 08:16
@vibhatha vibhatha marked this pull request as ready for review November 14, 2023 16:58
Copy link
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @vibhatha !

Comment on lines 69 to 73
private CompressionCodec.Factory compressionFactory;

private CompressionUtil.CodecType codecType;

private Optional<Integer> compressionLevel;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be marked private final and grouped with the other private final vars?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, way neater that way.

Comment on lines 146 to 145
VectorUnloader unloader = new VectorUnloader(dictRoot, /*includeNullCount*/ true,
this.compressionLevel.isPresent() ?
this.compressionFactory.createCodec(this.codecType, this.compressionLevel.get()) :
this.compressionFactory.createCodec(this.codecType),
/*alignBuffers*/ true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a helper function for creating a codec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use getCodec() here, too!

dictionaryVector1.close();
allocator.close();
}

@Test
public void testArrowFileZstdRoundTrip() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Somewhat unrelated to the issue, but should we parameterize the tests to use all the different types of compression? See TestCompressionCodec.java as an example.

DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
provider.put(dictionary1);

final BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can get rid of these in favor of this.allocator if you keep the @Before/@After functions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I refactored the code to use this approach.


@Before
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the @Before/@After functionality isn't used for all @Tests, it might be better to move this to a helper function instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use @BeforeEach so that the allocator and root is reset for each test? It might make the tests slower, but not sure if it's better to have a fresh allocator for each test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think that is better and safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unaddressed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw @lidavidm

I updated the JUnit annotations for consistency and compatibility. The @beforeeach annotation from JUnit 5 was being used in conjunction with @test from JUnit 4, causing the setup method not to run as expected before each test.

Furthermore do we need to make the usage of JUnit consistent across tests?

Copy link
Collaborator Author

@vibhatha vibhatha Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change okay?

}

@Test
public void testArrowFileZstdRoundTripWithDictionary() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we still need the original test function testArrowFileZstdRoundTrip? Is this new test case possibly testing the same code path + dictionaries?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the same path + dictionaries, I refactored and reorganized the test cases, does it make sense or useful? I think my previous code had duplication.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 14, 2023
@vibhatha
Copy link
Collaborator Author

@danepitkin Thanks a lot for the review comments, I will address them.

Copy link
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! The tests look very clean now. left a few small additional comments.

Comment on lines 146 to 145
VectorUnloader unloader = new VectorUnloader(dictRoot, /*includeNullCount*/ true,
this.compressionLevel.isPresent() ?
this.compressionFactory.createCodec(this.codecType, this.compressionLevel.get()) :
this.compressionFactory.createCodec(this.codecType),
/*alignBuffers*/ true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use getCodec() here, too!


@Before
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use @BeforeEach so that the allocator and root is reset for each test? It might make the tests slower, but not sure if it's better to have a fresh allocator for each test.

Comment on lines 108 to 114
if (expectSuccess) {
Assert.assertTrue(reader.loadNextBatch());
Assert.assertTrue(root.equals(reader.getVectorSchemaRoot()));
Assert.assertFalse(reader.loadNextBatch());
} else {
Exception exception = Assert.assertThrows(IllegalArgumentException.class, reader::loadNextBatch);
Assert.assertEquals(expectedErrorMessage, exception.getMessage());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit: I think the readArrowFile() and readArrowStream() functions are not needed. While they improve code duplication, I think they also decrease code readability. Personally, I like to quickly see what is asserted in the test functions themselves. I do like how you've separated out other functionality into their own functions like createAndWriteArrowFile().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had doubts after the cleanup 😄
Let me update the PR.

@vibhatha
Copy link
Collaborator Author

@danepitkin I am not sure if this one is practical though 🤔

@danepitkin
Copy link
Member

@danepitkin I am not sure if this one is practical though 🤔

I'm okay with leaving it as-is if there are issues with using @BeforeEach instead of @Before.

Overall, LGTM! I think it's ready for final review from a committer. Excellent job.

@vibhatha
Copy link
Collaborator Author

@lidavidm appreciate your feedback.

@@ -174,6 +178,12 @@ public long bytesWritten() {
return out.getCurrentPosition();
}

private CompressionCodec getCodec() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we should be able to initialize the codec once and reuse it in the constructor, rather than add all the new fields?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm an issue was filed here: #39222
Also, I can work on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: why did we pull this out? It's called only once. Why are we adding a bunch of new fields? We don't use them.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Nov 27, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 13, 2023
@vibhatha
Copy link
Collaborator Author

@lidavidm I updated the PR. Appreciate another round of reviews.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vibhatha can you file a followup to add a new integration test to cover this scenario?

Map<String, String> metaData, IpcOption option, CompressionCodec.Factory compressionFactory,
CompressionUtil.CodecType codecType, Optional<Integer> compressionLevel) {
super(root, provider, out, option, compressionFactory, codecType, compressionLevel);
Map<String, String> metaData, IpcOption option, CompressionCodec codec) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this removing a public constructor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm for my clarification.

Maybe I have misunderstood your comment here.

I thought it would be rather cleaner to pass the CompressionCodec rather than passing all the other parameters which make this object.

Did you mean something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can delegate constructors without removing public ones (which breaks API).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm does the updated change make sense?


@Before
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unaddressed.

@github-actions github-actions bot added awaiting review Awaiting review awaiting changes Awaiting changes awaiting committer review Awaiting committer review and removed awaiting change review Awaiting change review awaiting review Awaiting review awaiting changes Awaiting changes labels Dec 13, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Feb 1, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 1, 2024
@vibhatha
Copy link
Collaborator Author

vibhatha commented Feb 1, 2024

@github-actions crossbow submit java

Copy link

github-actions bot commented Feb 1, 2024

Revision: 907195a

Submitted crossbow builds: ursacomputing/crossbow @ actions-27dab1a4d2

Task Status
java-jars GitHub Actions
verify-rc-source-java-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-java-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-java-linux-ubuntu-20.04-amd64 GitHub Actions
verify-rc-source-java-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-java-macos-amd64 GitHub Actions

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 1, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 1, 2024
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Feb 1, 2024
@lidavidm lidavidm merged commit f9b7ac2 into apache:main Feb 1, 2024
14 of 15 checks passed
@lidavidm lidavidm removed the awaiting merge Awaiting merge label Feb 1, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f9b7ac2.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…factory from the ArrowReader (apache#38371)

### Rationale for this change

This PR addresses apache#37841. 

### What changes are included in this PR?

Adding compression-based write and read for Dictionary data. 

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No
* Closes: apache#37841

Lead-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com>
Co-authored-by: vibhatha <vibhatha@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…factory from the ArrowReader (apache#38371)

### Rationale for this change

This PR addresses apache#37841. 

### What changes are included in this PR?

Adding compression-based write and read for Dictionary data. 

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No
* Closes: apache#37841

Lead-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com>
Co-authored-by: vibhatha <vibhatha@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
…factory from the ArrowReader (apache#38371)

### Rationale for this change

This PR addresses apache#37841. 

### What changes are included in this PR?

Adding compression-based write and read for Dictionary data. 

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No
* Closes: apache#37841

Lead-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com>
Co-authored-by: vibhatha <vibhatha@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Java] Dictionary decoding not using the compression factory from the ArrowReader
3 participants