Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Java][IPC] Java reader cannot read compressed file created by C++ writer #34432

Closed
wgtmac opened this issue Mar 3, 2023 · 12 comments · Fixed by #34580
Closed

[C++][Java][IPC] Java reader cannot read compressed file created by C++ writer #34432

wgtmac opened this issue Mar 3, 2023 · 12 comments · Fixed by #34580

Comments

@wgtmac
Copy link
Member

wgtmac commented Mar 3, 2023

Describe the bug, including details regarding any error messages, version, and platform.

To reproduce the issue, use the C++ code below to write and read arrow IPC file. I use the 11.0.0 version. Make sure arrow_testing is linked to enable ArrayFromJSON utility.

#include "arrow/io/api.h"
#include "arrow/filesystem/localfs.h"
#include "arrow/ipc/feather.h"
#include "arrow/ipc/writer.h"
#include "arrow/table.h"
#include "arrow/testing/gtest_util.h"

#include <iostream>

int main(int argc, char** argv) {
  auto codec_type = ::arrow::Compression::UNCOMPRESSED;
  std::string path = "/tmp/test.arrow";
  auto schema = arrow::schema({field("a", arrow::int32()), field("b", arrow::utf8())});
  auto a = arrow::ArrayFromJSON(arrow::int32(), "[1, 2, 3]");
  auto b = arrow::ArrayFromJSON(arrow::utf8(), R"(["a", "b", "c"])");
  auto table = arrow::Table::Make(schema, {a, b});
  auto out = arrow::io::FileOutputStream::Open(path).MoveValueUnsafe();
  ::arrow::ipc::IpcWriteOptions options{
      .codec = arrow::util::Codec::Create(codec_type).MoveValueUnsafe(),
      .metadata_version = arrow::ipc::MetadataVersion::V5,
  };
  auto writer = ::arrow::ipc::MakeFileWriter(std::move(out), schema, options).MoveValueUnsafe();
  writer->WriteTable(*table).ok();
  writer->Close().ok();

  auto fs = arrow::fs::LocalFileSystem();
  auto in = fs.OpenInputFile(path).MoveValueUnsafe();
  auto reader = arrow::ipc::feather::Reader::Open(in).MoveValueUnsafe();
  std::shared_ptr<arrow::Table> read_table;
  if (reader->Read(&read_table).ok()) {
    std::cout << read_table->ToString() << std::endl;
  } else {
    std::cerr << "Failed to read table" << std::endl;
  }
  return 0;
}

On the Java side, use the code below:

package test.arrow;

import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowFileReader;
import org.apache.commons.compress.utils.SeekableInMemoryByteChannel;
import org.apache.commons.io.IOUtils;

import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

public class ArrowReaderTest {

  public static void main(String[] args) throws IOException {
    InputStream inputStream = new DataInputStream(new FileInputStream("/tmp/test.arrow"));
    RootAllocator allocator = new RootAllocator(Long.MAX_VALUE);
    SeekableInMemoryByteChannel channel = new SeekableInMemoryByteChannel
      (IOUtils.toByteArray(inputStream));
    try (ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {
      while (reader.loadNextBatch()) {
        System.out.println(reader.getVectorSchemaRoot().contentToTSVString());
      }
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

The issues vary with different codec_type.

  1. UNCOMPRESSED: Both C++ and Java readers can read correct data.
  2. SNAPPY: Both C++ and Java readers cannot read any data. (Same for GZIP)
  3. ZSTD: C++ reader can read data correctly. But the Java reader throws
java.lang.NegativeArraySizeException: -16
	at org.apache.arrow.vector.VarCharVector.get(VarCharVector.java:115)
	at org.apache.arrow.vector.VarCharVector.getObject(VarCharVector.java:127)
	at org.apache.arrow.vector.VarCharVector.getObject(VarCharVector.java:38)
	at org.apache.arrow.vector.VectorSchemaRoot.contentToTSVString(VectorSchemaRoot.java:282)
	at org.apache.arrow.vector.ipc.TestRoundTrip.testReadFile(TestRoundTrip.java:537)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
	at org.junit.runners.Suite.runChild(Suite.java:128)
	at org.junit.runners.Suite.runChild(Suite.java:27)
	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
	at com.intellij.rt.junit.IdeaTestRunner$Repeater$1.execute(IdeaTestRunner.java:38)
	at com.intellij.rt.execution.junit.TestsRepeater.repeat(TestsRepeater.java:11)
	at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35)
	at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:235)
	at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:54)

Component(s)

C++, Java

@wgtmac
Copy link
Member Author

wgtmac commented Mar 3, 2023

@lidavidm Do you have any idea?

@wgtmac
Copy link
Member Author

wgtmac commented Mar 3, 2023

It seems related to #33688

@lidavidm
Copy link
Member

lidavidm commented Mar 3, 2023

Java-zstd seems like a bug.

C++ shouldn't let you specify snappy or gzip - that's also a bug.

@wgtmac
Copy link
Member Author

wgtmac commented Mar 3, 2023

Java-zstd seems like a bug.

C++ shouldn't let you specify snappy or gzip - that's also a bug.

I can workaround it by not using compression for now.

The fix on the C++ side should be trivial. Will someone fix the Java-zstd bug?

@lidavidm
Copy link
Member

lidavidm commented Mar 3, 2023

@lwhite1 @davisusanibar

@wgtmac
Copy link
Member Author

wgtmac commented Mar 14, 2023

@lwhite1 @davisusanibar

Do we have a plan to fix this in the next release v12.0.0?

@lidavidm
Copy link
Member

I don't have time in the near feature unfortunately. A PR would be welcome. (I was hoping David or Larry could help but it seems they're busy.)

@wgtmac
Copy link
Member Author

wgtmac commented Mar 15, 2023

I don't have time in the near feature unfortunately. A PR would be welcome. (I was hoping David or Larry could help but it seems they're busy.)

I tried to debug a little bit and found that the offset buffer of VarCharVector has some strange values when decompressed. In the meantime, I happened to see #15194 which deals with the problem in the inverse direction. Not sure if it can solve this issue.

@wgtmac
Copy link
Member Author

wgtmac commented Mar 16, 2023

Finally I have figured out that the default constructor of ArrowFileReader applies NoCompressionCodec.Factory.INSTANCE so it is unable to decompress a compressed file. One has to add dependency of arrow-compression and use the constructors with CompressionCodec.Factory to enable it.

  public ArrowFileReader(
      SeekableReadChannel in, BufferAllocator allocator, CompressionCodec.Factory compressionFactory) {
    super(allocator, compressionFactory);
    this.in = in;
  }

  public ArrowFileReader(
      SeekableByteChannel in, BufferAllocator allocator, CompressionCodec.Factory compressionFactory) {
    this(new SeekableReadChannel(in), allocator, compressionFactory);
  }

  public ArrowFileReader(SeekableReadChannel in, BufferAllocator allocator) {
    this(in, allocator, NoCompressionCodec.Factory.INSTANCE);
  }

  public ArrowFileReader(SeekableByteChannel in, BufferAllocator allocator) {
    this(new SeekableReadChannel(in), allocator);
  }

@wgtmac
Copy link
Member Author

wgtmac commented Mar 16, 2023

Please feel free to close this issue if there is no further action. @lidavidm
I think we can throw from ArrowFileReader if it detects a compressed file but NoCompressionCodec.Factory.INSTANCE is supplied.

@davisusanibar
Copy link
Contributor

My apologies for jumping in late.

Close related to #33384

In addition, there is a PR for cookbooks, so I'll get to work on the remaining tasks so we can have a recipe for this use case.

@lidavidm lidavidm added this to the 12.0.0 milestone Mar 16, 2023
lidavidm pushed a commit that referenced this issue Mar 16, 2023
…#34580)

### Rationale for this change

`NoCompressionCodec` does not complain about an unsupported codec type. `ArrowFileReader` uses `NoCompressionCodec` by default and fails to decompress a compressed arrow file.

### What changes are included in this PR?

`NoCompressionCodec` throws if unsupported codec type has been requested.

### Are these changes tested?

Make sure all tests pass.

### Are there any user-facing changes?

No.
* Closes: #34432

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
@lidavidm
Copy link
Member

I filed #34592 to track the C++ side of this

rtpsw pushed a commit to rtpsw/arrow that referenced this issue Mar 27, 2023
…c type (apache#34580)

### Rationale for this change

`NoCompressionCodec` does not complain about an unsupported codec type. `ArrowFileReader` uses `NoCompressionCodec` by default and fails to decompress a compressed arrow file.

### What changes are included in this PR?

`NoCompressionCodec` throws if unsupported codec type has been requested.

### Are these changes tested?

Make sure all tests pass.

### Are there any user-facing changes?

No.
* Closes: apache#34432

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants