Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] Dictionary decoding not using the compression factory from the ArrowReader #37841

Closed
freakyzoidberg opened this issue Sep 23, 2023 · 2 comments · Fixed by #38371
Closed

Comments

@freakyzoidberg
Copy link

freakyzoidberg commented Sep 23, 2023

I am trying to decode in Java records generated in Go (simple type + dictionaries) using ZSTD compression (using Arrow 13.0.0)

Although this is working fine for the simple types, I am getting this error when decoding dictionaries

java.lang.IllegalArgumentException: Please add arrow-compression module to use CommonsCompressionFactory for ZSTD
	at org.apache.arrow.vector.compression.NoCompressionCodec$Factory.createCodec(NoCompressionCodec.java:69)
	at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:82)
	at org.apache.arrow.vector.ipc.ArrowReader.load(ArrowReader.java:256)
	at org.apache.arrow.vector.ipc.ArrowReader.loadDictionary(ArrowReader.java:247)
	at org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:167)

The Go part is essentially

dtyp := &arrow.DictionaryType{
	IndexType: arrow.PrimitiveTypes.Int8,
	ValueType: arrow.BinaryTypes.LargeString,
}
bldrDictString := arrowarray.NewDictionaryBuilder(memory.DefaultAllocator, dtyp)
defer bldrDictString.Release()

bldrDictString.(*arrowarray.BinaryDictionaryBuilder).AppendString("foo")

columnTypes := make([]arrow.Field, 0, 1)
columnArrays := make([]arrow.Array, 0, 1)

columnArrays = append(columnArrays, bldrDictString.NewArray())
columnTypes = append(columnTypes, arrow.Field{Name: k.key, Type: dtyp, Nullable: nulls.Any()})

schema := arrow.NewSchema(columnTypes, nil)
rec := arrowarray.NewRecord(schema, columnArrays, int64(size))

var buf bytes.Buffer
writer := ipc.NewWriter(&buf, ipc.WithSchema(schema), ipc.WithZstd())
err := writer.Write(rec)
err = writer.Close()

And the Java side

import org.apache.arrow.compression.CommonsCompressionFactory;


try (ArrowStreamReader reader =
         new ArrowStreamReader(
             new ByteArrayInputStream(format.getArrow().toByteArray()),
             bufferAllocator,
             CommonsCompressionFactory.INSTANCE)) {
  reader.loadNextBatch();
  ...
} catch (IOException e) {
  throw new RuntimeException(e);
}

I am able to get it to not throw by making the VectorLoader used when loading the dictionary use the compression factory defined in the reader (it is currently defaulting to NoCompression)

see this change, note I was not able to make it fail using the java arrow test.

I am probably doing something wrong, and also wondering if dictionaries are compressed the same in go and java writers which could explain why the java test is not failing ?

Anyhow, unless I am doing something wrong, this looks like a bug.

Thanks !

Component(s)

Java

@lidavidm
Copy link
Member

CC @davisusanibar @vibhatha

@vibhatha
Copy link
Collaborator

@lidavidm I will take look.

lidavidm pushed a commit that referenced this issue Feb 1, 2024
…y from the ArrowReader (#38371)

### Rationale for this change

This PR addresses #37841. 

### What changes are included in this PR?

Adding compression-based write and read for Dictionary data. 

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No
* Closes: #37841

Lead-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com>
Co-authored-by: vibhatha <vibhatha@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
@lidavidm lidavidm added this to the 16.0.0 milestone Feb 1, 2024
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…factory from the ArrowReader (apache#38371)

### Rationale for this change

This PR addresses apache#37841. 

### What changes are included in this PR?

Adding compression-based write and read for Dictionary data. 

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No
* Closes: apache#37841

Lead-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com>
Co-authored-by: vibhatha <vibhatha@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…factory from the ArrowReader (apache#38371)

### Rationale for this change

This PR addresses apache#37841. 

### What changes are included in this PR?

Adding compression-based write and read for Dictionary data. 

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No
* Closes: apache#37841

Lead-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com>
Co-authored-by: vibhatha <vibhatha@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…factory from the ArrowReader (apache#38371)

### Rationale for this change

This PR addresses apache#37841. 

### What changes are included in this PR?

Adding compression-based write and read for Dictionary data. 

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No
* Closes: apache#37841

Lead-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com>
Co-authored-by: vibhatha <vibhatha@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants