[Java] Dictionary decoding not using the compression factory from the ArrowReader #37841

freakyzoidberg · 2023-09-23T14:22:15Z

I am trying to decode in Java records generated in Go (simple type + dictionaries) using ZSTD compression (using Arrow 13.0.0)

Although this is working fine for the simple types, I am getting this error when decoding dictionaries

java.lang.IllegalArgumentException: Please add arrow-compression module to use CommonsCompressionFactory for ZSTD
	at org.apache.arrow.vector.compression.NoCompressionCodec$Factory.createCodec(NoCompressionCodec.java:69)
	at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:82)
	at org.apache.arrow.vector.ipc.ArrowReader.load(ArrowReader.java:256)
	at org.apache.arrow.vector.ipc.ArrowReader.loadDictionary(ArrowReader.java:247)
	at org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:167)

The Go part is essentially

dtyp := &arrow.DictionaryType{
	IndexType: arrow.PrimitiveTypes.Int8,
	ValueType: arrow.BinaryTypes.LargeString,
}
bldrDictString := arrowarray.NewDictionaryBuilder(memory.DefaultAllocator, dtyp)
defer bldrDictString.Release()

bldrDictString.(*arrowarray.BinaryDictionaryBuilder).AppendString("foo")

columnTypes := make([]arrow.Field, 0, 1)
columnArrays := make([]arrow.Array, 0, 1)

columnArrays = append(columnArrays, bldrDictString.NewArray())
columnTypes = append(columnTypes, arrow.Field{Name: k.key, Type: dtyp, Nullable: nulls.Any()})

schema := arrow.NewSchema(columnTypes, nil)
rec := arrowarray.NewRecord(schema, columnArrays, int64(size))

var buf bytes.Buffer
writer := ipc.NewWriter(&buf, ipc.WithSchema(schema), ipc.WithZstd())
err := writer.Write(rec)
err = writer.Close()

And the Java side

import org.apache.arrow.compression.CommonsCompressionFactory;


try (ArrowStreamReader reader =
         new ArrowStreamReader(
             new ByteArrayInputStream(format.getArrow().toByteArray()),
             bufferAllocator,
             CommonsCompressionFactory.INSTANCE)) {
  reader.loadNextBatch();
  ...
} catch (IOException e) {
  throw new RuntimeException(e);
}

I am able to get it to not throw by making the VectorLoader used when loading the dictionary use the compression factory defined in the reader (it is currently defaulting to NoCompression)

see this change, note I was not able to make it fail using the java arrow test.

I am probably doing something wrong, and also wondering if dictionaries are compressed the same in go and java writers which could explain why the java test is not failing ?

Anyhow, unless I am doing something wrong, this looks like a bug.

Thanks !

Component(s)

Java

The text was updated successfully, but these errors were encountered:

lidavidm · 2023-09-23T22:02:23Z

CC @davisusanibar @vibhatha

vibhatha · 2023-09-24T02:27:52Z

@lidavidm I will take look.

…y from the ArrowReader (#38371) ### Rationale for this change This PR addresses #37841. ### What changes are included in this PR? Adding compression-based write and read for Dictionary data. ### Are these changes tested? Yes. ### Are there any user-facing changes? No * Closes: #37841 Lead-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com> Co-authored-by: vibhatha <vibhatha@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

…factory from the ArrowReader (apache#38371) ### Rationale for this change This PR addresses apache#37841. ### What changes are included in this PR? Adding compression-based write and read for Dictionary data. ### Are these changes tested? Yes. ### Are there any user-facing changes? No * Closes: apache#37841 Lead-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com> Co-authored-by: vibhatha <vibhatha@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

freakyzoidberg added the Type: bug label Sep 23, 2023

github-actions bot added the Component: Java label Sep 23, 2023

vibhatha mentioned this issue Oct 20, 2023

GH-37841: [Java] Dictionary decoding not using the compression factory from the ArrowReader #38371

Merged

github-actions bot assigned vibhatha Oct 20, 2023

vibhatha mentioned this issue Dec 14, 2023

[Java] Integration Test Needed for Decoding Dictionaries Using ZSTD Compression in Java #39222

Open

lidavidm closed this as completed in #38371 Feb 1, 2024

lidavidm added this to the 16.0.0 milestone Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Java] Dictionary decoding not using the compression factory from the ArrowReader #37841

[Java] Dictionary decoding not using the compression factory from the ArrowReader #37841

freakyzoidberg commented Sep 23, 2023 •

edited

Loading

lidavidm commented Sep 23, 2023

vibhatha commented Sep 24, 2023

[Java] Dictionary decoding not using the compression factory from the ArrowReader #37841

[Java] Dictionary decoding not using the compression factory from the ArrowReader #37841

Comments

freakyzoidberg commented Sep 23, 2023 • edited Loading

Component(s)

lidavidm commented Sep 23, 2023

vibhatha commented Sep 24, 2023

freakyzoidberg commented Sep 23, 2023 •

edited

Loading