Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go][Parquet] Panic when writing a ListOf(DeltaBinaryPacked field) with no data #39309

Closed
ewjoachim opened this issue Dec 19, 2023 · 6 comments · Fixed by #39347
Closed

[Go][Parquet] Panic when writing a ListOf(DeltaBinaryPacked field) with no data #39309

ewjoachim opened this issue Dec 19, 2023 · 6 comments · Fixed by #39347

Comments

@ewjoachim
Copy link

ewjoachim commented Dec 19, 2023

Describe the bug, including details regarding any error messages, version, and platform.

It's been quite hard to understand how to write code for using arrow in go, but I believe I've managed to get it working. One of the thing that still stumbles me is the following scenario (I did my best to make it minimal, so it's a bit strange, but as long as you don't think I'm doing something that should be "forbidden" altogether, then consider it's the short form of my use case).

package main

import (
	"os"

	"github.com/apache/arrow/go/v14/arrow"
	"github.com/apache/arrow/go/v14/arrow/array"
	"github.com/apache/arrow/go/v14/arrow/memory"
	"github.com/apache/arrow/go/v14/parquet"
	"github.com/apache/arrow/go/v14/parquet/pqarrow"
)

func main() {
	schema := arrow.NewSchema(
		[]arrow.Field{
			{Name: "ts", Type: arrow.ListOf(arrow.PrimitiveTypes.Uint64)},
		}, nil)

	builder := array.NewRecordBuilder(memory.DefaultAllocator, schema)

	listBuilder := builder.Field(0).(*array.ListBuilder)
	listBuilder.Append(true)

	arrowRec := builder.NewRecord()

	f, err := os.CreateTemp(".", "test.parquet")
	if err != nil {
		panic(err)
	}

	fileWriter, err := pqarrow.NewFileWriter(
		arrowRec.Schema(),
		f,
		parquet.NewWriterProperties(
			parquet.WithDictionaryFor("ts.list.element", false),
			parquet.WithEncodingFor("ts.list.element", parquet.Encodings.DeltaBinaryPacked),
		),
		pqarrow.DefaultWriterProps(),
	)
	parquetWriter := fileWriter
	if err != nil {
		panic(err)
	}
	parquetWriter.WriteBuffered(arrowRec)
	arrowRec.Release()

}

(I have no idea whether this is the right way to build a parquet with the lib, feel free to point me to better ways if there are)

So in human words: I'm creating an Arrow schema containing a list of uints. Then I write a single empty list. I indicate that parquet should write the uints as DeltaBinaryPacked. Finally, I write the parquet.

Without the Encoding part, this works, but when I specify DeltaBinaryPacked, I get a panic.

Failure is in BitWriter.Written, but the BitWriter pointer is nil.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xc6a3fc]

goroutine 1 [running]:
github.com/apache/arrow/go/v14/parquet/internal/utils.(*BitWriter).Written(0x0)
	/home/joachim/go/pkg/mod/github.com/apache/arrow/go/v14@v14.0.2/parquet/internal/utils/bit_writer.go:109 +0x1c
github.com/apache/arrow/go/v14/parquet/internal/encoding.(*deltaBitPackEncoder).EstimatedDataEncodedSize(0xc0002be900)
	/home/joachim/go/pkg/mod/github.com/apache/arrow/go/v14@v14.0.2/parquet/internal/encoding/delta_bit_packing.go:461 +0x2c
github.com/apache/arrow/go/v14/parquet/file.(*columnWriter).commitWriteAndCheckPageLimit(0xc0003d8000, 0x1, 0x0)
	/home/joachim/go/pkg/mod/github.com/apache/arrow/go/v14@v14.0.2/parquet/file/column_writer.go:255 +0x86
github.com/apache/arrow/go/v14/parquet/file.(*Int64ColumnChunkWriter).WriteBatchSpaced.func1(0x0, 0x1)
	/home/joachim/go/pkg/mod/github.com/apache/arrow/go/v14@v14.0.2/parquet/file/column_writer_types.gen.go:350 +0x412
github.com/apache/arrow/go/v14/parquet/file.doBatches(0x1, 0x400, 0xc0002d00b0)
	/home/joachim/go/pkg/mod/github.com/apache/arrow/go/v14@v14.0.2/parquet/file/column_writer.go:631 +0xf5
github.com/apache/arrow/go/v14/parquet/file.(*Int64ColumnChunkWriter).WriteBatchSpaced(0xc0003d8000, {0x0, 0x0, 0x0}, {0xc00036c3c0, 0x1, 0x80}, {0xc00036c500, 0x1, 0x80}, ...)
	/home/joachim/go/pkg/mod/github.com/apache/arrow/go/v14@v14.0.2/parquet/file/column_writer_types.gen.go:336 +0x247
github.com/apache/arrow/go/v14/parquet/pqarrow.writeDenseArrow(0xc000346730, {0x17c89b0, 0xc0003d8000}, {0x17c57a0, 0xc00033ba80}, {0xc00036c3c0, 0x1, 0x80}, {0xc00036c500, 0x1, ...}, ...)
	/home/joachim/go/pkg/mod/github.com/apache/arrow/go/v14@v14.0.2/parquet/pqarrow/encode_arrow.go:436 +0x57a9
github.com/apache/arrow/go/v14/parquet/pqarrow.WriteArrowToColumn({0x17b4180, 0xc00037a870}, {0x17c89b0, 0xc0003d8000}, {0x17c57a0, 0xc00033ba80}, {0xc00036c3c0, 0x1, 0x80}, {0xc00036c500, ...}, ...)
	/home/joachim/go/pkg/mod/github.com/apache/arrow/go/v14@v14.0.2/parquet/pqarrow/encode_arrow.go:232 +0x3cf
github.com/apache/arrow/go/v14/parquet/pqarrow.(*ArrowColumnWriter).Write(0xc0002d1860, {0x17b4180, 0xc00037a870})
	/home/joachim/go/pkg/mod/github.com/apache/arrow/go/v14@v14.0.2/parquet/pqarrow/encode_arrow.go:193 +0x832
github.com/apache/arrow/go/v14/parquet/pqarrow.(*FileWriter).WriteColumnChunked(0xc0003d4000, 0xc00033ba00, 0x0, 0x1)
	/home/joachim/go/pkg/mod/github.com/apache/arrow/go/v14@v14.0.2/parquet/pqarrow/file_writer.go:313 +0x27a
github.com/apache/arrow/go/v14/parquet/pqarrow.(*FileWriter).WriteColumnData(0xc0003d4000, {0x17c5aa0, 0xc0003466e0})
	/home/joachim/go/pkg/mod/github.com/apache/arrow/go/v14@v14.0.2/parquet/pqarrow/file_writer.go:322 +0x17d
github.com/apache/arrow/go/v14/parquet/pqarrow.(*FileWriter).WriteBuffered(0xc0003d4000, {0x17c3ae8, 0xc00037a4e0})
	/home/joachim/go/pkg/mod/github.com/apache/arrow/go/v14@v14.0.2/parquet/pqarrow/file_writer.go:186 +0x83b
main.main()
	/home/joachim/src/goplayground/play/main.go:66 +0x472

Runs on

v14, supposedly on master, but I new enough in go that I'm not really sure it's really master and not the latest stable v14 tag, go version go1.21.4 linux/amd64, ubuntu 22.04 LTS

Component(s)

Go, Parquet

@ewjoachim
Copy link
Author

(In case this helps, I've found #37102 which is somewhat close but different, and that was fixed. Maybe the issue lies somewhere close)

@zeroshade
Copy link
Member

The issue is likely similar given the stack trace you're seeing.

(I have no idea whether this is the right way to build a parquet with the lib, feel free to point me to better ways if there are)

That's absolutely the correct way to do it, with the exception that you're missing a defer fileWriter.Close() 😄

I'll take a look and see if i can figure out the issue here. With some luck I'll have a fix before we do the code freeze for v15

@zeroshade
Copy link
Member

@ewjoachim please have a look at the linked PR and verify that it fixes things for you. Thanks!

@ewjoachim
Copy link
Author

If my understanding of go mod is not absolutely bananas, then I can confirm that it works with the branch :D

Thanks a lot for your reactivity !

(At some point in the near future, I may submit other bugs I've found but I'll need to dive deeper and make sure the problem is real and comes from arrow. And even if we'll need to wait for the following release, it will not be that much of a problem)

@ewjoachim
Copy link
Author

ewjoachim commented Jan 1, 2024

(as a follow-up, I cleared my stack of other issues and none of them lead to opening new issues here)

@mapleFU
Copy link
Member

mapleFU commented Jan 2, 2024

I'll waiting for Matt to merge this patch

zeroshade added a commit that referenced this issue Jan 8, 2024
…39347)

### Rationale for this change
If using the DeltaBinaryPacked encoding, we end up with a nil pointer dereference if we end up with an empty column.

### What changes are included in this PR?
Add a nil check in `EstimatedDataEncodedSize` for the base `deltaBitPackEncoder`. This should only ever occur if we have an empty column with this encoding when closing a row group.

### Are these changes tested?
Yes a unit test was added to verify the fix.

* Closes: #39309

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
@zeroshade zeroshade added this to the 16.0.0 milestone Jan 8, 2024
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
…ked (apache#39347)

### Rationale for this change
If using the DeltaBinaryPacked encoding, we end up with a nil pointer dereference if we end up with an empty column.

### What changes are included in this PR?
Add a nil check in `EstimatedDataEncodedSize` for the base `deltaBitPackEncoder`. This should only ever occur if we have an empty column with this encoding when closing a row group.

### Are these changes tested?
Yes a unit test was added to verify the fix.

* Closes: apache#39309

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…ked (apache#39347)

### Rationale for this change
If using the DeltaBinaryPacked encoding, we end up with a nil pointer dereference if we end up with an empty column.

### What changes are included in this PR?
Add a nil check in `EstimatedDataEncodedSize` for the base `deltaBitPackEncoder`. This should only ever occur if we have an empty column with this encoding when closing a row group.

### Are these changes tested?
Yes a unit test was added to verify the fix.

* Closes: apache#39309

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…ked (apache#39347)

### Rationale for this change
If using the DeltaBinaryPacked encoding, we end up with a nil pointer dereference if we end up with an empty column.

### What changes are included in this PR?
Add a nil check in `EstimatedDataEncodedSize` for the base `deltaBitPackEncoder`. This should only ever occur if we have an empty column with this encoding when closing a row group.

### Are these changes tested?
Yes a unit test was added to verify the fix.

* Closes: apache#39309

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants