Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go][Parquet] RLE problem with ''(blank)" #34603

Closed
allinux opened this issue Mar 17, 2023 · 2 comments · Fixed by #34709
Closed

[Go][Parquet] RLE problem with ''(blank)" #34603

allinux opened this issue Mar 17, 2023 · 2 comments · Fixed by #34709
Assignees
Milestone

Comments

@allinux
Copy link

allinux commented Mar 17, 2023

Describe the bug, including details regarding any error messages, version, and platform.

There is a bug where '' is loaded as \x00 when WithDictionaryDefault of parquet.WriterProperties is set to true when all data in the column is ''.
If processed as false, it is normally loaded as ''.
It seems to occur during RLE processing, but need to check.

sample_parquet.zip contains src.parquet and dest.parquet files, src.parquet is a file with '' and dest.parquet is a file reloaded with WithDictionaryDefault set to true.

  • python code for check
In [1]: import pandas as pd
In [2]: src = pd.read_parquet('src.parquet')
In [3]: src
Out[3]: 
  reserved1
0          
1          
2          
3          
4          
5          

In [4]: src.iloc[0].to_dict()
Out[4]: {'reserved1': ''}

In [5]: dest = pd.read_parquet('dest.parquet')

In [6]: dest
Out[6]: 
  reserved1
0         
1         
2         
3         
4         
5         

In [7]: dest.iloc[0].to_dict()
Out[7]: {'reserved1': '\x00'}

sample_parquet.zip

Component(s)

Go

@allinux allinux changed the title RLE problem with ''(blank) [GO] RLE problem with ''(blank) Mar 18, 2023
@zeroshade zeroshade changed the title [GO] RLE problem with ''(blank) [Go][Parquet] RLE problem with ''(blank)" Mar 21, 2023
@zeroshade
Copy link
Member

@allinux can you provide the code you are using to perform the writes?

@allinux
Copy link
Author

allinux commented Mar 23, 2023

@zeroshade
Thanks for the commet.
Please see sample code below.
(For the file src.parquet, please refer to the previously attached sample_parquet.zip.)

package main

import (
	"context"
	"os"

	"github.com/apache/arrow/go/v12/arrow/memory"
	"github.com/apache/arrow/go/v12/parquet"
	"github.com/apache/arrow/go/v12/parquet/file"
	"github.com/apache/arrow/go/v12/parquet/pqarrow"
)

func main() {
	mem := memory.NewCheckedAllocator(memory.DefaultAllocator)

	src, _ := file.OpenParquetFile("src.parquet", false, file.WithReadProps(parquet.NewReaderProperties(mem)))
	defer src.Close()

	schema := src.MetaData().Schema

	arrowRdr, _ := pqarrow.NewFileReader(src, pqarrow.ArrowReadProperties{}, mem)

	arrowSchema, _ := pqarrow.FromParquet(schema, &pqarrow.ArrowReadProperties{}, src.MetaData().KeyValueMetadata())

	props := parquet.NewWriterProperties(
		parquet.WithDictionaryDefault(true), // \x00
	)

	dest, _ := os.OpenFile("dest.parquet", os.O_CREATE|os.O_APPEND|os.O_RDWR|os.O_TRUNC, 0644)
	arrowWriter, _ := pqarrow.NewFileWriter(arrowSchema, dest, props, pqarrow.DefaultWriterProps())
	defer arrowWriter.Close()
	
	arrowWriter.NewRowGroup()
	for i, numRowGroups := 0, src.NumRowGroups(); i < numRowGroups; i++ {
		rowGroupReader := arrowRdr.RowGroup(i)
		for i, numColumns := 0, schema.NumColumns(); i < numColumns; i++ {
			datas, _ := rowGroupReader.Column(i).Read(context.Background())
			if datas.Len() > 0 {
				for _, ar := range datas.Chunks() {
					arrowWriter.WriteColumnData(ar)
				}
			}
			datas.Release()
		}
	}
}

zeroshade added a commit to zeroshade/arrow that referenced this issue Mar 23, 2023
zeroshade added a commit that referenced this issue Mar 31, 2023
…#34709)

### Rationale for this change
Writing a dictionary encoded column consisting of empty strings ended up writing values that consisted of a string with a NUL character in them rather than actually writing an empty string. This fixes that issue and also cleans the code up a little bit in doing so.

### Are these changes tested?
A unit test is added to test for the behavior.

### Are there any user-facing changes?
Users who wrote dictionary ByteArray or FixedLenByteArray columns that contained empty strings will see this fixed when it comes to handling those empty strings rather than having written strings containing a single NUL character (`\x00`). 

* Closes: #34603

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
@zeroshade zeroshade added this to the 12.0.0 milestone Mar 31, 2023
minyoung pushed a commit to minyoung/arrow that referenced this issue May 1, 2023
…trings (apache#34709)

Writing a dictionary encoded column consisting of empty strings ended up writing values that consisted of a string with a NUL character in them rather than actually writing an empty string. This fixes that issue and also cleans the code up a little bit in doing so.

A unit test is added to test for the behavior.

Users who wrote dictionary ByteArray or FixedLenByteArray columns that contained empty strings will see this fixed when it comes to handling those empty strings rather than having written strings containing a single NUL character (`\x00`).

* Closes: apache#34603

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…trings (apache#34709)

### Rationale for this change
Writing a dictionary encoded column consisting of empty strings ended up writing values that consisted of a string with a NUL character in them rather than actually writing an empty string. This fixes that issue and also cleans the code up a little bit in doing so.

### Are these changes tested?
A unit test is added to test for the behavior.

### Are there any user-facing changes?
Users who wrote dictionary ByteArray or FixedLenByteArray columns that contained empty strings will see this fixed when it comes to handling those empty strings rather than having written strings containing a single NUL character (`\x00`). 

* Closes: apache#34603

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants