Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to read values from (Dictionary) Builders #35711

Closed
metalmatze opened this issue May 22, 2023 · 1 comment · Fixed by #35744
Closed

Add ability to read values from (Dictionary) Builders #35711

metalmatze opened this issue May 22, 2023 · 1 comment · Fixed by #35744

Comments

@metalmatze
Copy link
Contributor

Describe the enhancement requested

We are currently rewriting some of our components to use Arrow and when appending data to the record, we want to basically deduplicate/aggregate the data before adding new rows. For that, we either need to keep track of the previously appended data outside of Arrow, or (and that's what we want to enhance) we read back the data from the underlying builders.

I was able to find several occurrences of reading values from builders.

Next, to potentially adding the *Builder Value(i int) * functions to some types we would mostly be interested in reading back values from a *BinaryDictionaryBuilder.
I've managed to read the indices back from the builders, but then failed and stopped trying to read the values from the underlying BinaryMemoTable.

It would be super helpful to read back those strings from the Dictionary's BinaryMemoTable. Is this something we can somehow add? Is this something within the scope of this library?

diff --git a/go/arrow/array/dictionary.go b/go/arrow/array/dictionary.go
index 2409e296c..622dece04 100644
--- a/go/arrow/array/dictionary.go
+++ b/go/arrow/array/dictionary.go

@@ -1208,6 +1209,54 @@ func (b *BinaryDictionaryBuilder) InsertStringDictValues(arr *String) (err error
 	return
 }
 
+func (b *BinaryDictionaryBuilder) GetValueIndex(i int) int {
+	switch b := b.idxBuilder.Builder.(type) {
+	case *Int64Builder:
+		return int(b.Value(i))
+	case *Uint64Builder:
+		return int(b.Value(i))
+	case *Float64Builder:
+		return int(b.Value(i))
+	case *Int32Builder:
+		return int(b.Value(i))
+	case *Uint32Builder:
+		return int(b.Value(i))
+	case *Float32Builder:
+		return int(b.Value(i))
+	case *Int16Builder:
+		return int(b.Value(i))
+	case *Uint16Builder:
+		return int(b.Value(i))
+	case *Int8Builder:
+		return int(b.Value(i))
+	case *Uint8Builder:
+		return int(b.Value(i))
+	case *TimestampBuilder:
+		return int(b.Value(i))
+	case *Time32Builder:
+		return int(b.Value(i))
+	case *Time64Builder:
+		return int(b.Value(i))
+	case *Date32Builder:
+		return int(b.Value(i))
+	case *Date64Builder:
+		return int(b.Value(i))
+	case *DurationBuilder:
+		return int(b.Value(i))
+	default:
+		return -1
+	}
+}
+
+func (b *BinaryDictionaryBuilder) Value(i int) []byte {
+	return []byte{}
+}
+
+func (b *BinaryDictionaryBuilder) ValueStr(i int) string {
+	//b.memoTable.(*hashing.BinaryMemoTable).
+	return ""
+}
+
 type FixedSizeBinaryDictionaryBuilder struct {
 	dictionaryBuilder
 	byteWidth int

Component(s)

Go

@zeroshade
Copy link
Member

@metalmatze this seems like a good idea (though for the dictionary builder getting an index you only have to handle the integral types, int8/16/32/64 and uint8/16/32/64, as all the other types are not valid dictionary index types).

For a binary memo table, since it maintains a binary builder internally and only stores the dictionary indices in the hash table as the values, it should be pretty easy to add a Value method which takes in an index and returns the value by calling b.builder.Value(idx). The other types would be a bit harder and would likely require maintaining some sort of reverse mapping.

Would you be willing to make a PR for this? At least for the binary dictionary types since that would be pretty easy to implement as suggested above.

zeroshade pushed a commit that referenced this issue May 30, 2023
…rs (#35744)

### Rationale for this change

See #35711 

### What changes are included in this PR?

Adding `*BinaryDictionaryBuilder.GetValueIndex(i int) int` to get the value's index in the dictionary. Then two methods to get a `[]byte` and `string` results. `Value(i int) []byte` and `ValueStr(i int) string`.

### Are these changes tested?

So far I have added very little testing. Mostly some sort of integration test to read back values from the BinaryDictionaryBuilder. We should definitely add some test for the `*BinaryMemoTable) Value(i int) []byte`. 

### Are there any user-facing changes?

Just additions, no breaking API changes. 

cc @ zeroshade 
* Closes: #35711

Authored-by: Matthias Loibl <mail@matthiasloibl.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
@zeroshade zeroshade added this to the 13.0.0 milestone May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants