Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gcgarbage collector support for StringViewArray and BinaryViewArray #5513

Closed
Tracked by #5374
alamb opened this issue Mar 14, 2024 · 3 comments · Fixed by #5885 · May be fixed by #5707
Closed
Tracked by #5374

Add gcgarbage collector support for StringViewArray and BinaryViewArray #5513

alamb opened this issue Mar 14, 2024 · 3 comments · Fixed by #5885 · May be fixed by #5707
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Mar 14, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This is part of the larger project to implement StringViewArray -- see #5374

In #5481 we added support for StringViewArray and ByteViewArray.

This ticket tracks adding a gc method to StringViewArray and ByteViewArray

After calling filter or take on a StringViewArray or ByteViewArray the backing variable length buffer may be much larger than necessary to store the results

So before an array may look like the following with significant "garbage" space

                                       ┌──────┐                 
                                       │......│                 
                                       │......│                 
┌────────────────────┐       ┌ ─ ─ ─ ▶ │Data1 │   Large buffer  
│       View 1       │─ ─ ─ ─          │......│  with data that 
├────────────────────┤                 │......│ is not referred 
│       View 2       │─ ─ ─ ─ ─ ─ ─ ─▶ │Data2 │ to by View 1 or 
└────────────────────┘                 │......│      View 2     
                                       │......│                 
   2 views, refer to                   │......│                 
  small portions of a                  └──────┘                 
     large buffer                                               
                                                                
                                                                

After GC it should look like

┌────────────────────┐                 ┌─────┐    After gc, only 
│       View 1       │─ ─ ─ ─ ─ ─ ─ ─▶ │Data1│     data that is  
├────────────────────┤       ┌ ─ ─ ─ ▶ │Data2│    pointed to by  
│       View 2       │─ ─ ─ ─          └─────┘     the views is  
└────────────────────┘                                 left      
                                                                 
                                                                 
        2 views                                                  
                                                                 
                                                                 
                                                                 

Describe the solution you'd like
I would like to add a method called StringViewArray::gc (and ByteViewArray::gc) that will compact

I expect users of the arrow crates to invoke this function, not any of the arrow kernels themselves

Describe alternatives you've considered

We could also add the gc functionality as its own standalone kernel (e.g. kernels::gc) rather than a method on the array.

Additional context
This GC is what is described in https://pola.rs/posts/polars-string-type/

What I consider the biggest downside is that we have to do garbage collection. When we gather/filter from an array with allocated long strings, we might keep strings alive that are not used anymore. This requires us to use some heuristics to determine when we will do a garbage collection pass on the string column. And because they are heuristics, sometimes they will be unneeded.

@alamb
Copy link
Contributor Author

alamb commented May 6, 2024

BTW @RinChanNOWWW has implemented StringViewArray / BinaryViewArray --> StringArray / BinaryArray in #5704

I think the same basic pattern can be applied to implement gc (aka copy each view value into a destination buffer)

@alamb
Copy link
Contributor Author

alamb commented May 9, 2024

Another potential idea came up in discord today which was to also implement some way of "interning" strings (aka track which strings have been seen before and remove duplicates)

https://discord.com/channels/885562378132000778/885562378132000781/1238127788788285521

The arrow dictionary array builder does this already so we could borrow its implementation

@XiangpengHao
Copy link
Contributor

(sorry for jumping into this from nowhere)
I'm trying to push this forward by summarizing the todo items:

  • Implement a dumb GC: collect the values into new buffers through view type builders. It's up to the user to call GC and ensure the call to GC is beneficial, i.e., the benefits of GC (smaller memory footprint) are larger than the overhead (recreating stuff). Users typically should only call GC after slice or filter (i.e., a significant decrease in cardinality).
  • Discuss and justify a compact check API -- if the users want to be smart and only GC when the view array is sparse.
  • As discussed in Add gcgarbage collector support for StringViewArray and BinaryViewArray #5513 (comment), we can be smart when constructing string arrays by having a hash table to deduplicate the strings. I believe this no only applies to view arrays but also byte arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
2 participants