-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ruby] Add option to keep/merge join keys in Table#join #15287
Comments
How about If both of For merging, how about |
|
I think it will work as long as C++ join() is correct :-) It will be ok to take These are current implementation. left = Arrow::Table.new(KEY: ["A", "B", nil], X1: [1, 2, 3])
right = Arrow::Table.new(KEY: ["A", "C", nil], X2: [4, 5, 6])
left.join(right, :KEY, type: :left_outer)
#=>
#<Arrow::Table:0x116f5d158 ptr=0x7fa61e50bc50>
KEY X1 KEY X2
0 A 1 A 4
1 B 2 (null) (null)
2 (null) 3 (null) (null)
left.join(right, :KEY, type: :right_outer)
#=>
#<Arrow::Table:0x116f51bf0 ptr=0x7fa61e50bd40>
KEY X1 KEY X2
0 A 1 A 4
1 (null) (null) C 5
2 (null) (null) (null) 6
left.join(right, :KEY, type: :full_outer)
#=>
#<Arrow::Table:0x116f18378 ptr=0x7fa61e06f0e0>
KEY X1 KEY X2
0 A 1 A 4
1 B 2 (null) (null)
2 (null) 3 (null) (null)
3 (null) (null) C 5
4 (null) (null) (null) 6
left.join(right, :KEY, type: :inner)
#=>
#<Arrow::Table:0x11a048128 ptr=0x7fa61d5edd10>
KEY X1 KEY X2
0 A 1 A 4
left.join(right, :KEY, type: :left_semi)
#=>
#<Arrow::Table:0x11a0ea748 ptr=0x7fa61d531cb0>
KEY X1
0 A 1
left.join(right, :KEY, type: :right_semi)
#=>
#<Arrow::Table:0x11a0e2c00 ptr=0x7fa61d581190>
KEY X2
0 A 4
left.join(right, :KEY, type: :left_anti)
#=>
#<Arrow::Table:0x11a0cbb40 ptr=0x7fa61e3dde50>
KEY X1
0 B 2
1 (null) 3
left.join(right, :KEY, type: :right_anti)
#=>
#<Arrow::Table:0x11a0c56f0 ptr=0x7fa61e3ddee0>
KEY X2
0 C 5
1 (null) 6 |
Is it better to ignore Are default values |
OK. Then, how about
Yes but we can change the default |
One thing to consider is that it is possible for the left and right side to have a non-key column that has the same name. For example:
|
@westonpace I think non-key common columns ( left = Arrow::Table.new(id: [1, 2, 3], region_id: [7, 8, 9], description: %w[A B C])
right = Arrow::Table.new(id: [1, 2, 4], description: %w[D E F])
puts left.join(right, :id, type: :full_outer) # current implementation (1)
puts left.join(right, [:id], type: :full_outer) # is same as this (2)
#=>
id region_id description id description
0 1 7 A 1 D
1 2 8 B 2 E
2 3 9 C (null) (null)
3 (null) (null) (null) 4 F
puts left.join(right, :id, type: :full_outer) # this will merge key (3)
#=>
id region_id description description
0 1 7 A D
1 2 8 B E
2 3 9 C (null)
3 4 (null) (null) F @kou Is above same as your thought? |
I will add candidates for natural keys ( puts left.join(right, keys=nil, type: :full_outer) # if keep them (4)
#=>
id region_id description id description
0 1 7 A (null) (null)
1 2 8 B (null) (null)
2 3 9 C (null) (null)
3 (null) (null) (null) 1 D
4 (null) (null) (null) 2 E
5 (null) (null) (null) 4 F
#=> or merge them if they are natural keys. (5)
id region_id description
0 1 7 A
1 2 8 B
2 3 9 C
3 1 (null) D
4 2 (null) E
5 4 (null) F |
Yes. |
I didn't know We can choose (5) because it's same behavior as SQL's |
Thank you! I will try to make PR with (2), (3) and (5). |
I think that (2) is needless. I think that the current implementation already implements (2). |
That's correct. Sorry for the confusion. |
# Rationale for this change Current implementation is always preserve column of join key. It is convenient if columns are merged. # What changes are included in this PR? - Columns from left and right are marged if; - Join key is a String or a Symbol (<= incompatible Change) - Join key is nil (natural join) (<= change in unreleased feature) - New options `left_suffix=""` and `right_suffix=""` are introduced. - If it is empty (by default), join key(s) do not change. - If it is not empty, the suffix is appended to join key(s). # Are these changes tested? Yes. # Are there any user-facing changes? There are incompatible change when join key is a String or a Symbol. * Closes: #15287 Lead-authored-by: Hirokazu SUZUKI <heronshoes877@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Describe the enhancement requested
Target method
Arrow::Table#join
Proposed feature
Add an option to select whether to keep or to merge columns in the join result.
Impact of this request
Current behavior is always keep every columns. But it will be convenient to set merge as a default.
For example, default in dplyr is
keep = FALSE
(Do not preserve both join keys).Reference of this request
dplyr-join in R.
https://github.com/apache/arrow/blob/master/r/R/dplyr-join.R
Join keys are always merged in RedAmber (without an option).
https://github.com/heronshoes/red_amber/blob/main/lib/red_amber/data_frame_combinable.rb#L836-L842
Component(s)
Ruby
The text was updated successfully, but these errors were encountered: