Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ruby] Add option to keep/merge join keys in Table#join #15287

Closed
heronshoes opened this issue Jan 10, 2023 · 13 comments · Fixed by #33654
Closed

[Ruby] Add option to keep/merge join keys in Table#join #15287

heronshoes opened this issue Jan 10, 2023 · 13 comments · Fixed by #33654

Comments

@heronshoes
Copy link
Contributor

Describe the enhancement requested

Target method

Arrow::Table#join

Proposed feature

Add an option to select whether to keep or to merge columns in the join result.

Impact of this request

Current behavior is always keep every columns. But it will be convenient to set merge as a default.

For example, default in dplyr is keep = FALSE (Do not preserve both join keys).

Reference of this request

dplyr-join in R.

https://github.com/apache/arrow/blob/master/r/R/dplyr-join.R

Join keys are always merged in RedAmber (without an option).

https://github.com/heronshoes/red_amber/blob/main/lib/red_amber/data_frame_combinable.rb#L836-L842

Component(s)

Ruby

@kou
Copy link
Member

kou commented Jan 11, 2023

How about left_suffix=nil, right_suffix=nil like pandas?
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

If both of left_suffix and right_suffix are nil, we don't return duplicated keys.
If any of left_suffix and right_suffix are not nil, we return both keys in left and right with suffix.

For merging, how about merge_key_values=true? If both left_suffix and right_suffix are nil and merge_key_values is true, fill NULL values in a left key with values in a right key. (Hmm. I'm not sure this is suitable behavior. If left key has multiple NULL values and outer join is used, can we distinct NULL values in left key and NULL values caused by join?)

@kou
Copy link
Member

kou commented Jan 11, 2023

left_suffix="", right_suffix="" that generates the same name columns may be better to keep backward compatibility...

@heronshoes
Copy link
Contributor Author

heronshoes commented Jan 11, 2023

I think it will work as long as C++ join() is correct :-) It will be ok to take or to merge.

These are current implementation.

left = Arrow::Table.new(KEY: ["A", "B", nil], X1: [1, 2, 3])
right = Arrow::Table.new(KEY: ["A", "C", nil], X2: [4, 5, 6])

left.join(right, :KEY, type: :left_outer)
#=>
#<Arrow::Table:0x116f5d158 ptr=0x7fa61e50bc50>
	KEY	X1	KEY	    X2
0	A  	 1	A  	     4
1	B  	 2	(null)	(null)
2	(null)	 3	(null)	(null)

left.join(right, :KEY, type: :right_outer)
#=>
#<Arrow::Table:0x116f51bf0 ptr=0x7fa61e50bd40>
	KEY	    X1	KEY	X2
0	A  	     1	A  	 4
1	(null)	(null)	C  	 5
2	(null)	(null)	(null)	 6

left.join(right, :KEY, type: :full_outer)
#=>
#<Arrow::Table:0x116f18378 ptr=0x7fa61e06f0e0>
	KEY	    X1	KEY	    X2
0	A  	     1	A  	     4
1	B  	     2	(null)	(null)
2	(null)	     3	(null)	(null)
3	(null)	(null)	C  	     5
4	(null)	(null)	(null)	     6

left.join(right, :KEY, type: :inner)
#=>
#<Arrow::Table:0x11a048128 ptr=0x7fa61d5edd10>
	KEY	X1	KEY	X2
0	A  	 1	A  	 4

left.join(right, :KEY, type: :left_semi)
#=>
#<Arrow::Table:0x11a0ea748 ptr=0x7fa61d531cb0>
	KEY	X1
0	A  	 1

left.join(right, :KEY, type: :right_semi)
#=>
#<Arrow::Table:0x11a0e2c00 ptr=0x7fa61d581190>
	KEY	X2
0	A  	 4

left.join(right, :KEY, type: :left_anti)
#=>
#<Arrow::Table:0x11a0cbb40 ptr=0x7fa61e3dde50>
	KEY	X1
0	B  	 2
1	(null)	 3

left.join(right, :KEY, type: :right_anti)
#=>
#<Arrow::Table:0x11a0c56f0 ptr=0x7fa61e3ddee0>
	KEY	X2
0	C  	 5
1	(null)	 6

@heronshoes
Copy link
Contributor Author

heronshoes commented Jan 11, 2023

Is it better to ignore left_suffix and right_suffix when merge_key_values is true?
I think only one key is used when column is merged. It may be left_key and do not need renaming.

Are default values left_suffix="", right_suffix="", merge_key_values=false to ensure backward compatibility?

@kou
Copy link
Member

kou commented Jan 11, 2023

Is it better to ignore left_suffix and right_suffix when merge_key_values is true?
I think only one key is used when column is merged. It may be left_key and do not need renaming.

OK. Then, how about table.join(right, :KEY) merges KEY values instead of adding new merge_key_values option? (table.join(right, [:KEY]) and table.join(right, {left: :KEY, right: :KEY}) don't merge.)

Are default values left_suffix="", right_suffix="", merge_key_values=false to ensure backward compatibility?

Yes but we can change the default keys. If column_names & right.column_names only have 1 element, we can use (column_names & right.column_names)[0] to merge key values by default. Because keys=nil isn't released yet. (We don't need to think about backward compatibility.)

@westonpace
Copy link
Member

One thing to consider is that it is possible for the left and right side to have a non-key column that has the same name. For example:

CREATE TABLE products (
    id int,
    region_id int,
    description string,
); 
CREATE TABLE regions (
    id int,
    description string
); 
SELECT * FROM products JOIN regions ON id;

@heronshoes
Copy link
Contributor Author

heronshoes commented Jan 12, 2023

@westonpace I think non-key common columns (description) will be preserved. Arrow's Table allows duplicated column names.

left  = Arrow::Table.new(id: [1, 2, 3], region_id: [7, 8, 9], description: %w[A B C])
right = Arrow::Table.new(id: [1, 2, 4],                       description: %w[D E F])

puts left.join(right, :id, type: :full_outer)   # current implementation (1)
puts left.join(right, [:id], type: :full_outer) # is same as this        (2)
#=>
	    id	region_id	description	    id	description
0	     1	        7	A          	     1	D          
1	     2	        8	B          	     2	E          
2	     3	        9	C          	(null)	     (null)
3	(null)	   (null)	     (null)	     4	F

puts left.join(right, :id, type: :full_outer) # this will merge key      (3)
#=>
	id	region_id	description	description
0	 1	        7	A          	D          
1	 2	        8	B          	E          
2	 3	        9	C          	     (null)
3	 4	   (null)	     (null)	F

@kou Is above same as your thought?

@heronshoes
Copy link
Contributor Author

I will add candidates for natural keys (keys=nil) case.

puts left.join(right, keys=nil, type: :full_outer)  # if keep them      (4)
#=>
	    id	region_id	description	    id	description
0	     1	        7	A          	(null)	     (null)
1	     2	        8	B          	(null)	     (null)
2	     3	        9	C          	(null)	     (null)
3	(null)	   (null)	     (null)	     1	D          
4	(null)	   (null)	     (null)	     2	E          
5	(null)	   (null)	     (null)	     4	F

#=> or merge them if they are natural keys.      (5)
	id	region_id	description
0	 1	        7	A          
1	 2	        8	B          
2	 3	        9	C          
3	 1	   (null)	D          
4	 2	   (null)	E          
5	 4	   (null)	F

@kou
Copy link
Member

kou commented Jan 12, 2023

@kou Is above same as your thought?

Yes.

@kou
Copy link
Member

kou commented Jan 12, 2023

I will add candidates for natural keys (keys=nil) case.

I didn't know NATURAL JOIN in SQL!

We can choose (5) because it's same behavior as SQL's NATURAL JOIN.

@heronshoes
Copy link
Contributor Author

Thank you! I will try to make PR with (2), (3) and (5).

@kou
Copy link
Member

kou commented Jan 13, 2023

I think that (2) is needless. I think that the current implementation already implements (2).

heronshoes added a commit to heronshoes/arrow that referenced this issue Jan 13, 2023
@heronshoes
Copy link
Contributor Author

That's correct. Sorry for the confusion.

heronshoes added a commit to heronshoes/arrow that referenced this issue Jan 14, 2023
heronshoes added a commit to heronshoes/arrow that referenced this issue Jan 16, 2023
kou added a commit that referenced this issue Jan 17, 2023
# Rationale for this change

Current implementation is always preserve column of join key. It is convenient if columns are merged.

# What changes are included in this PR?

- Columns from left and right are marged if;
  - Join key is a String or a Symbol (<= incompatible Change)
  - Join key is nil (natural join) (<= change in unreleased feature)
- New options `left_suffix=""` and `right_suffix=""` are introduced.
  - If it is empty (by default), join key(s) do not change.
  - If it is not empty, the suffix is appended to join key(s).

# Are these changes tested?

Yes.

# Are there any user-facing changes?

There are incompatible change when join key is a String or a Symbol.

* Closes: #15287

Lead-authored-by: Hirokazu SUZUKI <heronshoes877@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@kou kou added this to the 12.0.0 milestone Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants