Skip to content

Conversation

@dgd-contributor
Copy link

@dgd-contributor dgd-contributor commented Jun 10, 2021

What changes were proposed in this pull request?

It seems like spark inner join is performing a cartesian join in self joining using joinWith

To produce this issues:

val df = spark.range(0,3)
df.joinWith(df, df("id") === df("id")).show()

Before this pull request, the result is
+---+---+
| _1 | _2 |
+---+---+
| 0 | 0 |
| 0 | 1 |
| 0 | 2 |
| 1 | 0 |
| 1 | 1 |
| 1 | 2 |
| 2 | 0 |
| 2 | 1 |
| 2 | 2 |
+---+---+

The expected result is
+---+---+
| _1 | _2 |
+---+---+
| 0 | 0 |
| 1 | 1 |
| 2 | 2 |
+---+---+

Why are the changes needed?

correctness

Does this PR introduce any user-facing change?

no

How was this patch tested?

add test

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions github-actions bot added the SQL label Jun 10, 2021
@dgd-contributor dgd-contributor changed the title [WIP][SPARK-35652][SQL] joinWith on two table generated from same one [SPARK-35652][SQL] joinWith on two table generated from same one Jun 11, 2021
@dgd-contributor
Copy link
Author

Link SPARK-6231

@dgd-contributor
Copy link
Author

cc @rxin @marmbrus @cloud-fan @MaxGekk
Can you review this, thanks!

@cloud-fan
Copy link
Contributor

cloud-fan commented Jun 11, 2021

thanks, merging to master/3.1!

@cloud-fan cloud-fan closed this in 6e1aa15 Jun 11, 2021
cloud-fan pushed a commit that referenced this pull request Jun 11, 2021
It seems like spark inner join is performing a cartesian join in self joining using `joinWith`

To produce this issues:
```
val df = spark.range(0,3)
df.joinWith(df, df("id") === df("id")).show()
```

Before this pull request, the result is
+---+---+
 | _1 |  _2 |
+---+---+
|    0 |   0 |
|    0 |   1 |
|    0 |   2 |
|    1 |   0 |
|    1 |   1 |
|    1 |   2 |
|    2 |   0 |
|    2 |   1 |
|    2 |   2 |
+---+---+

The expected result is
+---+---+
 | _1 |  _2 |
+---+---+
|    0 |   0 |
|    1 |   1 |
|    2 |   2 |
+---+---+
correctness

no

add test

Closes #32863 from dgd-contributor/SPARK-35652_join_and_joinWith_in_seft_joining.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6e1aa15)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

@dgd-contributor can you open a backport pr for 3.0? thanks!

@dgd-contributor
Copy link
Author

@dgd-contributor can you open a backport pr for 3.0? thanks!

ok, OTW

dgd-contributor pushed a commit to dgd-contributor/spark that referenced this pull request Jun 14, 2021
It seems like spark inner join is performing a cartesian join in self joining using `joinWith`

To produce this issues:
```
val df = spark.range(0,3)
df.joinWith(df, df("id") === df("id")).show()
```

Before this pull request, the result is
+---+---+
 | _1 |  _2 |
+---+---+
|    0 |   0 |
|    0 |   1 |
|    0 |   2 |
|    1 |   0 |
|    1 |   1 |
|    1 |   2 |
|    2 |   0 |
|    2 |   1 |
|    2 |   2 |
+---+---+

The expected result is
+---+---+
 | _1 |  _2 |
+---+---+
|    0 |   0 |
|    1 |   1 |
|    2 |   2 |
+---+---+
correctness

no

add test

Closes apache#32863 from dgd-contributor/SPARK-35652_join_and_joinWith_in_seft_joining.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jun 17, 2021
It seems like spark inner join is performing a cartesian join in self joining using `joinWith`

To produce this issues:
```
val df = spark.range(0,3)
df.joinWith(df, df("id") === df("id")).show()
```

Before this pull request, the result is
+---+---+
 | _1 |  _2 |
+---+---+
|    0 |   0 |
|    0 |   1 |
|    0 |   2 |
|    1 |   0 |
|    1 |   1 |
|    1 |   2 |
|    2 |   0 |
|    2 |   1 |
|    2 |   2 |
+---+---+

The expected result is
+---+---+
 | _1 |  _2 |
+---+---+
|    0 |   0 |
|    1 |   1 |
|    2 |   2 |
+---+---+
correctness

no

add test

Closes #32863 from dgd-contributor/SPARK-35652_join_and_joinWith_in_seft_joining.

Authored-by: dgd-contributor <dgd_contributorviettel.com.vn>
Signed-off-by: Wenchen Fan <wenchendatabricks.com>

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #32899 from dgd-contributor/branch-3.0.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
It seems like spark inner join is performing a cartesian join in self joining using `joinWith`

To produce this issues:
```
val df = spark.range(0,3)
df.joinWith(df, df("id") === df("id")).show()
```

Before this pull request, the result is
+---+---+
 | _1 |  _2 |
+---+---+
|    0 |   0 |
|    0 |   1 |
|    0 |   2 |
|    1 |   0 |
|    1 |   1 |
|    1 |   2 |
|    2 |   0 |
|    2 |   1 |
|    2 |   2 |
+---+---+

The expected result is
+---+---+
 | _1 |  _2 |
+---+---+
|    0 |   0 |
|    1 |   1 |
|    2 |   2 |
+---+---+
correctness

no

add test

Closes apache#32863 from dgd-contributor/SPARK-35652_join_and_joinWith_in_seft_joining.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6e1aa15)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
It seems like spark inner join is performing a cartesian join in self joining using `joinWith`

To produce this issues:
```
val df = spark.range(0,3)
df.joinWith(df, df("id") === df("id")).show()
```

Before this pull request, the result is
+---+---+
 | _1 |  _2 |
+---+---+
|    0 |   0 |
|    0 |   1 |
|    0 |   2 |
|    1 |   0 |
|    1 |   1 |
|    1 |   2 |
|    2 |   0 |
|    2 |   1 |
|    2 |   2 |
+---+---+

The expected result is
+---+---+
 | _1 |  _2 |
+---+---+
|    0 |   0 |
|    1 |   1 |
|    2 |   2 |
+---+---+
correctness

no

add test

Closes apache#32863 from dgd-contributor/SPARK-35652_join_and_joinWith_in_seft_joining.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6e1aa15)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants