- Author: Ben Du
- Date: 2020-06-17 15:24:04
- Title: Inner Join of Spark DataFrames
- Slug: spark-dataframe-inner-join
- Category: Computer Science
- Tags: Computer Science, Spark, DataFrame, inner join, big data

https://spark.apache.org/docs/latest/sql-programming-guide.html

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions

Good ways to avoid duplicated columns after join:

1. select only required columns before joining

2. rename (joining) column names before joining

## Same Names in Both Tables

In [1]:
val left = Seq(
    ("bob")
).toDF
left.show

+-----+
|value|
+-----+
|  bob|
+-----+



left = [value: string]


[value: string]

In [1]:
val left = Seq(
    ("bob", "2015-01-13", 4), 
    ("alice", "2015-04-23",10)
).toDF("name","date","duration")
left.show()

+-----+----------+--------+
| name|      date|duration|
+-----+----------+--------+
|  bob|2015-01-13|       4|
|alice|2015-04-23|      10|
+-----+----------+--------+


llist = List((bob,2015-01-13,4), (alice,2015-04-23,10))
left = [name: string, date: string ... 1 more field]


[name: string, date: string ... 1 more field]

In [3]:
val right = Seq(("alice", 100),("bob", 23)).toDF("name","upload")
right.show()

+-----+------+
| name|upload|
+-----+------+
|alice|   100|
|  bob|    23|
+-----+------+



Duplicate columns happens if you use an expression as join condition!

In [9]:
val df = left.join(right, left.col("name") === right.col("name"))
df.show()

+-----+----------+--------+-----+------+
| name|      date|duration| name|upload|
+-----+----------+--------+-----+------+
|  bob|2015-01-13|       4|  bob|    23|
|alice|2015-04-23|      10|alice|   100|
+-----+----------+--------+-----+------+



In [24]:
val df = left.join(right, left("name") === right("name"))
df.show()

+-----+----------+--------+-----+------+--------+
| name|      date|duration| name|upload|duration|
+-----+----------+--------+-----+------+--------+
|  bob|2015-01-13|       4|  bob|    23|       2|
|alice|2015-04-23|      10|alice|   100|       1|
+-----+----------+--------+-----+------+--------+



Using (a Seq of) string names can avoid duplicate columns.

In [11]:
val df = left.join(right, Seq("name"))
df.show()

+-----+----------+--------+------+
| name|      date|duration|upload|
+-----+----------+--------+------+
|  bob|2015-01-13|       4|    23|
|alice|2015-04-23|      10|   100|
+-----+----------+--------+------+



In [12]:
val df = left.join(right, "name")
df.show()

+-----+----------+--------+------+
| name|      date|duration|upload|
+-----+----------+--------+------+
|  bob|2015-01-13|       4|    23|
|alice|2015-04-23|      10|   100|
+-----+----------+--------+------+



## Same Columns Not in Join

In [6]:
val llist = Seq(
    ("bob", "2015-01-13", 4), 
    ("alice", "2015-04-23", 10)
)
val left = llist.toDF("name", "date", "duration")
left.show()

+-----+----------+--------+
| name|      date|duration|
+-----+----------+--------+
|  bob|2015-01-13|       4|
|alice|2015-04-23|      10|
+-----+----------+--------+



In [5]:
val right = Seq(
    ("alice", 100, 1),
    ("bob", 23, 2)
).toDF("name", "upload", "duration")
right.show()

+-----+------+--------+
| name|upload|duration|
+-----+------+--------+
|alice|   100|       1|
|  bob|    23|       2|
+-----+------+--------+



Join the 2 DataFrame by the `name` column. 
Duplicate columns happen as the `duration` column is in both DataFrame.

In [8]:
val df = left.join(right, "name")
df.show

+-----+----------+--------+------+--------+
| name|      date|duration|upload|duration|
+-----+----------+--------+------+--------+
|  bob|2015-01-13|       4|    23|       2|
|alice|2015-04-23|      10|   100|       1|
+-----+----------+--------+------+--------+



In [13]:
import org.apache.spark.sql.SparkSession

val sparkSession = SparkSession.
    builder().
    appName("Spark SQL basic example").
    config("spark.some.config.option", "some-value").
    getOrCreate()
import sparkSession.implicits._

Select via string names works on non duplicate columns.
Exception will be throw if you select a duplicate column using string names.

In [21]:
val df = left.alias("l").join(right.alias("r"), "name").select("name", "date")
df.show

+-----+----------+
| name|      date|
+-----+----------+
|  bob|2015-01-13|
|alice|2015-04-23|
+-----+----------+



Select using `column` objects.

In [19]:
val df = left.join(right, "name").select(left("name"), left("date"), left("duration"))
df.show

+-----+----------+--------+
| name|      date|duration|
+-----+----------+--------+
|  bob|2015-01-13|       4|
|alice|2015-04-23|      10|
+-----+----------+--------+



Using table alias is probably the most convenient way (in syntax).
Similar to SQL, 
you don't have specify table when there's no ambiguition. 

In [23]:
val df = left.alias("l").join(right.alias("r"), "name").
    select($"name", $"date", $"l.duration", $"upload")
df.show

+-----+----------+--------+------+
| name|      date|duration|upload|
+-----+----------+--------+------+
|  bob|2015-01-13|       4|    23|
|alice|2015-04-23|      10|   100|
+-----+----------+--------+------+



## Star in Select

Notice that `*` can be used to select all columns from a table.

In [4]:
val df = left.alias("l").join(right.alias("r"), "name").
    select($"l.*")
df.show

+-----+----------+--------+
| name|      date|duration|
+-----+----------+--------+
|  bob|2015-01-13|       4|
|alice|2015-04-23|      10|
+-----+----------+--------+



In [6]:
val df = left.alias("l").join(right.alias("r"), "name").
    select("l.*")
df.show

+-----+----------+--------+
| name|      date|duration|
+-----+----------+--------+
|  bob|2015-01-13|       4|
|alice|2015-04-23|      10|
+-----+----------+--------+



## Different Names for Joining

If you want to do inner join only, 
it is suggested that you rename the columns to join to have the same names
so that 

1. minimal number of columns
2. no duplicate columns

In [26]:
val left = Seq(
    ("bob", "2015-01-13", 4), 
    ("alice", "2015-04-23",10)
).toDF("name","date","duration")
left.show()

+-----+----------+--------+
| name|      date|duration|
+-----+----------+--------+
|  bob|2015-01-13|       4|
|alice|2015-04-23|      10|
+-----+----------+--------+



In [28]:
val right = Seq(
    ("alice", 100, 1),
    ("bob", 23, 2)
).toDF("nm", "upload", "duration")
right.show()

+-----+------+--------+
|   nm|upload|duration|
+-----+------+--------+
|alice|   100|       1|
|  bob|    23|       2|
+-----+------+--------+



In [30]:
left.join(right, $"name" === $"nm").show

+-----+----------+--------+-----+------+--------+
| name|      date|duration|   nm|upload|duration|
+-----+----------+--------+-----+------+--------+
|  bob|2015-01-13|       4|  bob|    23|       2|
|alice|2015-04-23|      10|alice|   100|       1|
+-----+----------+--------+-----+------+--------+

