# Chapter 5: Join

[**5.1 Join**](#5.1-Join)   
[**5.2 Join Types**](#5.2-Join-Types)   
[**5.2.1 Inner Join**](#5.2.1-Inner-Join)   
[**5.2.2 Left Outer Join**](#5.2.2-Left-Outer-Join)   
[**5.2.3 Left Semi Join**](#5.2.3-Left-Semi-Join)   
[**5.2.4 Left Anti Join**](#5.2.4-Left-Anti-Join)   
[**5.2.5 Right Outer Join**](#5.2.5-Right-Outer-Join)   
[**5.2.6 Outer Join**](#5.2.6-Outer-Join)   
[**5.2.7 Natural Join**](#5.2.7-Natural-Join)   
[**5.2.8 Cross Join**](#5.2.8-Cross-Join)   
[**5.3 Complex Data Types Join**](#5.3-Complex-Data-Types-Join)   
[**5.4 Duplicate Columns in Join**](#5.4-Duplicate-Columns-in-Join)     
[**5.5 Optimization and Performance Tuning**](#5.5-Optimization-and-Performance-Tuning)  

#### 5.1 Join
`Join` is used to combine/merge two or more datasets based on common key(s) between those datasets. Relational database are designed based on normalization $^1$. Join are heavily used in data transformation during ETL/ELT process.  If you are not familiar with join, check out [join in SQL chapter](https://github.com/analyticstensor/sql/blob/master/Chapter_4_Filtering_Join_Subquery/Course_Materials/sql-chapter_4.pdf). 

In Spark, `join` operator is used to join two DataFrames. The join operator and it's parameter is described below:

Method:   
**join**(other, on=None, how=None)   
Parameter:   
**other**: DataFrame to be joined.      
**on**: column name or list of column names for join condition. All the column names must exits on both DataFrame. It also accepts join expression which determines whether two rows should join or not.     
**how**: type of join which determines what records will be in resultset. Must be `inner, left, left_outer, left_semi, left_anti, right, right_outer, outer, full, and cross`. The default value is `inner`.

For Example:  

`employees.join(dept_emp, employees.emp_no == dept_emp.emp_no, 'left')`

`employees.join(dept_emp, 'emp_no', 'left')`

`employees.join(dept_emp, 'emp_no')`

`employees.join(dept_emp, ['emp_no', 'dept_id'])`

`joinCond = [employees.emp_no == dept_emp.emp_no, employees.dept_id == dept.dept_id]   
employees.join(dept_emp, joinCond, 'inner')`


**Test**   
@todo: with an optional `join condition`. Join condition can either be part of join operators or filter operators i.e (`where` or `filter`).   
For example:   
`employees.join(dep_emp, employees.emp_no == dep_emp.emp_no, 'left')   
employees.join(dep_emp).where("employees.emp_no" == "dep_emp.emp_no")   
employees.join(dep_emp).filter("employees.emp_no" == "dep_emp.emp_no")`

#### 5.2 Join Types
* **Inner Join**: Hold rows from both left and right DataFrames that has matching keys between them.
* **Left Outer Join**: Hold rows from left DataFrame that have either matching keys or doesn't have matching keys from right DataFrame as well as matching records from right DataFrame for the join keys/condition.
* **Left Semi Join**: Hold rows only from left DataFrame that has matching keys in right DataFrame. Similiar to `in`.
* **Left Anti Join**: Hold rows only from left DataFrame that doesn't have matching keys in right DataFrame. Similar to `not in`. 
* **Right Outer Join**: Hold rows from right DataFrame that have either matching keys or doesn't have matching keys from right DataFrame as well as matching records from left DataFrame for the join keys/condition.
* **Full Outer Join**: Combination of Left Outer and Right Outer Join. 
* **Natural Join**: Used to perform join by implicitly matching columns between two DataFrames that has same names.
* **Cross Join**: Used to perform Cartesian joins where all the row from left DataFrame matches with all the rows from right DataFrame.

Table 5.2: Join Types

| Join Types | Spark SQL | DataFrame |
| --------- | ----------- | --------- |
| Inner Join | INNER | inner |
| Left Outer Join | LEFT OUTER | left, leftouter |
| Left Semi Join | LEFT SEMI | leftsemi |
| Left Anti Join | LEFT ANTI | leftanti |
| Right Outer Join | RIGHT OUTER | right, rightouter |
| Outer Join | FULL OUTER | outer, full, fullouter 
| Natural Join | NATURAL | Inner Join, Left Outer Join, Right Outer Join, Full Outer Join can also be specified |
| Cross Join | CROSS | cross |

#### 5.2.1 Inner Join
`Inner Join` checks the join keys between both (i.e. left and right) DataFrames, if the key matches between those DataFrames then it includes only matching records from both DataFrame. For unmatched keys, the records will be ignored in the resulting DataFrame. `inner` is the default join, which means if we don't specify join type then it will set value to `inner`.

*Read Config File*

In [1]:
from script import conf

config_file = 'db_properties.ini'
config_section = 'mysql'
read_prop = conf.ReadProperties(config_file, config_section)

*Create Spark Session*

In [2]:
from pyspark.sql import SparkSession

# Create Spark Session
spark = SparkSession.builder \
    .master("local") \
    .appName("Chapter 5 Join") \
    .getOrCreate()

*Read MySQL employees table into Spark DataFrame*

In [3]:
# Read employees table
employees = spark.read.jdbc(url = read_prop.get_properties()['url'], table = 'employees', properties = read_prop.get_properties())

*Read MySQL dept_emp table into Spark DataFrame*

In [4]:
# Read dept_emp table
dept_emp = spark.read.jdbc(url = read_prop.get_properties()['url'], table = 'dept_emp', properties = read_prop.get_properties())

*Print Schema for employees and dept_emp DataFrame*

In [5]:
employees.printSchema()
dept_emp.printSchema()

root
 |-- emp_no: integer (nullable = true)
 |-- birth_date: date (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- hire_date: date (nullable = true)

root
 |-- emp_no: integer (nullable = true)
 |-- dept_no: string (nullable = true)
 |-- from_date: date (nullable = true)
 |-- to_date: date (nullable = true)



**Inner Join with single join key**

In [6]:
# join employees and dept_emp in emp_no

employees.join(dept_emp, 'emp_no').show(10)

+------+----------+----------+---------+------+----------+-------+----------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|dept_no| from_date|   to_date|
+------+----------+----------+---------+------+----------+-------+----------+----------+
| 10206|1960-09-19|  Alassane|  Iwayama|     F|1988-04-19|   d005|1988-04-19|9999-01-01|
| 10362|1963-09-16|   Shalesh|  dAstous|     M|1988-08-24|   d003|1990-11-02|1997-07-16|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07|   d005|1992-01-15|1996-02-11|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07|   d008|1996-02-11|9999-01-01|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26|   d007|1990-12-26|2000-01-24|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26|   d009|2000-01-24|9999-01-01|
| 11033|1957-03-01|   Shushma|     Bahk|     F|1990-10-02|   d005|1991-03-14|9999-01-01|
| 11141|1957-08-20|   Vasiliy|Kermarrec|     F|1989-12-28|   d005|1993-02-12|9999-01-01|
| 11317|1954-07-24|  

In [7]:
# join employees and dept_emp in emp_no

employees.join(dept_emp, 'emp_no', 'inner').show(10)

+------+----------+----------+---------+------+----------+-------+----------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|dept_no| from_date|   to_date|
+------+----------+----------+---------+------+----------+-------+----------+----------+
| 10206|1960-09-19|  Alassane|  Iwayama|     F|1988-04-19|   d005|1988-04-19|9999-01-01|
| 10362|1963-09-16|   Shalesh|  dAstous|     M|1988-08-24|   d003|1990-11-02|1997-07-16|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07|   d005|1992-01-15|1996-02-11|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07|   d008|1996-02-11|9999-01-01|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26|   d007|1990-12-26|2000-01-24|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26|   d009|2000-01-24|9999-01-01|
| 11033|1957-03-01|   Shushma|     Bahk|     F|1990-10-02|   d005|1991-03-14|9999-01-01|
| 11141|1957-08-20|   Vasiliy|Kermarrec|     F|1989-12-28|   d005|1993-02-12|9999-01-01|
| 11317|1954-07-24|  

In [8]:
# join employees and dept_emp in emp_no

employees.join(dept_emp, employees.emp_no == dept_emp.emp_no, 'inner').show(10)

+------+----------+----------+---------+------+----------+------+-------+----------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|emp_no|dept_no| from_date|   to_date|
+------+----------+----------+---------+------+----------+------+-------+----------+----------+
| 10206|1960-09-19|  Alassane|  Iwayama|     F|1988-04-19| 10206|   d005|1988-04-19|9999-01-01|
| 10362|1963-09-16|   Shalesh|  dAstous|     M|1988-08-24| 10362|   d003|1990-11-02|1997-07-16|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07| 10623|   d005|1992-01-15|1996-02-11|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07| 10623|   d008|1996-02-11|9999-01-01|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26| 10817|   d007|1990-12-26|2000-01-24|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26| 10817|   d009|2000-01-24|9999-01-01|
| 11033|1957-03-01|   Shushma|     Bahk|     F|1990-10-02| 11033|   d005|1991-03-14|9999-01-01|
| 11141|1957-08-20|   Vasiliy|Kermarrec|

In [9]:
# join employees and dept_emp in emp_no

joinExpr = employees["emp_no"] == dept_emp["emp_no"]
employees.join(dept_emp, joinExpr, 'inner').show(10)

+------+----------+----------+---------+------+----------+------+-------+----------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|emp_no|dept_no| from_date|   to_date|
+------+----------+----------+---------+------+----------+------+-------+----------+----------+
| 10206|1960-09-19|  Alassane|  Iwayama|     F|1988-04-19| 10206|   d005|1988-04-19|9999-01-01|
| 10362|1963-09-16|   Shalesh|  dAstous|     M|1988-08-24| 10362|   d003|1990-11-02|1997-07-16|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07| 10623|   d005|1992-01-15|1996-02-11|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07| 10623|   d008|1996-02-11|9999-01-01|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26| 10817|   d007|1990-12-26|2000-01-24|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26| 10817|   d009|2000-01-24|9999-01-01|
| 11033|1957-03-01|   Shushma|     Bahk|     F|1990-10-02| 11033|   d005|1991-03-14|9999-01-01|
| 11141|1957-08-20|   Vasiliy|Kermarrec|

In [10]:
# join employees and dept_emp in emp_no

joinExpr = employees.emp_no == dept_emp.emp_no
employees.join(dept_emp, joinExpr, 'inner').show(10)

+------+----------+----------+---------+------+----------+------+-------+----------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|emp_no|dept_no| from_date|   to_date|
+------+----------+----------+---------+------+----------+------+-------+----------+----------+
| 10206|1960-09-19|  Alassane|  Iwayama|     F|1988-04-19| 10206|   d005|1988-04-19|9999-01-01|
| 10362|1963-09-16|   Shalesh|  dAstous|     M|1988-08-24| 10362|   d003|1990-11-02|1997-07-16|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07| 10623|   d005|1992-01-15|1996-02-11|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07| 10623|   d008|1996-02-11|9999-01-01|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26| 10817|   d007|1990-12-26|2000-01-24|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26| 10817|   d009|2000-01-24|9999-01-01|
| 11033|1957-03-01|   Shushma|     Bahk|     F|1990-10-02| 11033|   d005|1991-03-14|9999-01-01|
| 11141|1957-08-20|   Vasiliy|Kermarrec|

In [16]:
# join employees and dept_emp in emp_no

joinExpr = employees["emp_no"] == dept_emp["emp_no"]
joinType = "inner"
employees.join(dept_emp, joinExpr, joinType).show(10)

+------+----------+----------+---------+------+----------+------+-------+----------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|emp_no|dept_no| from_date|   to_date|
+------+----------+----------+---------+------+----------+------+-------+----------+----------+
| 10206|1960-09-19|  Alassane|  Iwayama|     F|1988-04-19| 10206|   d005|1988-04-19|9999-01-01|
| 10362|1963-09-16|   Shalesh|  dAstous|     M|1988-08-24| 10362|   d003|1990-11-02|1997-07-16|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07| 10623|   d005|1992-01-15|1996-02-11|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07| 10623|   d008|1996-02-11|9999-01-01|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26| 10817|   d007|1990-12-26|2000-01-24|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26| 10817|   d009|2000-01-24|9999-01-01|
| 11033|1957-03-01|   Shushma|     Bahk|     F|1990-10-02| 11033|   d005|1991-03-14|9999-01-01|
| 11141|1957-08-20|   Vasiliy|Kermarrec|

**Inner Join with multiple join keys having same column name**

In [11]:
# join dept_emp and dept_manager in emp_no and dept_no

# Read dept_emp table
dept_manager = spark.read.jdbc(url = read_prop.get_properties()['url'], table = 'dept_manager', properties = read_prop.get_properties())

dept_emp.join(dept_manager, ['emp_no', 'dept_no']).show(10)

+------+-------+----------+----------+----------+----------+
|emp_no|dept_no| from_date|   to_date| from_date|   to_date|
+------+-------+----------+----------+----------+----------+
|111692|   d009|1985-01-01|9999-01-01|1985-01-01|1988-10-17|
|110022|   d001|1985-01-01|9999-01-01|1985-01-01|1991-10-01|
|110039|   d001|1986-04-12|9999-01-01|1991-10-01|9999-01-01|
|111133|   d007|1986-12-30|9999-01-01|1991-03-07|9999-01-01|
|110567|   d005|1986-10-21|9999-01-01|1992-04-25|9999-01-01|
|110344|   d004|1985-11-22|9999-01-01|1988-09-09|1992-08-02|
|110228|   d003|1985-08-04|9999-01-01|1992-03-21|9999-01-01|
|110420|   d004|1992-02-05|9999-01-01|1996-08-30|9999-01-01|
|110800|   d006|1986-08-12|9999-01-01|1991-09-12|1994-06-28|
|110854|   d006|1989-06-09|9999-01-01|1994-06-28|9999-01-01|
+------+-------+----------+----------+----------+----------+
only showing top 10 rows



**Inner Join with multiple join keys having different column name**

* By creating the list of join condition.

In [13]:
# join dept_emp and dept_manager in emp_no and department_number from dept_emp DataFrame


from pyspark.sql.functions import col 

# create new column department_number in dept_emp DataFrame
dept_emp = dept_emp.withColumn("department_number", col("dept_no"))

# display top 5 records
dept_emp.show(10)

joinExpr = [dept_emp.emp_no == dept_manager.emp_no, dept_emp.department_number == dept_manager.dept_no]
joinType = "inner"

dept_emp.join(dept_manager, joinExpr, joinType).show(10)

+------+-------+----------+----------+-----------------+
|emp_no|dept_no| from_date|   to_date|department_number|
+------+-------+----------+----------+-----------------+
| 10001|   d005|1986-06-26|9999-01-01|             d005|
| 10002|   d007|1996-08-03|9999-01-01|             d007|
| 10003|   d004|1995-12-03|9999-01-01|             d004|
| 10004|   d004|1986-12-01|9999-01-01|             d004|
| 10005|   d003|1989-09-12|9999-01-01|             d003|
| 10006|   d005|1990-08-05|9999-01-01|             d005|
| 10007|   d008|1989-02-10|9999-01-01|             d008|
| 10008|   d005|1998-03-11|2000-07-31|             d005|
| 10009|   d006|1985-02-18|9999-01-01|             d006|
| 10010|   d004|1996-11-24|2000-06-26|             d004|
+------+-------+----------+----------+-----------------+
only showing top 10 rows

+------+-------+----------+----------+-----------------+------+-------+----------+----------+
|emp_no|dept_no| from_date|   to_date|department_number|emp_no|dept_no| from_date|

In [15]:
joinExpr = [dept_emp["emp_no"] == dept_manager["emp_no"], dept_emp["department_number"] == dept_manager["dept_no"]]
joinType = "inner"

dept_emp.join(dept_manager, joinExpr, joinType).show(10)

join1 = dept_emp.join(dept_manager, joinExpr, joinType)
join2 = join1.join(employees, emp_no

+------+-------+----------+----------+-----------------+------+-------+----------+----------+
|emp_no|dept_no| from_date|   to_date|department_number|emp_no|dept_no| from_date|   to_date|
+------+-------+----------+----------+-----------------+------+-------+----------+----------+
|111692|   d009|1985-01-01|9999-01-01|             d009|111692|   d009|1985-01-01|1988-10-17|
|110022|   d001|1985-01-01|9999-01-01|             d001|110022|   d001|1985-01-01|1991-10-01|
|110039|   d001|1986-04-12|9999-01-01|             d001|110039|   d001|1991-10-01|9999-01-01|
|111133|   d007|1986-12-30|9999-01-01|             d007|111133|   d007|1991-03-07|9999-01-01|
|110567|   d005|1986-10-21|9999-01-01|             d005|110567|   d005|1992-04-25|9999-01-01|
|110344|   d004|1985-11-22|9999-01-01|             d004|110344|   d004|1988-09-09|1992-08-02|
|110228|   d003|1985-08-04|9999-01-01|             d003|110228|   d003|1992-03-21|9999-01-01|
|110420|   d004|1992-02-05|9999-01-01|             d004|1104

In [14]:
joinExpr = [dept_emp["emp_no"] == dept_manager["emp_no"],
            dept_emp["department_number"] == dept_manager["dept_no"]
           ]
joinExpr2 = [employees["emp_no"] == dept_emp["emp_no"]]
joinType = "inner"

dept_emp.join(dept_manager, joinExpr, joinType).show(10)
dfAll = dept_emp.join(dept_manager, joinExpr, joinType).join(employees, joinExpr2, joinType)
#join2 = join1.join(employees, emp_no
dfAll.show(10)

+------+-------+----------+----------+-----------------+------+-------+----------+----------+
|emp_no|dept_no| from_date|   to_date|department_number|emp_no|dept_no| from_date|   to_date|
+------+-------+----------+----------+-----------------+------+-------+----------+----------+
|111692|   d009|1985-01-01|9999-01-01|             d009|111692|   d009|1985-01-01|1988-10-17|
|110022|   d001|1985-01-01|9999-01-01|             d001|110022|   d001|1985-01-01|1991-10-01|
|110039|   d001|1986-04-12|9999-01-01|             d001|110039|   d001|1991-10-01|9999-01-01|
|111133|   d007|1986-12-30|9999-01-01|             d007|111133|   d007|1991-03-07|9999-01-01|
|110567|   d005|1986-10-21|9999-01-01|             d005|110567|   d005|1992-04-25|9999-01-01|
|110344|   d004|1985-11-22|9999-01-01|             d004|110344|   d004|1988-09-09|1992-08-02|
|110228|   d003|1985-08-04|9999-01-01|             d003|110228|   d003|1992-03-21|9999-01-01|
|110420|   d004|1992-02-05|9999-01-01|             d004|1104

**Inner Join and Filter**

In [15]:
from pyspark.sql.functions import col

employees.join(dept_emp, 'emp_no','inner').where(col("first_name") == 'Georgi').show(10)

+------+----------+----------+---------+------+----------+-------+----------+----------+-----------------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|dept_no| from_date|   to_date|department_number|
+------+----------+----------+---------+------+----------+-------+----------+----------+-----------------+
|285977|1960-09-26|    Georgi|     Gide|     M|1986-10-17|   d004|1986-10-17|9999-01-01|             d004|
|205890|1954-09-28|    Georgi|Chenoweth|     F|1987-06-27|   d007|1987-06-27|9999-01-01|             d007|
|497592|1959-04-22|    Georgi|   Ariola|     M|1986-03-27|   d007|1986-03-27|1995-10-05|             d007|
| 85237|1957-04-05|    Georgi| Chartres|     M|1985-08-30|   d005|1985-08-30|9999-01-01|             d005|
|420295|1952-09-13|    Georgi|Kuhnemann|     M|1989-11-07|   d002|1994-12-12|9999-01-01|             d002|
|493075|1953-07-19|    Georgi| Kaminger|     M|1987-01-22|   d009|1987-01-22|1995-07-13|             d009|
| 22787|1953-02-10|    Georgi|   Hebe

**Inner Join, Filter and Distinct**

In [16]:
employees.join(dept_emp, 'emp_no','inner')\
        .where(col("first_name") == 'Georgi')\
        .distinct().show(10)

+------+----------+----------+---------+------+----------+-------+----------+----------+-----------------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|dept_no| from_date|   to_date|department_number|
+------+----------+----------+---------+------+----------+-------+----------+----------+-----------------+
|285977|1960-09-26|    Georgi|     Gide|     M|1986-10-17|   d004|1986-10-17|9999-01-01|             d004|
|205890|1954-09-28|    Georgi|Chenoweth|     F|1987-06-27|   d007|1987-06-27|9999-01-01|             d007|
|497592|1959-04-22|    Georgi|   Ariola|     M|1986-03-27|   d007|1986-03-27|1995-10-05|             d007|
| 85237|1957-04-05|    Georgi| Chartres|     M|1985-08-30|   d005|1985-08-30|9999-01-01|             d005|
|420295|1952-09-13|    Georgi|Kuhnemann|     M|1989-11-07|   d002|1994-12-12|9999-01-01|             d002|
|493075|1953-07-19|    Georgi| Kaminger|     M|1987-01-22|   d009|1987-01-22|1995-07-13|             d009|
| 22787|1953-02-10|    Georgi|   Hebe

In [17]:
# example with select statement
employees.join(dept_emp, 'emp_no','inner')\
        .where(col("first_name") == 'Georgi')\
        .select("first_name", "gender")\
        .distinct().show(10)

+----------+------+
|first_name|gender|
+----------+------+
|    Georgi|     F|
|    Georgi|     M|
+----------+------+



**Inner Join, Filter, Distinct, and Sort**

In [30]:
employees.join(dept_emp, 'emp_no','inner')\
    .where(col("first_name") == 'Georgi')\
    .distinct()\
    .orderBy("emp_no")\
    .show(10)

+------+----------+----------+-----------+------+----------+-------+----------+----------+-----------------+
|emp_no|birth_date|first_name|  last_name|gender| hire_date|dept_no| from_date|   to_date|department_number|
+------+----------+----------+-----------+------+----------+-------+----------+----------+-----------------+
| 10001|1953-09-02|    Georgi|    Facello|     M|1986-06-26|   d005|1986-06-26|9999-01-01|             d005|
| 10909|1954-11-11|    Georgi|    Atchley|     M|1985-04-21|   d005|1993-12-26|9999-01-01|             d005|
| 11029|1962-07-12|    Georgi|   Itzfeldt|     M|1992-12-27|   d007|1992-12-27|9999-01-01|             d007|
| 11430|1957-01-23|    Georgi|    Klassen|     M|1996-02-27|   d007|1997-07-06|9999-01-01|             d007|
| 12157|1960-03-30|    Georgi|    Barinka|     M|1985-06-04|   d003|1985-06-04|9999-01-01|             d003|
| 15220|1957-08-03|    Georgi|  Panienski|     F|1995-07-23|   d004|1995-07-23|9999-01-01|             d004|
| 15660|1956-01-13|

**Inner Join, Filter, and Grouping**

In [39]:
employees.join(dept_emp, 'emp_no','inner')\
    .where(col("first_name") == 'Georgi')\
    .groupBy("department_number").count()\
    .show()

+-----------------+-----+
|department_number|count|
+-----------------+-----+
|             d005|   67|
|             d009|   22|
|             d003|   18|
|             d001|   14|
|             d007|   45|
|             d004|   64|
|             d002|   14|
|             d006|   19|
|             d008|   15|
+-----------------+-----+



#### 5.2.2 Left Outer Join
`Left Outer Join` checks the join keys between both (i.e. left and right) DataFrames, it includes all the records from left DataFrames as well as matching records from right DataFrame for the given keys. If the records doesn't exists i.e. unmatched keys in right DataFrame, `null` value will be inserted.

In [18]:
# left outer join employees and dept_emp in emp_no

joinExpr = employees["emp_no"] == dept_emp["emp_no"]
joinType = "left_outer"
employees.join(dept_emp, joinExpr, joinType).show(10)

+------+----------+----------+---------+------+----------+------+-------+----------+----------+-----------------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|emp_no|dept_no| from_date|   to_date|department_number|
+------+----------+----------+---------+------+----------+------+-------+----------+----------+-----------------+
| 10206|1960-09-19|  Alassane|  Iwayama|     F|1988-04-19| 10206|   d005|1988-04-19|9999-01-01|             d005|
| 10362|1963-09-16|   Shalesh|  dAstous|     M|1988-08-24| 10362|   d003|1990-11-02|1997-07-16|             d003|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07| 10623|   d005|1992-01-15|1996-02-11|             d005|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07| 10623|   d008|1996-02-11|9999-01-01|             d008|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26| 10817|   d007|1990-12-26|2000-01-24|             d007|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26| 10817|   d009|2000-01-24|9999

#### 5.2.3 Left Semi Join
`Left Semi Join` checks the join keys between both (i.e. left and right) DataFrames, it only includes records from left DataFrames that have matching keys in right DataFrame. It doesn't include any records from right DataFrame. It is similar to `IN` and `EXISTS` in SQL.

In [19]:
#  left semi join employees and dept_emp in emp_no

joinExpr = employees["emp_no"] == dept_emp["emp_no"]
joinType = "left_semi"
employees.join(dept_emp, joinExpr, joinType).show(10)

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|
+------+----------+----------+---------+------+----------+
| 10206|1960-09-19|  Alassane|  Iwayama|     F|1988-04-19|
| 10362|1963-09-16|   Shalesh|  dAstous|     M|1988-08-24|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26|
| 11033|1957-03-01|   Shushma|     Bahk|     F|1990-10-02|
| 11141|1957-08-20|   Vasiliy|Kermarrec|     F|1989-12-28|
| 11317|1954-07-24|  Shigeaki| Hagimont|     F|1989-12-21|
| 11458|1958-08-09|     Stevo|Chenoweth|     F|1985-10-06|
| 11748|1953-03-07|    Lihong| Massonet|     M|1992-12-20|
| 11858|1962-11-21|   Slavian|     Baik|     M|1988-11-12|
+------+----------+----------+---------+------+----------+
only showing top 10 rows



#### 5.2.4 Left Anti Join
`Left Anti Join` checks the join keys between both (i.e. left and right) DataFrames, it only includes records from left DataFrames that doesn't have matching keys in right DataFrame. Basically, it is reverse of left semi joins. It doesn't include any records from right DataFrame. It is similar to `NOT IN` and `NOT EXISTS` in SQL.

In [20]:
#  left anti join employees and dept_emp in emp_no

joinExpr = employees["emp_no"] == dept_emp["emp_no"]
joinType = "left_anti"
employees.join(dept_emp, joinExpr, joinType).show(10)

+------+----------+----------+---------+------+---------+
|emp_no|birth_date|first_name|last_name|gender|hire_date|
+------+----------+----------+---------+------+---------+
+------+----------+----------+---------+------+---------+



#### 5.2.5 Right Outer Join
`Right Outer Join` checks the join keys between both (i.e. left and right) DataFrames, it includes all the records from right DataFrames as well as matching records from left DataFrame for the given keys. If the records doesn't exists i.e. unmatched keys in left DataFrame, `null` value will be inserted.

In [21]:
#  right outer join employees and dept_emp in emp_no

joinExpr = employees["emp_no"] == dept_emp["emp_no"]
joinType = "right_outer"
employees.join(dept_emp, joinExpr, joinType).show(10)

+------+----------+----------+---------+------+----------+------+-------+----------+----------+-----------------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|emp_no|dept_no| from_date|   to_date|department_number|
+------+----------+----------+---------+------+----------+------+-------+----------+----------+-----------------+
| 10206|1960-09-19|  Alassane|  Iwayama|     F|1988-04-19| 10206|   d005|1988-04-19|9999-01-01|             d005|
| 10362|1963-09-16|   Shalesh|  dAstous|     M|1988-08-24| 10362|   d003|1990-11-02|1997-07-16|             d003|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07| 10623|   d005|1992-01-15|1996-02-11|             d005|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07| 10623|   d008|1996-02-11|9999-01-01|             d008|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26| 10817|   d007|1990-12-26|2000-01-24|             d007|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26| 10817|   d009|2000-01-24|9999

#### 5.2.6 Outer Join
`Outer Join` checks the join keys between both (i.e. left and right) DataFrames, it includes all the records from both left and right DataFrames for matched and unmatched keys between those DataFrames. `null` value will be inserted for all the records that doesn't have matching keys.  It is combination of left and right outer join. It is sometime referred as `Full Join`.

In [22]:
# outer join employees and dept_emp in emp_no

joinExpr = employees["emp_no"] == dept_emp["emp_no"]
joinType = "outer"
employees.join(dept_emp, joinExpr, joinType).show(10)

+------+----------+----------+---------+------+----------+------+-------+----------+----------+-----------------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|emp_no|dept_no| from_date|   to_date|department_number|
+------+----------+----------+---------+------+----------+------+-------+----------+----------+-----------------+
| 10206|1960-09-19|  Alassane|  Iwayama|     F|1988-04-19| 10206|   d005|1988-04-19|9999-01-01|             d005|
| 10362|1963-09-16|   Shalesh|  dAstous|     M|1988-08-24| 10362|   d003|1990-11-02|1997-07-16|             d003|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07| 10623|   d005|1992-01-15|1996-02-11|             d005|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07| 10623|   d008|1996-02-11|9999-01-01|             d008|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26| 10817|   d007|1990-12-26|2000-01-24|             d007|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26| 10817|   d009|2000-01-24|9999

#### 5.2.7 Natural Join
`Natural Join` doesn't need any join keys but it will implicitly choose the columns between two DataFrame by matching column names. We can also specify join types like left, right and outer.

In [28]:
# natural join employees and dept_emp


#### 5.2.8 Cross Join
`Cross Join` doesn't need any join keys. It matches with each records in left DataFrame to each records in right DataFrame and generate the resultset. It is a cartesian products of left and right DataFrame.

In [23]:
# cross join employees and dept_emp

employees.crossJoin(dept_emp).show(20)

+------+----------+----------+---------+------+----------+------+-------+----------+----------+-----------------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|emp_no|dept_no| from_date|   to_date|department_number|
+------+----------+----------+---------+------+----------+------+-------+----------+----------+-----------------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26| 10001|   d005|1986-06-26|9999-01-01|             d005|
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26| 10002|   d007|1996-08-03|9999-01-01|             d007|
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26| 10003|   d004|1995-12-03|9999-01-01|             d004|
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26| 10004|   d004|1986-12-01|9999-01-01|             d004|
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26| 10005|   d003|1989-09-12|9999-01-01|             d003|
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26| 10006|   d005|1990-08-05|9999

#### 5.3 Complex Data Types Join
While performing join with complex data type it isn't too difficult. All the expression will be a valid join expression until and unless it returns a Boolean type.

In [24]:
schema_user = "id int, first string, last string, skills array<int>"

dataset_user = [[1, "John", "Warthorn", [10, 20 ,30]],
       [2, "Harry","Joy", [10]],
       [3, "Patrick", "Roy", [40, 60, 80]],
       [4, "Bicky", "Boss", [20]],
       [5, "Micheal","Todd", [0]],
       [6, "Bobby", "Home", [10, 20, 30, 40, 50, 60]]
      ]
# create a DataFrame using the schema defined above
user = spark.createDataFrame(dataset_user, schema_user)
print(user.printSchema())
user.show(20, False)

root
 |-- id: integer (nullable = true)
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- skills: array (nullable = true)
 |    |-- element: integer (containsNull = true)

None
+---+-------+--------+------------------------+
|id |first  |last    |skills                  |
+---+-------+--------+------------------------+
|1  |John   |Warthorn|[10, 20, 30]            |
|2  |Harry  |Joy     |[10]                    |
|3  |Patrick|Roy     |[40, 60, 80]            |
|4  |Bicky  |Boss    |[20]                    |
|5  |Micheal|Todd    |[0]                     |
|6  |Bobby  |Home    |[10, 20, 30, 40, 50, 60]|
+---+-------+--------+------------------------+



In [25]:
schema_skill = "skill_id int, course string"

dataset_skill = [[10, "Java", ],
       [20, "Python" ],
       [30, "Scala"],
       [40, "Spark"],
       [50, "Bash"],
       [60, "Cloud"]
      ]
# create a DataFrame using the schema defined above
skill = spark.createDataFrame(dataset_skill, schema_skill)
print(skill.printSchema())
skill.show(20, False)

root
 |-- skill_id: integer (nullable = true)
 |-- course: string (nullable = true)

None
+--------+------+
|skill_id|course|
+--------+------+
|10      |Java  |
|20      |Python|
|30      |Scala |
|40      |Spark |
|50      |Bash  |
|60      |Cloud |
+--------+------+



In [27]:
from pyspark.sql.functions import expr
user.join(skill, expr("array_contains(skills, skill_id)"),"inner").show(20, False)

+---+-------+--------+------------------------+--------+------+
|id |first  |last    |skills                  |skill_id|course|
+---+-------+--------+------------------------+--------+------+
|1  |John   |Warthorn|[10, 20, 30]            |10      |Java  |
|1  |John   |Warthorn|[10, 20, 30]            |20      |Python|
|1  |John   |Warthorn|[10, 20, 30]            |30      |Scala |
|2  |Harry  |Joy     |[10]                    |10      |Java  |
|3  |Patrick|Roy     |[40, 60, 80]            |40      |Spark |
|3  |Patrick|Roy     |[40, 60, 80]            |60      |Cloud |
|4  |Bicky  |Boss    |[20]                    |20      |Python|
|6  |Bobby  |Home    |[10, 20, 30, 40, 50, 60]|10      |Java  |
|6  |Bobby  |Home    |[10, 20, 30, 40, 50, 60]|20      |Python|
|6  |Bobby  |Home    |[10, 20, 30, 40, 50, 60]|30      |Scala |
|6  |Bobby  |Home    |[10, 20, 30, 40, 50, 60]|40      |Spark |
|6  |Bobby  |Home    |[10, 20, 30, 40, 50, 60]|50      |Bash  |
|6  |Bobby  |Home    |[10, 20, 30, 40, 5

#### 5.4 Duplicate Columns in Join

While joining DataFrames, there might be same column names in both DataFrame. Duplicate column names only occurs:    
1. If two DataFrames have same column name that is specified on join expression, and after join it don't remove key from one DataFrame.
2.  If two DataFrames have same column name but they are not specified on join expression.

[Additional Resources](https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html)$^2$

**Methods for handling duplicate column names in join**  


**Method I**  
If the join key has same column name then instead of using Boolean expression in join, use string or sequence. This will automatically remove one columns during join.  
`Join with Boolean Expression`  
employees.join(dept_emp, employees.emp_no == dept_emp.emp_no, 'inner').show(10)  
`Join with String`  
employees.join(dept_emp, "emp_no").show(10)

In [28]:
# join with Boolean Expression
employees.join(dept_emp, employees.emp_no == dept_emp.emp_no, 'inner').show(10)

+------+----------+----------+---------+------+----------+------+-------+----------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|emp_no|dept_no| from_date|   to_date|
+------+----------+----------+---------+------+----------+------+-------+----------+----------+
| 10206|1960-09-19|  Alassane|  Iwayama|     F|1988-04-19| 10206|   d005|1988-04-19|9999-01-01|
| 10362|1963-09-16|   Shalesh|  dAstous|     M|1988-08-24| 10362|   d003|1990-11-02|1997-07-16|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07| 10623|   d005|1992-01-15|1996-02-11|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07| 10623|   d008|1996-02-11|9999-01-01|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26| 10817|   d007|1990-12-26|2000-01-24|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26| 10817|   d009|2000-01-24|9999-01-01|
| 11033|1957-03-01|   Shushma|     Bahk|     F|1990-10-02| 11033|   d005|1991-03-14|9999-01-01|
| 11141|1957-08-20|   Vasiliy|Kermarrec|

In [29]:
# Join with String
employees.join(dept_emp, "emp_no").show(10)

+------+----------+----------+---------+------+----------+-------+----------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|dept_no| from_date|   to_date|
+------+----------+----------+---------+------+----------+-------+----------+----------+
| 10206|1960-09-19|  Alassane|  Iwayama|     F|1988-04-19|   d005|1988-04-19|9999-01-01|
| 10362|1963-09-16|   Shalesh|  dAstous|     M|1988-08-24|   d003|1990-11-02|1997-07-16|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07|   d005|1992-01-15|1996-02-11|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07|   d008|1996-02-11|9999-01-01|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26|   d007|1990-12-26|2000-01-24|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26|   d009|2000-01-24|9999-01-01|
| 11033|1957-03-01|   Shushma|     Bahk|     F|1990-10-02|   d005|1991-03-14|9999-01-01|
| 11141|1957-08-20|   Vasiliy|Kermarrec|     F|1989-12-28|   d005|1993-02-12|9999-01-01|
| 11317|1954-07-24|  

**Method II**   
Drop the duplicate column name after join. By specifying Dataframe and column name OR only column name. In example-1 below, When DataFrame name is specified with its column name then only the column from particular DataFrame will be dropped. In example-2 below, When column name doesn't contains DataFrame name then both column will be dropped.
employees.join(dept_emp, employees.emp_no == dept_emp.emp_no, 'inner').drop(("emp_no")).show(10)  
employees.join(dept_emp, employees.emp_no == dept_emp.emp_no, 'inner').drop((dept_emp.emp_no)).show(10)  

In [35]:
# example-1: when we specify the column name it will drop both columns
employees.join(dept_emp, employees.emp_no == dept_emp.emp_no, 'inner').drop(("emp_no")).show(10)

# example-2: when we specify the DataFrame and column name it will drop the columns only from the specific DF.
employees.join(dept_emp, employees.emp_no == dept_emp.emp_no, 'inner').drop((dept_emp.emp_no)).show(10)

+------+----------+----------+---------+------+----------+-------+----------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|dept_no| from_date|   to_date|
+------+----------+----------+---------+------+----------+-------+----------+----------+
| 10206|1960-09-19|  Alassane|  Iwayama|     F|1988-04-19|   d005|1988-04-19|9999-01-01|
| 10362|1963-09-16|   Shalesh|  dAstous|     M|1988-08-24|   d003|1990-11-02|1997-07-16|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07|   d005|1992-01-15|1996-02-11|
| 10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07|   d008|1996-02-11|9999-01-01|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26|   d007|1990-12-26|2000-01-24|
| 10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26|   d009|2000-01-24|9999-01-01|
| 11033|1957-03-01|   Shushma|     Bahk|     F|1990-10-02|   d005|1991-03-14|9999-01-01|
| 11141|1957-08-20|   Vasiliy|Kermarrec|     F|1989-12-28|   d005|1993-02-12|9999-01-01|
| 11317|1954-07-24|  

**Method III**   
Rename a columns before join.  
employees = employees.withColumnRenamed("emp_no", "employees_no")  
employees.join(dept_emp, employees.employee_no == dept_emp.emp_no, 'inner').drop("emp_no").show(10)

In [41]:
# rename column and drop emp_no columns

employees = employees.withColumnRenamed("emp_no", "employees_no")
employees.printSchema()
employees.join(dept_emp, employees.employees_no == dept_emp.emp_no, 'inner').drop("emp_no").show(10)

root
 |-- employees_no: integer (nullable = true)
 |-- birth_date: date (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- hire_date: date (nullable = true)

+------------+----------+----------+---------+------+----------+-------+----------+----------+
|employees_no|birth_date|first_name|last_name|gender| hire_date|dept_no| from_date|   to_date|
+------------+----------+----------+---------+------+----------+-------+----------+----------+
|       10206|1960-09-19|  Alassane|  Iwayama|     F|1988-04-19|   d005|1988-04-19|9999-01-01|
|       10362|1963-09-16|   Shalesh|  dAstous|     M|1988-08-24|   d003|1990-11-02|1997-07-16|
|       10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07|   d005|1992-01-15|1996-02-11|
|       10623|1953-07-11|Aleksander|   Danlos|     F|1987-03-07|   d008|1996-02-11|9999-01-01|
|       10817|1958-10-02|       Uri|  Rullman|     F|1990-12-26|   d007|1990-12-26|2

#### 5.5 Optimization and Performance Tuning

Optimization and performance tuning is very important while joining big DataFrames. There are several ways to optimize during join operation. Before that we need to understand how Spark performs join operation internally. Check out the link below:   
* Node-to-Node Communication
* Per node Communcation

**Execution Plan**  
Execution Plan helps to give picture about the Physical Plan of join. We can display execution plan using `explain()` method.

In [39]:
employees.printSchema()
employees.join(dept_emp, employees.employees_no == dept_emp.emp_no, 'inner').explain()
dept_emp.join(employees, employees.employees_no == dept_emp.emp_no, 'inner').explain()

root
 |-- employees_no: integer (nullable = true)
 |-- birth_date: date (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- hire_date: date (nullable = true)

== Physical Plan ==
*(5) SortMergeJoin [employees_no#1369], [emp_no#12], Inner
:- *(2) Sort [employees_no#1369 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(employees_no#1369, 200), true, [id=#842]
:     +- *(1) Project [emp_no#0 AS employees_no#1369, birth_date#1, first_name#2, last_name#3, gender#4, hire_date#5]
:        +- *(1) Scan JDBCRelation(employees) [numPartitions=1] [gender#4,emp_no#0,hire_date#5,birth_date#1,first_name#2,last_name#3] PushedFilters: [*IsNotNull(emp_no)], ReadSchema: struct<gender:string,emp_no:int,hire_date:date,birth_date:date,first_name:string,last_name:string>
+- *(4) Sort [emp_no#12 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(emp_no#12, 200), true, [id=#848]
      +- *(3) Scan JDB

**Cluster Configuration**  
We can tune the Spark cluster by configuring default parameter. Refer the link below:  
[Spark Configuration](https://spark.apache.org/docs/latest/configuration.html)$^3$

**Partitioning DataFrame**  
Partitioning DataFrame is important when the size of dataset is large. It helps to:  
* Split the data into multiple splitable.
* Spread data across multiple nodes.

**Tuning job**  
Tuning help to optimize the query and make code run faster. To optimize:  
* Change the join order
* Use broadcast join hints. 

[Spark Performance Tuning](https://spark.apache.org/docs/latest/sql-performance-tuning.html)  

**Spark SQL Configurations**
For optimizing or debugging the SQL code, we can configure spark configuration properties. The table below shows some configuration for tuning the queries.

Table: 5.1 Spark SQL configurations

| Property | Default | Description |
|:---------------:|:---------------:|:---------------:|
| spark.sql.inMemoryColumnarStorage.compressed | true | When set to true, Spark SQL automatically selects a compression codec for each column based on statistics of the data. |
| spark.sql.inMemoryColumnarStorage.batchSize | 10000 | Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization and compression, but risk OutOfMemoryErrors (OOMs) when caching data. |
| spark.sql.files.maxPartitionBytes | 134217728 (128 MB) | The maximum number of bytes to pack into a single partition when reading files. |  
| spark.sql.broadcastTimeout | 300 | Timeout in seconds for the broadcast wait time in broadcast joins. |
| spark.sql.files.openCostInBytes | 10485760 (10 MB) | The estimated cost to open a file, measured by the number of bytes that could be scanned in the same time. This is used when putting multiple files into a partition. It is better to overestimate; that way the partitions with small files will be faster than partitions with bigger files (which is scheduled first). |
| spark.sql.autoBroadcastJoinThreshold | 10485760 (10 MB) | Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. You can disable broadcasting by setting this value to -1. Note that currently statistics are supported only for Hive Metastore tables for which the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run |
| spark.sql.shuffle.partitions | 200 | Configures the number of partitions to use when shuffling data for joins or aggregations. |

**References**

$^{1}$ https://en.wikipedia.org/wiki/Database_normalization#Normal_forms   
$^{2}$ https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html   
$^{3}$ https://spark.apache.org/docs/latest/configuration.html