## Joining Tables - Outer

Let us understand how to perform outer joins using Spark SQL. There are 3 different types of outer joins.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [2]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Basic Transformations").
    master("yarn").
    getOrCreate

username = itv002461
spark = org.apache.spark.sql.SparkSession@356b7339


org.apache.spark.sql.SparkSession@356b7339

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* LEFT OUTER JOIN (default) - Get all the records from both the datasets which satisfies JOIN condition along with those records which are in the left side table but not in the right side table.
* RIGHT OUTER JOIN - Get all the records from both the datasets which satisfies JOIN condition along with those records which are in the right side table but not in the left side table.
* FULL OUTER JOIN - left union right
* When we perform the outer join (lets say left outer join), we will see this.
  * Get all the values from both the tables when join condition satisfies.
  * If there are rows on left side tables for which there are no corresponding values in right side table, all the projected column values for right side table will be null.
* Here are some of the examples for outer join.
    * Get all the orders where there are no corresponding order items.
    * Get all the order items where there are no corresponding orders.

In [5]:
%%sql
use itv002461_retail

++
||
++
++



In [6]:
%%sql

SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_order_id,
    oi.order_item_subtotal
FROM orders o LEFT OUTER JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
LIMIT 10

|   34569|201...


+--------+--------------------+---------------+-------------------+-------------------+
|order_id|          order_date|   order_status|order_item_order_id|order_item_subtotal|
+--------+--------------------+---------------+-------------------+-------------------+
|   34565|2014-02-23 00:00:...|       COMPLETE|               null|               null|
|   34566|2014-02-23 00:00:...|PENDING_PAYMENT|              34566|             179.97|
|   34566|2014-02-23 00:00:...|PENDING_PAYMENT|              34566|              250.0|
|   34567|2014-02-23 00:00:...|SUSPECTED_FRAUD|               null|               null|
|   34568|2014-02-23 00:00:...|       COMPLETE|               null|               null|
|   34569|2014-02-23 00:00:...|       COMPLETE|               null|               null|
|   34570|2014-02-23 00:00:...|         CLOSED|              34570|              49.98|
|   34570|2014-02-23 00:00:...|         CLOSED|              34570|             119.97|
|   34570|2014-02-23 00:00:...| 

In [7]:
%%sql

SELECT count(1)
FROM orders o LEFT OUTER JOIN order_items oi
    ON o.order_id = oi.order_item_order_id

+--------+
|count(1)|
+--------+
|  183650|
+--------+



In [8]:
%%sql

SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_order_id,
    oi.order_item_subtotal
FROM orders o LEFT OUTER JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE oi.order_item_order_id IS NULL
LIMIT 10

|      40|201...


+--------+--------------------+---------------+-------------------+-------------------+
|order_id|          order_date|   order_status|order_item_order_id|order_item_subtotal|
+--------+--------------------+---------------+-------------------+-------------------+
|       3|2013-07-25 00:00:...|       COMPLETE|               null|               null|
|       6|2013-07-25 00:00:...|       COMPLETE|               null|               null|
|      22|2013-07-25 00:00:...|       COMPLETE|               null|               null|
|      26|2013-07-25 00:00:...|       COMPLETE|               null|               null|
|      32|2013-07-25 00:00:...|       COMPLETE|               null|               null|
|      40|2013-07-25 00:00:...|PENDING_PAYMENT|               null|               null|
|      47|2013-07-25 00:00:...|PENDING_PAYMENT|               null|               null|
|      53|2013-07-25 00:00:...|     PROCESSING|               null|               null|
|      54|2013-07-25 00:00:...|P

In [9]:
%%sql

SELECT count(1)
FROM orders o LEFT OUTER JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE oi.order_item_order_id IS NULL

+--------+
|count(1)|
+--------+
|   11452|
+--------+



In [10]:
%%sql

SELECT count(1)
FROM orders o LEFT OUTER JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE oi.order_item_order_id IS NULL
    AND o.order_status IN ('COMPLETE', 'CLOSED')

+--------+
|count(1)|
+--------+
|    5189|
+--------+



In [11]:
%%sql

SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_order_id,
    oi.order_item_subtotal
FROM orders o RIGHT OUTER JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
LIMIT 10

|   35212|2014-02-27 00:00:...|  PROC...


+--------+--------------------+------------+-------------------+-------------------+
|order_id|          order_date|order_status|order_item_order_id|order_item_subtotal|
+--------+--------------------+------------+-------------------+-------------------+
|   35210|2014-02-27 00:00:...|     ON_HOLD|              35210|             199.92|
|   35211|2014-02-27 00:00:...|    COMPLETE|              35211|             239.96|
|   35212|2014-02-27 00:00:...|  PROCESSING|              35212|              49.98|
|   35212|2014-02-27 00:00:...|  PROCESSING|              35212|             299.97|
|   35212|2014-02-27 00:00:...|  PROCESSING|              35212|              249.9|
|   35212|2014-02-27 00:00:...|  PROCESSING|              35212|              49.98|
|   35212|2014-02-27 00:00:...|  PROCESSING|              35212|             149.94|
|   35213|2014-02-27 00:00:...|      CLOSED|              35213|             239.96|
|   35213|2014-02-27 00:00:...|      CLOSED|              35213| 

In [12]:
%%sql

SELECT count(1)
FROM orders o RIGHT OUTER JOIN order_items oi
    ON o.order_id = oi.order_item_order_id

+--------+
|count(1)|
+--------+
|  172198|
+--------+



In [13]:
%%sql

SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_order_id,
    oi.order_item_subtotal
FROM orders o RIGHT OUTER JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_id IS NULL
LIMIT 10

+--------+----------+------------+-------------------+-------------------+
|order_id|order_date|order_status|order_item_order_id|order_item_subtotal|
+--------+----------+------------+-------------------+-------------------+
+--------+----------+------------+-------------------+-------------------+



* Using Spark SQL with Python or Scala

In [14]:
spark.sql("""
SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_order_id,
    oi.order_item_subtotal
FROM orders o LEFT OUTER JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
""").show()

+--------+--------------------+---------------+-------------------+-------------------+
|order_id|          order_date|   order_status|order_item_order_id|order_item_subtotal|
+--------+--------------------+---------------+-------------------+-------------------+
|       1|2013-07-25 00:00:...|         CLOSED|                  1|             299.98|
|       2|2013-07-25 00:00:...|PENDING_PAYMENT|                  2|             129.99|
|       2|2013-07-25 00:00:...|PENDING_PAYMENT|                  2|              250.0|
|       2|2013-07-25 00:00:...|PENDING_PAYMENT|                  2|             199.99|
|       3|2013-07-25 00:00:...|       COMPLETE|               null|               null|
|       4|2013-07-25 00:00:...|         CLOSED|                  4|             199.92|
|       4|2013-07-25 00:00:...|         CLOSED|                  4|              150.0|
|       4|2013-07-25 00:00:...|         CLOSED|                  4|             299.95|
|       4|2013-07-25 00:00:...| 

In [15]:
spark.sql("""
SELECT count(1)
FROM orders o LEFT OUTER JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
""").show()

+--------+
|count(1)|
+--------+
|  183650|
+--------+



In [16]:
spark.sql("""
SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_order_id,
    oi.order_item_subtotal
FROM orders o LEFT OUTER JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE oi.order_item_order_id IS NULL
""").show()

+--------+--------------------+---------------+-------------------+-------------------+
|order_id|          order_date|   order_status|order_item_order_id|order_item_subtotal|
+--------+--------------------+---------------+-------------------+-------------------+
|       3|2013-07-25 00:00:...|       COMPLETE|               null|               null|
|       6|2013-07-25 00:00:...|       COMPLETE|               null|               null|
|      22|2013-07-25 00:00:...|       COMPLETE|               null|               null|
|      26|2013-07-25 00:00:...|       COMPLETE|               null|               null|
|      32|2013-07-25 00:00:...|       COMPLETE|               null|               null|
|      40|2013-07-25 00:00:...|PENDING_PAYMENT|               null|               null|
|      47|2013-07-25 00:00:...|PENDING_PAYMENT|               null|               null|
|      53|2013-07-25 00:00:...|     PROCESSING|               null|               null|
|      54|2013-07-25 00:00:...|P

In [24]:
spark.sql("""
SELECT count(1)
FROM orders o LEFT OUTER JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE oi.order_item_order_id IS NULL""").show()

+--------+
|count(1)|
+--------+
|   11452|
+--------+



In [18]:
spark.sql("""
SELECT count(1)
FROM orders o LEFT OUTER JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE oi.order_item_order_id IS NULL
    AND o.order_status IN ('COMPLETE', 'CLOSED')
""").show()

+--------+
|count(1)|
+--------+
|    5189|
+--------+



In [19]:
spark.sql("""
SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_order_id,
    oi.order_item_subtotal
FROM orders o RIGHT OUTER JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
""").show()

+--------+--------------------+---------------+-------------------+-------------------+
|order_id|          order_date|   order_status|order_item_order_id|order_item_subtotal|
+--------+--------------------+---------------+-------------------+-------------------+
|       1|2013-07-25 00:00:...|         CLOSED|                  1|             299.98|
|       2|2013-07-25 00:00:...|PENDING_PAYMENT|                  2|             199.99|
|       2|2013-07-25 00:00:...|PENDING_PAYMENT|                  2|              250.0|
|       2|2013-07-25 00:00:...|PENDING_PAYMENT|                  2|             129.99|
|       4|2013-07-25 00:00:...|         CLOSED|                  4|              49.98|
|       4|2013-07-25 00:00:...|         CLOSED|                  4|             299.95|
|       4|2013-07-25 00:00:...|         CLOSED|                  4|              150.0|
|       4|2013-07-25 00:00:...|         CLOSED|                  4|             199.92|
|       5|2013-07-25 00:00:...| 

In [25]:
spark.sql("""
SELECT count(1)
FROM orders o RIGHT OUTER JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
""").show()

org.apache.spark.SparkException: Exception thrown in awaitResult: 

In [21]:
spark.sql("""
SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_order_id,
    oi.order_item_subtotal
FROM orders o RIGHT OUTER JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_id IS NULL
""").show()

+--------+----------+------------+-------------------+-------------------+
|order_id|order_date|order_status|order_item_order_id|order_item_subtotal|
+--------+----------+------------+-------------------+-------------------+
+--------+----------+------------+-------------------+-------------------+

