## Joining Tables - Inner

Let us understand how to join data from multiple tables.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [2]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Basic Transformations").
    master("yarn").
    getOrCreate

username = itv002461
spark = org.apache.spark.sql.SparkSession@5d0e100f


org.apache.spark.sql.SparkSession@5d0e100f

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* **Spark SQL** supports ASCII style join (**JOIN with ON**).
* There are different types of joins.
  * INNER JOIN - Get all the records from both the datasets which satisfies JOIN condition.
  * OUTER JOIN - We will get into the details as part of the next topic
* Example for INNER JOIN

```
SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_subtotal
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
LIMIT 10
```

* We can join more than 2 tables in one query. Here is how it will look like.

```
SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_subtotal
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
    JOIN products p
    ON p.product_id = oi.order_item_product_id
LIMIT 10
```

* If we have to apply additional filters, it is recommended to use WHERE clause. ON clause should only have join conditions.
* We can have non equal join conditions as well, but they are not used that often.
* Here are some of the examples for INNER JOIN:
  * Get order id, date, status and item revenue for all order items.
  * Get order id, date, status and item revenue for all order items for all orders where order status is either COMPLETE or CLOSED.
  * Get order id, date, status and item revenue for all order items for all orders where order status is either COMPLETE or CLOSED for the orders that are placed in the month of 2014 January.

In [5]:
%%sql

USE itv002461_retail

++
||
++
++



In [6]:
%%sql

SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_subtotal
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
LIMIT 10

|   34571|2014-02-23 00:00:...|      ...


+--------+--------------------+---------------+-------------------+
|order_id|          order_date|   order_status|order_item_subtotal|
+--------+--------------------+---------------+-------------------+
|   34566|2014-02-23 00:00:...|PENDING_PAYMENT|             179.97|
|   34566|2014-02-23 00:00:...|PENDING_PAYMENT|              250.0|
|   34570|2014-02-23 00:00:...|         CLOSED|              49.98|
|   34570|2014-02-23 00:00:...|         CLOSED|             119.97|
|   34570|2014-02-23 00:00:...|         CLOSED|              200.0|
|   34570|2014-02-23 00:00:...|         CLOSED|             239.96|
|   34571|2014-02-23 00:00:...|         CLOSED|             399.98|
|   34571|2014-02-23 00:00:...|         CLOSED|             299.98|
|   34571|2014-02-23 00:00:...|         CLOSED|              59.99|
|   34571|2014-02-23 00:00:...|         CLOSED|             299.98|
+--------+--------------------+---------------+-------------------+



In [7]:
%%sql

SELECT count(1)
FROM orders

+--------+
|count(1)|
+--------+
|   68883|
+--------+



In [8]:
%%sql

SELECT count(1)
FROM order_items

+--------+
|count(1)|
+--------+
|  172198|
+--------+



In [9]:
%%sql

SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_subtotal
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
LIMIT 10

|       4|2013-07-25 00:00:...|      ...


+--------+--------------------+---------------+-------------------+
|order_id|          order_date|   order_status|order_item_subtotal|
+--------+--------------------+---------------+-------------------+
|       1|2013-07-25 00:00:...|         CLOSED|             299.98|
|       2|2013-07-25 00:00:...|PENDING_PAYMENT|             129.99|
|       2|2013-07-25 00:00:...|PENDING_PAYMENT|              250.0|
|       2|2013-07-25 00:00:...|PENDING_PAYMENT|             199.99|
|       4|2013-07-25 00:00:...|         CLOSED|             199.92|
|       4|2013-07-25 00:00:...|         CLOSED|              150.0|
|       4|2013-07-25 00:00:...|         CLOSED|             299.95|
|       4|2013-07-25 00:00:...|         CLOSED|              49.98|
|       5|2013-07-25 00:00:...|       COMPLETE|             129.99|
|       5|2013-07-25 00:00:...|       COMPLETE|             299.98|
+--------+--------------------+---------------+-------------------+



In [10]:
%%sql

SELECT count(1)
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id

+--------+
|count(1)|
+--------+
|  172198|
+--------+



In [11]:
%%sql

SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_subtotal
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
LIMIT 10

| ...


+--------+--------------------+------------+-------------------+
|order_id|          order_date|order_status|order_item_subtotal|
+--------+--------------------+------------+-------------------+
|       1|2013-07-25 00:00:...|      CLOSED|             299.98|
|       4|2013-07-25 00:00:...|      CLOSED|             199.92|
|       4|2013-07-25 00:00:...|      CLOSED|              150.0|
|       4|2013-07-25 00:00:...|      CLOSED|             299.95|
|       4|2013-07-25 00:00:...|      CLOSED|              49.98|
|       5|2013-07-25 00:00:...|    COMPLETE|             129.99|
|       5|2013-07-25 00:00:...|    COMPLETE|             299.98|
|       5|2013-07-25 00:00:...|    COMPLETE|              99.96|
|       5|2013-07-25 00:00:...|    COMPLETE|             299.95|
|       5|2013-07-25 00:00:...|    COMPLETE|             299.98|
+--------+--------------------+------------+-------------------+



In [12]:
%%sql

SELECT count(1)
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
LIMIT 10

+--------+
|count(1)|
+--------+
|   75408|
+--------+



In [13]:
%%sql

SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_subtotal
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
    AND date_format(order_date, 'yyyy-MM') = '2014-01'
LIMIT 10

| ...


+--------+--------------------+------------+-------------------+
|order_id|          order_date|order_status|order_item_subtotal|
+--------+--------------------+------------+-------------------+
|   61898|2014-01-01 00:00:...|    COMPLETE|               50.0|
|   61898|2014-01-01 00:00:...|    COMPLETE|             199.92|
|   61904|2014-01-01 00:00:...|    COMPLETE|              500.0|
|   61907|2014-01-01 00:00:...|    COMPLETE|             299.98|
|   61907|2014-01-01 00:00:...|    COMPLETE|             399.96|
|   61907|2014-01-01 00:00:...|    COMPLETE|             199.99|
|   61907|2014-01-01 00:00:...|    COMPLETE|             399.98|
|   61908|2014-01-01 00:00:...|    COMPLETE|             129.99|
|   61910|2014-01-01 00:00:...|    COMPLETE|             119.98|
|   61910|2014-01-01 00:00:...|    COMPLETE|             399.98|
+--------+--------------------+------------+-------------------+



In [14]:
%%sql

SELECT count(1)
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
    AND date_format(order_date, 'yyyy-MM') = '2014-01'
LIMIT 10

+--------+
|count(1)|
+--------+
|    6198|
+--------+



* Using Spark SQL with Python or Scala

In [15]:
spark.sql("""
SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_subtotal
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
""").show()

+--------+--------------------+---------------+-------------------+
|order_id|          order_date|   order_status|order_item_subtotal|
+--------+--------------------+---------------+-------------------+
|       1|2013-07-25 00:00:...|         CLOSED|             299.98|
|       2|2013-07-25 00:00:...|PENDING_PAYMENT|             129.99|
|       2|2013-07-25 00:00:...|PENDING_PAYMENT|              250.0|
|       2|2013-07-25 00:00:...|PENDING_PAYMENT|             199.99|
|       4|2013-07-25 00:00:...|         CLOSED|             199.92|
|       4|2013-07-25 00:00:...|         CLOSED|              150.0|
|       4|2013-07-25 00:00:...|         CLOSED|             299.95|
|       4|2013-07-25 00:00:...|         CLOSED|              49.98|
|       5|2013-07-25 00:00:...|       COMPLETE|             129.99|
|       5|2013-07-25 00:00:...|       COMPLETE|             299.98|
|       5|2013-07-25 00:00:...|       COMPLETE|              99.96|
|       5|2013-07-25 00:00:...|       COMPLETE| 

In [16]:
spark.sql("""
SELECT count(1)
FROM orders
""").show()

+--------+
|count(1)|
+--------+
|   68883|
+--------+



In [17]:
spark.sql("""
SELECT count(1)
FROM order_items
""").show()

+--------+
|count(1)|
+--------+
|  172198|
+--------+



In [18]:
spark.sql("""
SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_subtotal
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
""").show()

+--------+--------------------+---------------+-------------------+
|order_id|          order_date|   order_status|order_item_subtotal|
+--------+--------------------+---------------+-------------------+
|       1|2013-07-25 00:00:...|         CLOSED|             299.98|
|       2|2013-07-25 00:00:...|PENDING_PAYMENT|             129.99|
|       2|2013-07-25 00:00:...|PENDING_PAYMENT|              250.0|
|       2|2013-07-25 00:00:...|PENDING_PAYMENT|             199.99|
|       4|2013-07-25 00:00:...|         CLOSED|             199.92|
|       4|2013-07-25 00:00:...|         CLOSED|              150.0|
|       4|2013-07-25 00:00:...|         CLOSED|             299.95|
|       4|2013-07-25 00:00:...|         CLOSED|              49.98|
|       5|2013-07-25 00:00:...|       COMPLETE|             129.99|
|       5|2013-07-25 00:00:...|       COMPLETE|             299.98|
|       5|2013-07-25 00:00:...|       COMPLETE|              99.96|
|       5|2013-07-25 00:00:...|       COMPLETE| 

In [19]:
spark.sql("""
SELECT count(1)
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
""").show()

+--------+
|count(1)|
+--------+
|  172198|
+--------+



In [20]:
spark.sql("""
SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_subtotal
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
""").show()

+--------+--------------------+------------+-------------------+
|order_id|          order_date|order_status|order_item_subtotal|
+--------+--------------------+------------+-------------------+
|       1|2013-07-25 00:00:...|      CLOSED|             299.98|
|       4|2013-07-25 00:00:...|      CLOSED|             199.92|
|       4|2013-07-25 00:00:...|      CLOSED|              150.0|
|       4|2013-07-25 00:00:...|      CLOSED|             299.95|
|       4|2013-07-25 00:00:...|      CLOSED|              49.98|
|       5|2013-07-25 00:00:...|    COMPLETE|             129.99|
|       5|2013-07-25 00:00:...|    COMPLETE|             299.98|
|       5|2013-07-25 00:00:...|    COMPLETE|              99.96|
|       5|2013-07-25 00:00:...|    COMPLETE|             299.95|
|       5|2013-07-25 00:00:...|    COMPLETE|             299.98|
|       7|2013-07-25 00:00:...|    COMPLETE|              79.95|
|       7|2013-07-25 00:00:...|    COMPLETE|             299.98|
|       7|2013-07-25 00:0

In [21]:
spark.sql("""
SELECT count(1)
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
""").show()

+--------+
|count(1)|
+--------+
|   75408|
+--------+



In [22]:
spark.sql("""
SELECT o.order_id,
    o.order_date,
    o.order_status,
    oi.order_item_subtotal
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
    AND date_format(order_date, 'yyyy-MM') = '2014-01'
""").show()

+--------+--------------------+------------+-------------------+
|order_id|          order_date|order_status|order_item_subtotal|
+--------+--------------------+------------+-------------------+
|   25882|2014-01-01 00:00:...|    COMPLETE|             399.98|
|   25882|2014-01-01 00:00:...|    COMPLETE|              79.98|
|   25882|2014-01-01 00:00:...|    COMPLETE|              100.0|
|   25882|2014-01-01 00:00:...|    COMPLETE|             299.97|
|   25888|2014-01-01 00:00:...|    COMPLETE|             299.98|
|   25889|2014-01-01 00:00:...|    COMPLETE|              19.99|
|   25889|2014-01-01 00:00:...|    COMPLETE|              99.96|
|   25891|2014-01-01 00:00:...|      CLOSED|             119.97|
|   25891|2014-01-01 00:00:...|      CLOSED|               50.0|
|   25891|2014-01-01 00:00:...|      CLOSED|              150.0|
|   25895|2014-01-01 00:00:...|    COMPLETE|             199.99|
|   25895|2014-01-01 00:00:...|    COMPLETE|             399.98|
|   25897|2014-01-01 00:0

In [23]:
spark.sql("""
SELECT count(1)
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
    AND date_format(order_date, 'yyyy-MM') = '2014-01'
""").show()

+--------+
|count(1)|
+--------+
|    6198|
+--------+

