## Conclusion - Final Solution

Let us review the Final Solution for our problem statement **daily_product_revenue**.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [2]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Basic Transformations").
    master("yarn").
    getOrCreate

username = itv002461
spark = org.apache.spark.sql.SparkSession@4e594cfa


org.apache.spark.sql.SparkSession@4e594cfa

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* Prepare tables
  * Create tables
  * Load the data into tables
* We need to project the fields which we are interested in.
  * order_date
  * order_item_product_id
  * product_revenue
* As we have fields from multiple tables, we need to perform join after which we have to filter for COMPLETE or CLOSED orders.
* We have to group the data by order_date and order_item_product_id, then we have to perform aggregation on order_item_subtotal to get product_revenue.

In [3]:
%%sql

DROP DATABASE itv002461_retail CASCADE

Waiting for a Spark session to start...

++
||
++
++



In [4]:
%%sql

CREATE DATABASE IF NOT EXISTS itv002461_retail

++
||
++
++



In [10]:
%%sql

USE itv002461_retail

++
||
++
++



In [11]:
%%sql

SHOW tables

+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+



In [12]:
%%sql

CREATE TABLE orders (
    order_id INT,
    order_date STRING,
    order_customer_id INT,
    order_status STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

++
||
++
++



In [13]:
%%sql

LOAD DATA LOCAL INPATH '/data/retail_db/orders' INTO TABLE orders

++
||
++
++



In [14]:
%%sql 

CREATE TABLE order_items (
    order_item_id INT,
    order_item_order_id INT,
    order_item_product_id INT,
    order_item_quantity INT,
    order_item_subtotal FLOAT,
    order_item_product_price FLOAT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

++
||
++
++



In [15]:
%%sql

LOAD DATA LOCAL INPATH '/data/retail_db/order_items' INTO TABLE order_items

++
||
++
++



In [16]:
%%sql

SELECT o.order_date,
    oi.order_item_product_id,
    round(sum(oi.order_item_subtotal), 2) AS product_revenue
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
GROUP BY o.order_date,
    oi.order_item_product_id

|2013-09-27 00:00:...|                  276|  ...


+--------------------+---------------------+---------------+
|          order_date|order_item_product_id|product_revenue|
+--------------------+---------------------+---------------+
|2013-07-27 00:00:...|                  703|          39.98|
|2013-07-29 00:00:...|                  793|          44.97|
|2013-08-12 00:00:...|                  627|         3199.2|
|2013-08-15 00:00:...|                  926|          15.99|
|2013-09-04 00:00:...|                  957|        3599.76|
|2013-09-07 00:00:...|                  235|         104.97|
|2013-09-17 00:00:...|                  792|          14.99|
|2013-09-25 00:00:...|                   44|         239.96|
|2013-09-27 00:00:...|                  276|          31.99|
|2013-10-04 00:00:...|                  792|          44.97|
+--------------------+---------------------+---------------+
only showing top 10 rows



In [17]:
%%sql

SELECT o.order_date,
    oi.order_item_product_id,
    round(sum(oi.order_item_subtotal), 2) AS product_revenue
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
GROUP BY o.order_date,
    oi.order_item_product_id
ORDER BY o.order_date,
    product_revenue DESC

|2013-07-25 00:00:...|                  627|  ...


+--------------------+---------------------+---------------+
|          order_date|order_item_product_id|product_revenue|
+--------------------+---------------------+---------------+
|2013-07-25 00:00:...|                 1004|        5599.72|
|2013-07-25 00:00:...|                  191|        5099.49|
|2013-07-25 00:00:...|                  957|         4499.7|
|2013-07-25 00:00:...|                  365|        3359.44|
|2013-07-25 00:00:...|                 1073|        2999.85|
|2013-07-25 00:00:...|                 1014|        2798.88|
|2013-07-25 00:00:...|                  403|        1949.85|
|2013-07-25 00:00:...|                  502|         1650.0|
|2013-07-25 00:00:...|                  627|        1079.73|
|2013-07-25 00:00:...|                  226|         599.99|
+--------------------+---------------------+---------------+
only showing top 10 rows



* Using Spark SQL with Python or Scala

In [18]:
spark.sql("DROP DATABASE itv002461_retail CASCADE")

[]

In [22]:
spark.sql("CREATE DATABASE IF NOT EXISTS itv002461_retail")

lastException: Throwable = null


[]

In [23]:
spark.sql("USE itv002461_retail")

[]

In [24]:
spark.sql("SHOW tables").show()

+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+



In [25]:
spark.sql("""
CREATE TABLE orders (
    order_id INT,
    order_date STRING,
    order_customer_id INT,
    order_status STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
""")

[]

In [26]:
spark.sql("""
LOAD DATA LOCAL INPATH '/data/retail_db/orders' 
INTO TABLE orders
""")

[]

In [27]:
spark.sql("""
CREATE TABLE order_items (
    order_item_id INT,
    order_item_order_id INT,
    order_item_product_id INT,
    order_item_quantity INT,
    order_item_subtotal FLOAT,
    order_item_product_price FLOAT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
""")

[]

In [28]:
spark.sql("""
LOAD DATA LOCAL INPATH '/data/retail_db/order_items' 
INTO TABLE order_items
""")

[]

In [29]:
spark.sql("""
SELECT o.order_date,
    oi.order_item_product_id,
    round(sum(oi.order_item_subtotal), 2) AS product_revenue
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
GROUP BY o.order_date,
    oi.order_item_product_id
""").show()

+--------------------+---------------------+---------------+
|          order_date|order_item_product_id|product_revenue|
+--------------------+---------------------+---------------+
|2014-03-28 00:00:...|                  793|          59.96|
|2014-04-09 00:00:...|                  191|        6599.34|
|2014-04-10 00:00:...|                  775|           9.99|
|2014-04-15 00:00:...|                  116|         404.91|
|2014-05-03 00:00:...|                  172|          120.0|
|2014-05-24 00:00:...|                  249|         164.91|
|2014-05-29 00:00:...|                 1014|        2698.92|
|2014-06-01 00:00:...|                 1073|        3199.84|
|2014-06-06 00:00:...|                  810|          39.98|
|2014-06-08 00:00:...|                  792|          89.94|
|2014-06-30 00:00:...|                  502|         4000.0|
|2014-07-01 00:00:...|                  926|          31.98|
|2014-07-02 00:00:...|                  793|          14.99|
|2013-08-12 00:00:...|  

In [30]:
spark.sql("""
SELECT o.order_date,
    oi.order_item_product_id,
    round(sum(oi.order_item_subtotal), 2) AS product_revenue
FROM orders o JOIN order_items oi
    ON o.order_id = oi.order_item_order_id
WHERE o.order_status IN ('COMPLETE', 'CLOSED')
GROUP BY o.order_date,
    oi.order_item_product_id
ORDER BY o.order_date,
    product_revenue DESC
""").show()

+--------------------+---------------------+---------------+
|          order_date|order_item_product_id|product_revenue|
+--------------------+---------------------+---------------+
|2013-07-25 00:00:...|                 1004|        5599.72|
|2013-07-25 00:00:...|                  191|        5099.49|
|2013-07-25 00:00:...|                  957|         4499.7|
|2013-07-25 00:00:...|                  365|        3359.44|
|2013-07-25 00:00:...|                 1073|        2999.85|
|2013-07-25 00:00:...|                 1014|        2798.88|
|2013-07-25 00:00:...|                  403|        1949.85|
|2013-07-25 00:00:...|                  502|         1650.0|
|2013-07-25 00:00:...|                  627|        1079.73|
|2013-07-25 00:00:...|                  226|         599.99|
|2013-07-25 00:00:...|                   24|         319.96|
|2013-07-25 00:00:...|                  821|         207.96|
|2013-07-25 00:00:...|                  625|         199.99|
|2013-07-25 00:00:...|  