-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Exercise #6 - Business Questions

In our last exercise, we are going to execute various joins across our four tables (**`orders`**, **`line_items`**, **`sales_reps`** and **`products`**) to answer basic business questions

This exercise is broken up into 3 steps:
* Exercise 6.A - Use Database
* Exercise 6.B - Question #1
* Exercise 6.C - Question #2
* Exercise 6.D - Question #3

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Setup Exercise #6</h2>

To get started, we first need to configure your Registration ID and then run the setup notebook.

### Setup - Registration ID

In the next commmand, please update the variable **`registration_id`** with the Registration ID you received when you signed up for this project.

For more information, see [Registration ID]($./Registration ID)

In [0]:
registration_id = "3339094"

### Setup - Run the exercise setup

Run the following cell to setup this exercise, declaring exercise-specific variables and functions.

In [0]:
%run ./_includes/Setup-Exercise-06

Variable/Function,Description
username,dakota.murdock@wavicledata.com
,This is the email address that you signed into Databricks with
working_dir,dbfs:/user/dakota.murdock@wavicledata.com/dbacademy/developer-foundations-capstone
,This is the directory in which all work should be conducted
user_db,dbacademy_dakota_murdock_wavicledata_com_developer_foundations_capstone
,The name of the database you will use for this project.
orders_table,orders
,The name of the orders table.
products_table,products
,The name of the products table.


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #6.A - Use Database</h2>

Each notebook uses a different Spark session and will initially use the **`default`** database.

As in the previous exercise, we can avoid contention to commonly named tables by using our user-specific database.

**In this step you will need to:**
* Use the database identified by the variable **`user_db`** so that any tables created in this notebook are **NOT** added to the **`default`** database

### Implement Exercise #6.A

Implement your solution in the following cell:

In [0]:
spark.sql(f"USE {user_db};")

Out[13]: DataFrame[]

### Reality Check #6.A
Run the following command to ensure that you are on track:

In [0]:
reality_check_06_a()

Points,Test,Result
1,Using DBR 9.1 & Proper Cluster Configuration,
1,Valid Registration ID,
1,The current database is dbacademy_dakota_murdock_wavicledata_com_developer_foundations_capstone,
1,"Expected 195,718 orders (20 new)",
1,"Expected 1,175,967 records (97 new)",
1,Expected 12 products,
1,Expected 93 sales reps,


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #6.B - Question #1</h2>
## How many orders were shipped to each state?

**In this step you will need to:**
* Aggregate the orders by **`shipping_address_state`**
* Count the number of records for each state
* Sort the results by **`count`**, in descending order
* Save the results to the temporary view **`question_1_results`**, identified by the variable **`question_1_results_table`**

### Implement Exercise #6.B

Implement your solution in the following cell:

In [0]:
df = spark.sql("""SELECT shipping_address_state, count(order_id) AS count
               FROM orders
               GROUP BY shipping_address_state
               SORT BY count DESC""")

df.createOrReplaceTempView(question_1_results_table)

### Reality Check #6.B
Run the following command to ensure that you are on track:

In [0]:
reality_check_06_b()

Points,Test,Result
1,"The table ""question_1_results"" exists",
1,"The table ""question_1_results"" is a temp view",
1,"Schema contains the column ""shipping_address_state"".",
1,"Schema contains the column ""count"".",
1,Expected the first state to be CA,
1,"Expected the first count to be 44,025",
1,Expected the last state to be MT,
1,Expected the last count to be 403,


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #6.C - Question #2</h2>
## What is the average, minimum and maximum sale price for green products sold to North Carolina where the Sales Rep submitted an invalid Social Security Number (SSN)?

**In this step you will need to:**
* Execute a join across all four tables:
  * **`orders`**, identified by the variable **`orders_table`**
  * **`line_items`**, identified by the variable **`line_items_table`**
  * **`products`**, identified by the variable **`products_table`**
  * **`sales_reps`**, identified by the variable **`sales_reps_table`**
* Limit the result to only green products (**`color`**).
* Limit the result to orders shipped to North Carolina (**`shipping_address_state`**)
* Limit the result to sales reps that initially submitted an improperly formatted SSN (**`_error_ssn_format`**)
* Calculate the average, minimum and maximum of **`product_sold_price`** - do not rename these columns after computing.
* Save the results to the temporary view **`question_2_results`**, identified by the variable **`question_2_results_table`**
* The temporary view should have the following three columns: **`avg(product_sold_price)`**, **`min(product_sold_price)`**, **`max(product_sold_price)`**
* Collect the results to the driver
* Assign to the following local variables, the average, minimum, and maximum values - these variables will be passed to the reality check for validation.
 * **`ex_avg`** - the local variable holding the average value
 * **`ex_min`** - the local variable holding the minimum value
 * **`ex_max`** - the local variable holding the maximum value

### Implement Exercise #6.C

Implement your solution in the following cell:

In [0]:
df = spark.sql(f"""SELECT avg(product_sold_price), min(product_sold_price), max(product_sold_price)
                    FROM {line_items_table}
                    INNER JOIN {orders_table} ON {line_items_table}.order_id = {orders_table}.order_id
                    INNER JOIN {products_table} ON {line_items_table}.product_id = {products_table}.product_id
                    INNER JOIN {sales_reps_table} ON {orders_table}.sales_rep_id = {sales_reps_table}.sales_rep_id
                    WHERE {products_table}.color LIKE 'green' AND {orders_table}.shipping_address_state = "NC" AND {sales_reps_table}._error_ssn_format = 'true';

""")

df.createOrReplaceTempView(question_2_results_table)
df.collect()

ex_avg = df.select("avg(product_sold_price)").head()[0]
ex_min = df.select("min(product_sold_price)").head()[0]
ex_max = df.select("max(product_sold_price)").head()[0]

### Reality Check #6.C
Run the following command to ensure that you are on track:

In [0]:
reality_check_06_c(ex_avg, ex_min, ex_max)

Points,Test,Result
1,"The table ""question_2_results"" exists",
1,"The table ""question_2_results"" is a temp view",
1,"Schema contains the column ""avg(product_sold_price)"".",
1,"Schema contains the column ""min(product_sold_price)"".",
1,"Schema contains the column ""max(product_sold_price)"".",
1,Expected the temp view's average to be 96.902253,
1,Expected the temp view's minimum to be 85.79,
1,Expected the temp view's maximum to be 113.43,
1,Expected the extracted average to be 96.902253,
1,Expected the extracted minimum to be 85.79,


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #6.D - Question #3</h2>
## What is the first and last name of the top earning sales rep based on net sales?

For this scenario...
* The top earning sales rep will be identified as the individual producing the largest profit.
* Profit is defined as the difference between **`product_sold_price`** and **`price`** which is then<br/>
  multiplied by **`product_quantity`** as seen in **(product_sold_price - price) * product_quantity**

**In this step you will need to:**
* Execute a join across all four tables:
  * **`orders`**, identified by the variable **`orders_table`**
  * **`line_items`**, identified by the variable **`line_items_table`**
  * **`products`**, identified by the variable **`products_table`**
  * **`sales_reps`**, identified by the variable **`sales_reps_table`**
* Calculate the profit for each line item of an order as described above.
* Aggregate the results by the sales reps' first &amp; last name and then sum each reps' total profit.
* Reduce the dataset to a single row for the sales rep with the largest profit.
* Save the results to the temporary view **`question_3_results`**, identified by the variable **`question_3_results_table`**
* The temporary view should have the following three columns: 
  * **`sales_rep_first_name`** - the first column by which to aggregate by
  * **`sales_rep_last_name`** - the second column by which to aggregate by
  * **`sum(total_profit)`** - the summation of the column **`total_profit`**

### Implement Exercise #6.D

Implement your solution in the following cell:

In [0]:
from pyspark.sql.functions import *

df = (spark.sql("""SELECT product_sold_price, product_quantity, products.price, sales_reps.sales_rep_first_name, sales_reps.sales_rep_last_name
                  FROM line_items
                  INNER JOIN orders ON line_items.order_id = orders.order_id
                  INNER JOIN products ON line_items.product_id = products.product_id
                  INNER JOIN sales_reps ON orders.sales_rep_id = sales_reps.sales_rep_id
                """)
                  .withColumn("profit", (col("product_sold_price") - col("price")) * col("product_quantity"))
                  .groupBy("sales_rep_first_name", "sales_rep_last_name")
                  .agg(sum("profit").alias("sum(total_profit)"))
                  .sort(col("sum(total_profit)").desc())
                  .limit(1)
     )

df.createOrReplaceTempView(question_3_results_table)

### Reality Check #6.D
Run the following command to ensure that you are on track:

In [0]:
reality_check_06_d()

PYTHON ERROR Invalid argument, not a string or column: 1636992872404.4348 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636992872936.6887 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636992872971.6099 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636992872991.823 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636992911785.75 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636992911805.5986 of type <class 'float'>. For column literals, use 'lit

Points,Test,Result
1,"The table ""question_3_results"" exists",
1,"The table ""question_3_results"" is a temp view",
1,"Schema contains the column ""sales_rep_first_name"".",
1,"Schema contains the column ""sales_rep_last_name"".",
1,Expected 1 record,
1,Expected the first name to be River,
1,Expected the last name to be Spears,


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #6 - Final Check</h2>

Run the following command to make sure this exercise is complete:

In [0]:
reality_check_06_final()

PYTHON ERROR Invalid argument, not a string or column: 1636992911881.4907 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636992911915.724 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636992911932.0671 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636992911947.1167 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636992911962.1936 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636992911978.3254 of type <class 'float'>. For column literals, use 'l

Points,Test,Result
1,Reality Check 06.A passed,
1,Reality Check 06.B passed,
1,Reality Check 06.C passed,
1,Reality Check 06.D passed,


-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>