-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Exercise #4 - XML Ingestion, Products Table

The products being sold by our sales reps are itemized in an XML document which we will need to load.

Unlike CSV, JSON, Parquet, & Delta, support for XML is not included with the default distribution of Apache Spark.

Before we can load the XML document, we need additional support for a **`DataFrameReader`** that can processes XML files.

Once the **spark-xml** library is installed to our cluster, we can load our XML document and proceede with our other transformations.

This exercise is broken up into 4 steps:
* Exercise 4.A - Use Database
* Exercise 4.B - Install Library
* Exercise 4.C - Load Products
* Exercise 4.D - Load ProductLineItems

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Setup Exercise #4</h2>

To get started, we first need to configure your Registration ID and then run the setup notebook.

### Setup - Registration ID

In the next commmand, please update the variable **`registration_id`** with the Registration ID you received when you signed up for this project.

For more information, see [Registration ID]($./Registration ID)

In [0]:
registration_id = "3203488"

### Setup - Run the exercise setup

Run the following cell to setup this exercise, declaring exercise-specific variables and functions.

In [0]:
%run ./_includes/Setup-Exercise-04

Variable/Function,Description
username,andrew.barry@infinitive.com
,This is the email address that you signed into Databricks with
working_dir,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone
,This is the directory in which all work should be conducted
user_db,dbacademy_andrew_barry_infinitive_com_db
,The name of the database you will use for this project.
products_table,products
,The name of the products table.
products_xml_path,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/products/products.xml
,The location of the product's XML file


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #4.A - Use Database</h2>

Each notebook uses a different Spark session and will initially use the **`default`** database.

As in the previous exercise, we can avoid contention to commonly named tables by using our user-specific database.

**In this step you will need to:**
* Use the database identified by the variable **`user_db`** so that any tables created in this notebook are **NOT** added to the **`default`** database

### Implement Exercise #4.A

Implement your solution in the following cell:

In [0]:
use_query = "USE {};".format(user_db)

sqlContext.sql(use_query)

### Reality Check #4.A
Run the following command to ensure that you are on track:

In [0]:
reality_check_04_a()

Points,Test,Result
1,"Using DBR 7.3 LTS, with 8 cores",
1,Valid Registration ID,
1,The current database is dbacademy_andrew_barry_infinitive_com_db,


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #4.B - Install Library</h2>

**In this step you will need to:**
* Register the **spark-xml** library - edit your cluster configuration and then from the **Libraries** tab, install the following library:
  * Type: **Maven**
  * Coordinates: **com.databricks:spark-xml_2.12:0.10.0**

If you are unfamiliar with this processes, more information can be found in the <a href="https://docs.databricks.com/libraries/cluster-libraries.html" target="_blank">Cluster libraries documentation</a>.

Once the library is installed, run the following reality check to confirm proper installation.<br/>
Note: You may need to restart the cluster after installing the library for you changes to take effect.

In [0]:
reality_check_04_b()

Points,Test,Result
1,Successfully installed the spark-xml library,


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #4.C - Load Products</h2>

With the **spark-xml** library installed, ingesting an XML document is identical to ingesting any other dataset - other than specific, provided, options.

**In this step you will need to:**
* Load the XML document using the following paramters:
  * Format: **xml**
  * Options:
    * **`rootTag`** = **`products`** - identifies the root tag in the XML document, in our case this is "products"
    * **`rowTag`** = **`product`** - identifies the tag of each row under the root tag, in our case this is "product"
    * **`inferSchema`** = **`True`** - The file is small, and a one-shot operation - infering the schema will save us some time
  * File Path: specified by the variable **`products_xml_path`**
  
* Update the schema to conform to the following specification:
  * **`product_id`**:**`string`**
  * **`color`**:**`string`**
  * **`model_name`**:**`string`**
  * **`model_number`**:**`string`**
  * **`base_price`**:**`double`**
  * **`color_adj`**:**`double`**
  * **`size_adj`**:**`double`**
  * **`price`**:**`double`**
  * **`size`**:**`string`**

* Exclude any records for which a **`price`** was not included - these represent products that are not yet available for sale.
* Load the dataset to the managed delta table **`products`** (identified by the variable **`products_table`**)

### Implement Exercise #4.C

Implement your solution in the following cell:

In [0]:
df = spark.read.format("xml").option("rootTag", "products").option("rowTag", "product").option("inferSchema", True).load(products_xml_path)
display(df)

_product_id,color,model_name,model_number,price,size
7a41323a-560f-4e34-aba6-995e2325f95e,red,Movio Classic,HM1K-BT-S-R,"List(95.32, 1.0, 0.9, 85.788)",small
bc93ed89-bb15-4e46-a110-a5878e46ccf6,green,Movio Classic,HM1K-BT-S-G,"List(95.32, 1.0, 0.9, 85.788)",small
ec15ba1d-53b6-44b0-8a22-1e498485f1b8,blue,Movio Classic,HM1K-BT-S-B,"List(95.32, 1.0, 0.9, 85.788)",small
a8fbcfea-4352-4c5a-af8b-c8623258b4f8,white,Movio Classic,HM1K-BT-S-W,"List(95.32, 1.1, 0.9, 94.3668)",small
95cbadca-cf90-4b8a-a134-2976f6ba6df8,red,Movio Classic,HM1K-BT-M-R,"List(95.32, 1.0, 0.95, 90.55399999999999)",medium
e26839a2-44fd-4003-a06b-faf6a2dff077,green,Movio Classic,HM1K-BT-M-G,"List(95.32, 1.0, 0.95, 90.55399999999999)",medium
699fcfe8-ce60-42c9-9d0f-728df3e48d70,blue,Movio Classic,HM1K-BT-M-B,"List(95.32, 1.0, 0.95, 90.55399999999999)",medium
e672483e-57a8-434a-bc42-ecf827c8a8d4,white,Movio Classic,HM1K-BT-M-W,"List(95.32, 1.1, 0.95, 99.6094)",medium
8d809e13-fdc5-4d15-9271-953750f6d592,red,Movio Classic,HM1K-BT-L-R,"List(95.32, 1.0, 1.0, 95.32)",large
668b2c1f-d76e-4bf0-82bb-c7d5776524a4,green,Movio Classic,HM1K-BT-L-G,"List(95.32, 1.0, 1.0, 95.32)",large


In [0]:
from pyspark.sql.functions import *

df = (df.withColumn("product_id", col("_product_id"))
  .withColumn("base_price", col("price._base_price").astype("double"))
  .withColumn("color_adj", col("price._color_adj").astype("double"))
  .withColumn("size_adj", col("price._size_adj").astype("double"))
  .withColumn("price", col("price.usd").astype("double"))
  .drop("_product_id")
  .filter(col("price").isNotNull())
)
  
display(df)

color,model_name,model_number,price,size,product_id,base_price,color_adj,size_adj
red,Movio Classic,HM1K-BT-S-R,85.788,small,7a41323a-560f-4e34-aba6-995e2325f95e,95.32,1.0,0.9
green,Movio Classic,HM1K-BT-S-G,85.788,small,bc93ed89-bb15-4e46-a110-a5878e46ccf6,95.32,1.0,0.9
blue,Movio Classic,HM1K-BT-S-B,85.788,small,ec15ba1d-53b6-44b0-8a22-1e498485f1b8,95.32,1.0,0.9
white,Movio Classic,HM1K-BT-S-W,94.3668,small,a8fbcfea-4352-4c5a-af8b-c8623258b4f8,95.32,1.1,0.9
red,Movio Classic,HM1K-BT-M-R,90.554,medium,95cbadca-cf90-4b8a-a134-2976f6ba6df8,95.32,1.0,0.95
green,Movio Classic,HM1K-BT-M-G,90.554,medium,e26839a2-44fd-4003-a06b-faf6a2dff077,95.32,1.0,0.95
blue,Movio Classic,HM1K-BT-M-B,90.554,medium,699fcfe8-ce60-42c9-9d0f-728df3e48d70,95.32,1.0,0.95
white,Movio Classic,HM1K-BT-M-W,99.6094,medium,e672483e-57a8-434a-bc42-ecf827c8a8d4,95.32,1.1,0.95
red,Movio Classic,HM1K-BT-L-R,95.32,large,8d809e13-fdc5-4d15-9271-953750f6d592,95.32,1.0,1.0
green,Movio Classic,HM1K-BT-L-G,95.32,large,668b2c1f-d76e-4bf0-82bb-c7d5776524a4,95.32,1.0,1.0


In [0]:
df.printSchema()

df.write.format("delta").mode("overwrite").saveAsTable(products_table)

### Reality Check #4.C
Run the following command to ensure that you are on track:

In [0]:
reality_check_04_c()

Points,Test,Result
1,The current database is dbacademy_andrew_barry_infinitive_com_db,
1,The table products exists,
1,The table products is a managed table,
1,Using the Delta file format,
1,Schema is valid,
1,Expected 12 records,
1,Sample A of color_adj (valid values),
1,Sample B of color_adj (valid values),
1,Sample A of size_adj (valid values),
1,Sample B of size_adj (valid values),


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #4 - Final Check</h2>

Run the following command to make sure this exercise is complete:

In [0]:
reality_check_04_final()

Points,Test,Result
1,Reality Check 04.A passed,
1,Reality Check 04.B passed,
1,Reality Check 04.C passed,
