-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Exercise #4 - XML Ingestion, Products Table

The products being sold by our sales reps are itemized in an XML document which we will need to load.

Unlike CSV, JSON, Parquet, & Delta, support for XML is not included with the default distribution of Apache Spark.

Before we can load the XML document, we need additional support for a **`DataFrameReader`** that can processes XML files.

Once the **spark-xml** library is installed to our cluster, we can load our XML document and proceede with our other transformations.

This exercise is broken up into 4 steps:
* Exercise 4.A - Use Database
* Exercise 4.B - Install Library
* Exercise 4.C - Load Products
* Exercise 4.D - Load ProductLineItems

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Setup Exercise #4</h2>
To get started, run the following cell to setup this exercise, declaring exercise-specific variables and functions.

In [0]:
%run ./_includes/Setup-Exercise-04

Variable/Function,Description
username,cenz.wong@ekimetrics.com
,This is the email address that you signed into Databricks with
working_dir,dbfs:/dbacademy/cenz.wong@ekimetrics.com/developer-foundations-capstone
,This is the directory in which all work should be conducted
user_db,dbacademy_cenz_wong_ekimetrics_com_developer_foundations_capstone
,The name of the database you will use for this project.
products_table,products
,The name of the products table.
products_xml_path,dbfs:/dbacademy/cenz.wong@ekimetrics.com/developer-foundations-capstone/raw/products/products.xml
,The location of the product's XML file


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #4.A - Use Database</h2>

Each notebook uses a different Spark session and will initially use the **`default`** database.

As in the previous exercise, we can avoid contention to commonly named tables by using our user-specific database.

**In this step you will need to:**
* Use the database identified by the variable **`user_db`** so that any tables created in this notebook are **NOT** added to the **`default`** database

### Implement Exercise #4.A

Implement your solution in the following cell:

In [0]:
# Spark Hive table operations
spark.sql("CREATE DATABASE IF NOT EXISTS {}".format(user_db))
spark.sql("USE {}".format(user_db))

Out[12]: DataFrame[]

### Reality Check #4.A
Run the following command to ensure that you are on track:

In [0]:
reality_check_04_a()

Points,Test,Result
1,Using DBR 9.1 & Proper Cluster Configuration,
1,Valid Registration ID,
1,The current database is dbacademy_cenz_wong_ekimetrics_com_developer_foundations_capstone,


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #4.B - Install Library</h2>

**In this step you will need to:**
* Register the **spark-xml** library - edit your cluster configuration and then from the **Libraries** tab, install the following library:
  * Type: **Maven**
  * Coordinates: **com.databricks:spark-xml_2.12:0.10.0**

If you are unfamiliar with this processes, more information can be found in the <a href="https://docs.databricks.com/libraries/cluster-libraries.html" target="_blank">Cluster libraries documentation</a>.

Once the library is installed, run the following reality check to confirm proper installation.<br/>
Note: You may need to restart the cluster after installing the library for you changes to take effect.

In [0]:
reality_check_04_b()

Points,Test,Result
1,Successfully installed the spark-xml library,


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #4.C - Load Products</h2>

With the **spark-xml** library installed, ingesting an XML document is identical to ingesting any other dataset - other than specific, provided, options.

**In this step you will need to:**
* Load the XML document using the following paramters:
  * Format: **xml**
  * Options:
    * **`rootTag`** = **`products`** - identifies the root tag in the XML document, in our case this is "products"
    * **`rowTag`** = **`product`** - identifies the tag of each row under the root tag, in our case this is "product"
    * **`inferSchema`** = **`True`** - The file is small, and a one-shot operation - infering the schema will save us some time
  * File Path: specified by the variable **`products_xml_path`**
  
* Update the schema to conform to the following specification:
  * **`product_id`**:**`string`**
  * **`color`**:**`string`**
  * **`model_name`**:**`string`**
  * **`model_number`**:**`string`**
  * **`base_price`**:**`double`**
  * **`color_adj`**:**`double`**
  * **`size_adj`**:**`double`**
  * **`price`**:**`double`**
  * **`size`**:**`string`**

* Exclude any records for which a **`price`** was not included - these represent products that are not yet available for sale.
* Load the dataset to the managed delta table **`products`** (identified by the variable **`products_table`**)

### Implement Exercise #4.C

Implement your solution in the following cell:

In [0]:
# https://github.com/databricks/spark-xml#python-api

df = spark.read.format('xml').options(rootTag = 'products').options(rowTag = 'product').options(inferSchema = True).load(products_xml_path)
df = df.na.drop(subset="price")


In [0]:
# https://stackoverflow.com/questions/63757221/how-to-flatten-json-file-in-pyspark

from pyspark.sql.types import StructType
from pyspark.sql.functions import col


# return a list of all (possibly nested) fields to select, within a given schema
def flatten(schema, prefix: str = ""):
    # return a list of sub-items to select, within a given field
    def field_items(field):
        name = f'{prefix}.{field.name}' if prefix else field.name
        if type(field.dataType) == StructType:
            return flatten(field.dataType, name)
        else:
            return [col(name)]
    return [item for field in schema.fields for item in field_items(field)]

In [0]:
flattened = flatten(df.schema)

In [0]:
df2 = df.select(*flattened)

In [0]:
df3 = df2.withColumnRenamed('_product_id', 'product_id')\
    .withColumnRenamed('_base_price', 'base_price')\
    .withColumnRenamed('_color_adj', 'color_adj')\
    .withColumnRenamed('_size_adj', 'size_adj')\
    .withColumnRenamed('usd', 'price')

In [0]:
# Load the dataset to the managed delta table line_items (identified by the variable line_items_table)

df3.write.option("overwriteSchema", "true").saveAsTable(products_table, mode="overwrite")

### Reality Check #4.C
Run the following command to ensure that you are on track:

In [0]:
reality_check_04_c()

Points,Test,Result
1,The current database is dbacademy_cenz_wong_ekimetrics_com_developer_foundations_capstone,
1,The table products exists,
1,The table products is a managed table,
1,Using the Delta file format,
1,Schema is valid,
1,Expected 12 records,
1,Sample A of color_adj (valid values),
1,Sample B of color_adj (valid values),
1,Sample A of size_adj (valid values),
1,Sample B of size_adj (valid values),


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #4 - Final Check</h2>

Run the following command to make sure this exercise is complete:

In [0]:
reality_check_04_final()

Wrote 17 bytes.


Points,Test,Result
1,Reality Check 04.A passed,
1,Reality Check 04.B passed,
1,Reality Check 04.C passed,


-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>