-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Exercise #3 - Create Fact & Dim Tables

Now that the three years of orders are combined into a single dataset, we can begin the processes of transforming the data.

In the one record, there are actually four sub-datasets:
* The order itself which is the aggregator of the other three datasets.
* The line items of each order which includes the price and quantity of each specific item.
* The sales rep placing the order.
* The customer placing the order - for the sake of simplicity, we will **not** break this dataset out and leave it as part of the order.

What we want to do next, is to extract all that data into their respective datasets (except the customer data). 

In other words, we want to normalize the data, in this case, to reduce data duplication.

This exercise is broken up into 5 steps:
* Exercise 3.A - Create & Use Database
* Exercise 3.B - Load & Cache Batch Orders
* Exercise 3.C - Extract Sales Reps
* Exercise 3.D - Extract Orders
* Exercise 3.E - Extract Line Items

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Setup Exercise #3</h2>

To get started, run the following cell to setup this exercise, declaring exercise-specific variables and functions.

In [0]:
%run ./_includes/Setup-Exercise-03

Variable/Function,Description
username,cenz.wong@ekimetrics.com
,This is the email address that you signed into Databricks with
working_dir,dbfs:/dbacademy/cenz.wong@ekimetrics.com/developer-foundations-capstone
,This is the directory in which all work should be conducted
user_db,dbacademy_cenz_wong_ekimetrics_com_developer_foundations_capstone
,The name of the database you will use for this project.
batch_source_path,dbfs:/dbacademy/cenz.wong@ekimetrics.com/developer-foundations-capstone/batch_orders_dirty.delta
,"The location of the combined, raw, batch of orders."
orders_table,orders
,The name of the orders table.


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #3.A - Create &amp; Use Database</h2>

By using a specific database, we can avoid contention to commonly named tables that may be in use by other users of the workspace.

**In this step you will need to:**
* Create the database identified by the variable **`user_db`**
* Use the database identified by the variable **`user_db`** so that any tables created in this notebook are **NOT** added to the **`default`** database

**Special Notes**
* Do not hard-code the database name - in some scenarios this will result in validation errors.
* For assistence with the SQL command to create a database, see <a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-database.html" target="_blank">CREATE DATABASE</a> on the Databricks docs website.
* For assistence with the SQL command to use a database, see <a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-usedb.html" target="_blank">USE DATABASE</a> on the Databricks docs website.

### Implement Exercise #3.A

Implement your solution in the following cell:

In [0]:
# %sql
# CREATE DATABASE IF NOT EXISTS dbacademy_cenz_wong_ekimetrics_com_developer_foundations_capstone;
# USE dbacademy_cenz_wong_ekimetrics_com_developer_foundations_capstone;

In [0]:
# Spark Hive table operations
spark.sql("CREATE DATABASE IF NOT EXISTS {}".format(user_db))
spark.sql("USE {}".format(user_db))

Out[57]: DataFrame[]

### Reality Check #3.A
Run the following command to ensure that you are on track:

In [0]:
reality_check_03_a()

Points,Test,Result
1,Using DBR 9.1 & Proper Cluster Configuration,
1,Valid Registration ID,
1,The current database is dbacademy_cenz_wong_ekimetrics_com_developer_foundations_capstone,


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #3.B - Load &amp; Cache Batch Orders</h2>

Next, we need to load the batch orders from the previous exercise and then cache them in preparation to transform the data later in this exercise.

**In this step you will need to:**
* Load the delta dataset we created in the previous exercise, identified by the variable **`batch_source_path`**.
* Using that same dataset, create a temporary view identified by the variable **`batch_temp_view`**.
* Cache the temporary view.

### Implement Exercise #3.B

Implement your solution in the following cell:

In [0]:
batch_source_df = spark.read.format('delta').load(batch_source_path)
batch_source_df.createOrReplaceTempView(batch_temp_view)
batch_source_df.cache()

Out[59]: DataFrame[submitted_at: string, order_id: string, customer_id: string, sales_rep_id: string, sales_rep_ssn: string, sales_rep_first_name: string, sales_rep_last_name: string, sales_rep_address: string, sales_rep_city: string, sales_rep_state: string, sales_rep_zip: string, shipping_address_attention: string, shipping_address_address: string, shipping_address_city: string, shipping_address_state: string, shipping_address_zip: string, product_id: string, product_quantity: string, product_sold_price: string, ingest_file_name: string, ingested_at: timestamp]

### Reality Check #3.B
Run the following command to ensure that you are on track:

In [0]:
reality_check_03_b()

Points,Test,Result
1,The current database is dbacademy_cenz_wong_ekimetrics_com_developer_foundations_capstone,
1,The table batched_orders exists,
1,The table batched_orders is a temp view,
1,The table batched_orders is cached,
1,"Expected 1,175,870 records",


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #3.C - Extract Sales Reps</h2>

Our batched orders from Exercise #2 contains thousands of orders and with every order, is the name, SSN, address and other information on the sales rep making the order.

We can use this data to create a table of just our sales reps.

If you consider that we have only ~100 sales reps, but thousands of orders, we are going to have a lot of duplicate data in this space.

Also unique to this set of data, is the fact that social security numbers were not always sanitized meaning sometime they were formatted with hyphens and in other cases they were not - this is something we will have to address here.

**In this step you will need to:**
* Load the table **`batched_orders`** (identified by the variable **`batch_temp_view`**)
* The SSN numbers have errors in them that we want to track - add the **`boolean`** column **`_error_ssn_format`** - for any case where **`sales_rep_ssn`** has a hypen in it, set this value to **`true`** otherwise **`false`**
* Convert various columns from their string representation to the specified type:
  * The column **`sales_rep_ssn`** should be represented as a **`Long`** (Note: You will have to first clean the column by removing extreneous hyphens in some records)
  * The column **`sales_rep_zip`** should be represented as an **`Integer`**
* Remove the columns not directly related to the sales-rep record:
  * Unrelated ID columns: **`submitted_at`**, **`order_id`**, **`customer_id`**
  * Shipping address columns: **`shipping_address_attention`**, **`shipping_address_address`**, **`shipping_address_city`**, **`shipping_address_state`**, **`shipping_address_zip`**
  * Product columns: **`product_id`**, **`product_quantity`**, **`product_sold_price`**
* Because there is one record per product ordered (many products per order), not to mention one sales rep placing many orders (many orders per sales rep), there will be duplicate records for our sales reps. Remove all duplicate records, making sure to exclude **`ingest_file_name`** and **`ingested_at`** from the evaluation of duplicate records
* Load the dataset to the managed delta table **`sales_rep_scd`** (identified by the variable **`sales_reps_table`**)

**Additional Requirements:**<br/>
The schema for the **`sales_rep_scd`** table must be:
* **`sales_rep_id`**:**`string`**
* **`sales_rep_ssn`**:**`long`**
* **`sales_rep_first_name`**:**`string`**
* **`sales_rep_last_name`**:**`string`**
* **`sales_rep_address`**:**`string`**
* **`sales_rep_city`**:**`string`**
* **`sales_rep_state`**:**`string`**
* **`sales_rep_zip`**:**`integer`**
* **`ingest_file_name`**:**`string`**
* **`ingested_at`**:**`timestamp`**
* **`_error_ssn_format`**:**`boolean`**

### Implement Exercise #3.C

Implement your solution in the following cell:

In [0]:
batch_temp = spark.read.table(batch_temp_view)

In [0]:
# The SSN numbers have errors in them that we want to track - add the boolean column _error_ssn_format - for any case where sales_rep_ssn has a hypen in it, set this value to true otherwise false

from pyspark.sql.functions import col
from pyspark.sql.types import BooleanType

@udf(returnType=BooleanType())
def toSalesRepBool(line: str) -> bool:
  # Currently the UDF can only be string in string out  
  if line is not None:
    if "-" in line:
      return True
  return False
#   return list(map(lambda s:s.strip(),list(filter(None, s.split("  ")))))

batch_temp01 = batch_temp.withColumn('_error_ssn_format', toSalesRepBool(batch_temp.sales_rep_ssn))


In [0]:
# Convert various columns from their string representation to the specified type:
# The column sales_rep_ssn should be represented as a Long (Note: You will have to first clean the column by removing extreneous hyphens in some records)
# The column sales_rep_zip should be represented as an Integer

from pyspark.sql.functions import col
from pyspark.sql.types import LongType, IntegerType

@udf(returnType=LongType())
def sales_rep_ssn_to_long(line: str) -> int:
  # Currently the UDF can only be string in string out  
  if line is not None:
    if "-" in line:
      return int(line.replace("-", ""))
    else:
      return int(line)
  return None
#   return list(map(lambda s:s.strip(),list(filter(None, s.split("  ")))))

batch_temp02 = batch_temp01.withColumn('sales_rep_ssn', sales_rep_ssn_to_long(batch_temp.sales_rep_ssn))

In [0]:
@udf(returnType=IntegerType())
def sales_rep_zip_to_int(line: str) -> int:
  # Currently the UDF can only be string in string out  
  if line is not None:
      return int(line)
  return None

batch_temp03 = batch_temp02.withColumn('sales_rep_zip', sales_rep_zip_to_int(batch_temp.sales_rep_zip))

In [0]:
# Remove the columns not directly related to the sales-rep record:
# Unrelated ID columns: submitted_at, order_id, customer_id
# Shipping address columns: shipping_address_attention, shipping_address_address, shipping_address_city, shipping_address_state, shipping_address_zip
# Product columns: product_id, product_quantity, product_sold_price

batch_temp04 = batch_temp03.drop(
  "submitted_at", "order_id", "customer_id",
  "shipping_address_attention", "shipping_address_address", "shipping_address_city", "shipping_address_state", "shipping_address_zip",
  "product_id", "product_quantity", "product_sold_price"
)


In [0]:
# Remove all duplicate records, making sure to exclude ingest_file_name and ingested_at from the evaluation of duplicate records

batch_final = batch_temp04.dropDuplicates([c for c in batch_temp04.columns if c not in {'ingest_file_name', 'ingested_at'}])

In [0]:
batch_final.write.saveAsTable(sales_reps_table, mode="overwrite")

### Reality Check #3.C
Run the following command to ensure that you are on track:

In [0]:
reality_check_03_c()

Points,Test,Result
1,The current database is dbacademy_cenz_wong_ekimetrics_com_developer_foundations_capstone,
1,The table sales_reps exists,
1,The table sales_reps is a managed table,
1,Using the Delta file format,
1,Schema is valid,
1,Expected 93 records,
1,Expected _error_ssn_format record count to be 17,


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #3.D - Extract Orders</h2>

Our batched orders from Exercise 02 contains one line per product meaning there are multiple records per order.

The goal of this step is to extract just the order details (excluding the sales rep and line items)

**In this step you will need to:**
* Load the table **`batched_orders`** (identified by the variable **`batch_temp_view`**)
* Convert various columns from their string representation to the specified type:
  * The column **`submitted_at`** is a "unix epoch" (number of seconds since 1970-01-01 00:00:00 UTC) and should be represented as a **`Timestamp`**
  * The column **`shipping_address_zip`** should be represented as an **`Integer`**
* Remove the columns not directly related to the order record:
  * Sales reps columns: **`sales_rep_ssn`**, **`sales_rep_first_name`**, **`sales_rep_last_name`**, **`sales_rep_address`**, **`sales_rep_city`**, **`sales_rep_state`**, **`sales_rep_zip`**
  * Product columns: **`product_id`**, **`product_quantity`**, **`product_sold_price`**
* Because there is one record per product ordered (many products per order), there will be duplicate records for each order. Remove all duplicate records, making sure to exclude **`ingest_file_name`** and **`ingested_at`** from the evaluation of duplicate records
* Add the column **`submitted_yyyy_mm`** which is a **`string`** derived from **`submitted_at`** and is formatted as "**yyyy-MM**".
* Load the dataset to the managed delta table **`orders`** (identified by the variable **`orders_table`**)
  * In thise case, the data must also be partitioned by **`submitted_yyyy_mm`**

**Additional Requirements:**
* The schema for the **`orders`** table must be:
  * **`submitted_at:timestamp`**
  * **`submitted_yyyy_mm`** using the format "**yyyy-MM**"
  * **`order_id:string`**
  * **`customer_id:string`**
  * **`sales_rep_id:string`**
  * **`shipping_address_attention:string`**
  * **`shipping_address_address:string`**
  * **`shipping_address_city:string`**
  * **`shipping_address_state:string`**
  * **`shipping_address_zip:integer`**
  * **`ingest_file_name:string`**
  * **`ingested_at:timestamp`**

### Implement Exercise #3.D

Implement your solution in the following cell:

In [0]:
batch_temp = spark.read.table(batch_temp_view)

In [0]:
import datetime

datetime.datetime.now()

Out[70]: datetime.datetime(2022, 1, 6, 3, 18, 3, 965834)

In [0]:
# Convert various columns from their string representation to the specified type:
# The column submitted_at is a "unix epoch" (number of seconds since 1970-01-01 00:00:00 UTC) and should be represented as a Timestamp
# The column shipping_address_zip should be represented as an Integer

from pyspark.sql.functions import col
from pyspark.sql.functions import from_unixtime, to_timestamp
from pyspark.sql.types import LongType, IntegerType, TimestampType

import datetime

@udf(returnType=TimestampType())
def submitted_at_to_ts(line: str) -> datetime.datetime:
  # Currently the UDF can only be string in string out  
#   if line is not None:
#       return int(line)
  return datetime.datetime.now()

batch_temp01 = batch_temp.withColumn('submitted_at', to_timestamp(from_unixtime('submitted_at'), 'yyyy-MM-dd HH:mm:ss'))



@udf(returnType=IntegerType())
def shipping_address_zip_to_int(line: str) -> int:
  # Currently the UDF can only be string in string out  
  if line is not None:
      return int(line)
  return None

batch_temp02 = batch_temp01.withColumn('shipping_address_zip', shipping_address_zip_to_int(batch_temp.shipping_address_zip))

In [0]:
batch_temp03 = batch_temp02.drop(
  "sales_rep_ssn", "sales_rep_first_name", "sales_rep_last_name", "sales_rep_address", "sales_rep_city", "sales_rep_state", "sales_rep_zip", 
  "product_id", "product_quantity", "product_sold_price"
)

In [0]:
batch_final = batch_temp03.dropDuplicates([c for c in batch_temp03.columns if c not in {'ingest_file_name', 'ingested_at'}])

In [0]:
# Add the column submitted_yyyy_mm which is a string derived from submitted_at and is formatted as "yyyy-MM".
from pyspark.sql.functions import date_format
from pyspark.sql.types import StringType

# @udf(returnType=StringType())
# def submitted_to_yyyy_mm(line) -> str:
#   # Currently the UDF can only be string in string out  
#   if line is not None:
#       return int(line)
#   return "None"
batch_final_yyyy_mm = batch_final.select("*", date_format(batch_final.submitted_at, "yyyy-MM").alias("submitted_yyyy_mm"))

In [0]:
# Load the dataset to the managed delta table orders (identified by the variable orders_table)
# In thise case, the data must also be partitioned by submitted_yyyy_mm

batch_final_yyyy_mm.write.saveAsTable(orders_table, mode="overwrite", partitionBy="submitted_yyyy_mm")

In [0]:
batch_final_yyyy_mm.display()
batch_final_yyyy_mm.printSchema()

root
 |-- submitted_at: timestamp (nullable = true)
 |-- order_id: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- sales_rep_id: string (nullable = true)
 |-- shipping_address_attention: string (nullable = true)
 |-- shipping_address_address: string (nullable = true)
 |-- shipping_address_city: string (nullable = true)
 |-- shipping_address_state: string (nullable = true)
 |-- shipping_address_zip: integer (nullable = true)
 |-- ingest_file_name: string (nullable = true)
 |-- ingested_at: timestamp (nullable = true)
 |-- submitted_yyyy_mm: string (nullable = true)



submitted_at,order_id,customer_id,sales_rep_id,shipping_address_attention,shipping_address_address,shipping_address_city,shipping_address_state,shipping_address_zip,ingest_file_name,ingested_at,submitted_yyyy_mm
2019-07-28T16:00:00.000+0000,0000cdba-66ef-4cf9-a07e-7da7da6b6dcc,1f346308-f3b5-412a-9022-c0cda208debc,92f62533-fec8-49c6-8661-f463bb6e095d,Mary Workman,942 N Wren Street,Broken Arrow,OK,74546,dbfs:/dbacademy/cenz.wong@ekimetrics.com/developer-foundations-capstone/raw/orders/batch/2019.csv,2022-01-06T03:05:17.820+0000,2019-07
2017-07-22T13:00:00.000+0000,0003ecad-6827-4ad3-a80d-fbeb1569d2cb,00b54fb1-a8cf-471d-869b-e368a77a5f2f,4a888d8f-afe0-4ba3-b76d-08f25e0dea27,Waverly Estrada,898 Dalecroft Trail W,Rancho Cucamonga,CA,93774,dbfs:/dbacademy/cenz.wong@ekimetrics.com/developer-foundations-capstone/raw/orders/batch/2017.txt,2022-01-06T02:59:41.232+0000,2017-07
2018-08-27T10:00:00.000+0000,000410cd-e8b5-4137-815b-255d7c9828cf,b79db4ce-e911-4e10-a87f-baca8869862c,761b3d8b-a96c-42fa-ba22-a0017621cbea,Alexander Graves,736 N Hampton Lane,Oceanside,CA,95979,dbfs:/dbacademy/cenz.wong@ekimetrics.com/developer-foundations-capstone/raw/orders/batch/2018.csv,2022-01-06T03:02:26.836+0000,2018-08
2017-05-01T17:00:00.000+0000,0008c4b7-3eab-4805-95f3-497c7e11f27a,fb514f0e-8535-4cb6-acdb-424c732f0062,af0f3842-846d-4e3f-9763-049978937827,Davis Garner,109 Wales Plaza E,Roseville,CA,92280,dbfs:/dbacademy/cenz.wong@ekimetrics.com/developer-foundations-capstone/raw/orders/batch/2017.txt,2022-01-06T02:59:41.232+0000,2017-05
2019-01-28T06:00:00.000+0000,000a765a-38fe-4e19-93e6-2dc552f84eef,f07e0e28-803a-492d-b6b3-e5382044e1ae,4898862a-68d3-43e5-8627-caf6933bdaec,Tanner Gross,456 Ravenel Court,Gainesville,FL,32936,dbfs:/dbacademy/cenz.wong@ekimetrics.com/developer-foundations-capstone/raw/orders/batch/2019.csv,2022-01-06T03:05:17.820+0000,2019-01
2019-04-04T06:00:00.000+0000,0013c783-52d3-4b6b-b6a4-0f985c4abacd,c3faf37b-f2f8-4b95-88d4-0208a69b4ba6,4afd2259-f7e8-4d2a-8643-33d68c4b5424,Malachi Crane,298 Burbank Lane E,Coral Springs,FL,33930,dbfs:/dbacademy/cenz.wong@ekimetrics.com/developer-foundations-capstone/raw/orders/batch/2019.csv,2022-01-06T03:05:17.820+0000,2019-04
2017-07-20T17:00:00.000+0000,0015677d-b05b-49c7-a9b0-d53abd852b49,80592e9d-b11e-487f-9565-0b7f9af10f9c,95a919b4-5c80-4242-bf0e-66e3443ee504,Saoirse Lee,395 Rockville Place,Lewisville,TX,78629,dbfs:/dbacademy/cenz.wong@ekimetrics.com/developer-foundations-capstone/raw/orders/batch/2017.txt,2022-01-06T02:59:41.232+0000,2017-07
2019-11-05T18:00:00.000+0000,001606e2-84a0-4ba7-9576-84f62d162767,eb151412-bf17-4a44-8c99-3191d7363389,6433aa8c-b6dd-47fd-afe4-c44b2f3b46d6,Averi Haynes,311 Rebusmen Road N,Columbus,GA,31198,dbfs:/dbacademy/cenz.wong@ekimetrics.com/developer-foundations-capstone/raw/orders/batch/2019.csv,2022-01-06T03:05:17.820+0000,2019-11
2018-04-25T16:00:00.000+0000,00168146-ed15-4d18-98b6-d0a65e418035,d1422979-3a3c-42c5-aefc-e0d124452f4c,de37844d-425e-47d0-a996-f02d070449ff,Gerald Alvarado,379 Deskin Lane,Renton,WA,99204,dbfs:/dbacademy/cenz.wong@ekimetrics.com/developer-foundations-capstone/raw/orders/batch/2018.csv,2022-01-06T03:02:26.836+0000,2018-04
2017-12-20T09:00:00.000+0000,0016ba14-4689-46b6-a781-afbbeb27e8f2,d7a79fe6-fc42-465f-a34f-e665eb25d34f,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,Giana Foley,3 E Whitney Way,Westminster,CO,81043,dbfs:/dbacademy/cenz.wong@ekimetrics.com/developer-foundations-capstone/raw/orders/batch/2017.txt,2022-01-06T02:59:41.232+0000,2017-12


### Reality Check #3.D
Run the following command to ensure that you are on track:

In [0]:
reality_check_03_d()

Points,Test,Result
1,The current database is dbacademy_cenz_wong_ekimetrics_com_developer_foundations_capstone,
1,The table orders exists,
1,The table orders is a managed table,
1,Using the Delta file format,
1,Schema is valid,
1,"Expected 195,698 records",
1,Non-null (properly parsed) submitted_at,
1,Partitioned by submitted_yyyy_mm,
1,Found 36 partitions,


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #3.E - Extract Line Items</h2>

Now that we have extracted sales reps and orders, we next want to extract the specific line items of each order.

**In this step you will need to:**
* Load the table **`batched_orders`** (identified by the variable **`batch_temp_view`**)
* Retain the following columns (see schema below)
  * The correlating ID columns: **`order_id`** and **`product_id`**
  * The two product-specific columns: **`product_quantity`** and **`product_sold_price`**
  * The two ingest columns: **`ingest_file_name`** and **`ingested_at`**
* Convert various columns from their string representation to the specified type:
  * The column **`product_quantity`** should be represented as an **`Integer`**
  * The column **`product_sold_price`** should be represented as an **`Decimal`** with two decimal places as in **`decimal(10,2)`**
* Load the dataset to the managed delta table **`line_items`** (identified by the variable **`line_items_table`**)

**Additional Requirements:**
* The schema for the **`line_items`** table must be:
  * **`order_id`**:**`string`**
  * **`product_id`**:**`string`**
  * **`product_quantity`**:**`integer`**
  * **`product_sold_price`**:**`decimal(10,2)`**
  * **`ingest_file_name`**:**`string`**
  * **`ingested_at`**:**`timestamp`**

### Implement Exercise #3.E

Implement your solution in the following cell:

In [0]:
batch_temp = spark.read.table(batch_temp_view)

In [0]:
batch_temp01 = batch_temp.select("order_id", "product_id","product_quantity", "product_sold_price", "ingest_file_name", "ingested_at")

In [0]:
# Convert various columns from their string representation to the specified type:
# The column product_quantity should be represented as an Integer
# The column product_sold_price should be represented as an Decimal with two decimal places as in decimal(10,2)

from pyspark.sql.types import IntegerType, DecimalType

batch_temp02 = batch_temp01.withColumn('product_quantity', batch_temp.product_quantity.cast(IntegerType()))
batch_temp02 = batch_temp02.withColumn('product_sold_price', batch_temp.product_sold_price.cast(DecimalType(10,2)))

In [0]:
# Load the dataset to the managed delta table line_items (identified by the variable line_items_table)

batch_temp02.write.saveAsTable(line_items_table, mode="overwrite")


In [0]:
batch_temp02.schema

Out[82]: StructType(List(StructField(order_id,StringType,true),StructField(product_id,StringType,true),StructField(product_quantity,IntegerType,true),StructField(product_sold_price,DecimalType(10,2),true),StructField(ingest_file_name,StringType,true),StructField(ingested_at,TimestampType,true)))

### Reality Check #3.E
Run the following command to ensure that you are on track:

In [0]:
reality_check_03_e()

Points,Test,Result
1,The current database is dbacademy_cenz_wong_ekimetrics_com_developer_foundations_capstone,
1,The table line_items exists,
1,The table line_items is a managed table,
1,Using the Delta file format,
1,Schema is valid,
1,"Expected 1,175,870 records",


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #3 - Final Check</h2>

Run the following command to make sure this exercise is complete:

In [0]:
reality_check_03_final()

Wrote 17 bytes.


Points,Test,Result
1,Reality Check 03.A passed,
1,Reality Check 03.B passed,
1,Reality Check 03.C passed,
1,Reality Check 03.D passed,
1,Reality Check 03.E passed,


-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>