-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Exercise #2 - Batch Ingestion

In this exercise you will be ingesting three batches of orders, one for 2017, 2018 and 2019.

As each batch is ingested, we are going to append it to a new Delta table, unifying all the datasets into one single dataset.

Each year, different individuals and different standards were used resulting in datasets that vary slightly:
* In 2017 the backup was written as fixed-width text files
* In 2018 the backup was written a tab-separated text files
* In 2019 the backup was written as a "standard" comma-separted text files but the format of the column names was changed

Our only goal here is to unify all the datasets while tracking the source of each record (ingested file name and ingested timestamp) should additional problems arise.

Because we are only concerned with ingestion at this stage, the majority of the columns will be ingested as simple strings and in future exercises we will address this issue (and others) with various transformations.

As you progress, several "reality checks" will be provided to you help ensure that you are on track - simply run the corresponding command after implementing the corresponding solution.

This exercise is broken up into 3 steps:
* Exercise 2.A - Ingest Fixed-Width File
* Exercise 2.B - Ingest Tab-Separated File
* Exercise 2.C - Ingest Comma-Separated File

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Setup Exercise #2</h2>

To get started, we first need to configure your Registration ID and then run the setup notebook.

### Setup - Registration ID

In the next commmand, please update the variable **`registration_id`** with the Registration ID you received when you signed up for this project.

For more information, see [Registration ID]($./Registration ID)

In [0]:
registration_id = "3203488"

### Setup - Run the exercise setup

Run the following cell to setup this exercise, declaring exercise-specific variables and functions.

In [0]:
%run ./_includes/Setup-Exercise-02

Variable/Function,Description
username,andrew.barry@infinitive.com
,This is the email address that you signed into Databricks with
working_dir,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone
,This is the directory in which all work should be conducted
batch_2017_path,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt
,The path to the 2017 batch of orders
batch_2018_path,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2018.csv
,The path to the 2018 batch of orders
batch_2019_path,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2019.csv
,The path to the 2019 batch of orders


Run the following cell to preview a list of the files you will be processing in this exercise.

In [0]:
files = dbutils.fs.ls(working_dir) # List all the files
display(files)                                           # Display the list of files

path,name,size
dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/,raw/,0


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #2.A - Ingest Fixed-Width File</h2>

**In this step you will need to:**
1. Use the variable **`batch_2017_path`**, and **`dbutils.fs.head`** to investigate the 2017 batch file, if needed.
2. Configure a **`DataFrameReader`** to ingest the text file identified by **`batch_2017_path`** - this should provide one record per line, with a single column named **`value`**
3. Using the information in **`fixed_width_column_defs`** (or the dictionary itself) use the **`value`** column to extract each new column of the appropriate length.<br/>
  * The dictionary's key is the column name
  * The first element in the dictionary's value is the starting position of that column's data
  * The second element in the dictionary's value is the length of that column's data
4. Once you are done with the **`value`** column, remove it.
5. For each new column created in step #3, remove any leading whitespace
  * The introduction of \[leading\] white space should be expected when extracting fixed-width values out of the **`value`** column.
6. For each new column created in step #3, replace all empty strings with **`null`**.
  * After trimming white space, any column for which a value was not specified in the original dataset should result in an empty string.
7. Add a new column, **`ingest_file_name`**, which is the name of the file from which the data was read from.
  * This should not be hard coded.
  * For the proper function, see the <a href="https://spark.apache.org/docs/latest/api/python/index.html" target="_blank">pyspark.sql.functions</a> module
8. Add a new column, **`ingested_at`**, which is a timestamp of when the data was ingested as a DataFrame.
  * This should not be hard coded.
  * For the proper function, see the <a href="https://spark.apache.org/docs/latest/api/python/index.html" target="_blank">pyspark.sql.functions</a> module
9. Write the corresponding **`DataFrame`** in the "delta" format to the location specified by **`batch_target_path`**

**Special Notes:**
* It is possible to use the dictionary **`fixed_width_column_defs`** and programatically extract <br/>
  each column but, it is also perfectly OK to hard code this step and extract one column at a time.
* The **`SparkSession`** is already provided to you as an instance of **`spark`**.
* The classes/methods that you will need for this exercise include:
  * **`pyspark.sql.DataFrameReader`** to ingest data
  * **`pyspark.sql.DataFrameWriter`** to ingest data
  * **`pyspark.sql.Column`** to transform data
  * Various functions from the **`pyspark.sql.functions`** module
  * Various transformations and actions from **`pyspark.sql.DataFrame`**
* The following methods can be used to investigate and manipulate the Databricks File System (DBFS)
  * **`dbutils.fs.ls(..)`** for listing files
  * **`dbutils.fs.rm(..)`** for removing files
  * **`dbutils.fs.head(..)`** to view the first N bytes of a file

**Additional Requirements:**
* The unified batch dataset must be written to disk in the "delta" format
* The schema for the unified batch dataset must be:
  * **`submitted_at`**:**`string`**
  * **`order_id`**:**`string`**
  * **`customer_id`**:**`string`**
  * **`sales_rep_id`**:**`string`**
  * **`sales_rep_ssn`**:**`string`**
  * **`sales_rep_first_name`**:**`string`**
  * **`sales_rep_last_name`**:**`string`**
  * **`sales_rep_address`**:**`string`**
  * **`sales_rep_city`**:**`string`**
  * **`sales_rep_state`**:**`string`**
  * **`sales_rep_zip`**:**`string`**
  * **`shipping_address_attention`**:**`string`**
  * **`shipping_address_address`**:**`string`**
  * **`shipping_address_city`**:**`string`**
  * **`shipping_address_state`**:**`string`**
  * **`shipping_address_zip`**:**`string`**
  * **`product_id`**:**`string`**
  * **`product_quantity`**:**`string`**
  * **`product_sold_price`**:**`string`**
  * **`ingest_file_name`**:**`string`**
  * **`ingested_at`**:**`timestamp`**

### Fixed-Width Meta Data 

The following dictionary is provided for reference and/or implementation<br/>
(depending on which strategy you choose to employ).

Run the following cell to instantiate it.

In [0]:
fixed_width_column_defs = {
  "submitted_at": (1, 15),
  "order_id": (16, 40),
  "customer_id": (56, 40),
  "sales_rep_id": (96, 40),
  "sales_rep_ssn": (136, 15),
  "sales_rep_first_name": (151, 15),
  "sales_rep_last_name": (166, 15),
  "sales_rep_address": (181, 40),
  "sales_rep_city": (221, 20),
  "sales_rep_state": (241, 2),
  "sales_rep_zip": (243, 5),
  "shipping_address_attention": (248, 30),
  "shipping_address_address": (278, 40),
  "shipping_address_city": (318, 20),
  "shipping_address_state": (338, 2),
  "shipping_address_zip": (340, 5),
  "product_id": (345, 40),
  "product_quantity": (385, 5),
  "product_sold_price": (390, 20)
}

### Implement Exercise #2.A

Implement your solution in the following cell:

In [0]:
dbutils.fs.head(batch_2017_path)

In [0]:
from pyspark.sql import DataFrameReader

raw_2017_df = spark.read.text(batch_2017_path)

display(raw_2017_df)

value
1504263600 0002589b-d84c-467b-a7b1-de4342812f75 2ac6fe34-26c8-4760-945c-f8b35eb12795 09e2ca9b-a241-4f63-a6a1-5bf63aeeb870 446912278 Cayson Wiggins 607 S Woodridge Drive VacavilleCA95851 Lena May 677 Red Hill Road W ChicagoIL61729 7a41323a-560f-4e34-aba6-995e2325f95e 300 87.50
1504263600 0002589b-d84c-467b-a7b1-de4342812f75 2ac6fe34-26c8-4760-945c-f8b35eb12795 09e2ca9b-a241-4f63-a6a1-5bf63aeeb870 446912278 Cayson Wiggins 607 S Woodridge Drive VacavilleCA95851 Lena May 677 Red Hill Road W ChicagoIL61729 8d809e13-fdc5-4d15-9271-953750f6d592 800 97.23
1504263600 0002589b-d84c-467b-a7b1-de4342812f75 2ac6fe34-26c8-4760-945c-f8b35eb12795 09e2ca9b-a241-4f63-a6a1-5bf63aeeb870 446912278 Cayson Wiggins 607 S Woodridge Drive VacavilleCA95851 Lena May 677 Red Hill Road W ChicagoIL61729 95cbadca-cf90-4b8a-a134-2976f6ba6df8 800 92.37
1504263600 0002589b-d84c-467b-a7b1-de4342812f75 2ac6fe34-26c8-4760-945c-f8b35eb12795 09e2ca9b-a241-4f63-a6a1-5bf63aeeb870 446912278 Cayson Wiggins 607 S Woodridge Drive VacavilleCA95851 Lena May 677 Red Hill Road W ChicagoIL61729 a8fbcfea-4352-4c5a-af8b-c8623258b4f8 200 96.25
1504263600 0002589b-d84c-467b-a7b1-de4342812f75 2ac6fe34-26c8-4760-945c-f8b35eb12795 09e2ca9b-a241-4f63-a6a1-5bf63aeeb870 446912278 Cayson Wiggins 607 S Woodridge Drive VacavilleCA95851 Lena May 677 Red Hill Road W ChicagoIL61729 a990d79b-4957-42fc-8e42-20ceb1fd1259 600 106.95
1504263600 0002589b-d84c-467b-a7b1-de4342812f75 2ac6fe34-26c8-4760-945c-f8b35eb12795 09e2ca9b-a241-4f63-a6a1-5bf63aeeb870 446912278 Cayson Wiggins 607 S Woodridge Drive VacavilleCA95851 Lena May 677 Red Hill Road W ChicagoIL61729 bc93ed89-bb15-4e46-a110-a5878e46ccf6 200 87.50
1504263600 0002589b-d84c-467b-a7b1-de4342812f75 2ac6fe34-26c8-4760-945c-f8b35eb12795 09e2ca9b-a241-4f63-a6a1-5bf63aeeb870 446912278 Cayson Wiggins 607 S Woodridge Drive VacavilleCA95851 Lena May 677 Red Hill Road W ChicagoIL61729 e26839a2-44fd-4003-a06b-faf6a2dff077 100 92.37
1504263600 0002589b-d84c-467b-a7b1-de4342812f75 2ac6fe34-26c8-4760-945c-f8b35eb12795 09e2ca9b-a241-4f63-a6a1-5bf63aeeb870 446912278 Cayson Wiggins 607 S Woodridge Drive VacavilleCA95851 Lena May 677 Red Hill Road W ChicagoIL61729 e672483e-57a8-434a-bc42-ecf827c8a8d4 100 101.60
1494810000 0003a3e6-a9f0-49ac-a0c5-75e5d5b149e1 992384ad-ba2c-449f-8304-34617cbe1148 9ef74d02-fe6d-42f3-b638-29e67bbfa20e 337-30-1919 Ruby Sanford 141 Bosie Run N RichmondCA90041 Misael Fuller 214 W Golden Grove Drive AnaheimCA90224 7a41323a-560f-4e34-aba6-995e2325f95e 800 96.94
1494810000 0003a3e6-a9f0-49ac-a0c5-75e5d5b149e1 992384ad-ba2c-449f-8304-34617cbe1148 9ef74d02-fe6d-42f3-b638-29e67bbfa20e 337-30-1919 Ruby Sanford 141 Bosie Run N RichmondCA90041 Misael Fuller 214 W Golden Grove Drive AnaheimCA90224 8d809e13-fdc5-4d15-9271-953750f6d592 600 107.71


In [0]:
from pyspark.sql.functions import *

df_2017 = (raw_2017_df
  .withColumn("submitted_at", trim(raw_2017_df.value.substr(1,15)))
  .withColumn("order_id", trim(raw_2017_df.value.substr(16,40)))
  .withColumn("customer_id", trim(raw_2017_df.value.substr(56,40)))
  .withColumn("sales_rep_id", trim(raw_2017_df.value.substr(96,40)))
  .withColumn("sales_rep_ssn", trim(raw_2017_df.value.substr(136,15)))
  .withColumn("sales_rep_first_name", trim(raw_2017_df.value.substr(151,15)))
  .withColumn("sales_rep_last_name", trim(raw_2017_df.value.substr(166,15)))
  .withColumn("sales_rep_address", trim(raw_2017_df.value.substr(181,40)))
  .withColumn("sales_rep_city", trim(raw_2017_df.value.substr(221,20)))
  .withColumn("sales_rep_state", trim(raw_2017_df.value.substr(241,2)))
  .withColumn("sales_rep_zip", trim(raw_2017_df.value.substr(243,5)))
  .withColumn("shipping_address_attention", trim(raw_2017_df.value.substr(248,30)))
  .withColumn("shipping_address_address", trim(raw_2017_df.value.substr(278,40)))
  .withColumn("shipping_address_city", trim(raw_2017_df.value.substr(318,20)))
  .withColumn("shipping_address_state", trim(raw_2017_df.value.substr(338,2)))
  .withColumn("shipping_address_zip", trim(raw_2017_df.value.substr(340,5)))
  .withColumn("product_id", trim(raw_2017_df.value.substr(345,40)))
  .withColumn("product_quantity", trim(raw_2017_df.value.substr(385,5)))
  .withColumn("product_sold_price", trim(raw_2017_df.value.substr(390,20)))
  .withColumn("ingest_file_name", input_file_name())
  .withColumn("ingested_at", current_timestamp())
  .drop("value")
)

display(df_2017)

submitted_at,order_id,customer_id,sales_rep_id,sales_rep_ssn,sales_rep_first_name,sales_rep_last_name,sales_rep_address,sales_rep_city,sales_rep_state,sales_rep_zip,shipping_address_attention,shipping_address_address,shipping_address_city,shipping_address_state,shipping_address_zip,product_id,product_quantity,product_sold_price,ingest_file_name,ingested_at
1504263600,0002589b-d84c-467b-a7b1-de4342812f75,2ac6fe34-26c8-4760-945c-f8b35eb12795,09e2ca9b-a241-4f63-a6a1-5bf63aeeb870,446912278,Cayson,Wiggins,607 S Woodridge Drive,Vacaville,CA,95851,Lena May,677 Red Hill Road W,Chicago,IL,61729,7a41323a-560f-4e34-aba6-995e2325f95e,300,87.5,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:13.068+0000
1504263600,0002589b-d84c-467b-a7b1-de4342812f75,2ac6fe34-26c8-4760-945c-f8b35eb12795,09e2ca9b-a241-4f63-a6a1-5bf63aeeb870,446912278,Cayson,Wiggins,607 S Woodridge Drive,Vacaville,CA,95851,Lena May,677 Red Hill Road W,Chicago,IL,61729,8d809e13-fdc5-4d15-9271-953750f6d592,800,97.23,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:13.068+0000
1504263600,0002589b-d84c-467b-a7b1-de4342812f75,2ac6fe34-26c8-4760-945c-f8b35eb12795,09e2ca9b-a241-4f63-a6a1-5bf63aeeb870,446912278,Cayson,Wiggins,607 S Woodridge Drive,Vacaville,CA,95851,Lena May,677 Red Hill Road W,Chicago,IL,61729,95cbadca-cf90-4b8a-a134-2976f6ba6df8,800,92.37,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:13.068+0000
1504263600,0002589b-d84c-467b-a7b1-de4342812f75,2ac6fe34-26c8-4760-945c-f8b35eb12795,09e2ca9b-a241-4f63-a6a1-5bf63aeeb870,446912278,Cayson,Wiggins,607 S Woodridge Drive,Vacaville,CA,95851,Lena May,677 Red Hill Road W,Chicago,IL,61729,a8fbcfea-4352-4c5a-af8b-c8623258b4f8,200,96.25,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:13.068+0000
1504263600,0002589b-d84c-467b-a7b1-de4342812f75,2ac6fe34-26c8-4760-945c-f8b35eb12795,09e2ca9b-a241-4f63-a6a1-5bf63aeeb870,446912278,Cayson,Wiggins,607 S Woodridge Drive,Vacaville,CA,95851,Lena May,677 Red Hill Road W,Chicago,IL,61729,a990d79b-4957-42fc-8e42-20ceb1fd1259,600,106.95,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:13.068+0000
1504263600,0002589b-d84c-467b-a7b1-de4342812f75,2ac6fe34-26c8-4760-945c-f8b35eb12795,09e2ca9b-a241-4f63-a6a1-5bf63aeeb870,446912278,Cayson,Wiggins,607 S Woodridge Drive,Vacaville,CA,95851,Lena May,677 Red Hill Road W,Chicago,IL,61729,bc93ed89-bb15-4e46-a110-a5878e46ccf6,200,87.5,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:13.068+0000
1504263600,0002589b-d84c-467b-a7b1-de4342812f75,2ac6fe34-26c8-4760-945c-f8b35eb12795,09e2ca9b-a241-4f63-a6a1-5bf63aeeb870,446912278,Cayson,Wiggins,607 S Woodridge Drive,Vacaville,CA,95851,Lena May,677 Red Hill Road W,Chicago,IL,61729,e26839a2-44fd-4003-a06b-faf6a2dff077,100,92.37,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:13.068+0000
1504263600,0002589b-d84c-467b-a7b1-de4342812f75,2ac6fe34-26c8-4760-945c-f8b35eb12795,09e2ca9b-a241-4f63-a6a1-5bf63aeeb870,446912278,Cayson,Wiggins,607 S Woodridge Drive,Vacaville,CA,95851,Lena May,677 Red Hill Road W,Chicago,IL,61729,e672483e-57a8-434a-bc42-ecf827c8a8d4,100,101.6,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:13.068+0000
1494810000,0003a3e6-a9f0-49ac-a0c5-75e5d5b149e1,992384ad-ba2c-449f-8304-34617cbe1148,9ef74d02-fe6d-42f3-b638-29e67bbfa20e,337-30-1919,Ruby,Sanford,141 Bosie Run N,Richmond,CA,90041,Misael Fuller,214 W Golden Grove Drive,Anaheim,CA,90224,7a41323a-560f-4e34-aba6-995e2325f95e,800,96.94,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:13.068+0000
1494810000,0003a3e6-a9f0-49ac-a0c5-75e5d5b149e1,992384ad-ba2c-449f-8304-34617cbe1148,9ef74d02-fe6d-42f3-b638-29e67bbfa20e,337-30-1919,Ruby,Sanford,141 Bosie Run N,Richmond,CA,90041,Misael Fuller,214 W Golden Grove Drive,Anaheim,CA,90224,8d809e13-fdc5-4d15-9271-953750f6d592,600,107.71,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:13.068+0000


In [0]:
for col_name in df_2017.columns:
  if col_name != "ingested_at":
    df_2017 = df_2017.withColumn(col_name, when(col(col_name) != "", col(col_name)).otherwise(None))
  
display(df_2017)

submitted_at,order_id,customer_id,sales_rep_id,sales_rep_ssn,sales_rep_first_name,sales_rep_last_name,sales_rep_address,sales_rep_city,sales_rep_state,sales_rep_zip,shipping_address_attention,shipping_address_address,shipping_address_city,shipping_address_state,shipping_address_zip,product_id,product_quantity,product_sold_price,ingest_file_name,ingested_at
1504263600,0002589b-d84c-467b-a7b1-de4342812f75,2ac6fe34-26c8-4760-945c-f8b35eb12795,09e2ca9b-a241-4f63-a6a1-5bf63aeeb870,446912278,Cayson,Wiggins,607 S Woodridge Drive,Vacaville,CA,95851,Lena May,677 Red Hill Road W,Chicago,IL,61729,7a41323a-560f-4e34-aba6-995e2325f95e,300,87.5,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:16.641+0000
1504263600,0002589b-d84c-467b-a7b1-de4342812f75,2ac6fe34-26c8-4760-945c-f8b35eb12795,09e2ca9b-a241-4f63-a6a1-5bf63aeeb870,446912278,Cayson,Wiggins,607 S Woodridge Drive,Vacaville,CA,95851,Lena May,677 Red Hill Road W,Chicago,IL,61729,8d809e13-fdc5-4d15-9271-953750f6d592,800,97.23,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:16.641+0000
1504263600,0002589b-d84c-467b-a7b1-de4342812f75,2ac6fe34-26c8-4760-945c-f8b35eb12795,09e2ca9b-a241-4f63-a6a1-5bf63aeeb870,446912278,Cayson,Wiggins,607 S Woodridge Drive,Vacaville,CA,95851,Lena May,677 Red Hill Road W,Chicago,IL,61729,95cbadca-cf90-4b8a-a134-2976f6ba6df8,800,92.37,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:16.641+0000
1504263600,0002589b-d84c-467b-a7b1-de4342812f75,2ac6fe34-26c8-4760-945c-f8b35eb12795,09e2ca9b-a241-4f63-a6a1-5bf63aeeb870,446912278,Cayson,Wiggins,607 S Woodridge Drive,Vacaville,CA,95851,Lena May,677 Red Hill Road W,Chicago,IL,61729,a8fbcfea-4352-4c5a-af8b-c8623258b4f8,200,96.25,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:16.641+0000
1504263600,0002589b-d84c-467b-a7b1-de4342812f75,2ac6fe34-26c8-4760-945c-f8b35eb12795,09e2ca9b-a241-4f63-a6a1-5bf63aeeb870,446912278,Cayson,Wiggins,607 S Woodridge Drive,Vacaville,CA,95851,Lena May,677 Red Hill Road W,Chicago,IL,61729,a990d79b-4957-42fc-8e42-20ceb1fd1259,600,106.95,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:16.641+0000
1504263600,0002589b-d84c-467b-a7b1-de4342812f75,2ac6fe34-26c8-4760-945c-f8b35eb12795,09e2ca9b-a241-4f63-a6a1-5bf63aeeb870,446912278,Cayson,Wiggins,607 S Woodridge Drive,Vacaville,CA,95851,Lena May,677 Red Hill Road W,Chicago,IL,61729,bc93ed89-bb15-4e46-a110-a5878e46ccf6,200,87.5,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:16.641+0000
1504263600,0002589b-d84c-467b-a7b1-de4342812f75,2ac6fe34-26c8-4760-945c-f8b35eb12795,09e2ca9b-a241-4f63-a6a1-5bf63aeeb870,446912278,Cayson,Wiggins,607 S Woodridge Drive,Vacaville,CA,95851,Lena May,677 Red Hill Road W,Chicago,IL,61729,e26839a2-44fd-4003-a06b-faf6a2dff077,100,92.37,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:16.641+0000
1504263600,0002589b-d84c-467b-a7b1-de4342812f75,2ac6fe34-26c8-4760-945c-f8b35eb12795,09e2ca9b-a241-4f63-a6a1-5bf63aeeb870,446912278,Cayson,Wiggins,607 S Woodridge Drive,Vacaville,CA,95851,Lena May,677 Red Hill Road W,Chicago,IL,61729,e672483e-57a8-434a-bc42-ecf827c8a8d4,100,101.6,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:16.641+0000
1494810000,0003a3e6-a9f0-49ac-a0c5-75e5d5b149e1,992384ad-ba2c-449f-8304-34617cbe1148,9ef74d02-fe6d-42f3-b638-29e67bbfa20e,337-30-1919,Ruby,Sanford,141 Bosie Run N,Richmond,CA,90041,Misael Fuller,214 W Golden Grove Drive,Anaheim,CA,90224,7a41323a-560f-4e34-aba6-995e2325f95e,800,96.94,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:16.641+0000
1494810000,0003a3e6-a9f0-49ac-a0c5-75e5d5b149e1,992384ad-ba2c-449f-8304-34617cbe1148,9ef74d02-fe6d-42f3-b638-29e67bbfa20e,337-30-1919,Ruby,Sanford,141 Bosie Run N,Richmond,CA,90041,Misael Fuller,214 W Golden Grove Drive,Anaheim,CA,90224,8d809e13-fdc5-4d15-9271-953750f6d592,600,107.71,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2021-09-17T22:34:16.641+0000


In [0]:
dbutils.fs.ls(working_dir)

In [0]:
(df_2017.write
  .format("delta")
  .mode("overwrite")
  .save(batch_target_path)
)

### Reality Check #2.A
Run the following command to ensure that you are on track:

In [0]:
reality_check_02_a()

Points,Test,Result
1,"Using DBR 7.3 LTS, with 8 cores",
1,Valid Registration ID,
1,Target directory exists,
1,Using the Delta file format,
1,Found at least one Parquet part-file,
1,Schema is valid,
1,"Expected 391,266 records",
1,"No leading or trailing whitespace in column values, need to trim",
1,"No empty strings in column values, should be the SQL value null",
1,"No ""null"" strings for column values, should be the SQL value null",


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #2.B - Ingest Tab-Separted File</h2>

**In this step you will need to:**
1. Use the variable **`batch_2018_path`**, and **`dbutils.fs.head`** to investigate the 2018 batch file, if needed.
2. Configure a **`DataFrameReader`** to ingest the tab-separated file identified by **`batch_2018_path`**
3. Add a new column, **`ingest_file_name`**, which is the name of the file from which the data was read from - note this should not be hard coded.
4. Add a new column, **`ingested_at`**, which is a timestamp of when the data was ingested as a DataFrame - note this should not be hard coded.
5. **Append** the corresponding **`DataFrame`** to the previously created datasets specified by **`batch_target_path`**

**Additional Requirements**
* Any **"null"** strings in the CSV file should be replaced with the SQL value **null**

### Implement Exercise #2.b

Implement your solution in the following cell:

In [0]:
dbutils.fs.head(batch_2018_path)

In [0]:
raw_2018_df = (spark.read
  .option("header", True)
  .option("sep", "\t")
  .csv(batch_2018_path)
)

display(raw_2018_df)

submitted_at,order_id,customer_id,sales_rep_id,sales_rep_ssn,sales_rep_first_name,sales_rep_last_name,sales_rep_address,sales_rep_city,sales_rep_state,sales_rep_zip,shipping_address_attention,shipping_address_address,shipping_address_city,shipping_address_state,shipping_address_zip,product_id,product_quantity,product_sold_price
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,668b2c1f-d76e-4bf0-82bb-c7d5776524a4,800,111.52
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,699fcfe8-ce60-42c9-9d0f-728df3e48d70,800,105.95
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,7a41323a-560f-4e34-aba6-995e2325f95e,200,100.37
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,8d809e13-fdc5-4d15-9271-953750f6d592,900,111.52
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,95cbadca-cf90-4b8a-a134-2976f6ba6df8,700,105.95
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,a8fbcfea-4352-4c5a-af8b-c8623258b4f8,900,110.41
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,bc93ed89-bb15-4e46-a110-a5878e46ccf6,700,100.37
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,e26839a2-44fd-4003-a06b-faf6a2dff077,900,105.95
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,e672483e-57a8-434a-bc42-ecf827c8a8d4,200,116.54
1537167600,0000f6b0-ab09-4df0-818a-3fc500f683a3,da4cb7c5-062b-4cda-b3af-5f9a711cfced,9ef74d02-fe6d-42f3-b638-29e67bbfa20e,337-30-1919,Ruby,Sanford,141 Bosie Run N,Richmond,CA,90041,Novah Mclean,903 Merry Road,Daly City,CA,95569,e26839a2-44fd-4003-a06b-faf6a2dff077,400,102.33


In [0]:
df_2018 = raw_2018_df

for col_name in df_2018.columns:
  df_2018 = df_2018.withColumn(col_name, (when(col(col_name) != "", trim(col(col_name))).otherwise(None)))
  df_2018 = df_2018.withColumn(col_name, (when(col(col_name) != "null", col(col_name))).otherwise(None))
  
display(df_2018)

submitted_at,order_id,customer_id,sales_rep_id,sales_rep_ssn,sales_rep_first_name,sales_rep_last_name,sales_rep_address,sales_rep_city,sales_rep_state,sales_rep_zip,shipping_address_attention,shipping_address_address,shipping_address_city,shipping_address_state,shipping_address_zip,product_id,product_quantity,product_sold_price
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,668b2c1f-d76e-4bf0-82bb-c7d5776524a4,800,111.52
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,699fcfe8-ce60-42c9-9d0f-728df3e48d70,800,105.95
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,7a41323a-560f-4e34-aba6-995e2325f95e,200,100.37
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,8d809e13-fdc5-4d15-9271-953750f6d592,900,111.52
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,95cbadca-cf90-4b8a-a134-2976f6ba6df8,700,105.95
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,a8fbcfea-4352-4c5a-af8b-c8623258b4f8,900,110.41
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,bc93ed89-bb15-4e46-a110-a5878e46ccf6,700,100.37
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,e26839a2-44fd-4003-a06b-faf6a2dff077,900,105.95
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,e672483e-57a8-434a-bc42-ecf827c8a8d4,200,116.54
1537167600,0000f6b0-ab09-4df0-818a-3fc500f683a3,da4cb7c5-062b-4cda-b3af-5f9a711cfced,9ef74d02-fe6d-42f3-b638-29e67bbfa20e,337-30-1919,Ruby,Sanford,141 Bosie Run N,Richmond,CA,90041,Novah Mclean,903 Merry Road,Daly City,CA,95569,e26839a2-44fd-4003-a06b-faf6a2dff077,400,102.33


In [0]:
df_2018 = (df_2018
  .withColumn("ingest_file_name", input_file_name())
  .withColumn("ingested_at", current_timestamp())
)

display(df_2018)

submitted_at,order_id,customer_id,sales_rep_id,sales_rep_ssn,sales_rep_first_name,sales_rep_last_name,sales_rep_address,sales_rep_city,sales_rep_state,sales_rep_zip,shipping_address_attention,shipping_address_address,shipping_address_city,shipping_address_state,shipping_address_zip,product_id,product_quantity,product_sold_price,ingest_file_name,ingested_at
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,668b2c1f-d76e-4bf0-82bb-c7d5776524a4,800,111.52,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2018.csv,2021-09-17T22:37:35.248+0000
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,699fcfe8-ce60-42c9-9d0f-728df3e48d70,800,105.95,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2018.csv,2021-09-17T22:37:35.248+0000
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,7a41323a-560f-4e34-aba6-995e2325f95e,200,100.37,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2018.csv,2021-09-17T22:37:35.248+0000
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,8d809e13-fdc5-4d15-9271-953750f6d592,900,111.52,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2018.csv,2021-09-17T22:37:35.248+0000
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,95cbadca-cf90-4b8a-a134-2976f6ba6df8,700,105.95,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2018.csv,2021-09-17T22:37:35.248+0000
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,a8fbcfea-4352-4c5a-af8b-c8623258b4f8,900,110.41,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2018.csv,2021-09-17T22:37:35.248+0000
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,bc93ed89-bb15-4e46-a110-a5878e46ccf6,700,100.37,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2018.csv,2021-09-17T22:37:35.248+0000
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,e26839a2-44fd-4003-a06b-faf6a2dff077,900,105.95,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2018.csv,2021-09-17T22:37:35.248+0000
1543078800,000011c4-6881-496f-a775-047f81320642,7ff1172c-6b11-48f2-9647-47fb04302c93,64748eb1-3898-49bd-8861-dd49de1c6adf,304-60-9930,Niklaus,Knox,PO Box 529,Grand Prairie,TX,76484,Arden Mcdonald,35 Ortez Lane W,Santa Ana,CA,94268,e672483e-57a8-434a-bc42-ecf827c8a8d4,200,116.54,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2018.csv,2021-09-17T22:37:35.248+0000
1537167600,0000f6b0-ab09-4df0-818a-3fc500f683a3,da4cb7c5-062b-4cda-b3af-5f9a711cfced,9ef74d02-fe6d-42f3-b638-29e67bbfa20e,337-30-1919,Ruby,Sanford,141 Bosie Run N,Richmond,CA,90041,Novah Mclean,903 Merry Road,Daly City,CA,95569,e26839a2-44fd-4003-a06b-faf6a2dff077,400,102.33,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2018.csv,2021-09-17T22:37:35.248+0000


In [0]:
(df_2018.write
  .format("delta")
  .mode("append")
  .save(batch_target_path)
)

### Reality Check #2.B
Run the following command to ensure that you are on track:

In [0]:
reality_check_02_b()

Points,Test,Result
1,Target directory exists,
1,Using the Delta file format,
1,Found at least one Parquet part-file,
1,Schema is valid,
1,"Expected 784,647 records",
1,"No leading or trailing whitespace in column values, need to trim",
1,"No empty strings in column values, should be the SQL value null",
1,"No ""null"" strings for column values, should be the SQL value null",
1,Ingest file names are valid for 2018,
1,Ingest date is valid for 2018,


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #2.C - Ingest Comma-Separted File</h2>

**In this step you will need to:**
1. Use the variable **`batch_2019_path`**, and **`dbutils.fs.head`** to investigate the 2019 batch file, if needed.
2. Configure a **`DataFrameReader`** to ingest the comma-separated file identified by **`batch_2019_path`**
3. Add a new column, **`ingest_file_name`**, which is the name of the file from which the data was read from - note this should not be hard coded.
4. Add a new column, **`ingested_at`**, which is a timestamp of when the data was ingested as a DataFrame - note this should not be hard coded.
5. **Append** the corresponding **`DataFrame`** to the previously created dataset specified by **`batch_target_path`**<br/>
   Note: The column names in this dataset must be updated to conform to the schema defined for Exercise #2.A - there are several strategies for this:
   * Provide a schema that alters the names upon ingestion
   * Manually rename one column at a time
   * Use **`fixed_width_column_defs`** programaticly rename one column at a time
   * Use transformations found in the **`DataFrame`** class to rename all columns in one operation

**Additional Requirements**
* Any **"null"** strings in the CSV file should be replaced with the SQL value **null**<br/>

### Implement Exercise #2.C

Implement your solution in the following cell:

In [0]:
dbutils.fs.head(batch_2019_path)

In [0]:
raw_2019_df = (spark.read
  .option("header", True)
  .option("sep", ",")
  .csv(batch_2019_path)
)

display(raw_2019_df)

submittedAt,orderId,customerId,salesRepId,salesRepSsn,salesRepFirstName,salesRepLastName,salesRepAddress,salesRepCity,salesRepState,salesRepZip,shippingAddressAttention,shippingAddressAddress,shippingAddressCity,shippingAddressState,shippingAddressZip,productId,productQuantity,productSoldPrice
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,668b2c1f-d76e-4bf0-82bb-c7d5776524a4,700,97.23
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,699fcfe8-ce60-42c9-9d0f-728df3e48d70,200,92.37
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,7a41323a-560f-4e34-aba6-995e2325f95e,100,87.5
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,7b547a10-e804-48e1-ad90-1f946cee659c,600,97.23
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,8d809e13-fdc5-4d15-9271-953750f6d592,1000,97.23
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,95cbadca-cf90-4b8a-a134-2976f6ba6df8,600,92.37
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,a8fbcfea-4352-4c5a-af8b-c8623258b4f8,1000,96.25
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,a990d79b-4957-42fc-8e42-20ceb1fd1259,100,106.95
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,bc93ed89-bb15-4e46-a110-a5878e46ccf6,800,87.5
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,e26839a2-44fd-4003-a06b-faf6a2dff077,200,92.37


In [0]:
df_2019 = raw_2019_df

for col_name in df_2019.columns:
  df_2019 = df_2019.withColumn(col_name, (when(col(col_name) != "", trim(col(col_name))).otherwise(None)))
  df_2019 = df_2019.withColumn(col_name, (when(col(col_name) != "null", col(col_name))).otherwise(None))
  
display(df_2019)

submittedAt,orderId,customerId,salesRepId,salesRepSsn,salesRepFirstName,salesRepLastName,salesRepAddress,salesRepCity,salesRepState,salesRepZip,shippingAddressAttention,shippingAddressAddress,shippingAddressCity,shippingAddressState,shippingAddressZip,productId,productQuantity,productSoldPrice
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,668b2c1f-d76e-4bf0-82bb-c7d5776524a4,700,97.23
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,699fcfe8-ce60-42c9-9d0f-728df3e48d70,200,92.37
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,7a41323a-560f-4e34-aba6-995e2325f95e,100,87.5
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,7b547a10-e804-48e1-ad90-1f946cee659c,600,97.23
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,8d809e13-fdc5-4d15-9271-953750f6d592,1000,97.23
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,95cbadca-cf90-4b8a-a134-2976f6ba6df8,600,92.37
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,a8fbcfea-4352-4c5a-af8b-c8623258b4f8,1000,96.25
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,a990d79b-4957-42fc-8e42-20ceb1fd1259,100,106.95
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,bc93ed89-bb15-4e46-a110-a5878e46ccf6,800,87.5
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,e26839a2-44fd-4003-a06b-faf6a2dff077,200,92.37


In [0]:
df_2019 = (df_2019
  .withColumn("ingest_file_name", input_file_name())
  .withColumn("ingested_at", current_timestamp())
)

In [0]:
df_2019 = df_2019.toDF("submitted_at", "order_id", "customer_id", "sales_rep_id", "sales_rep_ssn", "sales_rep_first_name", \
                      "sales_rep_last_name", "sales_rep_address", "sales_rep_city", "sales_rep_state", "sales_rep_zip", \
                      "shipping_address_attention", "shipping_address_address", "shipping_address_city", "shipping_address_state", \
                      "shipping_address_zip", "product_id", "product_quantity", "product_sold_price", "ingest_file_name", "ingested_at")

display(df_2019)

submitted_at,order_id,customer_id,sales_rep_id,sales_rep_ssn,sales_rep_first_name,sales_rep_last_name,sales_rep_address,sales_rep_city,sales_rep_state,sales_rep_zip,shipping_address_attention,shipping_address_address,shipping_address_city,shipping_address_state,shipping_address_zip,product_id,product_quantity,product_sold_price,ingest_file_name,ingested_at
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,668b2c1f-d76e-4bf0-82bb-c7d5776524a4,700,97.23,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2019.csv,2021-09-17T22:40:32.520+0000
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,699fcfe8-ce60-42c9-9d0f-728df3e48d70,200,92.37,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2019.csv,2021-09-17T22:40:32.520+0000
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,7a41323a-560f-4e34-aba6-995e2325f95e,100,87.5,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2019.csv,2021-09-17T22:40:32.520+0000
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,7b547a10-e804-48e1-ad90-1f946cee659c,600,97.23,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2019.csv,2021-09-17T22:40:32.520+0000
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,8d809e13-fdc5-4d15-9271-953750f6d592,1000,97.23,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2019.csv,2021-09-17T22:40:32.520+0000
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,95cbadca-cf90-4b8a-a134-2976f6ba6df8,600,92.37,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2019.csv,2021-09-17T22:40:32.520+0000
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,a8fbcfea-4352-4c5a-af8b-c8623258b4f8,1000,96.25,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2019.csv,2021-09-17T22:40:32.520+0000
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,a990d79b-4957-42fc-8e42-20ceb1fd1259,100,106.95,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2019.csv,2021-09-17T22:40:32.520+0000
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,bc93ed89-bb15-4e46-a110-a5878e46ccf6,800,87.5,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2019.csv,2021-09-17T22:40:32.520+0000
1565456400,00003356-11bf-41e7-9dc6-5eccf38a22ff,58ca2143-1cab-4d7e-a94e-90ddbce4a762,9ddf6d48-6314-48eb-8107-f98dd3d7bd46,675-33-6807,Barrett,Crosby,PO Box 95,Chandler,AZ,85623,Sage Leblanc,845 Berkshire Place,Tampa,FL,32260,e26839a2-44fd-4003-a06b-faf6a2dff077,200,92.37,dbfs:/user/andrew.barry@infinitive.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2019.csv,2021-09-17T22:40:32.520+0000


In [0]:
(df_2019.write
  .format("delta")
  .mode("append")
  .save(batch_target_path)
)

### Reality Check #2.C
Run the following command to ensure that you are on track:

In [0]:
reality_check_02_c()

Points,Test,Result
1,Target directory exists,
1,Using the Delta file format,
1,Found at least one Parquet part-file,
1,Schema is valid,
1,"Expected 1,175,870 records",
1,"No leading or trailing whitespace in column values, need to trim",
1,"No empty strings in column values, should be the SQL value null",
1,"No ""null"" strings for column values, should be the SQL value null",
1,Ingest file names are valid for 2019,
1,Ingest date is valid for 2019,


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #2 - Final Check</h2>

Run the following command to make sure this exercise is complete:

In [0]:
reality_check_02_final()

Points,Test,Result
1,Reality Check 02.A passed,
1,Reality Check 02.B passed,
1,Reality Check 02.C passed,
1,Target directory exists,
1,Using the Delta file format,
1,Found at least one Parquet part-file,
1,Schema is valid,
1,"Expected 1,175,870 records",
1,"No leading or trailing whitespace in column values, need to trim",
1,"No empty strings in column values, should be the SQL value null",
