-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Exercise #2 - Batch Ingestion

In this exercise you will be ingesting three batches of orders, one for 2017, 2018 and 2019.

As each batch is ingested, we are going to append it to a new Delta table, unifying all the datasets into one single dataset.

Each year, different individuals and different standards were used resulting in datasets that vary slightly:
* In 2017 the backup was written as fixed-width text files
* In 2018 the backup was written a tab-separated text files
* In 2019 the backup was written as a "standard" comma-separted text files but the format of the column names was changed

Our only goal here is to unify all the datasets while tracking the source of each record (ingested file name and ingested timestamp) should additional problems arise.

Because we are only concerned with ingestion at this stage, the majority of the columns will be ingested as simple strings and in future exercises we will address this issue (and others) with various transformations.

As you progress, several "reality checks" will be provided to you help ensure that you are on track - simply run the corresponding command after implementing the corresponding solution.

This exercise is broken up into 3 steps:
* Exercise 2.A - Ingest Fixed-Width File
* Exercise 2.B - Ingest Tab-Separated File
* Exercise 2.C - Ingest Comma-Separated File

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Setup Exercise #2</h2>

To get started, we first need to configure your Registration ID and then run the setup notebook.

### Setup - Registration ID

In the next commmand, please update the variable **`registration_id`** with the Registration ID you received when you signed up for this project.

For more information, see [Registration ID]($./Registration ID)

In [0]:
# TODO
registration_id = "3339094"

### Setup - Run the exercise setup

Run the following cell to setup this exercise, declaring exercise-specific variables and functions.

In [0]:
%run ./_includes/Setup-Exercise-02

Variable/Function,Description
username,dakota.murdock@wavicledata.com
,This is the email address that you signed into Databricks with
working_dir,dbfs:/user/dakota.murdock@wavicledata.com/dbacademy/developer-foundations-capstone
,This is the directory in which all work should be conducted
batch_2017_path,dbfs:/user/dakota.murdock@wavicledata.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt
,The path to the 2017 batch of orders
batch_2018_path,dbfs:/user/dakota.murdock@wavicledata.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2018.csv
,The path to the 2018 batch of orders
batch_2019_path,dbfs:/user/dakota.murdock@wavicledata.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2019.csv
,The path to the 2019 batch of orders


Run the following cell to preview a list of the files you will be processing in this exercise.

In [0]:
files = dbutils.fs.ls(f"{working_dir}/raw/orders/batch") # List all the files
display(files)                                           # Display the list of files

path,name,size
dbfs:/user/dakota.murdock@wavicledata.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2017.txt,2017.txt,160422092
dbfs:/user/dakota.murdock@wavicledata.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2018.csv,2018.csv,110777063
dbfs:/user/dakota.murdock@wavicledata.com/dbacademy/developer-foundations-capstone/raw/orders/batch/2019.csv,2019.csv,125058494


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #2.A - Ingest Fixed-Width File</h2>

**In this step you will need to:**
1. Use the variable **`batch_2017_path`**, and **`dbutils.fs.head`** to investigate the 2017 batch file, if needed.
2. Configure a **`DataFrameReader`** to ingest the text file identified by **`batch_2017_path`** - this should provide one record per line, with a single column named **`value`**
3. Using the information in **`fixed_width_column_defs`** (or the dictionary itself) use the **`value`** column to extract each new column of the appropriate length.<br/>
  * The dictionary's key is the column name
  * The first element in the dictionary's value is the starting position of that column's data
  * The second element in the dictionary's value is the length of that column's data
4. Once you are done with the **`value`** column, remove it.
5. For each new column created in step #3, remove any leading whitespace
  * The introduction of \[leading\] white space should be expected when extracting fixed-width values out of the **`value`** column.
6. For each new column created in step #3, replace all empty strings with **`null`**.
  * After trimming white space, any column for which a value was not specified in the original dataset should result in an empty string.
7. Add a new column, **`ingest_file_name`**, which is the name of the file from which the data was read from.
  * This should not be hard coded.
  * For the proper function, see the <a href="https://spark.apache.org/docs/latest/api/python/index.html" target="_blank">pyspark.sql.functions</a> module
8. Add a new column, **`ingested_at`**, which is a timestamp of when the data was ingested as a DataFrame.
  * This should not be hard coded.
  * For the proper function, see the <a href="https://spark.apache.org/docs/latest/api/python/index.html" target="_blank">pyspark.sql.functions</a> module
9. Write the corresponding **`DataFrame`** in the "delta" format to the location specified by **`batch_target_path`**

**Special Notes:**
* It is possible to use the dictionary **`fixed_width_column_defs`** and programatically extract <br/>
  each column but, it is also perfectly OK to hard code this step and extract one column at a time.
* The **`SparkSession`** is already provided to you as an instance of **`spark`**.
* The classes/methods that you will need for this exercise include:
  * **`pyspark.sql.DataFrameReader`** to ingest data
  * **`pyspark.sql.DataFrameWriter`** to ingest data
  * **`pyspark.sql.Column`** to transform data
  * Various functions from the **`pyspark.sql.functions`** module
  * Various transformations and actions from **`pyspark.sql.DataFrame`**
* The following methods can be used to investigate and manipulate the Databricks File System (DBFS)
  * **`dbutils.fs.ls(..)`** for listing files
  * **`dbutils.fs.rm(..)`** for removing files
  * **`dbutils.fs.head(..)`** to view the first N bytes of a file

**Additional Requirements:**
* The unified batch dataset must be written to disk in the "delta" format
* The schema for the unified batch dataset must be:
  * **`submitted_at`**:**`string`**
  * **`order_id`**:**`string`**
  * **`customer_id`**:**`string`**
  * **`sales_rep_id`**:**`string`**
  * **`sales_rep_ssn`**:**`string`**
  * **`sales_rep_first_name`**:**`string`**
  * **`sales_rep_last_name`**:**`string`**
  * **`sales_rep_address`**:**`string`**
  * **`sales_rep_city`**:**`string`**
  * **`sales_rep_state`**:**`string`**
  * **`sales_rep_zip`**:**`string`**
  * **`shipping_address_attention`**:**`string`**
  * **`shipping_address_address`**:**`string`**
  * **`shipping_address_city`**:**`string`**
  * **`shipping_address_state`**:**`string`**
  * **`shipping_address_zip`**:**`string`**
  * **`product_id`**:**`string`**
  * **`product_quantity`**:**`string`**
  * **`product_sold_price`**:**`string`**
  * **`ingest_file_name`**:**`string`**
  * **`ingested_at`**:**`timestamp`**

### Fixed-Width Meta Data 

The following dictionary is provided for reference and/or implementation<br/>
(depending on which strategy you choose to employ).

Run the following cell to instantiate it.

In [0]:
fixed_width_column_defs = {
  "submitted_at": (1, 15),
  "order_id": (16, 40),
  "customer_id": (56, 40),
  "sales_rep_id": (96, 40),
  "sales_rep_ssn": (136, 15),
  "sales_rep_first_name": (151, 15),
  "sales_rep_last_name": (166, 15),
  "sales_rep_address": (181, 40),
  "sales_rep_city": (221, 20),
  "sales_rep_state": (241, 2),
  "sales_rep_zip": (243, 5),
  "shipping_address_attention": (248, 30),
  "shipping_address_address": (278, 40),
  "shipping_address_city": (318, 20),
  "shipping_address_state": (338, 2),
  "shipping_address_zip": (340, 5),
  "product_id": (345, 40),
  "product_quantity": (385, 5),
  "product_sold_price": (390, 20)
}

### Implement Exercise #2.A

Implement your solution in the following cell:

In [0]:
dbutils.fs.ls(batch_target_path)

Out[15]: [FileInfo(path='dbfs:/user/dakota.murdock@wavicledata.com/dbacademy/developer-foundations-capstone/batch_orders_dirty.delta/_delta_log/', name='_delta_log/', size=0),
 FileInfo(path='dbfs:/user/dakota.murdock@wavicledata.com/dbacademy/developer-foundations-capstone/batch_orders_dirty.delta/part-00000-01c9595e-b2a3-4cdc-81c0-2d61fec61873-c000.snappy.parquet', name='part-00000-01c9595e-b2a3-4cdc-81c0-2d61fec61873-c000.snappy.parquet', size=1141902),
 FileInfo(path='dbfs:/user/dakota.murdock@wavicledata.com/dbacademy/developer-foundations-capstone/batch_orders_dirty.delta/part-00000-05c02f90-154b-4f55-aaf1-7d1526f67df6-c000.snappy.parquet', name='part-00000-05c02f90-154b-4f55-aaf1-7d1526f67df6-c000.snappy.parquet', size=1141902),
 FileInfo(path='dbfs:/user/dakota.murdock@wavicledata.com/dbacademy/developer-foundations-capstone/batch_orders_dirty.delta/part-00000-3a04c528-621d-4171-8710-b375fd6f22dc-c000.snappy.parquet', name='part-00000-3a04c528-621d-4171-8710-b375fd6f22dc-c000.s

In [0]:
# dbutils.fs.head(batch_2017_path)
from pyspark.sql.functions import *

# Read in 2017 file into dataframe
df_2017 = spark.read.format('text').load(batch_2017_path)

# Creating a list to store strings of commands to be generated in loop
schema_list = []
# Looping through provided dictionary to create string commands that will be used to convert fixed width data into separate columns
for key in fixed_width_column_defs:
    pos = fixed_width_column_defs.get(key)
    schema_list.append(f"ltrim(df_2017.value.substr({pos[0]}, {pos[1]})).alias(\"{key}\")")

# Creating a single string from the list of string commands generated in the loop 
schema_string = ", ".join(schema_list)

# Programatically converting the fixed with 'value' column into separate columns utilizing previously created string of fixed width information
exec(f"""df_2017_columns = (df_2017.select({schema_string}))""")

# Performing transformations to dataframe
df_2017_columns = df_2017_columns.na.replace('', None) # Fill empty strings with null
df_2017_columns = df_2017_columns.withColumn("ingest_file_name", input_file_name()).withColumn("ingested_at", current_timestamp()) # Add columns containing the source file name and the time it was ingested at

# Write the dataframe to delta table
df_2017_columns.write.format("delta").mode("overwrite").save(batch_target_path)

### Reality Check #2.A
Run the following command to ensure that you are on track:

In [0]:
reality_check_02_a()

PYTHON ERROR Invalid argument, not a string or column: 1636990662483.615 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636990662510.216 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636990662792.8596 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636990663025.7673 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636990663251.448 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636990663435.6226 of type <class 'float'>. For column literals, use 'lit

Points,Test,Result
1,Using DBR 9.1 & Proper Cluster Configuration,
1,Valid Registration ID,
1,Target directory exists,
1,Using the Delta file format,
1,Found at least one Parquet part-file,
1,Schema is valid,
1,"Expected 391,266 records",
1,"No leading or trailing whitespace in column values, need to trim",
1,"No empty strings in column values, should be the SQL value null",
1,"No ""null"" strings for column values, should be the SQL value null",


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #2.B - Ingest Tab-Separted File</h2>

**In this step you will need to:**
1. Use the variable **`batch_2018_path`**, and **`dbutils.fs.head`** to investigate the 2018 batch file, if needed.
2. Configure a **`DataFrameReader`** to ingest the tab-separated file identified by **`batch_2018_path`**
3. Add a new column, **`ingest_file_name`**, which is the name of the file from which the data was read from - note this should not be hard coded.
4. Add a new column, **`ingested_at`**, which is a timestamp of when the data was ingested as a DataFrame - note this should not be hard coded.
5. **Append** the corresponding **`DataFrame`** to the previously created datasets specified by **`batch_target_path`**

**Additional Requirements**
* Any **"null"** strings in the CSV file should be replaced with the SQL value **null**

### Implement Exercise #2.b

Implement your solution in the following cell:

In [0]:
# Read in the tsv file into a dataframe
df_2018 = spark.read.option("sep", '\t').option("header", True).format('csv').load(batch_2018_path)

# Replace csv "null" string with SQL value null
df_2018 = df_2018.na.replace('null', None)

# Add columns containing the source file name and the time it was ingested at
df_2018 = df_2018.withColumn("ingest_file_name", input_file_name()).withColumn("ingested_at", current_timestamp())

# Write the dataframe to delta table
df_2018.write.format("delta").mode("append").save(batch_target_path)

### Reality Check #2.B
Run the following command to ensure that you are on track:

In [0]:
reality_check_02_b()

PYTHON ERROR Invalid argument, not a string or column: 1636990796158.4683 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636990796341.9612 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636990796525.808 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636990796630.7593 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636990798836.0396 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636990838017.1326 of type <class 'float'>. For column literals, use 'l

Points,Test,Result
1,Target directory exists,
1,Using the Delta file format,
1,Found at least one Parquet part-file,
1,Schema is valid,
1,"Expected 784,647 records",
1,"No leading or trailing whitespace in column values, need to trim",
1,"No empty strings in column values, should be the SQL value null",
1,"No ""null"" strings for column values, should be the SQL value null",
1,Ingest file names are valid for 2018,
1,Ingest date is valid for 2018,


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #2.C - Ingest Comma-Separted File</h2>

**In this step you will need to:**
1. Use the variable **`batch_2019_path`**, and **`dbutils.fs.head`** to investigate the 2019 batch file, if needed.
2. Configure a **`DataFrameReader`** to ingest the comma-separated file identified by **`batch_2019_path`**
3. Add a new column, **`ingest_file_name`**, which is the name of the file from which the data was read from - note this should not be hard coded.
4. Add a new column, **`ingested_at`**, which is a timestamp of when the data was ingested as a DataFrame - note this should not be hard coded.
5. **Append** the corresponding **`DataFrame`** to the previously created dataset specified by **`batch_target_path`**<br/>
   Note: The column names in this dataset must be updated to conform to the schema defined for Exercise #2.A - there are several strategies for this:
   * Provide a schema that alters the names upon ingestion
   * Manually rename one column at a time
   * Use **`fixed_width_column_defs`** programaticly rename one column at a time
   * Use transformations found in the **`DataFrame`** class to rename all columns in one operation

**Additional Requirements**
* Any **"null"** strings in the CSV file should be replaced with the SQL value **null**<br/>

### Implement Exercise #2.C

Implement your solution in the following cell:

In [0]:
from pyspark.sql.types import StringType, StructType, StructField

# Defining schema based on 2017 and 2018 column names for import of 2019 data
defined_schema = StructType([
    StructField("submitted_at", StringType(), True),
    StructField("order_id", StringType(), True),
    StructField("customer_id", StringType(), True),
    StructField("sales_rep_id", StringType(), True),
    StructField("sales_rep_ssn", StringType(), True),
    StructField("sales_rep_first_name", StringType(), True),
    StructField("sales_rep_last_name", StringType(), True),
    StructField("sales_rep_address", StringType(), True),
    StructField("sales_rep_city", StringType(), True),
    StructField("sales_rep_state", StringType(), True),
    StructField("sales_rep_zip", StringType(), True),
    StructField("shipping_address_attention", StringType(), True),
    StructField("shipping_address_address", StringType(), True),
    StructField("shipping_address_city", StringType(), True),
    StructField("shipping_address_state", StringType(), True),
    StructField("shipping_address_zip", StringType(), True),
    StructField("product_id", StringType(), True),
    StructField("product_quantity", StringType(), True),
    StructField("product_sold_price", StringType(), True)
])

# # Read in the csv file into a dataframe
df_2019 = spark.read.option("header", True).schema(defined_schema).format('csv').load(batch_2019_path)

# # Replace csv "null" string with SQL value null
df_2019 = df_2019.na.replace('null', None)

# # Add columns containing the source file name and the time it was ingested at
df_2019 = df_2019.withColumn("ingest_file_name", input_file_name()).withColumn("ingested_at", current_timestamp())

# Write the dataframe to delta table
df_2019.write.format("delta").mode("append").save(batch_target_path)

### Reality Check #2.C
Run the following command to ensure that you are on track:

In [0]:
reality_check_02_c()

PYTHON ERROR Invalid argument, not a string or column: 1636990958776.9055 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636990958966.8657 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636990959149.929 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636990959245.3945 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636990961417.6409 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636991017013.8633 of type <class 'float'>. For column literals, use 'l

Points,Test,Result
1,Target directory exists,
1,Using the Delta file format,
1,Found at least one Parquet part-file,
1,Schema is valid,
1,"Expected 1,175,870 records",
1,"No leading or trailing whitespace in column values, need to trim",
1,"No empty strings in column values, should be the SQL value null",
1,"No ""null"" strings for column values, should be the SQL value null",
1,Ingest file names are valid for 2019,
1,Ingest date is valid for 2019,


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #2 - Final Check</h2>

Run the following command to make sure this exercise is complete:

In [0]:
reality_check_02_final()

PYTHON ERROR Invalid argument, not a string or column: 1636991147297.8127 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636991147312.6528 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636991147326.318 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636991147506.9524 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636991147715.3125 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
PYTHON ERROR Invalid argument, not a string or column: 1636991147892.5483 of type <class 'float'>. For column literals, use 'l

Points,Test,Result
1,Reality Check 02.A passed,
1,Reality Check 02.B passed,
1,Reality Check 02.C passed,
1,Target directory exists,
1,Using the Delta file format,
1,Found at least one Parquet part-file,
1,Schema is valid,
1,"Expected 1,175,870 records",
1,"No leading or trailing whitespace in column values, need to trim",
1,"No empty strings in column values, should be the SQL value null",


-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>