-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Exercise #1 - Project Overview & Dataset Install

The capstone project aims to assess rudimentary skills as it relates to the Apache Spark and DataFrame APIs.

The approach taken here assumes that you are familiar with and have some experience with the following entities:
* **`SparkContext`**
* **`SparkSession`**
* **`DataFrame`**
* **`DataFrameReader`**
* **`DataFrameWriter`**
* The various functions found in the module **`pyspark.sql.functions`**

Throughout this project, you will be given specific instructions and it is our expectation that you will be able to complete these instructions drawing on your existing knowledge as well as other sources such as the <a href="https://spark.apache.org/docs/latest/api.html" target="_blank">Spark API Documentation</a>.

After reviewing the project, the datasets and the various exercises, we will install the<br/>
datasets into your Databricks workspace so that you may proceed with this capstone project.

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Project Overview</h2>
* The Project - an introduction to this project
* The Data - an introduction to this project's datasets
* The Exercises - an overview of the various exercises in this project

### The Project

The idea behind this project is to ingest data from a purchasing system and load it into a data lake for further analysis. 

Each exercise is broken up into smaller steps, or milestones.

After every milestone, we have provided a "reality check" to help ensure that you are progressing as expected.

Please note, because each exercise builds on the previous, it is essential to complete all exercises and ensure their "reality checks" pass, before moving on to the next exercise.

As the last exercise of this project, we will use this data we loaded to answer some simple business questions.

### The Data
The raw data comes in three forms:

1. Orders that were processed in 2017, 2018 and 2019.
  * For each year a separate batch (or backup) of that year's orders was produced
  * The format of all three files are similar, but were not produced exactly the same:
    * 2017 is in a fixed-width text file format
    * 2018 is tab-separated text file
    * 2019 is comma-separated text file
  * Each order consists for four main data points:
    0. The order - the highest level aggregate
    0. The line items - the individual products purchased in the order
    0. The sales reps - the person placing the order
    0. The customer - the person who purchased the items and where it was shipped.
  * All three batches are consistent in that there is one record per line item creating a significant amount of duplicated data across orders, reps and customers.
  * All entities are generally referenced by an ID, such as order_id, customer_id, etc.
  
2. All products to be sold by this company (SKUs) are represented in a single XML file

3. In 2020, the company switched systems and now lands a single JSON file in cloud storage for every order received.
  * These orders are simplified versions of the batched data fro 2017-2019 and includes only the order's details, the line items, and the correlating ids
  * The sales reps's data is no longer represented in conjunction with an order

### The Exercises

* In **Exercise #1**, (this notebook) we introduce the registration procedure, the installation of our datasets and the reality-checks meant to aid you in your progress thought this capstone project.

* In **Exercise #2**, we will ingest the batch data for 2017-2019, combine them into a single dataset for future processing.

* In **Exercise #3**, we will take the unified batch data from **Exercise #2**, clean it, and extract it into three new datasets: Orders, Line Items and Sales Reps. The customer data, for the sake of simplicity, will not be broken out and left with the orders.

* In **Exercise #4**, we will ingest the XML document containing all the projects, and combine it with the Line Items to create yet another dataset, Product Line Items.

* In **Exercise #5**, we will begin processing the stream of orders for 2020, appending that stream of data to the existing datasets as necessary.

* In **Exercise #6**, we will use all of our new datasets to answer a handful of business questions.

* In **Exercise #7**, we provide final instructions for submitting your capstone project.

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #1 - Install Datasets</h2>

The datasets for this project are stored in a public object store.

They need to be downloaded and installed into your Databricks workspace before proceeding with this project.

But before doing that, we need to configure a cluster appropriate for this project.

**In this step you will need to:**
1. Configure the cluster (see specific instructions below)
2. Attach this notebook to your cluster
3. Specify your Registration ID
4. Run the setup notebook for this exercise
5. Install the datasets
6. Run the reality check to verify the datasets were correctly installed

Note: These steps represent the basic pattern used by each exercise in this capstone project

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Setup Exercise #1</h2>

To get started, we first need to configure your Registration ID and then run the setup notebook.

### Setup - Create A Cluster

#### Databricks Community Edition

This Capstone project was designed to work with Databricks Runtime Version (DBR) 9.1 LTS and the Databricks Community Edition's (CE) default cluster configuration. 

When working in CE, start a default cluster, specify **DBR 9.1 LTS**, and then proceede with the next step. 

#### Other than Community Edition (MSA, AWS or GCP)

This capstone project was designed to work with a small, single-node cluster when not using CE. When configuring your cluster, please specify the following:

* DBR: **9.1 LTS** 
* Cluster Mode: **Single Node**
* Node Type: 
  * for Microsoft Azure - **Standard_E4ds_v4**
  * for Amazon Web Services - **i3.xlarge** 
  * for Google Cloud Platform - **n1-highmem-4** 

Please feel free to use the Community Edition if the recomended node types are not available.

### Setup - Run the exercise setup

Run the following cell to setup this exercise, declaring exercise-specific variables and functions.

In [0]:
%run ./_includes/Setup-Exercise-01

Variable/Function,Description
install_datasets(),A utility function for installing datasets into the current workspace.
reality_check_install(),A utility function for validating the install process.


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise #1 - Install Datasets</h2>

Simply run the following command to install the capstone's datasets into your workspace.

In [0]:
# At any time during this project, you can reinstall the source datasets
# by setting reinstall=True. These datasets will not be automtically 
# reinstalled when this notebook is re-ran so as to save you time.
install_datasets(reinstall=False)


The source directory for this dataset is
wasbs://courseware@dbacademy.blob.core.windows.net/developer-foundations-capstone/v01/

Skipping install of existing dataset to
dbfs:/dbacademy/cenz.wong@ekimetrics.com/developer-foundations-capstone/raw


### Reality Check #1
Run the following command to ensure that you are on track:

In [0]:
reality_check_install()

Wrote 17 bytes.


Points,Test,Result
1,Using DBR 9.1 & Proper Cluster Configuration,
1,Valid Registration ID,
1,"Expected 3 files, found 3 in /",
1,"Expected 2 or more files, found 3 in /_meta",
1,"Expected 2 files, found 2 in /products",
1,"Expected 2 files, found 2 in /orders",
1,"Expected 3 files, found 3 in /orders/batch",
1,"Expected 20 files, found 20 in /orders/stream",
1,All datasets were installed succesfully!,


-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>