In [4]:
%run "./Includes/Classroom-Setup"

### Getting Started

You will notice that throughout this course, there is a lot of context switching between PySpark, Scala and SQL.

This is because:
* `read` and `write` operations are performed on DataFrames using PySpark or Scala
* table creates and queries are performed directly off Databricks Delta tables using SQL

Run the following cell to configure our "classroom."

Set up relevant paths.

In [8]:
inputPath = "/mnt/training/online_retail/data-001/data.csv"

parquetDataPath  = workingDir + "/customer-data/"
deltaDataPath    = workingDir + "/customer-data-delta/"

###  READ CSV Data

Read the data into a DataFrame. We supply the schema.

Partition on `Country` because there are only a few unique countries and because we will use `Country` as a predicate in a `WHERE` clause.

More information on table partitioning is contained in the links at the bottom of this notebook.

In [10]:
inputSchema = "InvoiceNo STRING, StockCode STRING, Description STRING, Quantity INT, InvoiceDate STRING, UnitPrice DOUBLE, CustomerID INT, Country STRING"

rawDF = (spark.read 
  .option("header", "true")
  .schema(inputSchema)
  .csv(inputPath) 
)

###  WRITE to Parquet and Databricks Delta

Use `overwrite` mode so that it is not a problem to re-write data in case you end up running the cell again.

In [12]:
# write using Parquet format
(rawDF.write
  .mode("overwrite")
  .format("parquet")
  .partitionBy("Country")
  .save(parquetDataPath) )

In [13]:
# write using Databricks Delta format
(rawDF.write
  .mode("overwrite")
  .format("delta")
  .partitionBy("Country")
  .save(deltaDataPath) )

### CREATE Statement Using Non-Databricks Delta Pipeline

Create a table called `customer_data` using `parquet` out of the above data.

In [15]:
spark.sql("""
    CREATE TABLE IF NOT EXISTS customer_data 
    USING parquet 
    OPTIONS (path = '{}')
  """.format(parquetDataPath))

Perform a simple `count` query to verify the number of records.

### Why 0 records? 

It's the concept of
<b>schema on read</b> where data is applied to a plan or schema as it is pulled out of a stored location, rather than as it goes into a stored location.

In the traditional data lake architecture (including our pre-Databricks Delta), 
 * The data backing the table **`customer_data`** is located in **`parquetDataPath`** (which you can see below).
 * The paths to the meta data backing the table **`customer-data`** (the schema, partitioning info and other table properties) are stored elsewhere 
  - This is called the **metastore**.

Suppose, we add more data to **`parquetDataPath`**, 
 * Then, we need to run a separate step for the metastore to become aware of this.
 * We use the **`MSCK REPAIR TABLE`** command. 
 * **`MSCK`** stands for "**M**eta**S**tore **C**hec**K**", modeled after Unix **`FSCK`** (**F**ile **S**ystem **C**hec**K**)

Schema on read is explained in more detail <a href="https://stackoverflow.com/a/11764519/53495#" target="_blank">in this article</a>.

In [18]:
print(parquetDataPath)

After using `MSCK REPAIR TABLE`, the count is correct.

### CREATE Statement Using Databricks Delta Pipeline

Create a table called `<database-name>.customer_data_delta` using `DELTA` out of `<path-to-data> = deltaDataPath`     

The notation is:
> `CREATE TABLE IF NOT EXISTS <database-name>.customer_data_delta` <br>
  `USING DELTA` <br>
  `LOCATION <path-to-data> ` <br>
  
Then, perform SQL queries on the table you just created.
> `SELECT count(*) FROM <database-name>.customer_data_delta`

Notice how you do not have to specify a schema or partition info here:
* Databricks Delta stores schema and partition info in the `_delta_log` directory.
* It infers schema from the data sitting in `<path-to-data>`.

In [21]:
spark.sql("""
  CREATE TABLE IF NOT EXISTS customer_data_delta 
  USING DELTA 
  LOCATION '{}' 
""".format(deltaDataPath))

Perform a simple `count` query to verify the number of records.

Notice how the count is right off the bat; no need to worry about table repairs.

## A New Notation

But, there is a more compact notation as well, one where you do not explicitly have to create a table.

Simply specify `delta.` along with the path to your Databricks Delta directory (in backticks!) directly in the SQL query.
* The dot in ```delta.`<path>` ``` means "Spark, recognize `<path>` as a Databricks Delta directory"

> ```SELECT count(*) FROM delta.`<path-to-Delta-data>` ```

We will use this notation extensively throughout the rest of the course.

In your own work, you may chose either notation:
* Sometimes, SQL queries are more readable than DataFrame queries.

Make sure you use BACKTICKS in the statement ``` delta.`<path-to-Delta-data>` ``` .

In [24]:
sqlCmd = "SELECT count(*) FROM delta.`{}` ".format(deltaDataPath)
display(spark.sql(sqlCmd))

count(1)
65499


##  The Transaction Log (Metadata)
Databricks Delta stores the schema, partitioning info and other table properties in the same place as the data:
 * The schema and partition info is located in the `00000000000000000000.json` file under the `_delta_log` directory as shown below.
 * Subsequent `write` operations create additional `json` files.
 * In addition to the schema, the `json` file(s) contain information such as
   - Which files were added.
   - Which files were removed.
   - Transaction IDs.
 * Each Delta table should correspond to a unique `_delta_log` directory.

In [26]:
dbutils.fs.head(deltaDataPath + "/_delta_log/00000000000000000000.json")

Metadata is displayed through `DESCRIBE DETAIL <tableName>`.

As long as we have some data in place already for a Databricks Delta table, we can infer schema.

In [28]:
sqlCmd = "DESCRIBE DETAIL delta.`{}` ".format(deltaDataPath)
display(spark.sql(sqlCmd))

format,id,name,description,location,createdAt,lastModified,partitionColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion
delta,74d08d98-36cc-43b9-aeff-b4f5993c0a27,,,dbfs:/user/jose.manuel.bustos.munoz@everis.com/delta/delta_02_create_psp/customer-data-delta,2020-04-15T08:39:35.766+0000,2020-04-15T08:39:52.000+0000,List(Country),37,636918,Map(),1,2


## Converting Parquet Workloads to Databricks Delta

A Databricks Delta workload is defined by the presence of the `_delta_log` directory containing metadata files.

Given a generic Parquet-based data lake, converting to Databricks Delta is quite straightforward.

Suppose our Parquet-based data lake is found under `/data-pipeline`.

To convert it to Databricks Delta, simply do

> ```CONVERT TO DELTA parquet.`/data-pipeline` ``` <br>
  ```[PARTITIONED BY (col_name1 col_type1, col_name2 col_type2, ...)] ```


# LAB

## Step 1

Read in data in `outdoorSmallPath` using `inputSchema` to DataFrame `inventoryDF`.

Use appropriate options, given that this is a CSV file.

In [32]:
# TODO
outdoorSmallPath = "/mnt/training/online_retail/outdoor-products/outdoor-products-small.csv"
inputSchema = "InvoiceNo STRING, StockCode STRING, Description STRING, Quantity INT, InvoiceDate STRING, UnitPrice DOUBLE, CustomerID INT, Country STRING"

inventoryDF = (spark
  .read        
  .option("header", "true")
  .schema(inputSchema)  
  .csv(outdoorSmallPath)   
)

In [33]:
# TEST - Run this cell to test your solution.
inventoryCount = inventoryDF.count()

dbTest("Delta-02-schemas", 99999, inventoryCount)

print("Tests passed!")

## Step 2

Write data to a Databricks path `inventoryDataPath = workingDir + "/inventory-data/"` 
* Make sure to set the `format` to `delta`
* Use overwrite mode 
* Partititon by `Country`

In [35]:
# TODO
inventoryDataPath = workingDir + "/inventory-data/"

(inventoryDF
  .write
  .mode("overwrite")
  .format("delta")
  .partitionBy("Country")
  .save(inventoryDataPath)
)

In [36]:
# TEST - Run this cell to test your solution.
try:
  tableNotEmpty = spark.sql("SELECT count(*) FROM delta.`{}` ".format(inventoryDataPath)).first()[0] > 0
except:
  tableNotEmpty = False
  
dbTest("Delta-02-inventoryTableExists", True, tableNotEmpty)  

print("Tests passed!")

## Step 3

Count number of records found under `inventoryDataPath` where the `Country` is `Sweden`.

In [38]:
count = spark.sql("SELECT count(*) as total FROM delta.`{}` WHERE Country='Sweden'".format(inventoryDataPath)).first()[0]

In [39]:
# TEST - Run this cell to test your solution.
dbTest("Delta-L2-inventoryDataDelta-count", 2925, count)
print("Tests passed!")

In [41]:
%run "./Includes/Classroom-Cleanup"