# Delta Architecture: Beyond Lambda Architecture

<img src="https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-logo-whitebackground.png" width=200/>

This is a companion notebook to provide a Delta Lake example against the Lending Club data.
* This notebook has been tested with *DBR 5.4 ML Beta, Python 3*

## The Data

The data used is public data from Lending Club. It includes all funded loans from 2012 to 2017. Each loan includes applicant information provided by the applicant as well as the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. For a full view of the data please view the data dictionary available [here](https://resources.lendingclub.com/LCDataDictionary.xlsx).


![Loan_Data](https://preview.ibb.co/d3tQ4R/Screen_Shot_2018_02_02_at_11_21_51_PM.png)

https://www.kaggle.com/wendykan/lending-club-loan-data

## ![Delta Lake Tiny Logo](https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-tiny-logo.png) Delta Lake

Optimization Layer a top blob storage for Reliability (i.e. ACID compliance) and Low Latency of Streaming + Batch data pipelines.

## Import Data and create pre-Delta Lake Table
* This will create a lot of small Parquet files emulating the typical small file problem that occurs with streaming or highly transactional data

In [5]:
# -----------------------------------------------
# Uncomment and run if this folder does not exist
# -----------------------------------------------
# Configure location of loanstats_2012_2017.parquet
lspq_path = "/databricks-datasets/samples/lending_club/parquet/"

# Read loanstats_2012_2017.parquet
data = spark.read.parquet(lspq_path)

# Reduce the amount of data (to run on DBCE)
(loan_stats, loan_stats_rest) = data.randomSplit([0.01, 0.99], seed=123)

# Select only the columns needed
loan_stats = loan_stats.select("addr_state", "loan_status")

# Create loan by state
loan_by_state = loan_stats.groupBy("addr_state").count()

# Create table
loan_by_state.createOrReplaceTempView("loan_by_state")

# Display loans by state
display(loan_by_state)

addr_state,count
AZ,336
SC,172
LA,171
MN,261
NJ,540
DC,40
OR,182
VA,424
RI,65
WY,31


## ![Delta Lake Tiny Logo](https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-tiny-logo.png) Easily Convert Parquet to Delta Lake format
With Delta Lake, you can easily transform your Parquet data into Delta Lake format.

In [7]:
# Configure Delta Lake Silver Path
DELTALAKE_SILVER_PATH = "/ml/loan_by_state_delta"

# Remove folder if it exists
dbutils.fs.rm(DELTALAKE_SILVER_PATH, recurse=True)

In [8]:
%sql 
-- Current example is creating a new table instead of in-place import so will need to change this code
DROP TABLE IF EXISTS loan_by_state_delta;

CREATE TABLE loan_by_state_delta
USING delta
LOCATION '/ml/loan_by_state_delta'
AS SELECT * FROM loan_by_state;

-- View Delta Lake table
SELECT * FROM loan_by_state_delta

addr_state,count
OK,136
DC,40
WI,200
ID,15
MD,369
GA,503
AR,104
NC,396
AK,32
UT,77


In [9]:
%sql 
DESCRIBE DETAIL delta.`/ml/loan_by_state_delta`

format,id,name,description,location,createdAt,lastModified,partitionColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion
delta,53667fe3-935b-4453-b76b-c2404fbba01f,,,dbfs:/ml/loan_by_state_delta,2020-03-05T17:15:28.921+0000,2020-03-05T17:15:38.000+0000,List(),46,30507,Map(),1,2


## Stop the notebook before the streaming cell, in case of a "run all"

In [11]:
dbutils.notebook.exit("stop") 

stop

In [12]:
%sh 
ls -lt /dbfs/ml/loan_by_state_delta/

In [13]:
%sh 
ls -lt /dbfs/ml/loan_by_state_delta/_delta_log/

## ![Delta Lake Tiny Logo](https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-tiny-logo.png) Unified Batch and Streaming Source and Sink

These cells showcase streaming and batch concurrent queries (inserts and reads)
* This notebook will run an `INSERT` every 10s against our `loan_stats_delta` table
* We will run two streaming queries concurrently against this data
* Note, you can also use `writeStream` but this version is easier to run in DBCE

In [15]:
# Read the insertion of data
spark.sql("set spark.sql.shuffle.partitions = 1")
spark.sql("set spark.databricks.delta.snapshotPartitions = 1")

loan_by_state_readStream = spark.readStream.format("delta").load(DELTALAKE_SILVER_PATH)
loan_by_state_readStream.createOrReplaceTempView("loan_by_state_readStream")

In [16]:
%sql
select addr_state, sum(`count`) as loans from loan_by_state_readStream group by addr_state

addr_state,loans
MI,351
IL,580
WY,31
AR,104
OR,182
SC,172
AZ,336
IA,3285
MN,261
SD,30


**Wait** until the stream is up and running before executing the code below

In [18]:
import random
import os
from pyspark.sql.functions import *
from pyspark.sql.types import *


def random_checkpoint_dir(): 
  return "/tmp/loan_by_state_delta/chkpt/%s" % str(random.randint(0, 10000))

states = ["IA", "WA"]

@udf(returnType=StringType())
def random_state():
  return str(random.choice(states))

# Function to start a streaming query with a stream of randomly generated data and append to the parquet table
def generate_and_append_data_stream(table_format, table_path):

  stream_data = (spark.readStream.format("rate").option("rowsPerSecond", 1).load() 
    .withColumn("addr_state", random_state())
    #.withColumn("count", lit(45))
    .withColumn("count", lit(45).cast("long"))                 
    .select("addr_state", "count")
  )

  query = (stream_data.writeStream 
    .format(table_format) 
    .option("checkpointLocation", random_checkpoint_dir()) 
    .trigger(processingTime = "10 seconds") 
    .start(table_path))

  return query

# Function to stop all streaming queries 
def stop_all_streams():
  # Stop all the streams
  print("Stopping all streams")
  for s in spark.streams.active:
    s.stop()
  print("Stopped all streams")
  print("Deleting checkpoints")  
  dbutils.fs.rm("/tmp/loan_by_state_delta/chkpt/", True)
  print("Deleted checkpoints")

In [19]:
stream_query_2 = generate_and_append_data_stream(table_format = "delta", table_path = DELTALAKE_SILVER_PATH)

In [20]:
stream_query_3 = generate_and_append_data_stream(table_format = "delta", table_path = DELTALAKE_SILVER_PATH)

In [21]:
%sql
select addr_state, sum(`count`) from loan_by_state_delta group by addr_state

addr_state,sum(count)
WA,4476
IA,3510
MN,261
NJ,540
DC,40
OR,182
VA,424
RI,65
NH,71
MI,351


In [22]:
stop_all_streams()

**Note**: Once the previous cell is finished and the state of Iowa is fully populated in the map (in cell 14), click *Cancel* in Cell 14 to stop the `readStream`.

In [24]:
%sh 
ls -lt /dbfs/ml/loan_by_state_delta/_delta_log/

In [25]:
%sh 
head /dbfs/ml/loan_by_state_delta/_delta_log/00000000000000000014.json

In [26]:
%sh 
head /dbfs/ml/loan_by_state_delta/_delta_log/00000000000000000013.json