# Chapter 9: Building Reliable Data Lakes with Apache Spark
Christoph Windheuser    
May, 2022   
Python examples of chapter 9 (page 265 ff) in the book *Learning Spark*

In order to use delta-lake functionality in PySpark, add the following line to the `spark-defaults.conf` file in `$SPARK_HOME/conf` directory:

```spark.jars.packages io.delta:delta-core_2.12:1.1.0```

If you want to add more jar packages, separate them with comma **without blanks**!

If no file `spark-defaults.conf` exists in `$SPARK_HOME/conf` but a file `spark-defaults.conf.template`, then copy this template file to `spark-defaults.conf` and do the changes in the new file.


In [1]:
# Import required python spark libraries
import pyspark
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from delta import *
from delta.tables import *


## Run Delta Lake from Jupyter Notebooks:
See Stackoverflow: https://stackoverflow.com/questions/57740693/how-to-refer-deltalake-tables-in-jupyter-notebook-using-pyspark

In [2]:
# create a SparkSession
# This requires access to the internet. If executed offline, an error is thrown

builder = (SparkSession \
         .builder \
         .appName("Chapter_9") \
         .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
         .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog"))

spark = configure_spark_with_delta_pip(builder).getOrCreate()


In [3]:
sourcePath = "../DB_Spark/LearningSparkV2/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet"

deltaPath  = "/tmp/spark/loans_delta"

(spark.read.format("parquet").load(sourcePath)
    .write.format("delta").save(deltaPath))

In [4]:
spark.read.format("delta").load(deltaPath).createOrReplaceTempView("loans_delta")


In [5]:
spark.sql("SELECT count(*) FROM loans_delta").show()

+--------+
|count(1)|
+--------+
|   14705|
+--------+



In [6]:
# Delete the delta tables on the file system:
!rm /tmp/spark/loans_delta/*.parquet
!rm -rf /tmp/spark/loans_delta/_delta_log
