# Chapter 4
Christoph Windheuser    
May, 2022   
Python examples of chapter 4 in the book *Learning Spark*


In [1]:
# Import required python spark libraries
import findspark
import pyspark

from pyspark.conf import SparkConf
from pyspark.context import SparkContext

from pyspark.sql.types import *
from pyspark.sql.functions import col, expr, when, concat, lit, avg, desc
from pyspark.sql import SparkSession
from pyspark.sql import Row


In [2]:
# Connect Jupyter Notebook with the Spark application and create Spark Context
findspark.init()
sc = pyspark.SparkContext(appName="chapter_4")


ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PySparkShell, master=local[*]) created by <module> at /home/christoph/anaconda3/lib/python3.8/site-packages/IPython/utils/py3compat.py:168 

In [3]:
#create a SparkSession

spark = (SparkSession \
         .builder \
         .enableHiveSupport() \
         .config("spark.sql.catalogImplementation","hive") \
         .appName("Chapter_4_Examples") \
         .getOrCreate())


In [None]:
csv_file = "data/departuredelays.csv"

df = (spark.read.format("csv")
      .option("inferSchema", "true")
      .option("header", "true")
      .load(csv_file))


In [None]:
df.show(5)

# Create a View
Create the View from a DataFrame:

In [None]:
df.createOrReplaceTempView("us_delay_flights_view")

Views can be created with SQL from other tables or views.   
In the following example, we create a table with flight delays only from San Francisco (SFO):

In [None]:
spark.sql("""
            CREATE OR REPLACE TEMP VIEW us_delay_flights_SFO_view
            AS SELECT date, delay, distance, origin, destination from us_delay_flights_view
            WHERE origin = 'SFO'
            """)

In [None]:
spark.sql("SELECT * FROM us_delay_flights_SFO_view").show(10)

## Using the view (like a table) in SQL


Show flights with a distance of > 1000 miles and order the results by descendent distance. Show the first 10 results of this list:

In [None]:
spark.sql("""SELECT distance, origin, destination
          FROM us_delay_flights_SFO_view WHERE distance > 1000
          ORDER BY distance DESC""").show(10)

Instead of spark.sql, the same querry can be executed with the DataFrame API and shows the same result: 

In [None]:
(df.select("distance", "origin", "destination")
   .where("distance > 1000")
   .orderBy("distance", ascending = False).show(10))

Find all flights between San Francisco (SFO) and Chicago (ORD) with at least a two-hour delay:

In [None]:
spark.sql("""SELECT date, delay, origin, destination
          FROM us_delay_flights_view
          WHERE delay > 120 AND ORIGIN = 'SFO' AND DESTINATION = 'ORD'
          ORDER by delay DESC""").show(10)

Label the flights based on the delays they have experienced. Add a human-readable new column called 'Flight_Delays' containing the labels to the table:

In [None]:
spark.sql("""SELECT delay, origin, destination,
          CASE
              WHEN delay > 360 THEN 'Very Long Delays'
              WHEN delay > 120 AND delay < 360 THEN 'Long Delay'
              WHEN delay > 60 AND delay < 120 THEN 'Short Delay'
              WHEN delay > 0 AND delay < 60 THEN 'Tolerable Delay'
              WHEN delay = 0 THEN 'No Delay'
              ELSE 'Early'
         END AS Flight_Delays
         FROM us_delay_flights_view
         ORDER BY origin, delay DESC""").show(10)

# Creating SQL Tables
(Chapter 4, page 89 ff)

## Create a Database
Create a database called `learn_spark_db`:

In [None]:
spark.sql("CREATE DATABASE learn_spark_db")

In [None]:
spark.sql("USE learn_spark_db")

Spark creates a dictionary `learn_spark_db.db` in the "spark.sql.warehouse.dir" to save tables of the new database.     
The "spark.sql.warehouse.dir" can be get by:

In [None]:
print(spark.conf.get("spark.sql.warehouse.dir"))

# Create a managed table

A managed table is managed by Spark. It is saved and maintained in a hive repository. When the table is dropped, all content and metadata of the table is deleted.   
To create a managed table, no data source is specified.    
Create a managed table with the SQL API:

In [None]:
spark.sql("""CREATE TABLE managed_us_delay_flights_tbl
             USING CSV
             AS SELECT * FROM us_delay_flights_view""")


In [None]:
spark.sql("SELECT * FROM managed_us_delay_flights_tbl").show(10)


Let's drop the table and create it again with the DataFrame API:

In [None]:
spark.sql("DROP TABLE managed_us_delay_flights_tbl")

In [None]:
csv_file = "data/departuredelays.csv"
schema = "date STRING, delay INT, distance INT, origin STRING, destination STRING"
flights_df = spark.read.csv(csv_file, schema = schema)


In [None]:
flights_df.show(3)

In [None]:
flights_df.write.saveAsTable("managed_us_delay_flights_tbl")

In [None]:
spark.sql("SELECT * FROM managed_us_delay_flights_tbl").show(3)

# Viewing the Metadata of databases and tables
With the commands `spark.catalog.` the metadata from Spark cvan be viewed:

In [None]:
spark.catalog.listDatabases()

In [None]:
spark.catalog.listTables()

In [None]:
spark.catalog.listColumns("managed_us_delay_flights_tbl")

Now, let's drop the table again:

In [None]:
spark.sql("DROP TABLE managed_us_delay_flights_tbl")

At the end, let's drop the database `learn_spark_db`and all the tables in this database:

In [None]:
spark.sql("DROP DATABASE learn_spark_db CASCADE")

# Creating an unmanaged table

Let's first create and use a database:

In [None]:
spark.sql("CREATE DATABASE learn_spark_db")

In [None]:
spark.sql("USE learn_spark_db")

Now let's create the table `us_delay_flights_tbl` with an SQL command.   
In contrast to create a managed table, here we specify the source of the data for the table.    
Spark will manage the metadata, but not the data of the table.   
If the table is dropped, only the metadata is deleted, but not the data file.

In [None]:
spark.sql("""CREATE TABLE us_delay_flights_tbl
             (date STRING, delay INT, distance INT, origin STRING, destination STRING)
             USING CSV OPTIONS (PATH '/Users/cwi/Dev/LearningSpark/data/departuredelays.csv')""")


In [None]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show(3)

Let's drop the table again:

In [None]:
spark.sql("DROP TABLE us_delay_flights_tbl")

Now creating the same table with the DataFrame API:   
*(I do not see any differences to the case of creating a* ***managed*** *table with the DataFrame API!)*

In [None]:
csv_file = "data/departuredelays.csv"
schema = "date STRING, delay INT, distance INT, origin STRING, destination STRING"
flights_df = spark.read.csv(csv_file, schema = schema)


In [None]:
flights_df.show(10)

In [None]:
flights_df.write.saveAsTable("us_delay_flights_tbl")


In [None]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show(3)

And dropping it again:

In [None]:
spark.sql("DROP TABLE us_delay_flights_tbl")

And at the end, let's drop the whole database:

In [None]:
spark.sql("DROP DATABASE learn_spark_db CASCADE")

# DataFrameReader
(page 94 ff)

The DataFrameReader is a generic function to read data in different formats into a DataFrame. The following examples are using sample data from the repository https://github.com/databricks/LearningSparkV2. To run these examples, you have to clone the github repo with `git clone https://github.com/databricks/LearningSparkV2.git` 


## Reading a parquet file

In [None]:
datafile = """../DB_Spark/LearningSparkV2/databricks-datasets/\
learning-spark-v2/flights/summary-data/parquet/2010-summary.parquet"""

df = spark.read.format("parquet").load(datafile)

In [None]:
df.show(10)

## Reading a CSV file

In [None]:
datafile = """../DB_Spark/LearningSparkV2/databricks-datasets/\
learning-spark-v2/flights/summary-data/csv/*"""

df2 = (spark.read.format("csv")
       .option("inferSchema", "true")
       .option("header", "true")
       .option("mode", "PERMISSIVE")
       .load(datafile))


In [None]:
df2.show(10)

## Reading a JSON File

In [None]:
datafile = """../DB_Spark/LearningSparkV2/databricks-datasets/\
learning-spark-v2/flights/summary-data/json/*"""

df3 = (spark.read.format("json")
       .load(datafile))

In [None]:
df3.show(10)

## Reading and Writing Avro Files

The spark-avro module is external and not included in `park-submit` or `spark-shell` by default (see https://spark.apache.org/docs/latest/sql-data-sources-avro.html).  
Therefore the spark-avro package has to be specified:
```
spark-submit --packages org.apache.spark:spark-avro_2.12:3.2.1 <python_file.py>
```
See the example in the file `chapter_04_avro.py`. Run the file in a terminal with:
```
spark-submit --packages org.apache.spark:spark-avro_2.12:3.2.1 chapter_04_avro.py
```


### Trying to run reading an Avro file in the notebook

In [4]:
# Specify an existing avro file in the repository:
datafile = """../DB_Spark/LearningSparkV2/databricks-datasets/\
learning-spark-v2/flights/summary-data/avro/*"""

# Read the avro file:
df_avro = (spark.read.format("avro").load(datafile))


## Reading image files

In [None]:
from pyspark.ml import image

image_dir = """../DB_Spark/LearningSparkV2/databricks-datasets/\
learning-spark-v2/cctvVideos/train_images/"""

df_images = spark.read.format("image").load(image_dir)


In [None]:
df_images.printSchema()

In [None]:
df_images.select("image.height", "image.width", "image.nChannels", "image.mode", "label").show(10, truncate=False)

# DataFrame Writer
Page 96 ff.

## Writing a parquet file

In [None]:
(df.write.format("parquet")
    .mode("overwrite")
    .option("compression", "snappy")
    .save("data/temp/parquet"))


## Writing a JSON file

In [None]:
(df.write.format("json")
    .mode("overwrite")
    .save("data/temp/json"))


## Writing a CSV file

In [None]:
(df.write.format("csv")
    .mode("overwrite")
    .save("data/temp/csv"))


## Writing an Avro File

See example python file `chapter_04_avro.py` in this repo.