# Chapter 3:
Christoph Windheuser    
April 10, 2022   
Python examples of chapter 3 in the book *Learning Spark*

In [1]:
# Import required python spark libraries
import findspark
import pyspark

from pyspark.sql.types import *
from pyspark.sql.functions import col, expr, when, concat, lit, avg
from pyspark.sql import SparkSession
from pyspark.sql import Row


In [2]:
# Connect Jupyter Notebook with the Spark application and create Spark Context
findspark.init()
sc = pyspark.SparkContext(appName="chapter_3")


In [3]:
#create a SparkSession
spark = (SparkSession
       .builder
       .appName("Example-3_6")
       .getOrCreate())


## Example page 45 ff: RDDs vs. DataFrames
We want to solve a simple data analytics task.    
We have the following data points of persons and their age:
* Brooke: 20
* Denny: 31
* Jules: 30
* TD: 35
* Brooke: 24

Be aware that there are thwo Brookes with different ages.   
The task is to summarize the datapoints by name and average over their ages.

First we solve it with an RDD ((Resilient Distributed Dataset).    


In [6]:
# Create the RDD containing the data
dataRDD = sc.parallelize([("Brooke", 20), ("Denny", 31),
                          ("Jules", 30), ("TD", 35), ("Brooke", 25)])


In [7]:
# Calculate the average age per name
agesRDD = (dataRDD
          .map(lambda x: (x[0], (x[1],1)))
          .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
          .map(lambda x: (x[0], x[1][0] / x[1][1])))


In [8]:
# Show results
print (agesRDD.take(4))

[('Brooke', 22.5), ('TD', 35.0), ('Jules', 30.0), ('Denny', 31.0)]


Now we solve the same task with Sparks high-level Domain Specific Languages (DSL - Python in our case). We are using Sparks DataFrame API to tell Spark *what to do* instead of *how to do it* as in the previous code with RDDs.

In [9]:
# Create a DataFrame
data_df = spark.createDataFrame([("Brooke", 20), ("Denny", 31),
                          ("Jules", 30), ("TD", 35), ("Brooke", 25)], ["name", "age"])


In [10]:
# Group by names, aggregate their ages and average the age
avg_df = data_df.groupBy("name").agg(avg("age"))

In [11]:
# Show the results
avg_df.show()

+------+--------+
|  name|avg(age)|
+------+--------+
|Brooke|    22.5|
| Denny|    31.0|
| Jules|    30.0|
|    TD|    35.0|
+------+--------+



## Example page 51 ff: Define a schema
There are two possibilities to define a schema in Spark:
1. define it programmaticvally 
2. define it with a *Data Definition Languagen (DDL)* string

In [12]:
# 1. defining a schema programmatically
schema = StructType([StructField("author", StringType(), False),
                     StructField("title", StringType(), False),
                     StructField("pages", IntegerType(), False)])


In [13]:
print (schema)

StructType(List(StructField(author,StringType,false),StructField(title,StringType,false),StructField(pages,IntegerType,false)))


In [14]:
# 2. defining the schema with a DDL:
schema = "author STRING, title STRING, pages INT"

In [15]:
print (schema)

author STRING, title STRING, pages INT


### Define a more complex DataFrame with a schema

In [16]:
# Define the schema programmatically or with a DDL:

schema2_prg = (StructType([
   StructField("Id", IntegerType(), False),
   StructField("First", StringType(), False),
   StructField("Last", StringType(), False),
   StructField("Url", StringType(), False),
   StructField("Published", StringType(), False),
   StructField("Hits", IntegerType(), False),
   StructField("Campaigns", ArrayType(StringType()), False)]))

# Define the schema with DDL: In case of the DDL, nullable = True
schema2_ddl = "`Id` INT,`First` STRING,`Last` STRING,`Url` STRING,`Published` STRING,`Hits` INT,`Campaigns` ARRAY<STRING>"


In [17]:
# create the data
data = [[1, "Jules", "Damji", "https://tinyurl.1", "1/4/2016", 4535, ["twitter", "LinkedIn"]],
       [2, "Brooke","Wenig","https://tinyurl.2", "5/5/2018", 8908, ["twitter", "LinkedIn"]],
       [3, "Denny", "Lee", "https://tinyurl.3","6/7/2019",7659, ["web", "twitter", "FB", "LinkedIn"]],
       [4, "Tathagata", "Das","https://tinyurl.4", "5/12/2018", 10568, ["twitter", "FB"]],
       [5, "Matei","Zaharia", "https://tinyurl.5", "5/14/2014", 40578, ["web", "twitter", "FB", "LinkedIn"]],
       [6, "Reynold", "Xin", "https://tinyurl.6", "3/2/2015", 25568, ["twitter", "LinkedIn"]]
      ]


In [18]:
# create a DataFrame using one of the schemas defined above and the data:

# Uncomment one of the following lines - both show the same result:
# blogs_df = spark.createDataFrame(data, schema2_ddl)
blogs_df = spark.createDataFrame(data, schema2_prg)

# show the DataFrame:
blogs_df.show()


+---+---------+-------+-----------------+---------+-----+--------------------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|
+---+---------+-------+-----------------+---------+-----+--------------------+
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|
+---+---------+-------+-----------------+---------+-----+--------------------+



In [19]:
# print the schema used by Spark to process the DataFrame
print(blogs_df.printSchema())


root
 |-- Id: integer (nullable = false)
 |-- First: string (nullable = false)
 |-- Last: string (nullable = false)
 |-- Url: string (nullable = false)
 |-- Published: string (nullable = false)
 |-- Hits: integer (nullable = false)
 |-- Campaigns: array (nullable = false)
 |    |-- element: string (containsNull = true)

None


In [20]:
blogs_df.schema

StructType(List(StructField(Id,IntegerType,false),StructField(First,StringType,false),StructField(Last,StringType,false),StructField(Url,StringType,false),StructField(Published,StringType,false),StructField(Hits,IntegerType,false),StructField(Campaigns,ArrayType(StringType,true),false)))

## Example page 53 ff: Creating a DataFrame by reading a json file

In [21]:
jsonFile = "data/blogs.json"
blogs2_df =  spark.read.schema(schema2_ddl).json(jsonFile)

In [22]:
blogs2_df.show()

+---+---------+-------+-----------------+---------+-----+--------------------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|
+---+---------+-------+-----------------+---------+-----+--------------------+
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|
+---+---------+-------+-----------------+---------+-----+--------------------+



In [23]:
blogs2_df.printSchema()

root
 |-- Id: integer (nullable = true)
 |-- First: string (nullable = true)
 |-- Last: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- Published: string (nullable = true)
 |-- Hits: integer (nullable = true)
 |-- Campaigns: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [24]:
blogs2_df.schema

StructType(List(StructField(Id,IntegerType,true),StructField(First,StringType,true),StructField(Last,StringType,true),StructField(Url,StringType,true),StructField(Published,StringType,true),StructField(Hits,IntegerType,true),StructField(Campaigns,ArrayType(StringType,true),true)))

## Examples page 54 ff: Columns and Expressions

Show the columns of a DataFrame:

In [26]:
blogs2_df.columns

['Id', 'First', 'Last', 'Url', 'Published', 'Hits', 'Campaigns']

Access a particular column:

In [29]:
blogs2_df["id"]

Column<'id'>

Use an expression to calculate a certain value of a column:

In [30]:
blogs2_df.select(expr("Hits * 2")).show()

+----------+
|(Hits * 2)|
+----------+
|      9070|
|     17816|
|     15318|
|     21136|
|     81156|
|     51136|
+----------+



Add a new column based on an expression:

In [31]:
blogs2_df.withColumn("Big Hitters", (expr("Hits > 10000"))).show()

+---+---------+-------+-----------------+---------+-----+--------------------+-----------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|Big Hitters|
+---+---------+-------+-----------------+---------+-----+--------------------+-----------+
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|      false|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|      false|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|      false|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|       true|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|       true|
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|       true|
+---+---------+-------+-----------------+---------+-----+--------------------+-----------+



Create a new column by concatenating three columns and show the new column:

In [38]:
blogs2_df.withColumn("AuthorsId", (concat(expr("First"), expr("Last"), expr("Id")))).select(col("AuthorsId")).show()


+-------------+
|    AuthorsId|
+-------------+
|  JulesDamji1|
| BrookeWenig2|
|    DennyLee3|
|TathagataDas4|
|MateiZaharia5|
|  ReynoldXin6|
+-------------+



These three statements return the same results, showing that expr is the same as a col method call:

In [39]:
blogs2_df.select(expr("Hits")).show()

+-----+
| Hits|
+-----+
| 4535|
| 8908|
| 7659|
|10568|
|40578|
|25568|
+-----+



In [40]:
blogs2_df.select(col("Hits")).show()

+-----+
| Hits|
+-----+
| 4535|
| 8908|
| 7659|
|10568|
|40578|
|25568|
+-----+



In [41]:
blogs2_df.select("Hits").show()

+-----+
| Hits|
+-----+
| 4535|
| 8908|
| 7659|
|10568|
|40578|
|25568|
+-----+



Sort by column "Id" in descending order. Note that col("Id") and "Id" are identical:

In [53]:
blogs2_df.sort(col("Id"), ascending=False).show()

+---+---------+-------+-----------------+---------+-----+--------------------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|
+---+---------+-------+-----------------+---------+-----+--------------------+
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|
+---+---------+-------+-----------------+---------+-----+--------------------+



In [56]:
blogs2_df.sort("Id", ascending=False).show()

+---+---------+-------+-----------------+---------+-----+--------------------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|
+---+---------+-------+-----------------+---------+-----+--------------------+
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|
+---+---------+-------+-----------------+---------+-----+--------------------+



## Rows (page 57 ff)

In [58]:
from pyspark.sql import Row

blog_row = Row(6, "Reynold", "Xin", "https://tinyurl.6", 255568, "3/2/2015", ["twitter", "LinkedIn"])


In [59]:
blog_row

<Row(6, 'Reynold', 'Xin', 'https://tinyurl.6', 255568, '3/2/2015', ['twitter', 'LinkedIn'])>

In [62]:
#access individual items of the row unsing the index:
blog_row[1]

'Reynold'

## Create DataFrames out of rows

In [63]:
rows = [Row("Matai Zaharia", "CA"), Row("Reynold Xin", "MA")]
authors_df = spark.createDataFrame (rows, ["Authors", "State"])
authors_df.show()


+-------------+-----+
|      Authors|State|
+-------------+-----+
|Matai Zaharia|   CA|
|  Reynold Xin|   MA|
+-------------+-----+



## End-to-End example: San Francisco Fire Department Calls (page 58 ff)
See the extra Jupyter Notebook `chapter_03_SF_Fire_Calls.ipynb`


## Typed Objects, Untyped Objects and Generic Rows (page 69 ff)
As Spark DataSets as strongly typed objects only make sense in Scala and Java, but not in Python or R, we will implement the examples in this chapter with Spark DataFrames in Python.
https://databricks.com/blog/2016/03/28/how-to-process-iot-device-json-data-using-apache-spark-datasets-and-dataframes.html

Python Notebook to create the IoT Data on GitHub:
https://github.com/dmatrix/examples/blob/master/spark/databricks/notebooks/py/sql_device_provisioning.ipynb

In [2]:
row = Row(350, True, "Learning Spark 2E", None)

In [9]:
# Try row[0], row[1], row[2], row[3]: 
row[0]


350

## End-to-End example: IoT Data (page 71 ff)
See the extra Jupyter Notebook `chapter_03_IoT_Data.ipynb`
