### **In this notebook we talk about**

* Static data frames.
* Streaming data frames.

In [1]:
# PYSPARK INITIALIZATION
import os
import sys

APP_NAME = 'pyspark_python'
MASTER = 'local[*]'

In [2]:
from pyspark import SparkConf
from pyspark.sql import SparkSession


conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster(MASTER)
spark = SparkSession.builder.config(conf = conf).getOrCreate()
sc = spark.sparkContext

In [3]:

flightData2015 = spark\
.read\
.option("inferSchema", "true")\
.option("header", "true")\
.csv("../data/2015-summary.csv") 

In [4]:
flightData2015.show()

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|       United States|            Romania|   15|
|       United States|            Croatia|    1|
|       United States|            Ireland|  344|
|               Egypt|      United States|   15|
|       United States|              India|   62|
|       United States|          Singapore|    1|
|       United States|            Grenada|   62|
|          Costa Rica|      United States|  588|
|             Senegal|      United States|   40|
|             Moldova|      United States|    1|
|       United States|       Sint Maarten|  325|
|       United States|   Marshall Islands|   39|
|              Guyana|      United States|   64|
|               Malta|      United States|    1|
|            Anguilla|      United States|   41|
|             Bolivia|      United States|   30|
|       United States|           Paraguay|    6|
|             Algeri

In [5]:
flightData2015.toPandas().head()

Unnamed: 0,DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
0,United States,Romania,15
1,United States,Croatia,1
2,United States,Ireland,344
3,Egypt,United States,15
4,United States,India,62


## EXPLAIN 
* Nothing happens to the data when we call sort because it’s just a transformation. However, we can see that Spark is
building up a plan for how it will execute this across the cluster by looking at the explain plan. We can call explain on any DataFrame object to see the DataFrame’s lineage (or how Spark will execute this query).

* Note that sort of our data is actually a wide transformation because rows will have to be
compared with one another.

In [6]:
flightData2015.sort("count").explain()

== Physical Plan ==
*(2) Sort [count#12 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(count#12 ASC NULLS FIRST, 200)
   +- *(1) FileScan csv [DEST_COUNTRY_NAME#10,ORIGIN_COUNTRY_NAME#11,count#12] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/erikapat/Dropbox/PRUEBAS_DATA_SCIENCE/SPARK/GIT_SPARK-PRACTICE-NOTE..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>


By default, when we perform a shuffle Spark will output two hundred shuffle partitions. We
will set this value to five in order to reduce the number of the output partitions from the shuffle from two hundred to five.

Go ahead and experiment with different values and see the number of partitions yourself. In experimenting with
different values, you should see drastically different run times. Remenber that you can monitor the job progress by
navigating to the Spark UI on port 4040 to see the physical and logical execution characteristics of our jobs.

In [7]:
import datetime
#Calculamos el tiempo de ejecucion
timestart= datetime.datetime.now()
spark.conf.set("spark.sql.shuffle.partitions", "1")
flightData2015.sort("count").take(2)

# Calculamos el tiempo empleado en la ejecución
timeend = datetime.datetime.now()
timedelta = round((timeend-timestart).total_seconds(), 2) 
print("Tiempo tomado en crear el modelo: " + str(timedelta) + " segundos")

Tiempo tomado en crear el modelo: 0.13 segundos


In [8]:
import datetime
#Calculamos el tiempo de ejecucion
timestart= datetime.datetime.now()
spark.conf.set("spark.sql.shuffle.partitions", "5")
flightData2015.sort("count").take(2)

# Calculamos el tiempo empleado en la ejecución
timeend = datetime.datetime.now()
timedelta = round((timeend-timestart).total_seconds(), 2) 
print("Tiempo tomado en crear el modelo: " + str(timedelta) + " segundos")

Tiempo tomado en crear el modelo: 0.07 segundos


In [9]:
import datetime
#Calculamos el tiempo de ejecucion
timestart= datetime.datetime.now()
spark.conf.set("spark.sql.shuffle.partitions", "20")
flightData2015.sort("count").take(2)

# Calculamos el tiempo empleado en la ejecución
timeend = datetime.datetime.now()
timedelta = round((timeend-timestart).total_seconds(), 2) 
print("Tiempo tomado en crear el modelo: " + str(timedelta) + " segundos")

Tiempo tomado en crear el modelo: 0.07 segundos


In [10]:
import datetime
#Calculamos el tiempo de ejecucion
timestart= datetime.datetime.now()
spark.conf.set("spark.sql.shuffle.partitions", "100")
flightData2015.sort("count").take(2)

# Calculamos el tiempo empleado en la ejecución
timeend = datetime.datetime.now()
timedelta = round((timeend-timestart).total_seconds(), 2) 
print("Tiempo tomado en crear el modelo: " + str(timedelta) + " segundos")

Tiempo tomado en crear el modelo: 0.07 segundos


# DataFrames and SQL

There is no performance difference between writing SQL queries or writing DataFrame code, they
both "compile" to the same underlying plan that we specify in DataFrame code.

**Any DataFrame can be made into a table or view with one simple method call**

In [11]:
flightData2015.createOrReplaceTempView("flight_data_2015")

To execute a SQL query, we’ll use the spark.sql function (remember spark
is our SparkSession variable?) that conveniently, returns a new DataFrame. While this may seem a bit circular in logic
- that a SQL query against a DataFrame returns another DataFrame, it’s actually quite powerful.

In [12]:
sqlWay = spark.sql("""
SELECT DEST_COUNTRY_NAME, count(1)
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
""")
sqlWay.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#10, 100)
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[partial_count(1)])
      +- *(1) FileScan csv [DEST_COUNTRY_NAME#10] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/erikapat/Dropbox/PRUEBAS_DATA_SCIENCE/SPARK/GIT_SPARK-PRACTICE-NOTE..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>


In [13]:
sqlWay.toPandas().head()

Unnamed: 0,DEST_COUNTRY_NAME,count(1)
0,Panama,1
1,Cape Verde,1
2,Hong Kong,1
3,Anguilla,1
4,Russia,1


In [14]:
dataFrameWay = flightData2015\
.groupBy("DEST_COUNTRY_NAME")\
.count()
dataFrameWay.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#10, 100)
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[partial_count(1)])
      +- *(1) FileScan csv [DEST_COUNTRY_NAME#10] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/erikapat/Dropbox/PRUEBAS_DATA_SCIENCE/SPARK/GIT_SPARK-PRACTICE-NOTE..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>


In [15]:
dataFrameWay.toPandas().head()

Unnamed: 0,DEST_COUNTRY_NAME,count
0,Panama,1
1,Cape Verde,1
2,Hong Kong,1
3,Anguilla,1
4,Russia,1


In [16]:
from pyspark.sql.functions import desc
flightData2015\
.groupBy("DEST_COUNTRY_NAME")\
.sum("count")\
.withColumnRenamed("sum(count)", "destination_total")\
.sort(desc("destination_total"))\
.limit(5)\
.collect()

[Row(DEST_COUNTRY_NAME='United States', destination_total=411352),
 Row(DEST_COUNTRY_NAME='Canada', destination_total=8399),
 Row(DEST_COUNTRY_NAME='Mexico', destination_total=7140),
 Row(DEST_COUNTRY_NAME='United Kingdom', destination_total=2025),
 Row(DEST_COUNTRY_NAME='Japan', destination_total=1548)]

## Structured Streaming

Structured Streaming is a high-level API for stream processing that became production-ready in Spark 2.2. Structured
Streaming allows you to take the same operations that you perform in batch mode using Spark’s structured APIs, and
run them in a streaming fashion. 

This can reduce latency and allow for incremental processing. The best thing about
Structured Streaming is that it allows you to rapidly and quickly get value out of streaming systems with virtually no code changes. 

It also makes it easy to reason about because you can write your batch job as a way to prototype it and
then you can convert it to streaming job. 

The way all of this works is by incrementally processing that data.
Let’s walk through a simple example of how easy it is to get started with Structured Streaming. For this we will use a retail dataset. One that has specific dates and times for us to be able to use. We will use the "by-day" set of files where one file represents one day of data.

We put it in this format to simulate data being produced in a consistent and regular manner by a different process.
Now this is retail data so imagine that these are being produced by retail stores and sent to a location where they will be read by our Structured Streaming job.

In [17]:
staticDataFrame = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load("../data/retail-data/by-day/*.csv")
staticDataFrame.createOrReplaceTempView("retail_data")
staticSchema = staticDataFrame.schema

Now since we’re working with time series data it’s worth mentioning how we might go along grouping and
aggregating our data. In this example we’ll take a look at the largest sale hours where a given customer (identified by
CustomerId) makes a large purchase. For example, let’s add a total cost column and see on what days a customer
spent the most.

In [18]:
from pyspark.sql.functions import window, column, desc, col
staticDataFrame\
.selectExpr(
"CustomerId",
"(UnitPrice * Quantity) as total_cost" ,
"InvoiceDate" )\
.groupBy(
col("CustomerId"), window(col("InvoiceDate"), "1 day"))\
.sum("total_cost")\
.show(5)

+----------+--------------------+-------------------+
|CustomerId|              window|    sum(total_cost)|
+----------+--------------------+-------------------+
|   16057.0|[2011-12-05 01:00...|              -37.6|
|   14126.0|[2011-11-29 01:00...|  643.6300000000001|
|   13500.0|[2011-11-16 01:00...|  497.9700000000001|
|   16253.0|[2011-11-08 01:00...|-18.240000000000002|
|   17160.0|[2011-11-08 01:00...|  516.8499999999999|
+----------+--------------------+-------------------+
only showing top 5 rows



That’s the static DataFrame version, there shouldn’t be any big surprises in there if you’re familiar with the syntax.
Now we’ve seen how that works, let’s take a look at the streaming code! You’ll notice that very little actually changes
about our code. The biggest change is that we used readStream instead of read, additionally you’ll notice
maxFilesPerTrigger option which simply specifies the number of files we should read in at once. This is to make
our demonstration more "streaming" and in a production scenario this would be omitted.

Now since you’re likely running this in local mode, it’s a good practice to set the number of shuffle partitions to
something that’s going to be a better fit for local mode. This configuration simple specifies the number of partitions
that should be created after a shuffle, by default the value is two hundred but since there aren’t many executors
on this machine it’s worth reducing this to five.

`spark.conf.set("spark.sql.shuffle.partitions", "5")`

In [19]:
streamingDataFrame = spark.readStream\
.schema(staticSchema)\
.option("maxFilesPerTrigger", 1)\
.format("csv")\
.option("header", "true")\
.load("../data/retail-data/by-day/*.csv")

In [20]:
streamingDataFrame.isStreaming

True

Now we can see the DataFrame is streaming.

Let’s set up the same business logic as the previous DataFrame manipulation, we’ll perform a summation in the process.

In [21]:
purchaseByCustomerPerHour = streamingDataFrame\
.selectExpr(
"CustomerId",
"(UnitPrice * Quantity) as total_cost" ,
"InvoiceDate" )\
.groupBy(
col("CustomerId"), window(col("InvoiceDate"), "1 day"))\
.sum("total_cost")

This is still a lazy operation, so we will need to call a streaming action to start the execution of this data flow.

## NOTE
Before kicking off the stream, we will set a small optimization that will allow this to run better on a single machine.
This simply limits the number of output partitions after a shuffle.

In [22]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

Streaming actions are a bit different from our conventional static action because we’re going to be populating data
somewhere instead of just calling something like count (which doesn’t make any sense on a stream anyways). The
action we will use will out to an in-memory table that we will update after each trigger. In this case, each trigger is
based on an individual file (the read option that we set). Spark will mutate the data in the in-memory table such that
we will always have the highest value as specified in our aggregation above.

In [23]:

purchaseByCustomerPerHour.writeStream.format("memory").queryName("customer_purchases").outputMode("complete").start()

<pyspark.sql.streaming.StreamingQuery at 0x12021e890>

Once we start the stream, we can run queries against the stream to debug what our result will look like if we were to
write this out to a production sink.

In [24]:
spark.sql("""
SELECT *
FROM customer_purchases
ORDER BY `sum(total_cost)` DESC
""")\
.show(5)

+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   12415.0|[2011-03-03 01:00...|          16558.14|
|      null|[2011-03-03 01:00...| 3538.750000000001|
|   17416.0|[2011-03-03 01:00...|           2114.71|
|   18102.0|[2011-03-03 01:00...|            1396.0|
|   16709.0|[2011-03-03 01:00...|1120.5300000000002|
+----------+--------------------+------------------+
only showing top 5 rows



You’ll notice that as we read in more data - the composition of our table changes! With each file the results may or
may not be changing based on the data. Naturally since we’re grouping customers we hope to see an increase in the
top customer purchase amounts over time (and do for a period of time!). Another option you can use is to just simply
write the results out to the console.

In [25]:
purchaseByCustomerPerHour.writeStream.format("console").queryName("customer_purchases_2").outputMode("complete").start()

<pyspark.sql.streaming.StreamingQuery at 0x1201ddb90>

Neither of these streaming methods should be used in production but they do make for convenient demonstration of
Structured Streaming’s power. Notice how this window is built on event time as well, not the time at which the data
Spark processes the data. This was one of the shortcoming of Spark Streaming that Structured Streaming as resolved.

## Machine Learning and Advanced Analytics
Another popular aspect of Spark is its ability to perform large scale machine learning with a built-in library of machine
learning algorithms called MLlib. MLlib allows for preprocessing, munging, training of models, and making predictions
at scale on data. You can even use models trained in MLlib to make predictions in Strucutred Streaming. Spark
provides a sophisticated machine learning API for performing a variety of machine learning tasks, from classification
to regression, clustering to deep learning. To demonstrate this functionality, we will perform some basic clustering on
our data using a common algorithm called K-Means.

In [26]:
staticDataFrame.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)



Machine learning algorithms in MLlib require data to be represented as numerical values. Our current data is
represented by a variety of different types including timestamps, integers, and strings. Therefore we need to transform
this data into some numerical representation. In this instance, we will use several DataFrame transformations to
manipulate our date data.

In [27]:
from pyspark.sql.functions import date_format, col
preppedDataFrame = staticDataFrame\
.na.fill(0)\
.withColumn("day_of_week", date_format(col("InvoiceDate"), "EEEE"))\
.coalesce(5)

In [28]:
#preppedDataFrame.toPandas().head()
type(preppedDataFrame)

pyspark.sql.dataframe.DataFrame

Now we are also going to need to split our data into training and test sets. In this instance we are going to do this
manually by the data that a certain purchase occurred however we could also leverage MLlib’s transformation APIs to
create a training and test set via train validation splits or cross validation.

In [29]:
trainDataFrame = preppedDataFrame.where("InvoiceDate < '2011-07-01'")
testDataFrame = preppedDataFrame.where("InvoiceDate >= '2011-07-01'")

Now that we prepared our data, let’s split it into a training and test set. Since this is a time-series set of data, we will
split by an arbitrary date in the dataset. While this may not be the optimal split for our training and test, for the intents
and purposes of this example it will work just fine. We’ll see that this splits our dataset roughly in half.

In [30]:
trainDataFrame.count()

245903

In [31]:
testDataFrame.count()

296006

In [32]:
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer()\
.setInputCol("day_of_week")\
.setOutputCol("day_of_week_index")

In [33]:
indexer

StringIndexer_0a21e90e831a

This will turn our days of weeks into corresponding numerical values. For example, Spark may represent Saturday
as 6 and Monday as 1. However with this numbering scheme, we are implicitly stating that Saturday is greater than
Monday (by pure numerical values). This is obviously incorrect. Therefore we need to use a OneHotEncoder to
encode each of these values as their own column. These boolean flags state whether that day of week is the relevant
day of the week

## OneHotEncoder

In [34]:
from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder()\
.setInputCol("day_of_week_index")\
.setOutputCol("day_of_week_encoded")

Each of these will result in a set of columns that we will "assemble" into a vector. All machine learning algorithms in
Spark take as input a Vector type, which must be a set of numerical values.

In [35]:
from pyspark.ml.feature import VectorAssembler
vectorAssembler = VectorAssembler()\
.setInputCols(["UnitPrice", "Quantity", "day_of_week_encoded"])\
.setOutputCol("features")

We can see that we have 3 key features, the price, the quantity, and the day of week. Now we’ll set this up into a
pipeline so any future data we need to transform can go through the exact same process.

In [36]:
from pyspark.ml import Pipeline
transformationPipeline = Pipeline()\
.setStages([indexer, encoder, vectorAssembler])

In [37]:
fittedPipeline = transformationPipeline.fit(trainDataFrame)

Now preparing for training is a two step process. We first need to fit our transformers to this dataset. We cover this in depth, but basically our StringIndexer needs to know how many unique values there are to be index. Once those exist, encoding is easy but Spark must look at all the distinct values in the column to be indexed in order to store those values later on.

Once we fit the training data, we are now create to take that fitted pipeline and use it to transform all of our data in a
consistent and repeatable way.

In [38]:
transformedTraining = fittedPipeline.transform(trainDataFrame)

At this point, it’s worth mentioning that we could have included our model training in our pipeline. We chose not to
in order to demonstrate a use case for caching the data.

In [39]:
transformedTraining.cache()

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string, day_of_week: string, day_of_week_index: double, day_of_week_encoded: vector, features: vector]

In [40]:
from pyspark.ml.clustering import KMeans
kmeans = KMeans()\
.setK(20)\
.setSeed(1)

In Spark, training machine learning models is a two phase process. 

* First we initialize an untrained model, 
* then we train it. 

There are always two types for every algorithm in MLlib’s DataFrame API. 
They following the naming pattern of **Algorithm, for the untrained version**, and **AlgorithmModel for the trained version**. In our case, this is KMeans and then KMeansModel.

In [41]:
kmModel = kmeans.fit(transformedTraining)

We can see the resulting cost at this point. Which is quite high, that’s likely because we didn’t necessary scale our data
or transform

In [42]:
kmModel.computeCost(transformedTraining)

103503481.10517502

In [43]:
transformedTest = fittedPipeline.transform(testDataFrame)
kmModel.computeCost(transformedTest)

548689531.0782777