<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Databricks Learning" style="width: 600px; height: 240px">
</div>

# ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Touching Spark - Part 2

## Getting Started

Let's start again by creating a new SparkSession.

Note that we are working in a notebook environment, each notebook runs its own spark session...

In [None]:
%load_ext autotime

In [None]:
import os
import qcutils
from pyspark.sql import SparkSession
import boto3
import io

from pyspark.sql.functions import *
from pyspark.sql.types import *

baseUri = "s3a://quantia-master/training/"

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

spark

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Dataset and DataDrame Overview

Dataset and Dataframe are built on top of RDDs and represent an higher level abstraction.

A Dataset is an immutable distributed collection of data. It is available since Spark 1.6 and offers the benefits of RDDs (strong typing and lambda functions) with the benefits of Spark SQL’s optimized execution engine. 
The Dataset API is **only** available in Scala and Java. 

`DataFrame` is a type alias of `Dataset[Row]` (from Java/Scala perspective). It is organized into named columns and conceptually equivalent to a table in a relational database or a data frame in R/Python.
The DataFrame API is available in Scala, Java, Python, and R.

Since spark 2.x DataFrame APIs will merge with Datasets APIs, unifying data processing capabilities across libraries.

### Dataset/Dataframe API Benefits
* Type-safety at runtime and Strong-typing: syntax error at compile time vs syntax error at runtime
<img src=https://www.quantiaconsulting.com/logos/img/sql-vs-dataframes-vs-datasets-type-safety-spectrum.png width="450">

<br/>

* High-level abstraction of structured and semi-structured data:Dataset/Dataframe APIs create a structured view of your semi-structured data (JSON source -> Dataframe Object)

* Easy-to-use APIs: `agg`, `select`, `sum`, `filter`, or `groupBy` vs `map`, `flatMap` or `reduceByKey`

* Performance and Optimization:
<img src=https://www.quantiaconsulting.com/logos/img/memory-usage-when-caching-datasets-vs-rdds.png width="500">

The next section uses [**Pageviews By Seconds** data set](https://dumps.wikimedia.org/other/pagecounts-raw/).

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Data Ingestion

The first basic operation during a data project is ingest data.

The SparkSession entrypoint offers the `read` function to directly create Dataframes from external sources.

Spark offers readers for all of the most common structured data:
* Structured Text file (csv, tsv, etc)
* Big data format (parquet, orc, etc)
* Semistructured data (json)
* Relational source (JDBC)
* ...

Most of this reader offers option to automate basic step during ingestion.

Let's see an example:

In [None]:
schema = StructType(
  [
    StructField("timestamp", StringType(), False),
    StructField("site", StringType(), False),
    StructField("requests", IntegerType(), False)
  ]
)

fileName = baseUri + "wikipedia_pageviews_by_second.tsv"

pageviewsDF = (spark.read
  .option("header", "true")
  .option("sep", "\t")
  .schema(schema)
  .csv(fileName)
)

pageviewsDF

For example, in the previous read operation we tell the reader:
* That the separator is `\t` -> `.option("sep", "\t")`
* The file has an header -> `.option("header", "true")`
* The schema for the columns is the one we described above -> `.schema(schema)`
    * we could ask the system to guess the schema -> `.option("inferSchema", "true")`

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Transformations & Actions (From DataFrame Perspective)

### Transformations

Transformations have the following key characteristics:
* always return a `DataFrame` (or a `DataSet[Row]` the case of Java/Scala).
* are immutable - a `DataFrame` instance of cannot be altered once it's instantiated (further optimizations are still possible)
* are classified as either a Wide or Narrow operation
* in **Scala & Java** come in two flavors: Typed & Untyped

### Actions

In contrast to transformation, Actions either return a result or write to disc. For example:
* The number of records in the case of `count()` 
* An array of objects in the case of `collect()` or `take(n)`

Hereafter you can find a list of the most important Actions

| Method | Return | Description |
|--------|--------|-------------|
| `collect()` | Collection | Returns an array that contains all of Rows in this DataFrame. |
| `count()` | Long | Returns the number of rows in the DataFrame. |
| `first()` | Row | Returns the first row. |
| `foreachPartition(f)` | - | Applies a function f to each partition of this DataFrame. |
| `head()` | Row | Returns the first row. |
| `show(..)` | - | Displays the top 20 rows of Dataset in a tabular form. |
| `take(n)` | Collection | Returns the first n rows in the DataFrame. |
| `toLocalIterator()` | Iterator | Return an iterator that contains all of Rows in this DataFrame. |
| `...`||


**Note:** The list of transformations and actions varies significantly between each language. Mostly because Java & Scala are strictly typed languages compared Python & R which are loosed typed.

Let's now put at work some transformations and actions.

The 255 MB pageviews file is currently in our object store, which means each time you scan through it, your Spark cluster has to read the 255 MB of data remotely over the network.

This time let's try to perform and `filter` transformation and a `count` to trigger jobs.

Once again, use the `count()` action to scan the entire 255 MB file from disk and count how many total records (rows) there are:

In [None]:
fpv = pageviewsDF.filter('site=="mobile"')
fpv.count()

Rerun the cell several times and take note of an average execution time.

Every time we re-run these operations, it goes all the way back to the original data store.

This requires pulling all the data across the network for every execution.

In many/most cases, this network IO is the most expensive part of a job.

### cache()

We limit the overhead by caching the data on the executors.

As for an RDD, the `cache(..)` operation doesn't do anything other than mark a `DataFrame` as cacheable.

It is not technically a transformation or action, and, in order to actually cache the data, Spark has to process over every single record.

A very common method for materializing the cache is to execute a `count()`.


In [None]:
pageviewsDF.cache()
pageviewsDF.count()

The last `count()` will take a little longer than normal.

It has to perform the cache and do the work of materializing the cache.

Now that `pageviewsDF` is cached **AND** the cache has been materialized.

Now, run the two queries and compare their execution time to the ones above.

In [None]:
fpv = pageviewsDF.filter('site=="mobile"')
fpv.count()

Faster, right?

We are no longer making network calls, now all of our data is being stored in RAM on the executors.

You can use directly SQL!!

In [None]:
pageviewsDF.createOrReplaceTempView("pageviews")

In [None]:
spark.sql("""
SELECT count(*) AS n 
FROM pageviews 
WHERE site=="mobile"
""")

**Note** Differently from RDDs, there are several ways to clean cache:
  * Remove each cache one-by-one, fairly problematic -> `unpersist()`
  * Restart the cluster - takes a fair while to come back online
  * Just blow the entire cache away - this will affect every user on the cluster!! -> `spark.catalog.clearCache()`

In [None]:
spark.catalog.clearCache()

### A Little Step More

#### Why is Laziness So Important?

As for RDD, Spark is Lazy for DataFrame Transformations too. But why?

The laziness a common pattern in functional programming.
It has a number of benefits
* Not forced to load all data at step #1 
  * Technically impossible with **REALLY** large datasets.
* Easier to parallelize operations 
  * N different transformations can be processed on a single data element, on a single thread, on a single machine. 
* Most importantly, it allows the framework to **automatically apply various optimizations**

We will see it at the end of this notebook.

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Data Preparation 

As you can see, DataFrame ease the data ingestion and the data exploration.

Moreover, the tabular data structure makes easy to manipulate and change the original format of the data.

**Note** As for RDD, a DataFrame is immutable once instatiated. In order to change the schema or the content in a persistent way, you have to re-assign the DataFrame.

Examples of most common data manipulations are:
* Change the data structure by adding/renaming/deleting columns
* Manipulate date and time

Let's play with more complex data

In [None]:
bikeSharingDF = (spark.read
                .option("header", True)
                .option("inferSchema", True)
                .csv(baseUri+"bikeSharing.csv"))

bikeSharingDF

In [None]:
bikeSharingDF.count()

What we are seeing here?

A pretty large file, with 17 columns and thousands of rows.

This is an open source dataset that contains the number of active bike per hour, together with accurated information related to the date, the weather and the season.

Based on your needs, it can be useful to add/remove columns or filter rows or change the data type of a column.

Let's start taking a look to the column name and data-type.

In [None]:
bikeSharingDF.printSchema()

### New Columns and Column Name

The name of a column represents and important starting point to understand data.

The method `withColumnRenamed(...)` change the name of an existing column. 

In [None]:
bikeSharingDF.withColumnRenamed("hr", "hour")

We just changed the name of `hr` column to `hour`, but.... remember that DataFrame is immutable! You have to re-assign it.

You can re-assign it to the same variable, but you will lose the reference to the original one.

In [None]:
bikeSharingDF = bikeSharingDF.withColumnRenamed("hr", "hour")
bikeSharingDF.printSchema()

### DataType

Spark did a great job during the ingestion in inferring the right datatype, but what if we need to change the datatype of a column?

We can perform a `cast`. 

A cast can change the datatype of a column.

Let's start from a simple example:

In [None]:
bikeSharingDF = bikeSharingDF.withColumn("instant", col("instant").cast(StringType()))
bikeSharingDF.printSchema()

Now the instant column is of type `String`.

We used the `withColumn()` method, that let you add a new column to a Dataframe specifying a function to apply to the data.

**Note** We used an existing name in the `withColumn()`, the new column replace the old one.

### A Little Step More

#### DateTime Manipulation

The date and time manipulation represents a very common step in a data preparation pipeline.

Let's start with an example, in our DataFrame, we want create a single datetime column with the information related to the day (now in `dteday`) column and related to the hour (now in the column `hour`), how can we do that?

Spark offers a wide range of function to manipulate date and time.

At this moment we have:
* `dteday`: timestamp column
* `hour`: integer column

Let's start creating a new column and trying to merge the two information.

Any suggestion?

**HINT** Unix timestamp










In [None]:
bikeSharingDF.withColumn("dt", from_unixtime(unix_timestamp('dteday') + col("hour") * 3600))

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Catalyst Optimizer

![Catalyst](https://www.quantiaconsulting.com/logos/img/catalyst-diagram.png)

## Optimized Logical Plan

The Catalyst Optimizer **rewrites our code**

In the next section we will see **two examples** involving the rewriting of our filters.

The first is an **innocent mistake** almost most every new Spark developer makes.

The second "mistake" is... well... **really bad** - but Spark can fix it.

### Example #1: Innocent Mistake

I don't want any project that starts with **en.zero**.

There are **better ways of doing this**, as in it can be done with a single condition.

But we will make **8 passes** on the data **with 8 different filters**.

After every individual pass, we will **go back over the remaining dataset** to filter out the next set of records.

In [None]:
parquetFile = baseUri + "wikipedia_pagecount.parquet"

allDF = spark.read.parquet(parquetFile)

pass1 = allDF.filter( col("project") != "en.zero")
pass2 = pass1.filter( col("project") != "en.zero.n")
pass3 = pass2.filter( col("project") != "en.zero.s")
pass4 = pass3.filter( col("project") != "en.zero.d")
pass5 = pass4.filter( col("project") != "en.zero.voy")
pass6 = pass5.filter( col("project") != "en.zero.b")
pass7 = pass6.filter( col("project") != "en.zero.v")
pass8 = pass7.filter( col("project") != "en.zero.q")

print("Pass 1: {0:,}".format( pass1.count() ))
print("Pass 2: {0:,}".format( pass2.count() ))
print("Pass 3: {0:,}".format( pass3.count() ))
print("Pass 4: {0:,}".format( pass4.count() ))
print("Pass 5: {0:,}".format( pass5.count() ))
print("Pass 6: {0:,}".format( pass6.count() ))
print("Pass 7: {0:,}".format( pass7.count() ))
print("Pass 8: {0:,}".format( pass8.count() ))

**Logically**, the code above is the same as the code below.

The only real difference is that we are **not asking for a count** after every filter.

In [None]:
innocentDF = (spark.read.parquet(parquetFile)
  .filter( col("project") != "en.zero")
  .filter( col("project") != "en.zero.n")
  .filter( col("project") != "en.zero.s")
  .filter( col("project") != "en.zero.d")
  .filter( col("project") != "en.zero.voy")
  .filter( col("project") != "en.zero.b")
  .filter( col("project") != "en.zero.v")
  .filter( col("project") != "en.zero.q")
)
print("Final Count: {0:,}".format( innocentDF.count() ))

We don't even have to execute the code to see what is **logically** or **physically** taking place under the hood.

Here we can use the `explain(..)` command.

In [None]:
innocentDF.explain(True)

Of course, if we were to write this the correct way, the first time, ignoring the fact that there are better methods, it would look something like this...

In [None]:
betterDF = (spark.read.parquet(parquetFile)
  .filter( (col("project").isNotNull()) &
           (col("project") != "en.zero") & 
           (col("project") != "en.zero.n") & 
           (col("project") != "en.zero.s") & 
           (col("project") != "en.zero.d") & 
           (col("project") != "en.zero.voy") & 
           (col("project") != "en.zero.b") & 
           (col("project") != "en.zero.v") & 
           (col("project") != "en.zero.q")
        )
)

print("Final: {0:,}".format( betterDF.count() ))

betterDF.explain(True)

In [None]:
parquetFile = baseUri + "wikipedia_pagecount.parquet"

allDF = spark.read.parquet(parquetFile)

pass1 = allDF.filter( col("project") != "en.zero")
pass2 = pass1.filter( col("project") != "en.zero.n")
pass3 = pass2.filter( col("project") != "en.zero.s")
pass4 = pass3.filter( col("project") != "en.zero.d")
pass5 = pass4.filter( col("project") != "en.zero.voy")
pass6 = pass5.filter( col("project") != "en.zero.b")
pass7 = pass6.filter( col("project") != "en.zero.v")
pass8 = pass7.filter( col("project") != "en.zero.q")

print("Final: {0:,}".format( pass8.count() ))

pass8.explain(True)

### Example #2: Bad Programmer

This time we are going to do something **REALLY** bad...

Even if the compiler combines these filters into a single filter, **we still have five different tests** for any column that doesn't have the value "whatever".

In [None]:
stupidDF = (spark.read.parquet(parquetFile)
  .filter( col("project") != "whatever")
  .filter( col("project") != "whatever")
  .filter( col("project") != "whatever")
  .filter( col("project") != "whatever")
  .filter( col("project") != "whatever")
)

stupidDF.explain(True)

**Note** `explain(..)` is not the only way to get access to this level of detail...
We can also see it in the **Spark UI**.

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png)  Take Home Messages

* DataFrame API make Spark "accessible" and reduce complexity of Spark application
* Caching DataFrame 
* Laziness is important for optimization (especially in a complex pipeline)

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Let's Get Dirty

Let's put at work what we've just learnt on DataFrame with [Lab2a](./lab/02a-touching-spark-part2-lab.ipynb) and [Lab2b](./lab/02b-touching-spark-part2-lab.ipynb)

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.