<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Databricks Learning" style="width: 600px; height: 240px">
</div>

# ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Touching Spark - Part 3

## Getting Started

Let's start again by creating a new SparkSession.

Note that we are working in a notebook environment, each notebook runs its own spark session...

In [None]:
%load_ext autotime

In [None]:
import os
import qcutils
from pyspark.sql import SparkSession
import boto3
import io
from pyspark.sql.functions import *
from pyspark.sql.types import *

baseUri = "s3a://quantia-master/training/"

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

sc = spark.sparkContext

spark

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Partitioning

A Partition is a logical chunk of a large data set.

Very often the data we are processing is separated into logical partitions (ie. payments from the same country, ads displayed for given cookie, etc). In Spark, the partitions can be distributed among nodes.

Partitioning plays a key role in the parallelization task. Spark can run 1 concurrent task for every partition of an RDD/Dataframe/Dataset up to the number of cores in the cluster. 

The next section uses [**Pageviews By Seconds** data set](https://dumps.wikimedia.org/other/pagecounts-raw/).

In [None]:
schema = StructType(
  [
    StructField("timestamp", StringType(), False),
    StructField("site", StringType(), False),
    StructField("requests", IntegerType(), False)
  ]
)

fileName = baseUri + "wikipedia_pageviews_by_second.tsv"

initialDF = (spark.read
  .option("header", "true")
  .option("sep", "\t")
  .schema(schema)
  .csv(fileName)
)

initialDF

We can see below that our data consists of...
* when the record was created
* the site (mobile or desktop) 
* and the number of requests

For every second of the day, there are two records, one for the mobile version of the site and one for desktop version.

## Partitions vs Slots
### Slots/Cores

In most cases, if you created your cluster, you should know how many cores you have.

However, to check programatically, you can use `SparkContext.defaultParallelism`

For more information, see the doc <a href="https://spark.apache.org/docs/latest/configuration.html#execution-behavior" target="_blank">Spark Configuration, Execution Behavior</a>
> For operations like parallelize with no parent RDDs, it depends on the cluster manager:
> * Local mode: number of cores on the local machine
> * Mesos fine grained mode: 8
> * **Others: total number of cores on all executor nodes or 2, whichever is larger**

**Note:** In Spark API the term **core** meaning a thread available for parallel execution. In the next sections, we will refer to it as **slot** to avoid confusion with the number of cores in the underlying CPU(s)

In [None]:
cores = spark.sparkContext.defaultParallelism

print("You have {} cores, or slots.".format(cores))

### Partitions

* The second 1/2 of this question is how many partitions of data do I have?
* With that we have a question:
  0. Why do I have that many?

If our goal is to process all our data (say 1M records) in parallel, we need to divide that data up.

If I have 8 **slots** for parallel execution, it would stand to reason that I want 1M / 8 or 125,000 records per partition.

Let's start answering the question:
* takes the `initialDF`
* converts it to an `RDD`
* and then asks the `RDD` for the number of partitions

In [None]:
partitions = initialDF.rdd.getNumPartitions()
print("Partitions: {0:,}".format( partitions ))

* It is **NOT** coincidental that we have **8 slots** and **8 partitions**
* Starting from Spark 2.0 a lot of optimizations have been added to the readers.
* Namely the readers looks at **the number of slots**, the **size of the data**, and makes a best guess at how many partitions **should be created**.
* You can actually double the size of the data several times over and Spark will still read in **only 8 partitions**.
* Eventually it will get so big that Spark will forgo optimization and read it in as 10 partitions, in that case.

But 8 partitions and 8 slots is just too easy.
  * Let's read in another copy of this same data.
  * A parquet file that was saved in 9 partitions.
  * This gives us an excuse to reason about the **relationship between slots and partitions**

In [None]:
# Create our initial DataFrame. We can let it infer the 
# schema because the cost for parquet files is really low.
alternateDF = (spark.read
  .parquet(baseUri + "wikipedia_pageviews_by_second_9.parquet")
)

print("Partitions: {0:,}".format( alternateDF.rdd.getNumPartitions() ))

Now that we have 9 partitions we have to ask...

What is going to happen when I perform and action like `count()` **with 8 slots and 9 partitions?**

In [None]:
alternateDF.count()

**Question #1:** Is it OK to let my code continue to run this way?

**Question #2:** What if it was a **REALLY** big file that read in as **200 partitions** and we had **256 slots**?

**Question #3:** What if it was a **REALLY** big file that read in as **200 partitions** and we had only **8 slots**, how long would it take compared to a dataset that has only 8 partitions?

**Question #4:** Given the previous example (**200 partitions** vs **8 slots**) what are our options (given that we cannot increase the number of partitions)?

### Use Every Slot/Core

With some very few exceptions, you always want the number of partitions to be **a factor of the number of slots**.

That way **every slot is used**.

That is, every slots is being assigned a task.

With 9 partitions & 8 slots we just guaranteed our **job will take 2x** as long as it may need to.
* 10 seconds, for example, to process the first 8.
* Then as soon as one of the first 8 is done, another 10 seconds to process the last partition.

With 5 partitions & 8 slots we are **under-utilizing three of the eight slots**.

### More or Less Partitions?

As a **general guideline** it is advised that each partition (when cached) is roughly around 200MB.
* Size on disk is not a good gauge. For example...
* CSV files are large on disk but small in RAM - consider the string "12345" which is 10 bytes compared to the integer 12345 which is only 4 bytes.
* Parquet files are highly compressed but uncompressed in RAM.
* In a relational database... well... who knows?

The **200 comes from** the real-world-experience of engineers and is **based largely on efficiency** and not so much resource limitations. 

On an executor with a reduced amount of RAM you might need to lower that.

For example, at 8 partitions (corresponding to our max number of slots) & 200MB per partition
* That will use roughly **1.5GB**
* If you have transformations that balloon the data size (such as Natural Language Processing) you are sure to run into problems.

**Question:** If I read in my data and it comes in as 10 partitions should I...
* reduce my partitions down to 8 (1x number of slots)
* or increase my partitions up to 16 (2x number of slots)

**Answer:** It depends on the size of each partition
* Read the data in. 
* Cache it. 
* Look at the size per partition.
* If you are near or over 200MB consider increasing the number of partitions.
* If you are under 200MB consider decreasing the number of partitions.

The goal will **ALWAYS** be to use as few partitions as possible while maintaining at least 1 x number-of-slots.

## repartition(n) or coalesce(n)

We have two operations that can help address this problem: `repartition(n)` and `coalesce(n)`.

If you look at the API docs, `coalesce(n)` is described like this:
> Returns a new Dataset that has exactly numPartitions partitions, when fewer partitions are requested.<br/>
> If a larger number of partitions is requested, it will stay at the current number of partitions.

If you look at the API docs, `repartition(n)` is described like this:
> Returns a new Dataset that has exactly numPartitions partitions.

The key differences between the two are
* `coalesce(n)` is a **narrow** transformation and can only be used to reduce the number of partitions.
* `repartition(n)` is a **wide** transformation and can be used to reduce or increase the number of partitions.

So, if I'm increasing the number of partitions I have only one choice: `repartition(n)`

If I'm reducing the number of partitions I can use either one, so how do I decide?
* First off, `coalesce(n)` is a **narrow** transformation and performs better because it avoids a shuffle.
* However, `coalesce(n)` cannot guarantee even **distribution of records** across all partitions.
* For example, with `coalesce(n)` you might end up with **a few partitions containing 80%** of all the data.
* On the other hand, `repartition(n)` will give us a relatively **uniform distribution**.
* And `repartition(n)` is a **wide** transformation meaning we have the added cost of a **shuffle operation**.

In our case, we "need" to go form 5 partitions up to 8 partitions - our only option here is `repartition(n)`.

In [None]:
repartitionedDF = alternateDF.repartition(8)

print("Partitions: {0:,}".format( repartitionedDF.rdd.getNumPartitions() ))

## Cache, Again?

We just balanced the number of partitions to the number of slots.

Depending on the size of the data and the number of partitions, the shuffle operation can be fairly expensive (though necessary).

Let's cache the result of the `repartition(n)` call..
* Or more specifically, let's mark it for caching.
* The actual cache will occur later once an action is performed
* Or you could just execute a count to force materialization of the cache.

## spark.sql.shuffle.partitions

The next problem has to do with a side effect of certain **wide** transformations.

So far, we haven't hit any **wide** transformations other than `repartition(n)`
* But eventually we will... 
* Let's illustrate the problem that we will **eventually** hit
* We can do this by simply sorting our data.

In [None]:
(repartitionedDF
  .orderBy(col("timestamp"), col("site")) # sort the data
  .foreach(lambda x: None)               # litterally does nothing except trigger a job
)

### Quick Detour
Something isn't right here...
* We only executed one action.
* But two jobs were triggered.
* If we look at the physical plan we can see the reason for the extra job.
* The answer lies in the step **Exchange rangepartitioning**

In [None]:
(repartitionedDF
  .orderBy(col("timestamp"), col("site"))
  .explain()
)
print("-"*80)

(repartitionedDF
  .orderBy(col("timestamp"), col("site"))
  .limit(3000000)
  .explain()
)
print("-"*80)

In [None]:
(repartitionedDF
  .orderBy(col("timestamp"), col("site")) # sort the data
  .limit(100000)                           # only 100000 ....    
  .foreach(lambda x: None)                # litterally does nothing except trigger a job
)

### The Real Problem

Back to the original issue...
* Rerun the original job (below).
* Take a look at the second job.
* Look at the 3rd Stage.
* Notice that it has 200 partitions!
* And this is our problem.

In [None]:
funkyDF = (repartitionedDF
  .orderBy(col("timestamp"), col("site")) # sorts the data
)                                         #
funkyDF.foreach(lambda x: None)           # litterally does nothing except trigger a job

The problem is the number of partitions we ended up with.

Besides looking at the number of tasks in the final stage, we can simply print out the number of partitions

In [None]:
print("Partitions: {0:,}".format( funkyDF.rdd.getNumPartitions() ))

The engineers building Apache Spark chose a default value, 200, for the new partition size.

After all our work to determine the right number of partitions they go and undo it on us.

The value 200 is actually based on practical experience, attempting to account for the most common scenarios to date.

Work is being done to intelligently determine this new value but that is still in progress.

For now, we can tweak it with the configuration value `spark.sql.shuffle.partitions`

We can see below that it is actually configured for 200 partitions

In [None]:
spark.conf.get("spark.sql.shuffle.partitions")

We can change the config setting with the following command

In [None]:
spark.conf.set("spark.sql.shuffle.partitions", "8")

Now, if we re-run our query, we will see that we end up with the 8 partitions we want post-shuffle.

In [None]:
betterDF = (repartitionedDF
  .orderBy(col("timestamp"), col("site")) # sort the data
)                                         #
betterDF.foreach(lambda x: None)          # litterally does nothing except trigger a job

print("Partitions: {0:,}".format( betterDF.rdd.getNumPartitions() ))

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png)  Take Home Messages

* Partitioning is vital in distributed computing and Big Data frameworks
* The number of partitions should be related to the number of slots/cores of the system
* A good trade-off between number of partitions and dimension of partitions improve the execution performance
    * 200MB is, empirically, good partition dimension. 
    * Try to have fewer partition as possible (multiple of the number of slots) 
* Spark default configuration could not fit your needs
    * 200 partitions as default could not fit your need, it really depends on use case

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.