## Spark introduction

The rate of execution per node in Spark remains quite uniform, which indicates very good scaling performances.  
Spark streaming is for real-time processing (divided between Streaming and Structured-Streaming).

The foundation on which Spark runs are three servers: YARN is a scheduler and Mesos is similar to Redis as "network" between tasks.  
RDDs are the main programming abstraction.

### Spark SQL
The querying is done via SQL or HiveQL (a pseudo SQL that runs on MapReduce functionality). You can combine SQL with complex analytics


### Spark Streaming
Lives parallelly to Spark Core, and has dedicated API. Combined well for batch computation and interactive queries.  
If nodes goes down Spark can rebuild computations.


### Cluster managers
The driver kind of only gives instructions. The code you write get executed on nodes that are who knows how far away.  
A JVM is much more like an interpreter.  
Spark session is accessed through SparkContext. 

Spark-Notebook or Zeppelin are notebook application to use Spark.  

Spawning an application directly on YARN's Application Master gives you some more fault tolerance.

## RDDs

Partitions can have duplicates inside themselves. The data inside is immutable

### Transformation of partitions

`coalesce(n)` reduces the number of partitions (without shuffle), while `repartition(n)` increases it.  
The best is dividing the actual number of partitions by an integer.

### Key-values RDDs

They have a basic structure (and not a schemaless collections of rows). **Same keys have to be in the same partition**!

### Side effects of partitioning

The .filter transformation can result in unbalanced partitions for the output RDD.  
Imagine data like {name: persons_data}. With `map`, the partition is made again anew: this is because Spark cannot understand that you're assigning the same value to the same key. You have to use `mapValues` because in this way Spark knows that the key stays the same

## RDDs VS DataFrame VS DataSets

DataFrames may have limited expressiveness; complex transformations are better expressed by RDDs.  
User Defined Functions must be executed in child python process; it applies the function not in the Spark executor: moving the data back and forth the executor and the python process takes time (for RDDs).  
DataFrames can compress data effectively.

DataFrames are a special type of DataSets. With it Sparks detects error before compile time

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("Python Spark SQL basic example")\
.config("spark.some.config.option", "some-value").getOrCreate()

In [3]:
sc = spark.sparkContext

In [None]:
sc.t