#### RDD
This is the most basic data abstraction in Spark, short for resilient distributed dataset. It is a fault-tolerant collection of elements that can be operated on in parallel.
- Most basic data structure in Spark
- Fault tolerant
- Parallel
- Immutable
- Low level transformation API (does not recommend)

In [1]:
val r = sc.parallelize(Array(1, 2, 3, 4, 5)) 

Intitializing Scala interpreter ...

Spark Web UI available at http://ytuegemtincimbp:4040
SparkContext available as 'sc' (version = 2.4.4, master = local[*], app id = local-1574019237303)
SparkSession available as 'spark'


r: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:25


In [2]:
val filteredList = r.filter(k => k > 3).collect.toList

filteredList: List[Int] = List(4, 5)


#### DataFrame
It is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame but with lot more stuff under the hood.
- Structured data structure
- Written top of RDD API
- Faster than RDD
- Easy to use
- High level transformation API
- Support SQL API

In [3]:
val r = sc.parallelize(Array(1, 2, 3, 4, 5)) 
val df = r.toDF
val filteredDF = df.where(col("value") > 4)

r: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:27
df: org.apache.spark.sql.DataFrame = [value: int]
filteredDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]


In [4]:
filteredDF.show

+-----+
|value|
+-----+
|    5|
+-----+



In [5]:
val df = spark.read.option("header", "True").csv("file:///tmp/example")

df: org.apache.spark.sql.DataFrame = [cluster: string, number: string]


In [7]:
df.show(5)

+-------+------+
|cluster|number|
+-------+------+
|      c|   862|
|      d|   225|
|      e|   524|
|      d|   643|
|      b|   628|
+-------+------+
only showing top 5 rows



In [8]:
val g = df.groupBy(col("cluster")).agg(max(col("number")).alias("m"))

g: org.apache.spark.sql.DataFrame = [cluster: string, m: string]


In [10]:
g.show

+-------+---+
|cluster|  m|
+-------+---+
|      e|999|
|      d|999|
|      c|999|
|      b|999|
|      a|999|
+-------+---+



In [11]:
g.repartition(1).write.csv("file:///tmp/output")

#### Join in Spark

In [25]:
val a = spark.read.option("header", "true").csv("file:///tmp/adata/").as("a")
val b = spark.read.option("header", "true").csv("file:///tmp/bdata/").as("b")

a: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, name: string]
b: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, age: string]


In [26]:
val c = a.join(b, col("a.id") === col("b.id")).select("a.id", "name", "age")

c: org.apache.spark.sql.DataFrame = [id: string, name: string ... 1 more field]


In [27]:
c.show(5)

+---+----+---+
| id|name|age|
+---+----+---+
|752| Max| 39|
|752| Max| 37|
|736|John| 36|
|677|Ivan| 30|
|677|Ivan| 36|
+---+----+---+
only showing top 5 rows



In [30]:
val crossA = a.crossJoin(b)

crossA: org.apache.spark.sql.DataFrame = [id: string, name: string ... 2 more fields]


In [31]:
crossA.show(5)

+---+----+---+---+
| id|name| id|age|
+---+----+---+---+
|752| Max|752| 37|
|752| Max|736| 36|
|752| Max|677| 36|
|752| Max| 14| 44|
|752| Max|657| 29|
+---+----+---+---+
only showing top 5 rows



There are several join types in Spark,
- Broadcast hash join
- Sort-merge join
- Shuffle join

**Broadcast hash join**,
- Provides best performance
- Suitable if both dataset is fit in the memory
- Create hashtable for the key lookup
- spark.sql.autoBroadcastJoinThreshold is the threshold for memory fit

**Sort-merge join**,
- May slower than hash join
- Suitable if both dataset is cannot fit in the memory
- Sort data before joining
- spark.sql.join.preferSortMergeJoin is true by default

**Shuffle join**,
- May slower than hash join
- May faster than sort-merge join
- Partitioned broadcast join
- spark.sql.join.preferSortMergeJoin=false and spark.sql.autoBroadcastJoinThreshold may need some tweak

Benchmark,

Join example 2 tables,

```
+--------+-------------+---------+
| table  | records     | columns |
+--------+-------------+---------+
| users  | 1 million   | 4       |
+--------+-------------+---------+
| orders | 10 millions | 4       |
+--------+-------------+---------+
```

Here is result,

```
+----------------+-------------+-------------+
| type           | time (secs) | peak memory |
+----------------+-------------+-------------+
| Broadcast Hash | 20 seconds  | 424 MB      |
+----------------+-------------+-------------+
| Sort-Merge     | 8 seconds   | 4.2 MB      |
+----------------+-------------+-------------+
| Shuffle Hash   | 34 seconds  | 850 MB      |
+----------------+-------------+-------------+
```