# Big Data
Databricks + Spark  
2020.03.06

Render markdown with %md

## Tasks

Register at Databricks and use a Community Edition instance. It's free, [here](https://community.cloud.databricks.com/).

Create a notebook, attach a cluster to it, add a file to the cluster, then run some Actions, Transformations, and Functions on it.

**Note - I believe that after 2 hours of idling, your cluster is spun-down. You can clone it and reattach a notebook to it though.**

## Overview

### Apache Spark

[Apache Spark](https://en.wikipedia.org/wiki/Apache_Spark) created Databricks. 

> Apache Spark is a cluster computing platform designed to be *fast and general purpose*. 

> 1. **Speed**, Spark extends the popular MapReduce model to efficiently support more types of computations, including **interactive queries and stream processing**. Speed is important in processing large datasets, as it means the difference between exploring data interactively and waiting minutes or hours. One of the main features Spark offers for speed is the ability to run **computations in memory**...

> 2. **Generality**, Spark is designed to cover a wide range of workloads that previously required separate distributed systems, including **batch applications, iterative algorithms, interactive queries, and streaming. By supporting these workloads in the same engine, Spark makes it easy and inexpensive to combine different processing types, which is often necessary in production data analysis pipelines. In addition, it reduces the management burden of maintaining separate tools.**

> 3. **Highly accessible**, offering **simple APIs in Python, Java, Scala, and SQL, and rich built-in libraries**. It also integrates closely with other Big Data tools. In particular, Spark can run in Hadoop clusters and access any Hadoop data source, including Cassandra.

[Learning Spark](https://www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analysis/dp/1449358624). Page 1. Note - this book was written by 4 Databricks employees/founders.

### Databricks

[Databricks](https://en.wikipedia.org/wiki/Databricks) is a company founded by the original Apache Spark creators, and grew out of the AMPLab project at Berkeley which was involved in creating Spark. 

> Databricks develops a **web-based platform for working with Spark**, that provides **automated cluster management and IPython-style notebooks**. In addition to building the Databricks platform, the company is co-organizing **massive open online courses**.

They have a huuuuuuuge trove of videos on [YT](https://www.youtube.com/channel/UC3q8O3Bh2Le8Rj1-Q-_UUbA/videos)

#### Spark Basics

Every Spark application consists of a **driver program that runs the user’s main function and executes various parallel operations on a cluster**. The main abstraction Spark provides is a resilient distributed dataset, **RDD, which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel**. 

**RDDs are created by starting with a file in the Hadoop file system, and transforming it**. Users may also ask Spark to **persist an RDD in memory**, allowing it to be reused efficiently across parallel operations. Finally, **RDDs automatically recover from node failures**.

A second abstraction in Spark is **shared variables that can be used in parallel operations**. By default, when Spark runs a function in parallel as a set of tasks on different nodes, **it ships a copy of each variable used in the function to each task**. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: **broadcast variables, which can be used to cache a value in memory on all nodes**, and **accumulators, which are variables that are only “added” to, such as counters and sums**.

[Source](https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview)

#### RDDs

Resilient distributed dataset - a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs

1 - Parallelizing an existing collection in your driver program. = Parallelized collections are created on an existing iterable or collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.

```spark
data = [1, 2, 3, 4, 5]  
distData = sc.parallelize(data)  
```

2 - Referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, etc. This method takes a file URI and reads it as a collection of lines.

```spark
distFile = sc.textFile("data.txt")
```

[Source](https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds)

#### RDD Operations

RDDs support two types of operations
1. Transformations - Create a new dataset from an existing one
2. Actions - Return a value to the driver program after running a computation on the dataset.

For example, **map is a transformation** that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, **reduce is an action** that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

**Transformations are lazy, in that they do not compute their results right away**. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

By default, **each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory** using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it.

## Part 1 - [RDD Actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions)

Actions return a value to the driver program after running a computation on the dataset.

### Count

Return the number of elements in the dataset.

In [5]:
# Create an RDD 
data_variable = [9, 10, 5, 1, 2]
rdd = sc.parallelize(data_variable)
rdd.count()

### Reduce(func)

Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).  

The function should be commutative and associative so that it can be computed correctly in parallel. Example of commutative and associative function:  
a + b = b + a and a + (b + c) = (a + b) + c

In [7]:
# Create an RDD
data_variable = [9, 10, 5, 1, 2]
rdd = sc.parallelize(data_variable)

# test reduce(func) API 
rdd.reduce(lambda a, b: a + b)

In [8]:
# Create an RDD
data_variable = [9, 10, 5, 1, 2]
rdd = sc.parallelize(data_variable)

# test reduce(func) API 
rdd.reduce(lambda a, b: a if a > b else b)

### collect()

Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

In [10]:
# Create an RDD
data_variable = [9, 10, 5, 1, 2]
rdd = sc.parallelize(data_variable)

# test collect() API 
rdd.collect()

### Take(n)
Return an array with the first n elements of the dataset.

In [12]:
# Create an RDD
data_variable = [9, 10, 5, 1, 2]
rdd = sc.parallelize(data_variable)

# test collect() API
rdd.take(2)

### saveAsTextFile(path)

Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.

In [14]:
# Create an RDD
data = [9, 10, 5, 1, 2]
rdd = sc.parallelize(data)

# test saveAsTextFile() API
# rdd.saveAsTextFile("/tmp/file1.txt")

## Part 2 - [RDD Transformations](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations)

Transformations create a new dataset from an existing one.

### map(func)
Return a new distributed dataset formed by passing each element of the source through a function func.

In [16]:
# Create an RDD
rdd = sc.parallelize([1, 2, 3, 4, 5])

# apply map(func) transformation to the RDD
rdd1 = rdd.map(lambda x: x * 3)

# show results of the new rdd
rdd1.collect()

### flatmap(func)

Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

In [18]:
# Create RDD using sample data
rdd = sc.parallelize([1, 2, 3, 4, 5])

# apply map(func) transformation to the RDD
rdd2 = rdd.flatMap(lambda x: [x, x * 3])

# show results of a new rdd
rdd2.collect()

### Filter
Return a new dataset formed by selecting those elements of the source on which func returns true.

In [20]:
# Create RDD using sample data
rdd = sc.parallelize([1, 2, 3, 4, 5])

# apply filter(func) transformation to the RDD
rdd.filter(lambda x: x % 2 == 0).collect()

### Key-value pairs

A Key/Value RDD is an RDD whose elements comprise a pair of values – key and value. It should be in a tuple format such as (1,2) and then you apply key-value pair operations . For example: join(), groupByKey(), or reduceByKey()

In [22]:
# Setup the textFile RDD to read the README.md file
# Note: this is lazy
textFile = sc.textFile("databricks-datasets/samples/docs/README.md")

# split each line of readme file to words first, and then make a tuple of (word, 1)
textFile.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).take(1)

### reduceByKey

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

In [24]:
# Setup the textFile RDD to read the README.md file
textFile = sc.textFile("databricks-datasets/samples/docs/README.md")

# split each line of readme file to words first, and then make a tuple of word, 1
rdd_key = textFile.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))

# reduce by key

rdd_key.reduceByKey(lambda x, y: x + y).take(3)

### union

Return a new dataset that contains the union of the elements in the source dataset and the argument.

In [26]:
# create some rdds
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
rdd2 = rdd1.map(lambda x: x * 2)

# combine these rdds with a union
rdd1.union(rdd2).collect()

### groupByKey()
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.

In [28]:
# Setup the textFile RDD to read the README.md file
textFile = sc.textFile("/databricks-datasets/samples/docs/README.md")

# split each line of the readme file to words, and then make a tuple of word, 1.
rdd_key = textFile.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))

# group by key
rdd_gp = rdd_key.groupByKey()

for (key, value) in rdd_gp.take(5):
  print(key, sum(value))

### join
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

In [30]:
# create two RRDs
rdd1 = sc.parallelize([('rock', 1), ('paper', 2), ('scissor', 1), ('hammer', 3)])
rdd2 = sc.parallelize([('hammer', 2), ('paper', 3), ('water', 1), ('fire', 3)])

# perform left outer join
rdd2.leftOuterJoin(rdd1).collect()

### stats()

Return the count, mean, standard deviation, max and min of the RDDs' elements in one operation.

In [32]:
# stats transformation to get mean, count, std dev
sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9]).stats()

### sample(withReplacement, fraction, seed=None)

Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.  
Parameters:	
withReplacement – can elements be sampled multiple times (replaced when sampled out)
fraction – expected size of the sample as a fraction of this RDD’s size without
seed – seed for the random number generator

In [34]:
# Setup the textFile RDD to read the README.md file
textFile = sc.textFile("/databricks-datasets/samples/docs/README.md")

# split each line of the readme file to words first, then make a truple of the word.
rdd_key = textFile.flatMap(lambda x: x.split(' '))
rdd_key.sample(False, 0.02, 3).collect()

### Functions 
Create a function and use it for a transformation.

In [36]:
# create a function that tells if the line is small or large.

def strLenType(input):
  if len(input) < 15:
    return "Small"
  else:
    return "Large"

# Setup the textFile RDD to read the README.md file
textFile = sc.textFile("/databricks-datasets/samples/docs/README.md")

# split each line of the readme to words, then tuple-ize them
textFile.map(lambda x: strLenType(x)).take(5)

## Part 3 - Put several things together

### map(func)
Return a new distributed dataset formed by passing each element of the source through a function func.

In [38]:
# File location and type
file = "/FileStore/tables/cities.txt"
cities = sc.textFile(file)
cities.take(3)

Remove the header row, create a key value pair and reduce by key.

In [40]:
# create a key value pair of state, and 1 for each record
header = cities.first()

cities1 = cities.filter(lambda row: row != header)
cities1.take(4)  # effectively, print 4 rows
cities2 = cities1.map(lambda row: row.split(","))
cities_key = cities2.map(lambda row: (row[9], 1))
cities_key.reduceByKey(lambda x, y: x + y).take(10)