# Scalable Data Science Fundamentals

#### Storage Options

**SQL** - well-established open standard, fast index access, high data normalization, costly, difficult to scale, schema changes require DDL

**noSQL** - dynamic schema, linearly scalable, low storage cost, no data normalization/integrity constraints, less established, slower than SQL

**ObjectStorage** - schema-less, linearly scalable, cheap

In general, SQL is suitable for a small amounts of data requiring a stable schema. For larger amounts of data with high ingestion rates or frequent changes of schema, noSQL or ObjectStorage are appropriate. 

#### ApacheSpark

ApacheSpark handles the parallelization of distributed data and processing across many compute (“worker”) nodes. While the underlying execution engine in ApacheSpark is implemented in Scala on top of a Java Virtual Machine (JVM), it has connectors for multiple programming languages including Python, R, Java and Scala. The various languages come with their own advantages and disadvantages, with Python and R falling on the easier-to-learn side of the spectrum, at the cost of performance.

Multiple JVM instances can work in tandem on a single worker node, with the general rule of one JVM per CPU core. For example, a cluster with 100 nodes, 4 CPUs per node, 16 CPU cores per CPU and 4 hyperthreads per core could have 25,600 parallel threads running at the same time. Storage can either be connected via a fast network connection (off-node storage approach) or hard drives can be connected directly to worker nodes (Just a Bunch Of Disks aka JBOD approach). The second approach requires an additional software component called Hadoop Distributed File System (HDFS) to combine and present the disparate storage capacities into one virtual file system.

A **Resilient Distributed Dataset (RDD)** is a distributed immutable collection that resides on the main memory of worker nodes. RDDs are lazy, meaning that data is not read from the underlying storage system unless it is needed. 

Distribute data across spark nodes: `rdd = sc.parallelize(range(100))`

Trigger the execution of a Spark job which is divided into individual tasks that are executed in parallel across the cluster: `rdd.count()`

View the first ten elements of the RDD: `rdd.take(10)`

Copy the contents of the data to the local ApacheSpark driver JVM: `rdd.collect()` (be careful with doing this with large datasets as you can cause the driver JVM to crash due to exceeded memory capacity)

#### Functional Programming (FP)

The central concept of FP is Lambda Calculus, which enables computations to be expressed as anonymous functions. Scala is the most recent representative of FP, joining the likes of Haskell, while also supporting procedural and OOP. ApacheSpark parallelizes computations using Lambda Calculus.


**Add 1 to each element of a list:**

```
rdd = sc.parallelize(range(100))
rdd.map(lambda x: x+1).take(10)
```

**Sum elements of a list:**

```
sc.parallelize(range(1,101)).reduce(lambda a,b: a+b)
```

#### ApacheSparkSQL

ApacheSparkSQL wraps the RDD with a DataFrame object,  abstracting the RDD API into a more familiar relational interface. This utility produces an abstract syntax tree, which is transformed into a logical query execution plan by the catalyst optimizer. This results in very high performance code that is more intuitive to write.