# Spark Overview

## Who, what, when, where, why, how?

* framework / platform for dealing with big data
* big data - 4 vs
    * velocity
    * volume
    * variety
    * veracity
* spark streaming -- databricks
* when data is too big
    * memory vs storage: RAM vs HDD
* when would I use spark? When it's already setup
* alternatives: hadoop, dask; spark is the most popular
* how do we access spark? client libraries

## Spark Architecture (Sparkitecture)

* scala on the JVM (Java Virtual Machine)
* client libraries that talk to the running spark instance
    * pyspark
    * sparkR Rspark
    * ends up as the same spark code
    * Spark SQL
* Computers in a Spark Cluster
    * **Driver**: your laptop (or a machine that runs a spark *application*)
    * **Cluster Manager / Master**: a machine that organizes everything
    * **Executors**: computers that "do the work" under the direction of the cluster manager
* Local Mode: everything on one machine; what we'll use for this course
    * less common in practice
    * spark code for a full cluster and local mode is exactly the same
    * useful for data that fits in storage, but not memory

In [1]:
import numpy as np
import pandas as pd

In [7]:
n = 10_000_000

df = pd.DataFrame({
    'x1': np.random.randn(n),
    'x2': np.random.randn(n),
    'x3': np.random.randn(n),
    'x4': np.random.randn(n),
    'x5': np.random.choice(list('abcdef'), n),
    'x6': np.random.choice(list('abcdef'), n),
})

df.info()
df.to_csv('demo.csv')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   x1      float64
 1   x2      float64
 2   x3      float64
 3   x4      float64
 4   x5      object 
 5   x6      object 
dtypes: float64(4), object(2)
memory usage: 457.8+ MB


In [8]:
!ls -lh demo.csv

-rw-r--r-- 1 zach staff 863M Aug  3 14:58 demo.csv


## Parallel Work + Spark Dataframes

* spark does work in *parallel*, meaning multiple things are happening at once
* faster at scale, but some upfront overhead
* two levels: *executors* and *partitions*
* spark dataframes: abstract everything above
    * similar to a pandas dataframe, but important differences!
    * lazy!
        * reorder operations
* **transformations** and **actions**
    * actions start a job, transformations are lazy

In [13]:
import itertools as it

In [15]:
list(it.combinations(list('abcde'), 2))

[('a', 'b'),
 ('a', 'c'),
 ('a', 'd'),
 ('a', 'e'),
 ('b', 'c'),
 ('b', 'd'),
 ('b', 'e'),
 ('c', 'd'),
 ('c', 'e'),
 ('d', 'e')]

In [16]:
it.combinations(list('abcdefghijklmnopqrstuvwxyz'), 10)

<itertools.combinations at 0x7fd8605a1ef0>

## Lessons Covered

1. Overview
2. Env Setup
3. Spark API: learning how to use spark dataframes
4. Data Wrangling
5. Data Exploration