# Welcome to Apache Spark

![](images/spark-logo-trademark.png)

# Architecture

<div class="row">
    <div class="col-md-6"><img src="images/cluster-overview.png"></div>
    <div class="col-md-6">A Spark program consists of a <span class="text-primary">driver application</span> and <span class="text-success">worker programs</span>.
    <ul>
        <li>Worker nodes run on different machines in a cluster, or in local threads.</li>
        <li>Data is distributed among workers.</li>
    </ul>
    </div>
</div> 

## Spark Context

The `SparkContext` contains all of the necessary info on the cluster to run Spark code.

In [1]:
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('lecture-lyon2').setMaster('local[*]')
sc = SparkContext.getOrCreate(conf=conf)

sc

# Resilient Distributed Dataset

Partitioned collection of objects spread accross a cluster, stored in memory or on disk.

Image of a RDD

* Lowest-level data abstraction in Spark
* Immutable, tracks lineage

3 ways of creating a RDD

* by parallelizing an existing collection

In [6]:
rdd = sc.parallelize(range(10))
rdd

PythonRDD[10] at RDD at PythonRDD.scala:48

3 ways of creating a RDD

* from files in a storage system

In [12]:
titanic = sc.textFile('data/titanic.csv')
titanic

data/titanic.csv MapPartitionsRDD[15] at textFile at <unknown>:0

3 ways of creating a RDD

* by transforming another RDD

In [13]:
rdd.map(lambda number: number * 2)

PythonRDD[16] at RDD at PythonRDD.scala:48

## Working with RDDs

Let's create a RDD from a list of numbers, and play with it.

In [26]:
rdd = sc.parallelize(range(12), 4)
rdd.cache()

PythonRDD[27] at RDD at PythonRDD.scala:48

<h1 class="text-danger">Remember !</h1>

* A RDD is immutable
* A RDD is evaluated lazily
* Only tracks its lineage so it can reconstruct itself

In [14]:
print(rdd)              # prints only info on RDD, no evaluation
print(rdd.take(3))      # specific methods to gather data back to driver
print(rdd.map(lambda num: num + 1).toDebugString())  # check RDD lineage

PythonRDD[3] at RDD at PythonRDD.scala:48
[0, 1, 2]
b'(4) PythonRDD[11] at RDD at PythonRDD.scala:48 []\n |  PythonRDD[3] at RDD at PythonRDD.scala:48 []\n |      CachedPartitions: 1; MemorySize: 134.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B\n |  ParallelCollectionRDD[2] at parallelize at PythonRDD.scala:480 []'


## Spark operations

Come in two types : transformations / actions

* Transformations are lazy _(not computed immediately)_
* Only an action on a RDD will trigger the execution of all subsequent transformations.

![](images/spark-operations.png)

## Transformations

Transformations shape your dataset

### Filter

Return a new RDD containing only the elements that satisfy a predicate.

In [15]:
rdd

PythonRDD[3] at RDD at PythonRDD.scala:48

### Map

Return a new RDD by applying a function to each element of this RDD.

In [16]:
rdd

PythonRDD[3] at RDD at PythonRDD.scala:48

### FlatMap

Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

In [25]:
rdd.map(lambda num: range(num)).take(3)

[range(0, 0), range(0, 1), range(0, 2)]

### Distinct

Return a new RDD containing the distinct elements in this RDD.

In [21]:
rdd.map(lambda num: 0 if num % 2 == 0 else 1).distinct().collect()

[0, 1]

## Actions

Actions execute the task and associated transformations

### Collect / take

Return a list that contains all of the elements in this RDD.

Note This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory

In [27]:
rdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

### Count

Return the number of elements in this RDD.

### Reduce

Reduces the elements of this RDD using the specified commutative and associative binary operator. Currently reduces partitions locally.

## Key-value transformations

* Key/value RDDs are commonly used to perform aggregations, and often we will do some initial ETL (extract, transform, and load) to get our data into a key/value format. 

* Key/value RDDs expose new operations (e.g., counting up reviews for each product, grouping together data with the same key, and grouping together two different RDDs).

### ReduceByKey

Merge the values for each key using an associative and commutative reduce function.

### Join

Return an RDD containing all pairs of elements with matching keys in self and other.

## Wordcount !

# RDD conclusion

Low-level API

In [5]:
sc.stop()

# Higher-level APIs

Spark is known to have built more features around it.

<img src="images/spark-stack.png" class="img-responsive center-block"></img>

# SparkSQL

* Structured
* optimization

## SparkSession

The official way

In [1]:
from pyspark import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf().setAppName('lecture-lyon2').setMaster('local[*]')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark

## Dataframes

New way to interact with

## Unified data source interaction

## Catalyst optimization

# Machine Learning

Two parts :

* MLlib : RDD-based API
* ML : Dataframe-based API

# Spark Streaming

# GraphX

Graph component.

Image of graph.

# Unified engine

Spark's main contribution is to enable previously disparate cluster workloads to be composed

# Conclusion

In [2]:
spark.stop()

## That's it folks