# Welcome to Apache Spark

![](images/spark-logo-trademark.png)

# Architecture

<div class="row">
    <div class="col-md-6"><img src="images/cluster-overview.png"></div>
    <div class="col-md-6">A Spark program consists of a <span class="text-primary">driver application</span> and <span class="text-success">worker programs</span>.
    <ul>
        <li>Worker nodes run on different machines in a cluster, or in local threads.</li>
        <li>Data is distributed among workers.</li>
    </ul>
    </div>
</div> 

## Spark Context

The `SparkContext` contains all of the necessary info on the cluster to run Spark code.

In [7]:
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName('lecture-lyon2').setMaster('local[*]')
sc = SparkContext.getOrCreate(conf=conf)

sc

# Resilient Distributed Dataset

Partitioned collection of objects spread accross a cluster, stored in memory or on disk.

Image of a RDD

* Lowest-level data abstraction in Spark
* Immutable, tracks lineage

3 ways of creating a RDD

* by parallelizing an existing collection

In [6]:
rdd = sc.parallelize(range(10))
rdd

PythonRDD[10] at RDD at PythonRDD.scala:48

3 ways of creating a RDD

* by parallelizing an existing collection
* from files in a storage system

In [12]:
titanic = sc.textFile('data/titanic.csv')
titanic

data/titanic.csv MapPartitionsRDD[15] at textFile at <unknown>:0

3 ways of creating a RDD

* by parallelizing an existing collection
* from files in a storage system
* by transforming another RDD

In [13]:
rdd.map(lambda number: number * 2)

PythonRDD[16] at RDD at PythonRDD.scala:48

## Working with RDDs

Let's create a RDD from a list of numbers, and play with it.

In [16]:
rdd = sc.parallelize(range(100))

<h1 class="text-danger">Remember !</h1>

* A RDD is immutable
* A RDD is evaluated lazily
* Only tracks its lineage so it can reconstruct itself

Lazy evaluation

In [15]:
# lazy evaluation
rdd

PythonRDD[18] at RDD at PythonRDD.scala:48

## Spark operations

Two types : transformations / actions

* Transformations are lazy _(not computed immediately)_
* Only an action on a RDD will trigger the execution of all subsequent transformations.

Image


Why ? Explain

## Transformations

## Actions

## Key-value transformations

# RDD conclusion

Low-level API

In [5]:
sc.stop()

# Higher-level APIs

Spark is known to have built more features around it.

<img src="images/spark-stack.png" class="img-responsive center-block"></img>

# SparkSQL

* Structured
* optimization

## Dataframes

New way to interact with

In [1]:
from pyspark import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf().setAppName('lecture-lyon2').setMaster('local[*]')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark

In [2]:
spark.stop()

## Unified data source interaction

## Catalyst optimization

# Machine Learning

Two parts :

* MLlib : RDD-based API
* ML : Dataframe-based API

# Spark Streaming

# GraphX

Graph component.

Image of graph.

# Unified engine

Spark's main contribution is to enable previously disparate cluster workloads to be composed

# Conclusion