# RDD Programming Guide

## Overview

At high level, every Spark application consists of a _driver program_ that runs the user's `main` function and executes various _parallel operations_ on a cluster. The main abstraction Spark provides a _resilient distributed dataset_ (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also Spark to _persist_ an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.

A second abstraction in Spark is _shared variables_ that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program.
Spark supports two types of shared variables: _broadcast variables_ which can be used to cache a value in memory on all nodes, and _accumulators_, which are variables that are only "added" to, such as counters and sums.

In [1]:
from pyspark import SparkContext, SparkConf

appName = "rdd programming guide"
master = "local[1]"
conf = SparkConf().setAppName(appName).setMaster(master)


The `appName` parameter is a name for your application to show on the cluster UI. `master` is a Spark, Mesos or YARN cluster URL, or a special "local" string to run in local mode. In practice, when running on a cluster, you will not want to hardcode `master` in the program, but rather launch the application with `spark-submit` and receive it there. However, for local testing and unit tests, you can pass "local" to run Spark in-process.

Notice that if we are running within `pyspark` context we need to use `SparkContext::getOrCreate(conf)` otherwise we may need to deal with error of the type:

`Cannot run multiple SparkContexts at once; existing SparkContext(app=PySparkShell, master=local[*])`

In [2]:
sc = SparkContext.getOrCreate(conf=conf)

## Resilient Distributed Datasets (RDDs)

Spark revolves around the concept of a _resilient distributed dataset_ (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: _parallelizing_ an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

### Parallelized Collections

Parallelized collections are created by calling `SparkContext`'s `parallelize` method on an existing iterable or collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For example, here is how to create a parallelized collection holding the numbers 1 to 5:

In [3]:
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)

Once created, the distributed dataset (`distData`) can be operated on in parallel. For example, we can call `distData.reduce(lambda a, b: a + b)` to add up the elements of the list. We describe operations on distributed datasets later on.

One important parameter for parallel collections is the number of _partitions_ to cut the dataset into. Spark will run one task for each partition of the cluster. Typically, you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to _parallelize_ (e.g. `sc.parallelize(data, 10)`). Note: some places in the code use the term slices (a synonym for partitions) to maintain backward compatibility.

### External Datasets

PySpark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S2, etc. Spark supports text files, SequenceFiles