# Welcome to Spark

The Spark **R**esilient **D**istributed **D**ataset API allows us to specify data processing operations, which when executed get distributed across multiple workers.

A Resilient Distributed Dataset (RDD) represents an immutable collection of elements that can be operated on in parallel.

We interact with the Spark cluster by submitting queries via the Python `SparkContext` object.

![](https://spark.apache.org/docs/latest/img/cluster-overview.png)

Most methods on the SparkContext are lazy.  In other words, the `filter` method itself doesn't perform any processing and is very fast to call. Calling `collect`, however, will submit a `Job` to the cluster and wait for it to complete before printing results back to Python.

## RDD Examples

- [spark_context.parallelize](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.parallelize.html#pyspark.SparkContext.parallelize) - Form an RDD from a Python list.
- [filter](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.filter.html#pyspark.RDD.filter) - Return a new RDD containing only the elements that satisfy a predicate.
- [collect](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.collect.html#pyspark.RDD.collect) - Return a list that contains all of the elements in this RDD.

In [None]:
from pyspark.context import SparkContext

spark_context = SparkContext.getOrCreate()
rdd = spark_context.parallelize([
    "The quick brown fox jumps over the lazy dog",
    "Waltz, bad nymph, for quick jigs vex",
    "Pack my box with five dozen liquor jugs"
])

# Which pangrams are less than 40 characters in length?
rdd.filter(lambda pangram: len(pangram) < 40).collect()

- [map](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.map.html#pyspark.RDD.map) - Return a new RDD by applying a function to each element of this RDD.

In [None]:
rdd.map(lambda pangram: pangram[::-1]).collect()

- [count](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.count.html#pyspark.RDD.count) - Return the number of elements in this RDD.

In [None]:
rdd.filter(lambda pangram: 'fox' in pangram or 'box' in pangram).count()

## Exercises


1. Convert the existing pangrams from the `rdd` variable to a list of title case pangrams.


2. Generate from the `rdd` variable, a list of pangrams which contain only alpha and space characters.  In other words, don't include pangrams containing punctuation.


3. Generate from the `rdd` variable, a sorted list of pangrams from longest to shortest.
    <details>
      <summary>Hint</summary>
      Look at the sortBy RDD method within the API docs, see Resources below.
    </details>


4. Imagine `rdd` contains thousands of pangrams, but you'd like to look at just two of them.
    <details>
      <summary>Hint</summary>
      Search for "take" RDD methods within the API docs, see Resources below.
    </details>

## Resources
- [Spark RDD API docs](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html#pyspark.RDD)