**Import Spark Config & Context**

In [1]:
from pyspark import SparkConf, SparkContext

**Initialize a local Spark Context** 

In [2]:
conf = SparkConf().setMaster('local').setAppName('SparkApp')
sc = SparkContext(conf = conf)

Use textFile-Transformation to "load" a text file (there's actually nothing happening here, because transformations trigger no actions)

In [7]:
moby_dick_rdd = sc.textFile('../data/moby_dick.txt')

Put the "loaded" file into a *chain of transformations* (flatMap, map and reduceByKey). The outcome will be a completely new, pipelinedRDD

The RDDs are immutable and no changes happen to original documents. Instead, transformations generate new RDDs. Only Spark-actions can trigger "changes" and generate new data types (but not RDDs!)

In [8]:
pipelined_rdd = moby_dick_rdd.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda x,y: x + y)

In [9]:
type(pipelined_rdd)

pyspark.rdd.PipelinedRDD

In [10]:
def output(x):
    print(x[0].encode('utf-8'))
pipelined_rdd.foreach(output)

Use the foreach-action to trigger previously defined transformations and print out words from Moby Dick to the console. **This** is the very moment when all of the logic gets executed. 

**Spark is lazy.**  ;-)