Uncomment this cell and run once if you need to install pyspark

In [1]:

#!pip install pyspark

Sometimes we need to set the python to be used

In [2]:
# import os
# os.environ['PYSPARK_PYTHON'] = 'python'

In [3]:
import pyspark
from pyspark import SparkContext

We configure our "SparkContext" at the start - e.g. configuration options to be passed to all worker/executor nodes

In [4]:
# NOTE - we're running in 'local' context,
# however this could be changed later to use a resource/cluster manager e.g.YARN:
# conf = pyspark.SparkConf().setMaster('yarn').setAppName('YarnSparkTest')

conf = pyspark.SparkConf().setMaster('local[*]')\
                          .setAppName('LocalSparkTest')

sc = SparkContext(conf=conf)

In [5]:
sc

# right-click on the "Spark UI" link and "Open in another tab" - this shows you what this Spark App is doing

Read the Macbeth file from the local filesystem and count the words on each line

In [6]:
# Update if the data file is somewhere else relative to the notebook
FILEPATH = "data/Macbeth.txt"

In [7]:
# Note - this file is not actually read here, this is a "lazy" operation, the file will be read when needed
# Note - the filename is missing a t "on purpose", we'll fix it in a moment!
localFileRdd = sc.textFile(FILEPATH)
localFileRdd

data/Macbeth.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

### Perform a transform on the RDD

In this case we pass each row (line of text) through a function that splits the line on spaces and gets the number of returned elements

In [8]:
# still nothing done, this is another "lazy" operation
wordsPerLineRdd = localFileRdd.map(lambda line: len(line.split()))

# this will mean that if we run more operations later then we'll have cached the RDD at this point
# so we won't re-run the entire read/transform
# Note: this is also "lazy"
wordsPerLineRdd.cache()

PythonRDD[2] at RDD at PythonRDD.scala:53

In [9]:
# Now as we want an aggregated number to print, all of the above operation will happen
# Note that the filename might have been "wrong" all along, it's only actually read at this point
print("Total word count: ", wordsPerLineRdd.sum())
print("Average words per line: ", wordsPerLineRdd.mean())

Total word count:  18088
Average words per line:  4.4095563139931695


In [10]:
# Gather the first 5 rows from the cluster
wordsPerLineRdd.take(5)

[2, 5, 0, 6, 2]

In [11]:
# Collect the entire transformed RDD from the cluster
wordsPerLineRdd.collect()

[2,
 5,
 0,
 6,
 2,
 6,
 6,
 0,
 2,
 4,
 6,
 0,
 2,
 8,
 0,
 2,
 3,
 0,
 2,
 3,
 0,
 2,
 5,
 0,
 2,
 3,
 0,
 2,
 2,
 0,
 2,
 1,
 0,
 1,
 7,
 7,
 0,
 1,
 0,
 6,
 0,
 13,
 1,
 8,
 8,
 3,
 0,
 1,
 4,
 8,
 6,
 9,
 5,
 0,
 1,
 3,
 8,
 7,
 8,
 5,
 7,
 6,
 7,
 9,
 7,
 6,
 5,
 7,
 5,
 9,
 10,
 7,
 0,
 1,
 5,
 0,
 1,
 7,
 6,
 9,
 7,
 7,
 8,
 6,
 8,
 4,
 0,
 1,
 3,
 5,
 0,
 1,
 1,
 8,
 9,
 8,
 6,
 8,
 4,
 3,
 9,
 0,
 1,
 9,
 9,
 0,
 3,
 0,
 3,
 0,
 2,
 0,
 1,
 5,
 0,
 1,
 11,
 6,
 0,
 1,
 4,
 0,
 1,
 5,
 0,
 1,
 4,
 7,
 7,
 3,
 6,
 8,
 7,
 4,
 7,
 7,
 5,
 0,
 1,
 2,
 0,
 1,
 2,
 6,
 9,
 7,
 7,
 0,
 1,
 8,
 8,
 7,
 0,
 1,
 4,
 0,
 1,
 8,
 0,
 1,
 0,
 6,
 0,
 5,
 2,
 5,
 0,
 2,
 2,
 0,
 2,
 3,
 0,
 2,
 8,
 6,
 4,
 7,
 9,
 7,
 7,
 7,
 0,
 2,
 5,
 0,
 2,
 2,
 0,
 2,
 3,
 0,
 2,
 6,
 6,
 6,
 4,
 7,
 6,
 5,
 6,
 5,
 6,
 6,
 5,
 4,
 0,
 2,
 4,
 0,
 2,
 6,
 6,
 0,
 2,
 0,
 2,
 4,
 3,
 0,
 1,
 6,
 6,
 5,
 7,
 7,
 5,
 0,
 4,
 0,
 1,
 10,
 0,
 1,
 9,
 8,
 9,
 10,
 9,
 8,
 8,
 8,
 4,
 0,
 1,
 7,
 0,
 2,
 9,

If in a hadoop cluster, we may write the wordsPerLineRdd to a directory in HDFS

In [12]:
# If running in Hadoop, it may make sense to save the output RDD to a new text file
#wordsPerLineRdd.saveAsTextFile("hdfs://hadoop-master:9000/user/ec2-user/macbethWordsPerLine")