In [1]:
import pyspark

The following line, calling the <code>SparkContext()</code> method will initialize a Spark session and return an object that encapsulates everything you need to "talk" to a Spark cluster. The convention is to name that object <code>sc</code>, and that is what you will find on examples on the Spark documentation and around the web.

In [2]:
sc = pyspark.SparkContext()

Now that we have our Spark session initialized on the <code>sc</code> object, we are ready to create RDDs. There are two ways to create an RDD and have Spark partition and distribute data across the cluster. Let's look at the first one - the <code>parallelize</code> method:

In [3]:
# Let's create an RDD containing a small list with integers for elements:

some_numbers = [1,2,3,4,5,6,7,8,9,10]

my_first_rdd = sc.parallelize(some_numbers)

In [None]:
my_first_rdd

What just happened here?

Spark took our list of integers and broke it down into several chunks, called **Partitions**. Each of these partitions can be operated on independently from each other by Executors, enabling Spark to "divide and conquer" and perform computations on your data in parallel!

In [6]:
# Let's see how many partitions Spark broke our list of numbers into
my_first_rdd.getNumPartitions()

40

In [7]:
# Let's see what's in these partitions:

my_first_rdd.glom().collect()

[[],
 [],
 [],
 [1],
 [],
 [],
 [],
 [2],
 [],
 [],
 [],
 [3],
 [],
 [],
 [],
 [4],
 [],
 [],
 [],
 [5],
 [],
 [],
 [],
 [6],
 [],
 [],
 [],
 [7],
 [],
 [],
 [],
 [8],
 [],
 [],
 [],
 [9],
 [],
 [],
 [],
 [10]]

The number of Partitions is one of the important parameters of a Spark program that you need to be cognizant of. Split your data into too few partitions and Spark will not be able to do as much work in parallel as your Cluster hardware enables it to do; split it into too many and you may end up with empty partitions or not fully taking advantage of parallelism again, by forcing Executors to perform lots of very small tasks sequentially.

We will dive deeper into this topic on Day 2 of the workshop. For now, let's set the number of partitions to 10:

In [4]:
my_first_rdd_repartitioned = my_first_rdd.repartition(10)
my_first_rdd_repartitioned.getNumPartitions()

10

The RDD API has two main types of methods: **Transformations** and **Actions**. In a nutshell, Transformations are operations carried out on RDDs that return other RDDs. Actions are operations carried out on RDDs that do not return other RDDs. On the line above, <code>repartition</code> is a Transformation and <code>getNumPartitions</code> is an Action. Let's look at a few more examples to see what that means in practice:

In [5]:
# Our first meaningful transformation to our RDD: add 1 to each element

my_first_rdd_repartitioned.map(lambda element : element+1)

PythonRDD[6] at RDD at PythonRDD.scala:48

The <code>map</code> method applies a function to each element of each partition of an RDD. The output above tells us that this returned another RDD. Can we get its contents back from the cluster?

In [13]:
# The collect() method brings the contents of an RDD from the cluster back to the driver

my_first_rdd_repartitioned.collect()

[7, 4, 8, 9, 1, 10, 5, 2, 6, 3]


The numbers are shuffled (we will talk more about why and how on Day 2) but this is still our list of integers from 1 to 10... we had applied a transformation to our RDD, which created another RDD, but we had no way to refer to this new RDD!

In [12]:
# RDDs are immutable! Our transformation actually created another RDD we had no way to refer to on the Driver!

my_second_rdd = my_first_rdd.map(lambda element : element+1)

my_second_rdd.collect()

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

By creating new RDDs with each Transformation, Spark actually provides a type of fault-tolerance! It records these transformations in a DAG, so if ever an entire node or an Executor inside a node fails, Spark can immediately recompute your RDDs and your work isn't lost. 

In [11]:
# Spark preserves RDD lineage to automatically recompute them if they are lost!

my_second_rdd.toDebugString()

b'(40) PythonRDD[9] at RDD at PythonRDD.scala:48 []\n |   ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:480 []'

Now wait a minute... if Spark creates RDDs at every Transformation and Spark keeps things in memory... won't you quickly run out of memory by applying Transformations to RDDs?

The answer is: no! Spark performs "Lazy-Evaluation". This means all Spark does is record your transformations in a DAG without actually computing anything or using up any extra memory until an **Action** is called on an RDD!

Let's get a feeling of this concept by applying a long chain of Transformations to an RDD and timing it...

In [5]:
# Spark performs Lazy-Evaluation: No transformation 
#actually gets computed until an "action" is called on an RDD

%time my_third_rdd = my_first_rdd_repartitioned.map(lambda element : element+1).\
filter(lambda element : element % 2 ==0).map(lambda element : element+2)

CPU times: user 38 µs, sys: 0 ns, total: 38 µs
Wall time: 42.2 µs


... it ran almost instantly! Now let's call an Action on this RDD and time it:

In [11]:
# The "reduce" method is an "action". For a complete list of actions,
#see: https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions 

%time my_third_rdd.reduce(lambda a,b : a+b)

CPU times: user 8.99 ms, sys: 1.99 ms, total: 11 ms
Wall time: 919 ms


40

And here is some good news for those of you who can't get used to the "lambda function" syntax. This also works just fine:

In [12]:
def add_numbers(a,b):
    return a+b

%time my_third_rdd.reduce(add_numbers)

CPU times: user 8.28 ms, sys: 445 µs, total: 8.72 ms
Wall time: 255 ms


40

RDDs are a pretty powerful concept and if you take anything home from this workshop let it be this: RDDs are a simple way of performing **Data Parellelism**. 

In other words, you can write your code almost the exact same way you would in a serial program (i.e., not parallel) and the "parallel" part simply means your code will run against different chunks of your data at the same time. 

All you need to do most of the time is wrap your usual code with one or more RDD API methods and be aware of the nature of the elements in your Partitions so you pick the right method. Once you've done that, Spark takes care of performing Data Parallelism for you!

Here is a slightly more difficult example - let's use Spark to multiply each element of a numpy array by a random number!

What makes this more difficult? Now we are doing Data Parallelism not on a native Python object like before (a list), but on an object defined by a non-native library: numpy.

We start by creating this object: a 1-d array of 100 elements.

In [13]:
import numpy as np

an_object = np.linspace(0,1,100)

ModuleNotFoundError: No module named 'numpy'

In [None]:
an_object

In [None]:
my_new_rdd = sc.parallelize(an_object)

Now you might be tempted to do like we did before and just do what you would do on your own workstation without Spark:

In [None]:
my_new_rdd.map(lambda element : element * np.random.rand()).collect()

This should have failed if you are running on a Cluster (as opposed to running Spark on a single computer). Why? Well, you imported the <code>numpy</code> library on the Driver, but you are asking the Executors to use it... you need to tell the Executors to import numpy too!

In [None]:
def multiply_by_random(x):
    import numpy as np
    
    return x * np.random.rand()

In [None]:
my_new_rdd.map(lambda element : multiply_by_random(element)).collect()

Alright! This seems to have worked... but is it the best way to go about doing this? Remember, the <code>map</code> method applies whatever function you pass to it to every single element of each partition!

Does that mean we are importing <code>numpy</code> 100 times in this example? Yes it does.

This is a good segue into another very useful Transformation in the RDD API:

In [None]:
def partition_multiply_by_random(x):
    import numpy as np
    
    output = [element * np.random.rand() for element in x]
    
    return output

In [None]:
my_new_rdd.mapPartitions\
(lambda partition : partition_multiply_by_random(partition)).\
collect()

The <code>mapPartition</code> method applies whatever function you pass to it to each **Partition**, but with one caveat: whatever your function does, it must iterate through the elements of the input Partition. So in practice, this method also applies your function to the elements of a Partition, but it allows you more flexibility to do things like importing libraries only once per partition... or anything else that you don't need done repeatedly for each element of a partition.

***An important thing to note here:*** if your code imports libraries, you need to make sure they are installed on every node of your cluster! Generally that means asking your system admnistrator to do it for you...

We will talk more about options for handling your code's dependencies on a Spark cluster on Day 2!

## Hands-on Guided Example 1 - NASA's Website Log Analysis

So far we've used toy examples to introduce the RDD API along with a few of its Transformations and Actions. Now let's look at a more real-life example: let's wrangle a fairly big "semi-structured" file and turn it into something a Data Scientist would be ready to work with. In fact, let's ask a few Data Science-y questions of this data and use Spark itself to answer them while we are at it!

This example file is a standard Apache webserver log. It's the logs from a month's worth of requests to NASA's website, in the distant year of 1995, combined into one fairly big file to be more specific.

This log contains the following information:

1. The IP Address or the DNS name performing a request
2. A time stamp of the form: "dd/Mon/YYYY:hh:mm:ss Timezone"
3. The request type (HTTP verb), the resource being requested and the Protocol used
4. The code returned by the server (200 OK, 400 Not Found etc...)
5. The Size of the resource being requested

We will use the <code>textFile</code> method to read in this file. This, like the <code>parallelize</code> method, turns the data inside this file into an RDD. There are two **important things** you need to know about this method:

1. In a real-life Spark Cluster, the location of the file (the argument you will pass to <code>textFile</code>) must be visible/accessible to all nodes of the Cluster. In practice, a lot of the time this location will be a path on a Hadoop Distributed File System (HDFS), but this can be any Network File System, or a location mounted on all nodes, or Amazon S3... as long as it's visible accessible on all nodes!

2. This method turns **each line** of the input file into an element in a Partition. So ***no matter what the format of the file is*** - when it gets turned into an RDD, **each line** (as delimited by a newline a.k.a. "\n") becomes an element.

Without further ado... let's dive into it!

In [3]:
nasa_logs = sc.textFile('../../data/NASA_access_log_Jul95.gz')

The first step in any data problem is to look at the data to get a sense of what we are dealing with. The RDD API has the <code>take</code> Action, that brings a number of elements (remember, an element here is a line of the original file) back to the Driver so we can see them. The important thing here is to be careful not to bring too many elements back to Driver and blow up its memory capacity!

In [20]:
nasa_logs.take(5)

['199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245',
 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985',
 '199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085',
 'burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0',
 '199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179']

Another good practice is to find out how many elements we have to get a sense of what we are dealing with. The RDD API has the <code>count</code> method for that:

In [None]:
nasa_logs.count()

Now that we can see what the data looks like, a reasonable first step seems to be to split the data on the " " (space) character:

In [4]:
nasa_logs.map(lambda line : line.split(" ")).take(5)

[['199.72.81.55',
  '-',
  '-',
  '[01/Jul/1995:00:00:01',
  '-0400]',
  '"GET',
  '/history/apollo/',
  'HTTP/1.0"',
  '200',
  '6245'],
 ['unicomp6.unicomp.net',
  '-',
  '-',
  '[01/Jul/1995:00:00:06',
  '-0400]',
  '"GET',
  '/shuttle/countdown/',
  'HTTP/1.0"',
  '200',
  '3985'],
 ['199.120.110.21',
  '-',
  '-',
  '[01/Jul/1995:00:00:09',
  '-0400]',
  '"GET',
  '/shuttle/missions/sts-73/mission-sts-73.html',
  'HTTP/1.0"',
  '200',
  '4085'],
 ['burger.letters.com',
  '-',
  '-',
  '[01/Jul/1995:00:00:11',
  '-0400]',
  '"GET',
  '/shuttle/countdown/liftoff.html',
  'HTTP/1.0"',
  '304',
  '0'],
 ['199.120.110.21',
  '-',
  '-',
  '[01/Jul/1995:00:00:11',
  '-0400]',
  '"GET',
  '/shuttle/missions/sts-73/sts-73-patch-small.gif',
  'HTTP/1.0"',
  '200',
  '4179']]

Next, for the sake of this example, let's say we are not interested in lines where there is data missing. In other words, we are only interested in lines that have all 10 elements. We will use the <code>filter</code> method to filter any lines that don't have all 10 elements out of our RDD:

In [9]:
nasa_logs.map(lambda line : line.split(" ")).\
filter(lambda line : len(line)==10).count()

Web server logs like this are called 'semi-structured' for a reason: we can be pretty sure that every line will be formatted the same way. This means every element in each of our Partitions looks pretty much the same after our first step. We can be confident that the same unwanted characters ended up inside the elements of all partitions of our RDD. So our next step takes care of removing them:

In [4]:
replacement_dict = {"[":'',"]":'',"\"":''}

nasa_logs_structured = nasa_logs.map(lambda line : line.split(" ")).\
filter(lambda line : len(line)==10).\
map(lambda line : [element.translate(str.maketrans(replacement_dict)) \
                   for element in line])

You might be asking yourself whether using the <code>take</code> method all the time to check if we are doing things right is the best practice... and the answer is no. Everytime you call it, you are computing a new RDD and thus having the Spark Cluster do work for you. In real-life you will rarely have a Cluster all for yourself, so you should expect your computations to get queued and competing for resources with other users. in this scenario, minimizing the amount of times you move things back and forth between the Driver and the Executors is a good idea.

So in practice, one approach would be to use the RDD API method <code>sample</code> to extract a sample of your data to examine in the driver and figure out what you need to do before farming out computations to the cluster. The <code>take</code> method also works here, but getting a random sample instead of the first N elements of your RDD is almost always a better plan.

In [None]:
# Make sure you know how much data 0.01% of your dataset is! 
#It might look like a small fraction, but in the Big Data world 
#even that might be too much for your local computer!

local_sample = nasa_logs.sample(withReplacement=False,fraction=0.0001).collect()

print(local_sample)

Ok, so now our RDD has the following elements: IP/NAME_OF_ORIGIN, DATE/TIME, TIMEZONE, REQUEST_METHOD, RESOURCE_REQUESTED, PROTOCOL, STATUS_CODE, SIZE_OF_RESOURCE

That looks pretty much like a CSV (or a Dataframe) a Data Scientist could work with!

We can now go ahead and save this data somewhere your Data Science team can go get it. For now, we will save this as a CSV file - we will talk about writing directly to a Relational DB or Data Warehouse on Day 2.

Unfortunately, the RDD API does not have a method to write CSVs directly: we will have to add the commas and make it look like a CSV before saving it: 

In [11]:
def CSVfy(rdd_element):
  return ','.join(str(element) for element in rdd_element)

nasa_logs_structured.map(CSVfy).take(5)

['199.72.81.55,-,-,01/Jul/1995:00:00:01,-0400,GET,/history/apollo/,HTTP/1.0,200,6245',
 'unicomp6.unicomp.net,-,-,01/Jul/1995:00:00:06,-0400,GET,/shuttle/countdown/,HTTP/1.0,200,3985',
 '199.120.110.21,-,-,01/Jul/1995:00:00:09,-0400,GET,/shuttle/missions/sts-73/mission-sts-73.html,HTTP/1.0,200,4085',
 'burger.letters.com,-,-,01/Jul/1995:00:00:11,-0400,GET,/shuttle/countdown/liftoff.html,HTTP/1.0,304,0',
 '199.120.110.21,-,-,01/Jul/1995:00:00:11,-0400,GET,/shuttle/missions/sts-73/sts-73-patch-small.gif,HTTP/1.0,200,4179']

In [None]:
csv_to_be_saved = nasa_logs_structured.map(CSVfy)

csv_to_be_saved.saveAsTextFile('../../data/nasa_logs.csv.gz')

The <code>saveAsTextFile</code> method has the same caveats as its cousin <code>textFile</code>: the path where you save your data must be visible and accessible on all nodes of the cluster. As before, typically this will be a location on a Hadoop DFS. 

If you don't want to save this on whatever Distributed File System your Spark Cluster was configured to store things, you can always use the <code>collect</code> method of your RDD to bring your data over to the Driver, and then just save it to your local file system using your favourite library/function. Then again, the point of having a Spark Cluster is to deal with huge amounts of data that don't necessarily fit in your regular workstation...

You may also be thinking right now "how come Spark doesn't have something like a 'to_csv' method to write CSVs directly?", while pointing out that what we did above would certainly fail if there happened to be any commas **inside the elements** of our RDD. 

You would be right. 

It turns out Spark **does** have an easier method to create CSVs, one that handles escaping charcaters, quotes, commas and every other annoying thing we have to deal with when working with CSVs. This is part of the SparkSQL API though and we will talk about it on Day 2! 

But enough about CSVs! Let's take advantage of our now-structured dataset and see if we can do a bit of Data Science using the RDD API directly! Let's find out where most requests to the NASA webserver came from on our dataset.

To do this, let's go full Hadoop and do a little bit of Map-Reduce: 

In [12]:
# Take each line of our structured log and return a Key-Value Pair

nasa_logs_structured.map(lambda line : (line[0],1) ).take(5)

[('199.72.81.55', 1),
 ('unicomp6.unicomp.net', 1),
 ('199.120.110.21', 1),
 ('burger.letters.com', 1),
 ('199.120.110.21', 1)]

In [6]:
# Unlike "reduce", "reduceByKey" is not an Action!

nasa_logs_structured.map(lambda line : (line[0],1) ).reduceByKey(lambda a,b : a+b).map(lambda kv_pair : (kv_pair[1],kv_pair[0])).sortByKey(ascending=False).take(5)

[(17572, 'piweba3y.prodigy.com'),
 (11591, 'piweba4y.prodigy.com'),
 (9868, 'piweba1y.prodigy.com'),
 (7852, 'alyssa.prodigy.com'),
 (7573, 'siltb10.orl.mmc.com')]

## Exercise 1 - When Did NASA's Server Serve The Most Data?

Now you try! Take our structured log file RDD <code>nasa_logs_structured</code> and find out on which timestamp NASA's webserver registered the highest amount of data served. If you are looking for a challenge, try figuring out on which **day** there was the highest amount of data served!

HINT: Some requests don't return any data, so there is no amount on the logs, i.e., the amount is "-".

HINT2: All elements on our structured version of the log are Strings... 

In [22]:
nasa_logs_structured.persist()

PythonRDD[48] at RDD at PythonRDD.scala:48

In [24]:
nasa_logs_structured.is_cached

True

In [20]:
nasa_logs_structured.map(lambda line : (line[3],int(line[9].replace('-','0'))) ).reduceByKey(lambda a,b : a+b).map(lambda kv_pair : (kv_pair[1],kv_pair[0])).sortByKey(ascending=False).take(5)

[(6875029, '07/Jul/1995:14:03:32'),
 (3161433, '07/Jul/1995:10:28:56'),
 (3160666, '14/Jul/1995:09:11:29'),
 (3155499, '09/Jul/1995:09:22:14'),
 (3102848, '03/Jul/1995:12:30:07')]

## Exercise 2 - What is the Resource With the Most Unique Request Origins?

Can you find out what NASA resource had the most unique visitors/requestors in our dataset?

HINT: The <code>distinct</code> method does exactly what its name suggests


In [6]:
nasa_logs_structured.map(lambda line : (line[0],line[6])).distinct().map(lambda line : (line[1],1)).reduceByKey(lambda a,b : a+b).map(lambda kv_pair : (kv_pair[1],kv_pair[0])).sortByKey(ascending=False).take(5)

[(49583, '/images/NASA-logosmall.gif'),
 (49049, '/images/KSC-logosmall.gif'),
 (29729, '/images/MOSAIC-logosmall.gif'),
 (29490, '/images/USA-logosmall.gif'),
 (29244, '/images/WORLD-logosmall.gif')]

## Exercise 3 - Word count

If we take the element containing NASA's website resource names and we replace the "/"s and "."s by " "s, we sort of get words. I wonder how many words we get and I wonder what are the most frequent words... write a word count program to find the most frequent words and how many unique words there are.

HINT: The DAG for the word count program is on the slide deck!
HINT2: Use the <code>count</code> method for the unique words part.

In [11]:
words = nasa_logs_structured.map(lambda line : line[6].replace('/',' ').replace('.',' '))

In [12]:
words.flatMap(lambda line: line.split(" ")).map(lambda word : (word, 1)).reduceByKey(lambda a,b : a+b).take(5)

[('', 2024051),
 ('countdown', 184637),
 ('mission-sts-73', 2327),
 ('ksclogo-medium', 58615),
 ('facts', 8619)]

# Hands-On Guided Example 2 - A Night At the Museum

The RDD API is very powerful, but on its own it has some serious limitations. Ironically, one of its biggest limitations is its usefulness on structured data... like CSV files.

We had caught a glimpse of that on the NASA website example, but now let's look at a real-life CSV to illustrate this and introduce the SparkSQL API - an even more powerful API for which the RDD API works as a beautiful complement.

The file below contains data about all pieces owned/maintained by the Metropolitan Museum of Art in New York City. As we've seen before, the RDD API only allows us to load it as a plain text file:

In [13]:
museum_data = sc.textFile('../../data/MetObjects.csv.gz')

In [14]:
museum_data.take(5)

['Object Number,Is Highlight,Is Timeline Work,Is Public Domain,Object ID,Gallery Number,Department,AccessionYear,Object Name,Title,Culture,Period,Dynasty,Reign,Portfolio,Constiuent ID,Artist Role,Artist Prefix,Artist Display Name,Artist Display Bio,Artist Suffix,Artist Alpha Sort,Artist Nationality,Artist Begin Date,Artist End Date,Artist Gender,Artist ULAN URL,Artist Wikidata URL,Object Date,Object Begin Date,Object End Date,Medium,Dimensions,Credit Line,Geography Type,City,State,County,Country,Region,Subregion,Locale,Locus,Excavation,River,Classification,Rights and Reproduction,Link Resource,Object Wikidata URL,Metadata Date,Repository,Tags,Tags AAT URL,Tags Wikidata URL',
 '1979.486.1,False,False,False,1,,The American Wing,1979,Coin,One-dollar Liberty Head Coin,,,,,,16429,Maker,,James Barton Longacre,"American, Delaware County, Pennsylvania 1794–1869 Philadelphia, Pennsylvania",,"Longacre, James Barton",American,1794      ,1869      ,,http://vocab.getty.edu/page/ulan/500011409,,1853

In [15]:
museum_data.count()

587818

In [None]:
museum_data_split = museum_data.map(lambda line : line.split(","))

In [None]:
museum_data_split.take(1)

In [None]:
from pyspark.sql import SQLContext

In [None]:
sqlContext = SQLContext(sc)

In [None]:
museum_dataframe = sqlContext.read.options(header='true').csv('../../data/MetObjects.csv.gz')

In [None]:
museum_dataframe

In [None]:
museum_dataframe.head(1)