<h1 align = "center"> Spark Fundamentals 1 - Introduction to Spark</h1>
<h2 align = "center"> Getting Started</h2>
<h4 align = "center"> April 11, 2017 </h4>
<br align = "left">

![](http://spark.apache.org/images/spark-logo.png) ![](https://upload.wikimedia.org/wikipedia/commons/f/f8/Python_logo_and_wordmark.svg) ![]

## 1. Get the data

In [1]:
!wget https://ibm.box.com/shared/static/1c65hfqjxyxpdkts42oab8i8mzxbpvc8.zip

--2017-04-12 05:06:39--  https://ibm.box.com/shared/static/1c65hfqjxyxpdkts42oab8i8mzxbpvc8.zip
Resolving ibm.box.com (ibm.box.com)... 107.152.27.197
Connecting to ibm.box.com (ibm.box.com)|107.152.27.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/shared/static/1c65hfqjxyxpdkts42oab8i8mzxbpvc8.zip [following]
--2017-04-12 05:06:39--  https://ibm.ent.box.com/shared/static/1c65hfqjxyxpdkts42oab8i8mzxbpvc8.zip
Resolving ibm.ent.box.com (ibm.ent.box.com)... 107.152.26.211
Connecting to ibm.ent.box.com (ibm.ent.box.com)|107.152.26.211|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://public.boxcloud.com/d/1/lF6oUONnSZlKtI2og4oQXr_Qh_zu0XHWUP0I4BUu3DPVMZ3MGZPF3K9Rg_M6apQl1eYtSZyopmdaA0QubL5CaQoW5-WI47RdOUGNtys-lgEToifrZLx_ph6AiGoyCEp_KBDhUg9yxIQ71xRLtBPANGg7TDPru3VMdcgafYkUdDbDs7Gons5TcwkUR5z9IuV1sJ--6zMuuapHDbeU0I658u4zNr5uPl15xNiDx3j6i6pBAEiKfLhFU0mzuxDLgTAKHh_oqhfMF9IaAzCHr9XVgWvLr

In [2]:
!unzip -o -d /resources 1c65hfqjxyxpdkts42oab8i8mzxbpvc8.zip

Archive:  1c65hfqjxyxpdkts42oab8i8mzxbpvc8.zip
  inflating: /resources/LabData/.DS_Store  
  inflating: /resources/__MACOSX/LabData/._.DS_Store  
  inflating: /resources/LabData/followers.txt  
  inflating: /resources/__MACOSX/LabData/._followers.txt  
  inflating: /resources/LabData/notebook.log  
  inflating: /resources/__MACOSX/LabData/._notebook.log  
  inflating: /resources/LabData/nyctaxi.csv  
  inflating: /resources/__MACOSX/LabData/._nyctaxi.csv  
  inflating: /resources/LabData/nyctaxi100.csv  
  inflating: /resources/__MACOSX/LabData/._nyctaxi100.csv  
  inflating: /resources/LabData/nyctaxisub.csv  
  inflating: /resources/__MACOSX/LabData/._nyctaxisub.csv  
  inflating: /resources/LabData/nycweather.csv  
  inflating: /resources/__MACOSX/LabData/._nycweather.csv  
  inflating: /resources/LabData/pom.xml  
  inflating: /resources/__MACOSX/LabData/._pom.xml  
  inflating: /resources/LabData/README.md  
  inflating: /resources/__MACOSX/LabData/._README.md  
  inflating: /reso

In [3]:
!ls -1 /resources/LabData/

followers.txt
notebook.log
nyctaxi100.csv
nyctaxi.csv
nyctaxisub.csv
nycweather.csv
pom.xml
README.md
taxistreams.py
users.txt


## 2. Starting with Spark

In [4]:
sc.version

u'1.6.0'

In [5]:
readme = sc.textFile("/resources/LabData/README.md")

Let's perform some RDD actions on this text file. Count the number of items in the RDD using this command:


In [6]:
readme.count()

98

In [7]:
readme.first()

u'# Apache Spark'

Now let’s try a transformation. Use the filter transformation to return a new RDD with a subset of the items in the file.

In [8]:
linesWithSpark = readme.filter(lambda line: "spark" in line).count()

In [9]:
linesWithSpark

11

In [10]:
linesWithSpark = readme.filter(lambda line: "Spark" in line)
readme.filter(lambda line: "Spark" in line).count()

18

RDD can be used for more complex computations. To find the line from the readme file with most words in it.

In [11]:
readme.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)

14

- The first maps a line to an integer value, the number of words in that line
- The second part reduce is called to find the line with the most words in it.

The arguments to map and reduce are Python anonymous functions (lambdas), but you can use any top level Python functions.

In [12]:
def max(a, b):
    if a > b:
        return a
    else:
        return b

In [13]:
readme.map(lambda line: len(line.split())).reduce(max)

14

Spark has a MapReduce data flow pattern. We can use this to do a word count on the readme file.

In [14]:
wordCounts = readme.flatMap(lambda line: line.split())\
                   .map(lambda word: (word, 1))\
                   .reduceByKey(lambda a, b: a + b)

In [15]:
wordCounts.take(10)

[(u'when', 1),
 (u'R,', 1),
 (u'including', 3),
 (u'computation', 1),
 (u'using:', 1),
 (u'guidance', 3),
 (u'Scala,', 1),
 (u'environment', 1),
 (u'only', 1),
 (u'rich', 1)]

Here we combined the flatMap, map, and the reduceByKey functions to do a word count of each word in the readme file.

To collect the word counts, use the collect action.

####It should be noted that the collect function brings all of the data into the driver node. For a small dataset, this is acceptable but, for a large dataset this can cause an Out Of Memory error. It is recommended to use collect() for testing only. The safer approach is to use the take() function e.g. print take(n)

In [16]:
wc = readme.flatMap(lambda line: line.split())\
           .map(lambda word: (word, 1))\
           .reduceByKey(lambda a, b: a + b)

In [17]:
wc.take(5)

[(u'when', 1),
 (u'R,', 1),
 (u'including', 3),
 (u'computation', 1),
 (u'using:', 1)]

In [18]:
# swap k, v to v, k to sort by word frequency
swap = lambda x: (x[1], x[0])
wc_swap = wc.map(swap)

In [19]:
wc_swap.take(5)

[(1, u'when'),
 (1, u'R,'),
 (3, u'including'),
 (1, u'computation'),
 (1, u'using:')]

In [20]:
# sort the keys by ascending=false (descending)
freq = wc_swap.sortByKey(False, 1)

In [21]:
freq.take(5)

[(21, u'the'), (14, u'Spark'), (14, u'to'), (12, u'for'), (10, u'and')]

In [22]:
wordCounts.reduce(lambda a, b: a if (a[1] > b[1]) else b)

(u'the', 21)

## Using Spark caching

Spark caching can be used to pull data sets into a cluster-wide in- memory cache. This is very useful for accessing repeated data, such as querying a small “hot” dataset or when running an iterative algorithm. 

As a simple example, let’s mark our linesWithSpark dataset to be cached and then invoke the first count operation to tell Spark to cache it. Remember that transformation operations such as cache does not get processed until some action like count() is called. Once you run the second count() operation, you should notice a small increase in speed.

In [23]:
linesWithSpark = readme.filter(lambda line: "Spark" in line)
print linesWithSpark.count()

18


In [24]:
from timeit import Timer
def count():
    return linesWithSpark.count()
t = Timer(lambda: count())

In [25]:
print t.timeit(number=50)

4.07555699348


In [26]:
linesWithSpark.cache()
print t.timeit(number=50)

3.63598179817


It may seem silly to cache such a small file, but for larger data sets across tens or hundreds of nodes, this would still work. The second linesWithSpark.count() action runs against the cache and would perform significantly better for large datasets.