<h1 align = "center"> Spark Fundamentals 1 - Introduction to Spark</h1>
<h2 align = "center"> Getting Started</h2>
<h4 align = "center"> January 11, 2016 </h4>
<br align = "left">

**Related free online courses:**  
- [Spark Fundamentals II](http://bigdatauniversity.com/bdu-wp/bdu-course/spark-fundamentals-ii/)  
- [Data Analysis using R](https://bigdatauniversity.com/bdu-wp/bdu-course/introduction-to-data-analysis-using-r/)  
- [Big Data Fundamentals](http://bigdatauniversity.com/bdu-wp/bdu-course/big-data-fundamentals/)  

<img src = "http://spark.apache.org/images/spark-logo.png", height = 100, align = 'left'>
<img src = "https://upload.wikimedia.org/wikipedia/commons/f/f8/Python_logo_and_wordmark.svg", height = 95, align = 'left'>
<img src = "https://upload.wikimedia.org/wikipedia/en/8/85/Scala_logo.png", height = 85, align = 'left'>

## Welcome to Spark Fundamentals - Introduction to Spark. Spark is built around speed and the ease of use. In these labs you will see for yourself how easy it is to get started using Spark. 

Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset or RDD. In a subsequent lab exercise, you will learn more about the details of RDD. RDDs have actions, which return values, and transformations, which return pointers to new RDD.

This set of labs uses Data Scientist Workbench to provide an interactive environment to develop applications and analyze data. It is available in either Scala or Python shells. Scala runs on the Java VM and is thus a good way to use existing Java libraries. In this lab exercise, we will set up our environment in preparation for the later labs.
After completing this set of hands-on labs, you should be able to:

o Start the Spark shell with Scala and Python

o Perform basic RDD actions and transformations

o Use caching to speed up repeated operations


### Using this notebook - Please ensure you have viewed the Data Scientist Workbench tutorial on the Big Data University before proceeding. 

This is an interactive environment where you can show your code through cells, and documentation through markdown.

Look at the top right corner. Do you see "Python 2"? This indicates that you are running Python in this notebook.

**To run a cell:** Ctrl + Enter

**To run a cell and go to the next cell:** Shift + Enter

###Try creating a new cell below.

**To create a new cell:** In the menu, go to _"Insert" > "Insert Cell Below"_. Or, click outside of a cell, and press "a" (insert cell above) or "b" (insert cell below).

# Lab Setup

Run the following cells to get the lab data.

In [1]:
!wget https://ibm.box.com/shared/static/1c65hfqjxyxpdkts42oab8i8mzxbpvc8.zip

--2017-04-01 08:36:14--  https://ibm.box.com/shared/static/1c65hfqjxyxpdkts42oab8i8mzxbpvc8.zip
Resolving ibm.box.com (ibm.box.com)... 107.152.27.197
Connecting to ibm.box.com (ibm.box.com)|107.152.27.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/shared/static/1c65hfqjxyxpdkts42oab8i8mzxbpvc8.zip [following]
--2017-04-01 08:36:14--  https://ibm.ent.box.com/shared/static/1c65hfqjxyxpdkts42oab8i8mzxbpvc8.zip
Resolving ibm.ent.box.com (ibm.ent.box.com)... 107.152.24.211, 107.152.25.211
Connecting to ibm.ent.box.com (ibm.ent.box.com)|107.152.24.211|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://public.boxcloud.com/d/1/WcvwXyfryOqkNCspWbXn8LtBJ6FBz2lqu322fswxZybdOATQQS6b_hAi7AOH6TufU5Lp1eqnqZEJPU67gutm7l6zjX78et8HB3Ehvemcwm5jmEcpEjJ7FWcBDqIcK_-0objwf_K8WLo9DZagBmBTBq2s9_r84wPB_chGB2EHMCPIGOfxkM7Df2VQH6WmM2B42D2E6-XlaEjzrCJRkPRIMyPBG3JbMriH67OC0sMnFr_XVvXqZ_JDqR859L-DBs3wM_CELnuYN

In [2]:
!unzip -o -d /resources 1c65hfqjxyxpdkts42oab8i8mzxbpvc8.zip

Archive:  1c65hfqjxyxpdkts42oab8i8mzxbpvc8.zip
  inflating: /resources/LabData/.DS_Store  
  inflating: /resources/__MACOSX/LabData/._.DS_Store  
  inflating: /resources/LabData/followers.txt  
  inflating: /resources/__MACOSX/LabData/._followers.txt  
  inflating: /resources/LabData/notebook.log  
  inflating: /resources/__MACOSX/LabData/._notebook.log  
  inflating: /resources/LabData/nyctaxi.csv  
  inflating: /resources/__MACOSX/LabData/._nyctaxi.csv  
  inflating: /resources/LabData/nyctaxi100.csv  
  inflating: /resources/__MACOSX/LabData/._nyctaxi100.csv  
  inflating: /resources/LabData/nyctaxisub.csv  
  inflating: /resources/__MACOSX/LabData/._nyctaxisub.csv  
  inflating: /resources/LabData/nycweather.csv  
  inflating: /resources/__MACOSX/LabData/._nycweather.csv  
  inflating: /resources/LabData/pom.xml  
  inflating: /resources/__MACOSX/LabData/._pom.xml  
  inflating: /resources/LabData/README.md  
  inflating: /resources/__MACOSX/LabData/._README.md  
  inflating: /reso

In [3]:
!ls -1 /resources/LabData/

followers.txt
notebook.log
nyctaxi100.csv
nyctaxi.csv
nyctaxisub.csv
nycweather.csv
pom.xml
README.md
taxistreams.py
users.txt


Should have:
    
* followers.txt
* notebook.log
* nyctaxi100.csv
* nyctaxi.csv
* nyctaxisub.csv
* nycweather.csv
* pom.xml
* README.md
* taxistreams.py
* users.txt

### Starting with Spark

The notebooks provide code assist. For example, type in "sc." followed by the Tab key to get the list of options associated with the spark context:

In [None]:
sc.
    

To run a command as code, simple select the cell you want to run and either:

* Click the play button in the toolbar above
* Press "_Shift+Enter_"

Let's run a basic command and check the version of Spark running:

In [4]:
sc.version

u'1.6.0'

You can get files into your workbench in three ways:

1. Drag and drop a file from your file explorer onto the browser. This will upload the file to your workbench.
2. Enter the url of a file on the internet into the text field in the upper right of the screen.
3. Run code (such as wget) to download a file into your notebook.

Download the following file by pasting the link in the search field at the top right of the page and pressing ENTER:

https://raw.githubusercontent.com/apache/spark/master/README.md

You should see the file show up in _Recent Data_. Since this file is contained in the zip file you should see it named as README0.md. You can delete this file now by clicking the twistie on README0.md and choosing delete.

Highlight the text string between the double quotes in the cell below then, click the twistie next to the "README.md" in _Recent Data_ and select _Insert Path_ to paste the path below. Then, run the code in the cell.


In [6]:
readme = sc.textFile("/resources/LabData/README.md")

Let’s perform some RDD actions on this text file. Count the number of items in the RDD using this command:

In [7]:
readme.count()

98

You should see that this RDD action returned a value of 98.

Let’s run another action. Run this command to find the first item in the RDD:

In [8]:
readme.first()

u'# Apache Spark'

Now let’s try a transformation. Use the filter transformation to return a new RDD with a subset of the items in the file. Type in this command:

In [11]:
linesWithSpark = readme.filter(lambda line: "Spark" in line)

You can even chain together transformations and actions. To find out how many lines contains the word “Spark”, type in:

In [12]:
linesWithSpark = readme.filter(lambda line: "Spark" in line)
readme.filter(lambda line: "Spark" in line).count()

18

# More on RDD Operations

This section builds upon the previous section. In this section, you will see that RDD can be used for more complex computations. You will find the line from that readme file with the most words in it.

Run the following cell.

In [13]:
readme.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)

14

There are two parts to this. The first maps a line to an integer value, the number of words in that line. In the second part reduce is called to find the line with the most words in it. The arguments to map and reduce are Python anonymous functions (lambdas), but you can use any top level Python functions. In the next step, you’ll define a max function to illustrate this feature.

Define the max function. You will need to type this in:

In [14]:
def max(a, b):
 if a > b:
    return a
 else:
    return b


Now run the following with the max function:

In [15]:
readme.map(lambda line: len(line.split())).reduce(max)

14

Spark has a MapReduce data flow pattern. We can use this to do a word count on the readme file.

In [17]:
wordCounts = readme.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

Here we combined the flatMap, map, and the reduceByKey functions to do a word count of each word in the readme file.

To collect the word counts, use the collect action.

####It should be noted that the collect function brings all of the data into the driver node. For a small dataset, this is acceptable but, for a large dataset this can cause an Out Of Memory error. It is recommended to use collect() for testing only. The safer approach is to use the take() function e.g. print take(n)

In [18]:
wordCounts.collect()

[(u'when', 1),
 (u'R,', 1),
 (u'including', 3),
 (u'computation', 1),
 (u'using:', 1),
 (u'guidance', 3),
 (u'Scala,', 1),
 (u'environment', 1),
 (u'only', 1),
 (u'rich', 1),
 (u'Apache', 1),
 (u'sc.parallelize(range(1000)).count()', 1),
 (u'Building', 1),
 (u'guide,', 1),
 (u'return', 2),
 (u'Please', 3),
 (u'Try', 1),
 (u'not', 1),
 (u'Spark', 14),
 (u'scala>', 1),
 (u'Note', 1),
 (u'cluster.', 1),
 (u'./bin/pyspark', 1),
 (u'have', 1),
 (u'params', 1),
 (u'through', 1),
 (u'GraphX', 1),
 (u'[run', 1),
 (u'abbreviated', 1),
 (u'[project', 2),
 (u'##', 8),
 (u'library', 1),
 (u'see', 1),
 (u'"local"', 1),
 (u'[Apache', 1),
 (u'will', 1),
 (u'#', 1),
 (u'processing,', 1),
 (u'for', 12),
 (u'[building', 1),
 (u'provides', 1),
 (u'print', 1),
 (u'supports', 2),
 (u'built,', 1),
 (u'[params]`.', 1),
 (u'available', 1),
 (u'run', 7),
 (u'tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools).',
  1),
 (u'This', 2),
 (u'Hadoop,', 2),
 (u'Tests', 1),
 (u'example:', 

### <span style="color: red">YOUR TURN:</span> 

#### In the cell below, determine what is the most frequent word in the README, and how many times was it used?

In [31]:
#YOUR CODE BELOW
wordCounts.reduce(lambda a, b: a if (a[1] > b[1]) else b)
wordCounts.map(lambda (k,v): (v,k)).sortByKey(False).take(5)


[(21, u'the'), (14, u'Spark'), (14, u'to'), (12, u'for'), (10, u'and')]

Highlight text field for answer:

<input type="text" size="80" value="wordCounts.reduce(lambda a, b: a if (a[1] > b[1]) else b)" style="color: white">

## Using Spark caching

In this short section, you’ll see how Spark caching can be used to pull data sets into a cluster-wide in- memory cache. This is very useful for accessing repeated data, such as querying a small “hot” dataset or when running an iterative algorithm. Both Python and Scala use the same commands.

As a simple example, let’s mark our linesWithSpark dataset to be cached and then invoke the first count operation to tell Spark to cache it. Remember that transformation operations such as cache does not get processed until some action like count() is called. Once you run the second count() operation, you should notice a small increase in speed.


In [None]:
print linesWithSpark.count()

In [None]:
from timeit import Timer
def count():
    return linesWithSpark.count()
t = Timer(lambda: count())

In [None]:
print t.timeit(number=50)

In [None]:
linesWithSpark.cache()
print t.timeit(number=50)

It may seem silly to cache such a small file, but for larger data sets across tens or hundreds of nodes, this would still work. The second linesWithSpark.count() action runs against the cache and would perform significantly better for large datasets.

###Summary
Having completed this exercise, you should now be able to log in to your environment and use the Spark shell to run simple actions and transformations for Scala and/or Python. You understand that Spark caching can be used to cache large datasets and subsequent operations on it will utilize the data in the cache rather than re-fetching it from HDFS.

The next labs will show you RDD operation in more detail. The labs are available in both Scala and Python, you can do either or both. 

<h1 align="center" style="font-family: Monaco;">Continue on "[Spark Fundamentals 1 - PythonRDD.ipynb](/api/v1/resources/Spark%20Fundamentals%201%20-%20PythonRDD.ipynb)"</h1>
<h1 align="center" style="font-family: Monaco;">Continue on "[Spark Fundamentals 1 - ScalaRDD.ipynb](/api/v1/resources/Spark%20Fundamentals%201%20-%20ScalaRDD.ipynb)"</h1>
