#Hello Spark!!!
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

<img src='https://github.com/carloapp2/SparkPOT/blob/master/spark.png?raw=true' width="80%" height="80%"></img>

This notebook will show you some basic concepts to start working with Apache Spark including:

- Understanding Spark Context
- Creating Resilient Distributed Datasets (RDD)
- Performing Data Transformations
- Loading Data Files to use with Spark

####Tool Tips:
- Notice the navigation and command buttons at top of the notebook. Press Play & Stop buttons to execute code and interupt execution.
- Notice each cell has type. (Markdown, Code, Etc) This cell is a Markdown cell which is simply HTML informational vs Code cell allows you to execute against spark.
- Notice each cell has desigination, for eample In [n]: the number is cell number. When you see In [*]: that means the cell is executing
- To see all methonds available for object you can use Tab key Example Enter "SC." press Tab and a drop down will appear.
- To execute code in active cell press play button at top or you can use short cut keys Shift-Enter, Ctrl-Enter

###Spark Driver and Workers programs
A Spark program has a driver program and a workers program. Worker programs run on cluster nodes or in local threads. RDDs are distributed across workers. 

<img src='https://github.com/carloapp2/SparkPOT/blob/master/Spark%20Architecture.png?raw=true' width="80%" height="80%"></img>

###Python Spark (pySpark)
We are using the Python programming interface to Spark pySpark. 
pySpark provides an easy-to-use programming abstraction and parallel runtime.

###Spark Context
Apache Spark driver application uses a context allow a programming interface to interact with the driver application. This is know as a Spark Context which supports Python, Scala and Java programming languages. The SparkContext object tells Spark how and where to access a cluster.<br>
<font color="red">This lab uses IBM's fully managed cloud based notebook enviornment, so the spark context is predefined for you.</font><br>

In other enviornments you would need to pick an interprerter (i.e. pyspark for python) and create a Spark Config object to initilize a Spark Context. <br>

Example:<br>
from pyspark import SparkContext, SparkConf<br>
conf = SparkConf().setAppName(appName).setMaster(master)<br>
sc = SparkContext(conf=conf)<br>

In [1]:
#Execute Spark Context to see if its active in cluster
sc

<pyspark.context.SparkContext at 0x7f2aa0020490>

In [2]:
#Execute to get the version of the spark driver application

#Note: There is different versions of spark which support additional 
#functionality such as DataFrames, Streaming and Machine Learning.
sc.version

u'1.6.0'

##Resilient Distributed Datasets

Apache Spark uses an abstraction for working with data called RDDs - Resilient Distributed Datasets. An RDD is an immutable fault-tolerant collection of elements that can be operated on in parallel. In Apache Spark all work is expressed by either creating new RDDs, transforming existing RDDs or using RDDs to compute results. When working with RDDs, the Spark Driver application automatically distributes the work accross your cluster.

####You can construct RDDs by parallelizing existing Python collections (lists) or by transforming an existing RDDs or from files in HDFS or any other storage system. 

###Understanding Lazy Evaluations...
RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Transformations are transformations that do not initiate execution on the cluster. A transformation is mapped in a Digital Acrylic Graph (DAG) which is used to optimize execution on the cluster which occurs at time of an action.

In [3]:
#Create an RDD from Python collection of numbers

#Create a Python collection
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

#Place the collection into an rdd called x_nbr_rdd using parallelize
x_nbr_rdd = sc.parallelize(x)

####Notice no return occurs with sc.parallelize()
This means sc.parallelize didn't compute a result, so its a transformation. Spark only recorded how to create the RDD.

In [4]:
#Execute an Action and return the first element
x_nbr_rdd.first()

1

####Notice you get a return on .first()
This means .first() is an Action. Spark executed all transformations to compute the result of .first().

In [5]:
#Execute a Action and take first 5 elements
x_nbr_rdd.take(5)

[1, 2, 3, 4, 5]

In [6]:
#Create an RDD from Python collection of strings

#Create a Python collection
y = ["Hello Human", "My Name is Spark"]

#Place the string value into an rdd called y_str_rdd
y_str_rdd = sc.parallelize(y)

#Return the first value in yoru RDD - Action
y_str_rdd.take(2)

['Hello Human', 'My Name is Spark']

##Data Transformations
As you can see, you created a string "Hello Human" and you returned value that was parallelized into RDD first element. If we wanted to work with a corpus of words and run analysis on strings to filter out words, then you would need to map each word into an RDD element.  

###Some common Transformation Functions

- <b>map(func):</b> return a new distributed dataset formed by passing each element of the source through a function func
- <b>filter(func):</b> return a new dataset formed by selecting those elements of the source on which func returns true
- <b>distinct([numTasks])):</b> return a new dataset that contains the distinct elements of the source dataset
- <b>flatMap(func):</b> similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item)

In [7]:
#Create RDD named Words
Words = ["Hello Humman. I'm Apache Spark and I love running analysis on data."]
words_rd = sc.parallelize(Words)
words_rd.first()

"Hello Humman. I'm Apache Spark and I love running analysis on data."

###Review: Python lambda Functions

- Small anonymous functions (not bound to a name) 

- <b>lambda a , b : a + b</b> returns the sum of its two arguments

- Can use lambda functions wherever function objects are required
- Restricted to a single expression

In [8]:
Words_rd2 = words_rd.map(lambda line: line.split(" "))
Words_rd2.first()

['Hello',
 'Humman.',
 "I'm",
 'Apache',
 'Spark',
 'and',
 'I',
 'love',
 'running',
 'analysis',
 'on',
 'data.']

In [9]:
#Transform RDD Words into new RDD split on Space character.
words_rd2 = words_rd.flatMap(lambda line: line.split(" "))
words_rd2.take(3)

['Hello', 'Humman.', "I'm"]

###Filter Function
The filter command creates a new RDD from another RDD based on a filter criteria.

filter syntax is .filter(lambda line: "Filter Criteria Value" in line) 

Hint: Use a simple python print command to add string to your spark results and run multiple actions in single cell.

In [10]:
#Count number of "Hello" words

#Create a new RDD z_str3_rdd for all "Hello" words in corpus of words 
words_rd3 = words_rd2.filter(lambda line: "Hello" in line) 

#Print count of values in the new RDD which represents number of "Hello" words in corpus
print "The count of words " + str(words_rd3.first())
print "Is: " + str(words_rd3.count())

The count of words Hello
Is: 1


###Computations
Using Python and Spark functions to perform basic analytics on your data. Let investigate how we can sum a couple elements in a string.

In [11]:
#Create RDD of array of numbers
X = ["1,2,3,4,5,6,7,8,9,10"]

#Note: Notice the numbers are in "" which keeps the values together.
#Create an RDD
y_rd = sc.parallelize(X)

#Add Values 3 & 10
Sum_rd = y_rd.map(lambda y: y.split(",")).\
map(lambda y: (int(y[2])+int(y[9])))

#Note: Notice \ to break line for code clarity
#Note: Notice we used elements 2 and 9, array starts with 0
#Return Sum Value
Sum_rd.first()

13

##Creating an RDD from a data file

Apache Spark can access many data sources (Files, HDFS, APIs, Relational Data Sources, Etc.). These files need to be accessable by your Spark cluster.

We will use wget to pull Apache Spark README.md file into your fully managed spark cluster.

In [12]:
!rm README.md* -f
!wget https://github.com/carloapp2/SparkPOT/blob/master/README.md

--2016-05-16 16:11:19--  https://github.com/carloapp2/SparkPOT/blob/master/README.md
Resolving github.com (github.com)... 192.30.252.122
Connecting to github.com (github.com)|192.30.252.122|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'README.md'

    [ <=>                                   ] 40,242      --.-K/s   in 0.1s    

2016-05-16 16:11:19 (369 KB/s) - 'README.md' saved [40242]



###Use SparkContext textFile to convert a text file to an RDD.

**NOTE**<BR>
The file used is the "README.md" 

In [13]:
#Put Data file into RDD
textfile_rdd = sc.textFile("README.md")
textfile_rdd.count()

596

In [14]:
#Create a new RDD for all words "Spark" in text file
Spark_rdd = textfile_rdd.filter(lambda line: "Spark" in line)

#Print the count of elements in new RDD
print "The file README.md has the word SPARK in it " + str(Spark_rdd.count()) + ' Times.'

The file README.md has the word SPARK in it 49 Times.
