# Lab 1 - Hello Spark
This lab will introduce you to Apache Spark.  It will be written in Python and run in IBM's Data Science Experience environment through a Jupyter notebook.  While you work, it will be valuable to reference the [Apache Spark Documentation](http://spark.apache.org/docs/latest/programming-guide.html)

## Step 1 - Working with Spark Context
Step 1.1 - Invoke the spark context: sc.  The version method will return the working version of Apache Spark<br><br>
Type:<br>
&nbsp;&nbsp;&nbsp;&nbsp;sc.version

In [1]:
#Step 1 - Check spark version
sc.version

u'1.6.0'

## Step 2 - Working with Resilient Distributed Datasets

Step 2.1 - Create an RDD with numbers 1 to 10

Type: <br>
&nbsp;&nbsp;&nbsp;&nbsp;
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]<br>
&nbsp;&nbsp;&nbsp;&nbsp; x_nbr_rdd = sc.parallelize(x)<br>

In [2]:
#Step 2.1 - Create RDD of numbers 1-10
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]  #One could also write x = range(1,11)
x_nbr_rdd = sc.parallelize(x)

Step 2.2 - Return the first element<br><br>
Type: <br>
&nbsp;&nbsp;&nbsp;&nbsp;x_nbr_rdd.first()

In [3]:
#Step 2.2 - Return first element
x_nbr_rdd.first()

1

Step 2.3 - Return an array of the first five elements<br><br>
Type:<br>
&nbsp;&nbsp;&nbsp;&nbsp;x_nbr_rdd.take(5)

In [4]:
#Step 2.3 - Return an array of the first five elements
x_nbr_rdd.take(5)

[1, 2, 3, 4, 5]

Step 2.4 - Perform a map transformation to increment each element of the array.  The map function creates a new RDD by applying the function provided in the argument to each element.  For more information go to [Transformations](http://spark.apache.org/docs/latest/programming-guide.html#transformations)<br><br>
Type:<br>
&nbsp;&nbsp;&nbsp;&nbsp;x_nbr_rdd_2 = x_nbr_rdd.map(lambda x: x+1)

In [5]:
#Step 2.4 - Write your map function
x_nbr_rdd_2=x_nbr_rdd.map(lambda x: x+1)  #It's not required to be x.  (lambda a: a+1) would also work

Step 2.5 - Note that there was no result for step 2.4.  Why was this?  Take a look at all the elements of the new RDD.<br><br>
Type:<br>
&nbsp;&nbsp;&nbsp;&nbsp; x_nbr_rdd_2.collect()   

In [6]:
#Step 2.5 - Check out the elements of the new RDD. Warning: Be careful with this in real life! Collect returns everything!
x_nbr_rdd_2.collect()

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

Step 2.6 - Create a new RDD with one string "Hello Spark" and print it.<br><br>
Type:<br>
&nbsp;&nbsp;&nbsp;&nbsp; y_str_rdd = sc.parallelize(y)<br>
&nbsp;&nbsp;&nbsp;&nbsp; y_str_rdd.first()<br>

In [7]:
#Step 2.6 - Create String RDD, Extract first line
y = ['Hello Spark'] #Remember that parallelize takes an iterable such an array.  If you provide only a string, it will iterate through the characters of the string
y_str_rdd=sc.parallelize(y)
y_str_rdd.first()

'Hello Spark'

Step 2.7 - Create a third RDD with several strings.<br><br>
Type:<br>
&nbsp;&nbsp;&nbsp;&nbsp; z = ['IBM Data Science Experience is built for enterprise-scale deployment.', "Manage your data, your analytical assets, and your projects in a secured cloud environment." , "When you create an account in the IBM Data Science Experience, we deploy for you a Spark as a Service instance to power your analysis and 5 GB of IBM Object Storage to store your data."]<br>
&nbsp;&nbsp;&nbsp;&nbsp; z_str_rdd = sc.parallelize(z)<br>
&nbsp;&nbsp;&nbsp;&nbsp; z_str_rdd.first()

In [8]:
#Step 2.7 - Create String RDD with many lines / entries, Extract first line
z = 'IBM Data Science Experience is built for enterprise-scale deployment.', "Manage your data, your analytical assets, and your projects in a secured cloud environment." , "When you create an account in the IBM Data Science Experience, we deploy for you a Spark as a Service instance to power your analysis and 5 GB of IBM Object Storage to store your data."
z_str_rdd = sc.parallelize(z)
z_str_rdd.first()

'IBM Data Science Experience is built for enterprise-scale deployment.'

Step 2.8 - Count the number of entries in this RDD.<br><br>
Type:<br>
&nbsp;&nbsp;&nbsp;&nbsp; z_str_rdd.count()

In [9]:
#Step 2.8 - Count the number of entries in the RDD
z_str_rdd.count()

3

Step 2.9 - Inspect the elements of this RDD.<br><br>
Type:<br>
&nbsp;&nbsp;&nbsp;&nbsp;z_str_rdd.collect()

In [10]:
#Step 2.9 - Show all the entries in the RDD
z_str_rdd.collect()

['IBM Data Science Experience is built for enterprise-scale deployment.',
 'Manage your data, your analytical assets, and your projects in a secured cloud environment.',
 'When you create an account in the IBM Data Science Experience, we deploy for you a Spark as a Service instance to power your analysis and 5 GB of IBM Object Storage to store your data.']

Step 2.10 - Split all the entries in the RDD on the spaces.  Then print it out.  Pay careful attention to the new format.<br><br>
Type: <br>
&nbsp;&nbsp;&nbsp;&nbsp;z_str_rdd_split = z_str_rdd.map(lambda line: line.split(" "))<br>
&nbsp;&nbsp;&nbsp;&nbsp;z_str_rdd_split.collect()

In [11]:
#Step 2.10 - Perform a map transformation to split all entries in the RDD
#Check out the entries in the new RDD
z_str_rdd_split = z_str_rdd.map(lambda line: line.split(" "))
z_str_rdd_split.collect()

[['IBM',
  'Data',
  'Science',
  'Experience',
  'is',
  'built',
  'for',
  'enterprise-scale',
  'deployment.'],
 ['Manage',
  'your',
  'data,',
  'your',
  'analytical',
  'assets,',
  'and',
  'your',
  'projects',
  'in',
  'a',
  'secured',
  'cloud',
  'environment.'],
 ['When',
  'you',
  'create',
  'an',
  'account',
  'in',
  'the',
  'IBM',
  'Data',
  'Science',
  'Experience,',
  'we',
  'deploy',
  'for',
  'you',
  'a',
  'Spark',
  'as',
  'a',
  'Service',
  'instance',
  'to',
  'power',
  'your',
  'analysis',
  'and',
  '5',
  'GB',
  'of',
  'IBM',
  'Object',
  'Storage',
  'to',
  'store',
  'your',
  'data.']]

Step 2.11 - Explore a new transformation: flatMap <br>
flatMap will "flatten" all the elements of an RDD element into 0 or more output terms.<br><br>
Type:<br>
&nbsp;&nbsp;&nbsp;&nbsp; z_str_rdd_split_flatmap = z_str_rdd.flatMap(lambda line: line.split(" "))<br>
&nbsp;&nbsp;&nbsp;&nbsp; z_str_rdd_split_flatmap.collect()

In [12]:
#Step 2.11 - Learn the difference between two transformations: map and flatMap.
z_str_rdd_split_flatmap = z_str_rdd.flatMap(lambda line: line.split(" "))
z_str_rdd_split_flatmap.collect()


['IBM',
 'Data',
 'Science',
 'Experience',
 'is',
 'built',
 'for',
 'enterprise-scale',
 'deployment.',
 'Manage',
 'your',
 'data,',
 'your',
 'analytical',
 'assets,',
 'and',
 'your',
 'projects',
 'in',
 'a',
 'secured',
 'cloud',
 'environment.',
 'When',
 'you',
 'create',
 'an',
 'account',
 'in',
 'the',
 'IBM',
 'Data',
 'Science',
 'Experience,',
 'we',
 'deploy',
 'for',
 'you',
 'a',
 'Spark',
 'as',
 'a',
 'Service',
 'instance',
 'to',
 'power',
 'your',
 'analysis',
 'and',
 '5',
 'GB',
 'of',
 'IBM',
 'Object',
 'Storage',
 'to',
 'store',
 'your',
 'data.']

Step 2.12 - Augment each entry in the previous RDD with the number "1" to create pairs or tuples. The first element of the tuple will be the word and the second elements of the tuple will be the digit "1".  This is a common step in performing a count.<br><br>
Type:<br>
&nbsp;&nbsp;&nbsp;&nbsp; countWords = z_str_rdd_split_flatmap.map(lambda word:(word,1))<br>
&nbsp;&nbsp;&nbsp;&nbsp; countWords.collect()

In [13]:
#Step 2.12 - Create pairs or tuple RDD and print it.
countWords = z_str_rdd_split_flatmap.map(lambda word:(word,1))
countWords.collect()

[('IBM', 1),
 ('Data', 1),
 ('Science', 1),
 ('Experience', 1),
 ('is', 1),
 ('built', 1),
 ('for', 1),
 ('enterprise-scale', 1),
 ('deployment.', 1),
 ('Manage', 1),
 ('your', 1),
 ('data,', 1),
 ('your', 1),
 ('analytical', 1),
 ('assets,', 1),
 ('and', 1),
 ('your', 1),
 ('projects', 1),
 ('in', 1),
 ('a', 1),
 ('secured', 1),
 ('cloud', 1),
 ('environment.', 1),
 ('When', 1),
 ('you', 1),
 ('create', 1),
 ('an', 1),
 ('account', 1),
 ('in', 1),
 ('the', 1),
 ('IBM', 1),
 ('Data', 1),
 ('Science', 1),
 ('Experience,', 1),
 ('we', 1),
 ('deploy', 1),
 ('for', 1),
 ('you', 1),
 ('a', 1),
 ('Spark', 1),
 ('as', 1),
 ('a', 1),
 ('Service', 1),
 ('instance', 1),
 ('to', 1),
 ('power', 1),
 ('your', 1),
 ('analysis', 1),
 ('and', 1),
 ('5', 1),
 ('GB', 1),
 ('of', 1),
 ('IBM', 1),
 ('Object', 1),
 ('Storage', 1),
 ('to', 1),
 ('store', 1),
 ('your', 1),
 ('data.', 1)]

Step 2.13 Now we have above what is known as a [Pair RDD](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions). Each entry in the RDD has a KEY and a VALUE.<br>
The KEY is the word (Light, of, the, ...) and the value is the number "1".  
We can now AGGREGATE this RDD by summing up all the values BY KEY<br><br>
Type:<br>
&nbsp;&nbsp;&nbsp;&nbsp;countWords2 = countWords.reduceByKey(lambda x,y: x+y)<br>
&nbsp;&nbsp;&nbsp;&nbsp;countWords2.collect()<br>

In [14]:
#Step 2.13 - Check out the results of the aggregation
countWords2 = countWords.reduceByKey(lambda x,y: x+y)
countWords2.collect()

[('and', 2),
 ('enterprise-scale', 1),
 ('Service', 1),
 ('is', 1),
 ('Storage', 1),
 ('assets,', 1),
 ('as', 1),
 ('cloud', 1),
 ('for', 2),
 ('secured', 1),
 ('5', 1),
 ('you', 2),
 ('Data', 2),
 ('store', 1),
 ('we', 1),
 ('power', 1),
 ('When', 1),
 ('Experience', 1),
 ('Spark', 1),
 ('projects', 1),
 ('a', 3),
 ('account', 1),
 ('analysis', 1),
 ('deployment.', 1),
 ('analytical', 1),
 ('the', 1),
 ('create', 1),
 ('data,', 1),
 ('data.', 1),
 ('IBM', 3),
 ('instance', 1),
 ('Science', 2),
 ('Experience,', 1),
 ('Manage', 1),
 ('an', 1),
 ('to', 2),
 ('environment.', 1),
 ('Object', 1),
 ('GB', 1),
 ('deploy', 1),
 ('in', 2),
 ('of', 1),
 ('your', 5),
 ('built', 1)]

## Step 3 - Reading a file and counting words<br>
<br>
Step 3.1 - Read the Apache Spark Readme.md file from Github.  The ! gives you file system commands<br><br>
Type:<br>

&nbsp;&nbsp;&nbsp;&nbsp;!rm README.md* -f<br>
&nbsp;&nbsp;&nbsp;&nbsp;!wget https://raw.githubusercontent.com/apache/spark/master/README.md<br>


In [15]:
#Step 3.1 - Pull data file into workbench

!rm README.md* -f
!wget https://raw.githubusercontent.com/apache/spark/master/README.md

--2016-09-15 18:24:20--  https://raw.githubusercontent.com/apache/spark/master/README.md
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.48.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3828 (3.7K) [text/plain]
Saving to: 'README.md'


2016-09-15 18:24:20 (42.8 MB/s) - 'README.md' saved [3828/3828]



Step 3.2 - Create an RDD by reading from the local filesystem.  Here is the [textfile documentation](http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=textfile#pyspark.SparkContext.textFile).  Print the count to check that the read was successful.<br>
Type:<br>
&nbsp;&nbsp;&nbsp;&nbsp;textfile_rdd = sc.textFile("README.md")<br>
&nbsp;&nbsp;&nbsp;&nbsp;textfile_rdd.count()<br>

In [16]:
#Step 3.2 - Create RDD from data file
textfile_rdd = sc.textFile("README.md")
textfile_rdd.count() #You should see 99 lines

99

Step 3.3<br>Filter out lines that contain "Spark". This will be achieved using the [filter](http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=filter#pyspark.RDD.filter) transformation.  Python allows us to use the 'in' syntax to search strings.<br>
We will also take a look at the first line in the newly filtered RDD. <br><br>
Type:<br>
&nbsp;&nbsp;&nbsp;&nbsp;Spark_lines = textfile_rdd.filter(lambda line: "Spark" in line)<br>
&nbsp;&nbsp;&nbsp;&nbsp;Spark_lines.first()<br>

In [17]:
#Step 3.3 - Filter for only lines with word Spark
Spark_lines = textfile_rdd.filter(lambda line: "Spark" in line)
Spark_lines.first()

u'# Apache Spark'

Step 3.4 - Count the number of entries in this filtered RDD and print the result as a concatenated string.<br>
Type:<br>
&nbsp;&nbsp;&nbsp;&nbsp;print "The file README.md has " + str(Spark_lines.count()) + \<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;" of " + str(textfile_rdd.count()) + \<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;" lines with the word Spark in it."<br>

In [18]:
#Step 3.4 - count the number of lines
print "The file README.md has " + str(Spark_lines.count()) + \
" of " + str(textfile_rdd.count()) + \
" lines with the word Spark in it."

The file README.md has 19 of 99 lines with the word Spark in it.


Step 3.5 - Now count the number of times the word Spark appears in the original text, not just the number of lines that contain it.  <br>
Instructions:<br>
Looking back at previous exercises, you will need to: <br>
1 - Execute a flatMap transformation on the original RDD Spark_lines and split on white space.<br>
2 - Filter out all instances of the word Spark<br>
3 - Count all instances
4 - Print the total count

In [19]:
#Step 3.5 - Count the number of instances of tokens starting with "Spark"
spark_flat=Spark_lines.flatMap(lambda x: x.split(' '))
spark_filtered = spark_flat.filter(lambda word: word == "Spark")
print spark_filtered.count()
#Why is this count different? This count doesn't include where spark was not it's own token such as apache.spark because we only split on whitespace.  


15


## Step 4 - Perform analysis on a data file
This part is a little more open ended and there are a few ways to complete it.  Scroll up to previous examples for some guidance.  You will download a data file, transform the data, and then average the prices.  The data file will be a sample of tech stock prices over six days. <br>

Data Location: https://raw.githubusercontent.com/JosephKambourakisIBM/SparkPoT/master/StockPrices.csv<br>
The data file is a csv<br>
Here is a sample of the file:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"IBM","159.720001" ,"159.399994" ,"158.880005","159.539993", "159.550003", "160.350006"

In [20]:
#Step 4.1 - Delete the file if it exists, download a new copy and load it into an RDD
!rm StockPrices.csv
!wget https://raw.githubusercontent.com/JosephKambourakisIBM/SparkPoT/master/StockPrices.csv
SP = sc.textFile('StockPrices.csv')

--2016-09-15 18:24:21--  https://raw.githubusercontent.com/JosephKambourakisIBM/SparkPoT/master/StockPrices.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.48.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 244 [text/plain]
Saving to: 'StockPrices.csv'


2016-09-15 18:24:21 (73.2 MB/s) - 'StockPrices.csv' saved [244/244]



In [21]:
#Step 4.2 - Transform the data to extract the stock ticker symbol and the prices.  
SP_mapped = SP.map(lambda line: line.split(','))
SP_Keyed = SP_mapped.map(lambda x: (x[0], float(x[1]), float(x[2]), float(x[3]), float(x[4]), float(x[5])))
SP_Keyed.collect()

[(u'IBM', 159.720001, 159.399994, 158.880005, 159.539993, 159.550003),
 (u'MSFT', 58.099998, 57.889999, 57.459999, 57.59, 57.669998),
 (u'AAPL', 106.82, 106.0, 106.099998, 106.730003, 107.730003),
 (u'ORCL', 41.310001, 41.310001, 41.220001, 41.16, 41.25)]

In [22]:
#Step 4.3 - Compute the averages and print them for each symbol.
SP_Mean = SP_Keyed.map(lambda x: (x[0], (x[1]+x[2]+ x[3]+ x[4]+x[5])/5)) #There are other ways to do this such as importing a mean function from numpy
SP_Mean.collect()

#Using a mean function
#import numpy
#SP_Mean = SP_Keyed.map(lambda x: (x[0], numpy.mean([x[1],x[2], x[3], x[4], x[5]]))) 
#SP_Mean.collect()


[(u'IBM', 159.4179992),
 (u'MSFT', 57.7419988),
 (u'AAPL', 106.6760008),
 (u'ORCL', 41.2500006)]