# pySpark Introduction # 

The steps that Spark follows are:

* Dataset is loaded and partitioned across the cluster.
* Mapping operation is performed on the distributed environment(`map()`).
* All the partitions are collapsed together to generate the result(`reduce()`)

### Get spark configuration ###

In [1]:
# Spark config
sc._conf.getAll()

[(u'spark.rdd.compress', u'True'),
 (u'spark.master', u'yarn-client'),
 (u'spark.serializer.objectStreamReset', u'100'),
 (u'spark.yarn.isPython', u'true'),
 (u'spark.submit.deployMode', u'client'),
 (u'spark.executor.cores', u'2'),
 (u'spark.app.name', u'PySparkShell')]

### Resilient Distributed Dataset ###

In [2]:
# Create a Resilient Distributed Dataset(RDD) 
numbers = range(10)
numbers_rdd = sc.parallelize(numbers)
# We cannot simply pring the RDD content as it is split into multiple partitions(default number of partitions is twice
# the number of cores or CPUs). In our case we have 2 cores and therefore, 4 partitions.
numbers_rdd

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:423

In [3]:
# Print all the values of the RDD object
numbers_rdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [5]:
# Print a specific number of values from RDD
numbers_rdd.take(3)

[0, 1, 2]

In [6]:
# Read in a text file from HDFS and print the first line(file is delimited using newline characters)
sc.textFile("hdfs:///datasets/hadoop_git_readme.txt").first()

u'For the latest information about Hadoop, please visit our website at:'

In [7]:
# Read in a text file from the local filesystem
sc.textFile("file:///home/vagrant/datasets/hadoop_git_readme.txt").first()

u'For the latest information about Hadoop, please visit our website at:'

In [8]:
# Save the content of the RDD in HDFS
numbers_rdd.saveAsTextFile("hdfs:///tmp/numbers_1_10.txt")

In [9]:
# List the content of the file that the content was saved to
!hdfs dfs -ls /tmp/numbers_1_10.txt
'''
Spark writes one file for each partition exactly as MapReduce writing one file for each reducer. 
'''

Found 5 items
-rw-r--r--   1 vagrant supergroup          0 2016-10-06 21:44 /tmp/numbers_1_10.txt/_SUCCESS
-rw-r--r--   1 vagrant supergroup          4 2016-10-06 21:44 /tmp/numbers_1_10.txt/part-00000
-rw-r--r--   1 vagrant supergroup          4 2016-10-06 21:44 /tmp/numbers_1_10.txt/part-00001
-rw-r--r--   1 vagrant supergroup          4 2016-10-06 21:44 /tmp/numbers_1_10.txt/part-00002
-rw-r--r--   1 vagrant supergroup          8 2016-10-06 21:44 /tmp/numbers_1_10.txt/part-00003


In [10]:
'''
By using coalesce we can explicitly specify the number of partitions we want to split our data into. 
'''
# Store our RDD in a standalone partition and when saved, should produce one output file.
numbers_rdd.coalesce(1).saveAsTextFile("hdfs:///tmp/numbers_1_10_one_file.txt")

In [11]:
# List the content of the file that the content was saved to
!hdfs dfs -ls /tmp/numbers_1_10_one_file.txt
'''
Data was stored in 1 partition
'''

Found 2 items
-rw-r--r--   1 vagrant supergroup          0 2016-10-06 21:49 /tmp/numbers_1_10_one_file.txt/_SUCCESS
-rw-r--r--   1 vagrant supergroup         20 2016-10-06 21:49 /tmp/numbers_1_10_one_file.txt/part-00000


### Example of transformations and actions to show Lazy behavior ###

In [12]:
'''
Our objective in this example is to square the values contained in an RDD and then sum them up

Mapping: This is done using the map() function is lazily evaluated.
Reducing: This is done using reduce() and is an action and is therefore not lazy.
'''
# Step 1: Mapping
# Define the funtion that returns the square of the input argument
def sq(x):
    return x**2

In [13]:
# Get the square of the elements in RDD using the 'map' function
# We need to call collect as map is a lazy function and will not return an output unless used with collect()
numbers_rdd.map(sq).collect()

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [14]:
# Do the same using the lambda function
numbers_rdd.map(lambda x: x**2).collect()

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [15]:
# Step 2: Reducing using reduce()
numbers_rdd.map(lambda x: x**2).reduce(lambda a,b: a+b)

285

### Example introducing key-value pairs ###

In [16]:
'''
Our objective in this example is to find the sums of odd and even numbers in our RDD seperately.
'''
# Step 1: Define the function
def tag(x):
    return 'even' if x%2 == 0 else 'odd'

In [17]:
# Get the reuslts
numbers_rdd.map(lambda x: (tag(x), x) ).collect()

[('even', 0),
 ('odd', 1),
 ('even', 2),
 ('odd', 3),
 ('even', 4),
 ('odd', 5),
 ('even', 6),
 ('odd', 7),
 ('even', 8),
 ('odd', 9)]

In [19]:
'''
To get the squares of odds and evens together, we use reduceByKey(fxn). 

It has 2 steps:
- Aggregates the input RDD by key.
- Applies the reduce fxn to values of each group.
'''
numbers_rdd.map(lambda x: (tag(x), x) ) \
           .reduceByKey(lambda a,b: a + b) \
           .collect()

[('even', 20), ('odd', 25)]

## Appendix ##

Some useful transformations in spark are:

* map(fxn)
* flatmap(fxn)
* filter(fxn)
* sample(withReplacement, fraction, seed)
* distinct()
* coalesce(numPartitions)
* repartition(numPartitions)
* groupsByKey()
* reduceByKey()
* sortByKey(ascending)
* union(otherRDD)
* intersection(otherRDD)
* join(otherRDD) [leftOuterJoin, rightOuterJoin, fullOuterJoin]
* cartesian()

Some useful actions in spark are:

* reduce(fxn)
* count()
* countByKey()
* collect()
* first()
* take(n)
* takeSample(withReplacement, n, seed)
* takeOrdered(n, ordering)
* saveAsTextFile(path)

Useful methods that are neither transformations nor actions:

* cache()
* persist(storage) [starage can be memory, disk or both]
* unpersist()