# Machine Learning At Scale

Data Analytics and Machine Learning at Scale 

---
__Name:__  *Dr. James G. Shanahan*   
__Email:__  *James.Shanahan  @ gmail.com   
__Quiz:__  Secondary Sorts

# Secondary Sort in Spark (Total Sort)

__ This notebook provides examples of Secondary Sorts in Spark__

* Roll your own [See below]
* Via the repartitionAndSortWithinPartitions transformation  See below and [pySpark Manual](http://spark.apache.org/docs/latest/api/python/pyspark.html)
* DataFrames [explore by yourself]
---
__ See how findthe maximum value of a RDD __
* Find the maximum value of a RDD


# Please first choose which Spark cluster backs this notebook to get your SC/sqlContext

* Back this notebook by Spark that is running on your local machine in a Container world
* Back this notebook by Spark that is running an EMR Cluster (note one has to read and write data from/to S3 to run Spark jobs on EMR)
* Back this notebook by Spark that is rnning on your local machine natively

### Run the next cell if you wish to launch a Spark cluster on your local machine in a Container world and back this notebook by that cluster

In [1]:
import os
import sys 
import pyspark
from pyspark.sql import SQLContext

# We can give a name to our app (to find it in Spark WebUI) and configure execution mode
# In this case, it is local multicore execution with "local[*]"
app_name = "example-logs"
master = "local[*]"
conf = pyspark.SparkConf().setAppName(app_name).setMaster(master)
sc = pyspark.SparkContext(conf=conf)
sqlContext = SQLContext(sc)
print sc
print sqlContext

<pyspark.context.SparkContext object at 0x7f94b006a750>
<pyspark.sql.context.SQLContext object at 0x7f94a25eecd0>


### Run the next cell if you wish to back this notebook by an EMR cluster that is already up and running

In [None]:
import os
import sys 
# First, we initialize the Spark environment
import findspark
findspark.init('/usr/lib/spark')

import pyspark
from pyspark.sql import SQLContext

# We can give a name to our app (to find it in Spark WebUI) and configure execution mode
# In this case, it is local multicore execution with "local[*]"
app_name = "example-logs"
master = "local[*]"
conf = pyspark.SparkConf().setAppName(app_name).setMaster(master)
sc = pyspark.SparkContext(conf=conf)
sqlContext = SQLContext(sc)
print(sc)
print(sqlContext)

### Run the next cell if you wish to launch a Spark cluster on your local machine in NATIVE model and back this notebook by that cluster

In [1]:
import os
import sys #current as of 9/26/2015

# spark_home = os.environ['SPARK_HOME'] = '/Users/jshanahan/Dropbox/Lectures-UC-Berkeley-ML-Class-2015/spark-1.6.1-bin-hadoop2.6/'
spark_home = os.environ['SPARK_HOME'] = '/Users/jshanahan/Dropbox/Lectures-UC-Berkeley-ML-Class-2015/spark-1.6.1-bin-hadoop2.6/'
if not spark_home:
    raise ValueError('SPARK_HOME enviroment variable is not set')
sys.path.insert(0,os.path.join(spark_home,'python'))
sys.path.insert(0,os.path.join(spark_home,'python/lib/py4j-0.9-src.zip'))

# First, we initialize the Spark environment

import findspark
#findspark.init()

import pyspark
from pyspark.sql import SQLContext

# We can give a name to our app (to find it in Spark WebUI) and configure execution mode
# In this case, it is local multicore execution with "local[*]"
app_name = "example-logs"
master = "local[*]"
conf = pyspark.SparkConf().setAppName(app_name).setMaster(master)
sc = pyspark.SparkContext(conf=conf)
sqlContext = SQLContext(sc)


print(sc)
print(sqlContext)


# Import some libraries to work with dates
import dateutil.parser
import dateutil.relativedelta as dateutil_rd

<pyspark.context.SparkContext object at 0x1050241d0>
<pyspark.sql.context.SQLContext object at 0x100575210>


# Save some data

In [2]:
%%writefile text.txt
group1 bar 1
group1 zoo 26
group1 noo 2
group1 foo 1
group3 labs 1
group2 quxx 1
group2 axxx 1
group2 st#ff 1
group3 #funky 1

Overwriting text.txt


In [3]:
def myOwnHash(x):
    return hash(x[0])
def readData(line):
    x = line.split(" ")
    return x[:2],x[2:]
text_file = sc.textFile('text.txt')
rdd = text_file.map(readData)
rdd_partitioned = rdd.partitionBy(4,myOwnHash)

In [4]:
rdd.getNumPartitions()

2

In [5]:
rdd_partitioned.getNumPartitions()

4

## Glom in Spark RDD
glom() Return an RDD created by coalescing all elements within each partition into an array.


In [6]:
rdd.glom().collect()

[[([u'group1', u'bar'], [u'1']),
  ([u'group1', u'zoo'], [u'26']),
  ([u'group1', u'noo'], [u'2']),
  ([u'group1', u'foo'], [u'1']),
  ([u'group3', u'labs'], [u'1'])],
 [([u'group2', u'quxx'], [u'1']),
  ([u'group2', u'axxx'], [u'1']),
  ([u'group2', u'st#ff'], [u'1']),
  ([u'group3', u'#funky'], [u'1'])]]

In [7]:
# NOTE within each partition (aka group) the records are sorted is sorted in 
#increasing order of the second part of the key
rdd_partitioned.glom().collect()

# group 1 is in parition 1
# group 2 is in parition 4
# Group 3 is in partition 2


[[([u'group1', u'bar'], [u'1']),
  ([u'group1', u'zoo'], [u'26']),
  ([u'group1', u'noo'], [u'2']),
  ([u'group1', u'foo'], [u'1'])],
 [],
 [([u'group3', u'labs'], [u'1']), ([u'group3', u'#funky'], [u'1'])],
 [([u'group2', u'quxx'], [u'1']),
  ([u'group2', u'axxx'], [u'1']),
  ([u'group2', u'st#ff'], [u'1'])]]

In [8]:
# NOTE within each partition (aka group) the records are sorted  in 
#increasing order of the second part of the key
rdd_partitioned.collect()

[([u'group1', u'bar'], [u'1']),
 ([u'group1', u'zoo'], [u'26']),
 ([u'group1', u'noo'], [u'2']),
 ([u'group1', u'foo'], [u'1']),
 ([u'group3', u'labs'], [u'1']),
 ([u'group3', u'#funky'], [u'1']),
 ([u'group2', u'quxx'], [u'1']),
 ([u'group2', u'axxx'], [u'1']),
 ([u'group2', u'st#ff'], [u'1'])]

# Secondary Sort via the repartitionAndSortWithinPartitions transformation
Another important capability to be aware of is the repartitionAndSortWithinPartitions transformation. It’s a transformation that sounds arcane, but seems to come up in all sorts of strange situations. This transformation pushes sorting down into the shuffle machinery, where large amounts of data can be spilled efficiently and sorting can be combined with other operations.
For example, Apache Hive on Spark uses this transformation inside its join implementation. It also acts as a vital building block in the secondary sort pattern, in which you want to both group records by key and then, when iterating over the values that correspond to a key, have them show up in a particular order. This issue comes up in algorithms that need to group events by user and then analyze the events for each user based on the order they occurred in time. Taking advantage of repartitionAndSortWithinPartitions to do secondary sort currently requires a bit of legwork on the part of the user, but SPARK-3655 will simplify things vastly.
In [ ]:



In [9]:
# repartitionAndSortWithinPartitions(numPartitions=None, partitionFunc=<function portable_hash at 0x7f2bec385230>, ascending=True, keyfunc=<function <lambda> at 0x7f2bec3839b0>)
# Repartition the RDD according to the given partitioner and, within each resulting partition, 
# sort records by their keys.

rdd = sc.parallelize([(0, 10), (0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 3)])
rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 2)
rdd2.glom().collect()  #print the output
#[[(0, 5), (0, 8), (2, 6)], [(1, 3), (3, 8), (3, 8)]]

[[(0, 10), (0, 5), (0, 8), (2, 6)], [(1, 3), (3, 8), (3, 8)]]

In [27]:
rdd2.collect()  

[(0, 5), (0, 8), (2, 6), (1, 3), (3, 8), (3, 8)]

# Another example of secondary sort  via repartitionAndSortWithinPartitions
### k1,k2,v1,v2 : partition by k1 and sort by k2

In [8]:
%%writefile ss.txt
1,3,a,b
2,5,a,c
1,4,a,f
3,4,d,c
2,1,f,a
1,1,e,r
2,4,o,1
3,2,d,c

Writing ss.txt


In [17]:
def read_data(line):
    d = line.split(',')
    return [int(d[0]),int(d[1])],[d[2],d[3]]
dataRDD = sc.textFile("ss.txt").map(read_data).cache()
print dataRDD.collect()

[([1, 3], [u'a', u'b']), ([2, 5], [u'a', u'c']), ([1, 4], [u'a', u'f']), ([3, 4], [u'd', u'c']), ([2, 1], [u'f', u'a']), ([1, 1], [u'e', u'r']), ([2, 4], [u'o', u'1']), ([3, 2], [u'd', u'c'])]


In [21]:
ssdata = dataRDD.repartitionAndSortWithinPartitions(numPartitions=3,
                                                    partitionFunc= lambda x: x[0]%3,keyfunc=lambda x: x[1])
print dataRDD.collect()

[([1, 3], [u'a', u'b']), ([2, 5], [u'a', u'c']), ([1, 4], [u'a', u'f']), ([3, 4], [u'd', u'c']), ([2, 1], [u'f', u'a']), ([1, 1], [u'e', u'r']), ([2, 4], [u'o', u'1']), ([3, 2], [u'd', u'c'])]


In [22]:
# Check data by partition

In [23]:
ssdata.glom().collect()

[[([3, 2], [u'd', u'c']), ([3, 4], [u'd', u'c'])],
 [([1, 1], [u'e', u'r']), ([1, 3], [u'a', u'b']), ([1, 4], [u'a', u'f'])],
 [([2, 1], [u'f', u'a']), ([2, 4], [u'o', u'1']), ([2, 5], [u'a', u'c'])]]

# Find the maximum value of a RDD.
glom() Return an RDD created by coalescing all elements within each partition into an array.

For example, to get the maximum value of a RDD.

val maxValue = dataRDD.reduce(_ max _)

There will be lot of shuffles between partitions for comparison.
Rather than comparing all the values,
1. Find the maximum in each partition
2. Compare maximum value between partitions to get the final max value.

val maxValue = dataRDD.glom().map((row: Array[Double]) => value.max).reduce(_ max _)


Reference:
http://blog.madhukaraphatak.com/glom-in-spark/