<br><br><br>
<span style="color:red;font-size:60px">Spark Transformations</span>

<br><br><br>
<span style="color:green;font-size:xx-large">Map/Reduce in Spark</span>
<br><br>

<li>hadoop map/reduce shuffles data in and out of memory</li>
<li>Spark, uses RDDs (minimal data), keeps data in memory (faster)</li>
<li>Spark provides powerful <span style="color:blue">by key</span> support for transformations</li>

<br><br><br>
<span style="color:blue;font-size:x-large">Map and flatMap</span>
<br><br>
<li>flatMap and map work like Scala map and flatMap</li>
<li>Except, lazily</li>

In [1]:
val x = sc.parallelize(Array(Array("Scala"),Array("Spark")))
val t1 = x.flatMap(e => e)
val t2 = x.map(e=>e)

Intitializing Scala interpreter ...

Spark Web UI available at http://dyn-209-2-224-60.dyn.columbia.edu:4042
SparkContext available as 'sc' (version = 3.3.0, master = local[*], app id = local-1665090532157)
SparkSession available as 'spark'


x: org.apache.spark.rdd.RDD[Array[String]] = ParallelCollectionRDD[0] at parallelize at <console>:24
t1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at flatMap at <console>:25
t2: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:26


In [2]:
val r1 = t1.collect //Flatmap maps and then flattens
val r2 = t2.collect //map only maps

r1: Array[String] = Array(Scala, Spark)
r2: Array[Array[String]] = Array(Array(Scala), Array(Spark))


<br><br><br>
<span style="color:blue;font-size:x-large">reduce and reduceByKey</span>
<br><br>
<li><span style="color:green">reduce</span> is an <span style="color:red">action</span> and returns a <b>value (not an RDD)</b>. Reduce works like a scala reduce, except that it works on an RDD</li>
<li><span style="color:green">reduceByKey</span> is a <span style="color:red">transformation</span> and returns an <b>RDD (not a value)</b>. reduceByKey works on (key,value) pairs, applies a reduce function on each key independently, and returns an RDD</li>
<li>The key in reduceByKey is <span style="color:green">implicit</span></li>
<li>reduceByKey returns a <span style="color:green">ShuffledRDD</span>, the shuffle operation is automatically done for you</li>

In [3]:

val x = sc.parallelize(Array(1,2,3,4,5,6,7,8))
x.reduce(_+_)

x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:25
res0: Int = 36


In [10]:
val x = sc.parallelize(Array(('A',6),('B',2),('A',1),('X',4),('B',17)))//array of tuples of two elments:(k,v)
val y = x.reduceByKey((a,b) => a + b)


x: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:25
y: org.apache.spark.rdd.RDD[(Char, Int)] = ShuffledRDD[11] at reduceByKey at <console>:26


In [11]:
//val y = x.reduceByKey(_+_)
y.collect

res5: Array[(Char, Int)] = Array((X,4), (A,7), (B,19))


<br><br><br>
<span style="color:blue;font-size:x-large">Spark Word Count Example</span>
<br><br>
<li>We'll redo our word count in Spark</li>
<li>Note that in (key,value) terms, the key is a word and the value is its count</li>

In [12]:
val text = sc.textFile("shakespeare.txt")
val words = text.flatMap(line => line.split(" "))
val fwords = words.filter(l => l.length == 4)
val word_map = fwords.map(word => (word,1))  //Construct (key,value) paired RDD - required for reduceByKey
val result = word_map.reduceByKey((a,b) => a + b)


text: org.apache.spark.rdd.RDD[String] = shakespeare.txt MapPartitionsRDD[13] at textFile at <console>:24
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[14] at flatMap at <console>:25
fwords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[15] at filter at <console>:26
word_map: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[16] at map at <console>:27
result: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[17] at reduceByKey at <console>:28


In [13]:
result.collect

res6: Array[(String, Int)] = Array((PETO,2), (ban;,1), (jowl,1), (joy.,20), (bone,7), (pate,19), (Trow,2), (shot,30), (fawn,11), (ENDS,1), (woe!,6), (are.,55), (dole,4), (NYM,,5), (hoof,1), (Oyes,1), ("Tis,2), (Wake,3), (been,639), (bout,6), (Eve,,3), (rots,1), (iiii,1), (jade,6), (ope,,2), (toe,,2), (lad-,3), (dim.,2), (fowl,7), (nigh,6), (file,16), (so's,1), (O's!,1), (tang,2), (dive,6), (tune,32), (see?,12), (tips,1), (one,,115), (yet,,102), (cart,5), (fit;,5), (plus,2), (mild,19), (shut,45), (morn,19), (Why,,745), (1601,2), (rubs,5), (feu.,1), (Hers,2), (box;,2), (Chop,1), (aims,3), (Iago,17), (snip,2), (knot,18), (rail,22), (dead,170), (men,,116), (robs,6), (Alla,2), ((out,1), (map,,1), (ago!,1), (feil,1), (thus,399), (mad,,43), (dine,19), (hit?,1), (py'r,1), (them,1305), (iron,30)...


<br><br><br>
<span style="color:blue;font-size:x-large">Comparing scala spark and python</span>
<img src="scala spark python.png">

<br><br><br>
<span style="color:blue;font-size:x-large">Try this!</span>
<br><br>
<span style="font-size:large">
Given a set of scores in quizzes, use map and reduce by key to calculate the total score for each student in the class
</span>

In [15]:
val x = Array(("John","Q1",10),
              ("Jill","Q1",8),
              ("John","Q2",3),
              ("Jill","Q2",9))
val y = sc.parallelize(x)



x: Array[(String, String, Int)] = Array((John,Q1,10), (Jill,Q1,8), (John,Q2,3), (Jill,Q2,9))
y: org.apache.spark.rdd.RDD[(String, String, Int)] = ParallelCollectionRDD[18] at parallelize at <console>:29


In [20]:
val total_scores_by_student = y.map(x=>(x._1,x._3)).reduceByKey((a,b)=>a+b).collect

total_scores_by_student: Array[(String, Int)] = Array((John,13), (Jill,17))


<br><br><br>
<span style="color:blue;font-size:x-large">Example: NYC 311 Data</span>
<br><br>

In [None]:
// !wc -l ../../DataAnalytics/DataVisualization/nyc_311_2022_clean.csv

In [2]:
!wc nyc_311_2022_clean.csv

 8835630 90365104 1628192091 nyc_311_2022_clean.csv



In [3]:
!wc -l nyc_311_2022_clean.csv

 8835630 nyc_311_2022_clean.csv



In [7]:
// val NYC_Data_Path = "../../DataAnalytics/DataVisualization/nyc_311_2022_clean.csv"

In [4]:
val NYC_Data_Path = "nyc_311_2022_clean.csv"

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.0.149:4043
SparkContext available as 'sc' (version = 3.3.0, master = local[*], app id = local-1666119492355)
SparkSession available as 'spark'


NYC_Data_Path: String = nyc_311_2022_clean.csv


In [5]:
val raw_data = sc.textFile(NYC_Data_Path)
println(raw_data.partitions.length)
println(raw_data.count)

49
8835630


raw_data: org.apache.spark.rdd.RDD[String] = nyc_311_2022_clean.csv MapPartitionsRDD[1] at textFile at <console>:25


In [6]:
raw_data.take(2).foreach(println)

Created Date,Closed Date,Agency,Agency Name,Complaint Type,Incident Zip,Borough,Latitude,Longitude,processing_time,processing_days
2020-01-07 14:09:00,2020-01-13 11:20:00,DSNY,Department of Sanitation,Electronics Waste Appointment,11692,QUEENS,40.58993519447414,-73.78942049765358,5 days 21:11:00,5.882638888888889


<span style="color:blue;font-size:x-large">Remove the header row from the data</span>
<br>
<li><span style="color:red">mapPartitions</span> let's us map a function to each partition. mapPartitionsWithIndex returns the index and an iterator to the partition</li>



In [7]:
val raw_data_nohead = raw_data.mapPartitionsWithIndex{ (idx,iter) => if (idx==0) iter.drop(1) else iter}
val r1 = raw_data.count
val r2 = raw_data_nohead.count

raw_data_nohead: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at mapPartitionsWithIndex at <console>:24
r1: Long = 8835630
r2: Long = 8835629


<br><br>
<span style="color:green;font-size:xx-large">Complaints by agency</span>
<br><br>
<li>Let's calculate the number of complaints by agency</li>
<li><span style="color:red">countByKey</span> returns a Map object with counts by key given an RDD of (key,value) pairs</li>
<li>Construct an (agency,processing time) paired RDD and apply countByKey</li>

In [8]:
raw_data_nohead
    .map(l=>l.split(","))
    .map(t => (t(2),t(10).toDouble))
    .take(2)


res2: Array[(String, Double)] = Array((DSNY,5.882638888888889), (DSNY,4.070833333333334))


In [9]:
raw_data_nohead
    .map(l=>l.split(","))
    .map(t => (t(2),t(10).toDouble))
    .countByKey

res3: scala.collection.Map[String,Long] = Map(DOB -> 273229, DOT -> 583951, DHS -> 59416, OFFICE OF TECHNOLOGY AND INNOVATION -> 2, OSE -> 3, DOF -> 3042, MAYORâS OFFICE OF SPECIAL ENFORCEMENT -> 57511, DOE -> 3672, DPR -> 248101, DOITT -> 503, EDC -> 11897, TLC -> 6035, DFTA -> 6, DCA -> 12623, DEP -> 227815, HPD -> 1233934, DSNY -> 1527321, DOHMH -> 151211, FDNY -> 1, NYPD -> 4435356)


In [42]:
raw_data_nohead
    .map(l=>l.split(","))
    .map(t => (t(2),t(10).toDouble))
    .countByKey.foreach(println)

(DOB,273229)
(DOT,583951)
(DHS,59416)
(OFFICE OF TECHNOLOGY AND INNOVATION,2)
(OSE,3)
(DOF,3042)
(MAYORâS OFFICE OF SPECIAL ENFORCEMENT,57511)
(DOE,3672)
(DPR,248101)
(DOITT,503)
(EDC,11897)
(TLC,6035)
(DFTA,6)
(DCA,12623)
(DEP,227815)
(HPD,1233934)
(DSNY,1527321)
(DOHMH,151211)
(FDNY,1)
(NYPD,4435356)


<br><br>
<span style="color:green;font-size:xx-large">Total Processing time by agency</span>
<br><br>
<li>we'll use <span style="color:red">reduceByKey</span> for this</li>
<li>and convert processing time in days into processing time in hours</li>

In [10]:
val proc_time_by_agency = raw_data_nohead
                            .map(l=>l.split(","))
                            .map(t => (t(2),t(10).toDouble))
                            .reduceByKey((a,b)=>a+b)
                            .map(t => (t._1,t._2*24))

proc_time_by_agency: org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionsRDD[12] at map at <console>:28


In [11]:
%%time
proc_time_by_agency.collect

Time: 2.509887933731079 seconds.



res4: Array[(String, Double)] = Array((DOE,3828935.48), (DOF,1442816.484444445), (OFFICE OF TECHNOLOGY AND INNOVATION,37.346111111111114), (HPD,3.9216531026861185E8), (MAYORâS OFFICE OF SPECIAL ENFORCEMENT,1.517341014472222E7), (OSE,9.106944444444444), (DOT,2.031404898344442E8), (DCA,943223.69), (FDNY,9651.465555555555), (DSNY,2.5603643572305554E8), (EDC,1.5989842325000033E7), (DOHMH,5.58771262333334E7), (TLC,7775306.0655555595), (DHS,1793047.605277778), (NYPD,3.6147025671944425E7), (DOITT,342750.3583333331), (DEP,2.737430160000001E7), (DPR,3.929524421725001E8), (DFTA,1928.2644444444445), (DOB,2.564403408769446E8))


<br><br>
<span style="color:blue;font-size:large">Spark tries to keep the data in memory</span>
<p>
    <li>If we run collect again, it (SHOULD!) run faster</li>

In [12]:
%%time
proc_time_by_agency.collect
//persistant

Time: 0.1536400318145752 seconds.



res5: Array[(String, Double)] = Array((DOE,3828935.48), (DOF,1442816.484444445), (OFFICE OF TECHNOLOGY AND INNOVATION,37.346111111111114), (HPD,3.9216531026861185E8), (MAYORâS OFFICE OF SPECIAL ENFORCEMENT,1.517341014472222E7), (OSE,9.106944444444444), (DOT,2.031404898344442E8), (DCA,943223.69), (FDNY,9651.465555555555), (DSNY,2.5603643572305554E8), (EDC,1.5989842325000033E7), (DOHMH,5.58771262333334E7), (TLC,7775306.0655555595), (DHS,1793047.605277778), (NYPD,3.6147025671944425E7), (DOITT,342750.3583333331), (DEP,2.737430160000001E7), (DPR,3.929524421725001E8), (DFTA,1928.2644444444445), (DOB,2.564403408769446E8))


<br><br>
<span style="color:blue;font-size:large">Formatting using Scala match</span>
<p>
    <li>formatted print of processing time by agency</li>

In [14]:
proc_time_by_agency.collect.foreach(t=>t match { case(a,b) => println(f"$a%s\t$b%1.2f")})

DOE	3828935.48
DOF	1442816.48
OFFICE OF TECHNOLOGY AND INNOVATION	37.35
HPD	392165310.27
MAYORâS OFFICE OF SPECIAL ENFORCEMENT	15173410.14
OSE	9.11
DOT	203140489.83
DCA	943223.69
FDNY	9651.47
DSNY	256036435.72
EDC	15989842.33
DOHMH	55877126.23
TLC	7775306.07
DHS	1793047.61
NYPD	36147025.67
DOITT	342750.36
DEP	27374301.60
DPR	392952442.17
DFTA	1928.26
DOB	256440340.88


In [32]:
proc_time_by_agency.collect.foreach(t=>t match { case(a,b) => println(f"$a\t$b%1.2f")})

DOE	3828935.48
DOF	1442816.48
OFFICE OF TECHNOLOGY AND INNOVATION	37.35
HPD	392165310.27
MAYORâS OFFICE OF SPECIAL ENFORCEMENT	15173410.14
OSE	9.11
DOT	203140489.83
DCA	943223.69
FDNY	9651.47
DSNY	256036435.72
EDC	15989842.33
DOHMH	55877126.23
TLC	7775306.07
DHS	1793047.61
NYPD	36147025.67
DOITT	342750.36
DEP	27374301.60
DPR	392952442.17
DFTA	1928.26
DOB	256440340.88


In [50]:
proc_time_by_agency.foreach(t=>t match { case(a,b) => {val c = a.slice(0,10); println(f"$c%-10s$b%12.2f")}})

OSE               9.11
DOF         1442816.48
OFFICE OF        37.35
HPD       392165310.27
DOT       203140489.83
MAYORâS  15173410.14
DOE         3828935.48
DCA          943223.69
DFTA           1928.26
FDNY           9651.47
TLC         7775306.07
DHS         1793047.61
NYPD       36147025.67
EDC        15989842.33
DOHMH      55877126.23
DOITT        342750.36
DEP        27374301.60
DPR       392952442.17
DSNY      256036435.72
DOB       256440340.88


<br><br>
<span style="color:blue;font-size:large">Accessing keys and values</span>
<p>
        <li>In a (Key,Value) pair RDD, Spark automatically recognizes the first element as the key and the second as the value</li>
<li>The attribute <span style="color:blue">keys</span> returns an RDD containing the keys</li>
<li>The attribute <span style="color:blue">values</span> returns an RDD containing the values</li>
    <li>Note that only tuples of size 2 will work for by key functions</li>

In [56]:
val agency_time_map = raw_data_nohead
                            .map(l=>l.split(","))
                            .map(t => (t(2),t(10).toDouble))
val keys = agency_time_map.keys
val values = agency_time_map.values

agency_time_map: org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionsRDD[17] at map at <console>:26
keys: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[18] at keys at <console>:27
values: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[19] at values at <console>:28


In [48]:
keys.take(3)

res32: Array[String] = Array(DSNY, DSNY, DSNY)


In [52]:
values.take(3)

res36: Array[Double] = Array(5.882638888888889, 4.070833333333334, 1.3104166666666668)


<br><br><br>
<span style="color:green;font-size:xx-large">aggregate and aggregateByKey</span>
<p>
<li><span style="color:blue">aggregate</span>: a two stage reducer. In the first stage, a reduce function is applied to each partition separately. In the second stage, a reduce function is applied to combine the results across all paritions</li>
        <p>
<li><span style="color:blue">aggregateByKey</span>: similar to aggregate but is applied to each key separately</li>
<p>
<li>Both functions take three arguments</li>
    <p>
<ol>
    <li>an initial value</li>
    <li>a function that will be applied to each partition</li>
    <li>a function that will accumulate results from each partition</li>

<li>Let's compute the average across all students and the average for each student</li>
<li>aggregate for the average for all students</li>
<li>aggregateByKey for the average for each student</li>

In [53]:
val grades = sc.parallelize(List(("Jack",74),("Jill",92),("Jiahou",66),("Jahangir",89),("Jack",54),("Jahangir",99),
                 ("Jill",87),("Jack",76),("Jiahou",95),("Jill",67),("Jahangir",84),("Jack",93),
                 ("Jill",98),("Jahangir",89),("Jiahou",71),("Jack",65),("Jack",80),("Jill",99)))

//f1 accumulates scores in a single partition and returns an Int (total score)
//f1's arguments are therefore an Int (the accumulator) and (String, Int) (the data pairs)
def f1= (accu:Int, v:(String,Int)) => accu + v._2 

//f2 accumulates the result from across all partitions
//f2's arguments are an Int (the accumulator) and an Int (the accumulated value in each partition)
def f2= (accu1:Int,accu2:Int) => accu1 + accu2

val result = grades.aggregate(0)(f1,f2)
println("total scores: " + result)
println("average score: " + result.toDouble/grades.count)

total scores: 1478
average score: 82.11111111111111


grades: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[71] at parallelize at <console>:25
f1: (Int, (String, Int)) => Int
f2: (Int, Int) => Int
result: Int = 1478


In [54]:
grades.partitions.length

res38: Int = 8


<br><br><br>
<span style="color:blue;font-size:x-large">aggregateByKey</span>
<p>
<li>Works just like aggregate but the "by key" part makes the key implicit</li>
<li>f2 is the same as in the aggregate function (but there will be an implicit key)</li>
<li>f1 no longer has a tuple argument (because the key is implicit)</li>
        <li>aggregateByKey works on PairRDDs</li>     
    <p>
        To calculate averages by key we must track the sum and the count for each key

<h4>Calculating totals by student</h4>

In [55]:
val grades = sc.parallelize(List(("Jack",74),("Jill",92),("Jiahou",66),("Jahangir",89),("Jack",54),("Jahangir",99),
                 ("Jill",87),("Jack",76),("Jiahou",95),("Jill",67),("Jahangir",84),("Jack",93),
                 ("Jill",98),("Jahangir",89),("Jiahou",71),("Jack",65),("Jack",80),("Jill",99)))

def f1 = (accu:Int, v:Int) => accu + v //f1 automatically looks for the value element for each key
def f2 = (accu1:Int,accu2:Int) => accu1 + accu2 //f2 accumulates across partitions for each key separately
val result = grades.aggregateByKey(0)(f1,f2) // 0 is initialized to each new key
result.collect

grades: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[72] at parallelize at <console>:28
f1: (Int, Int) => Int
f2: (Int, Int) => Int
result: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[73] at aggregateByKey at <console>:34
res39: Array[(String, Int)] = Array((Jahangir,361), (Jiahou,232), (Jill,443), (Jack,442))


<h4>Calculating averages by student</h4>
<li>We'll need to calculate totals as well as counts for each student</li>

In [57]:
val grades = sc.parallelize(List(("Jack",74),("Jill",92),("Jiahou",66),("Jahangir",89),("Jack",54),("Jahangir",99),
                 ("Jill",87),("Jack",76),("Jiahou",95),("Jill",67),("Jahangir",84),("Jack",93),
                 ("Jill",98),("Jahangir",89),("Jiahou",71),("Jack",65),("Jack",80),("Jill",99)))

//The f1 accumulator tracks both the total and the count
def f1= (accu:(Int,Int), v:Int) => (accu._1 + v,accu._2+1) //f1 automatically looks for the value element

//f2 adds up the totals and the counts from each partition into a tuple
def f2= (accu1:(Int,Int),accu2:(Int,Int)) => (accu1._1 + accu2._1,accu1._2+accu2._2) //f2 accumulates across partitions for each key separately

//Divide the total by the count for each key to get averages
val result = grades.aggregateByKey((0,0))(f1,f2)
    .map(r=>(r._1,r._2._1*1.0/r._2._2))

result.collect

grades: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[77] at parallelize at <console>:28
f1: ((Int, Int), Int) => (Int, Int)
f2: ((Int, Int), (Int, Int)) => (Int, Int)
result: org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionsRDD[79] at map at <console>:40
res41: Array[(String, Double)] = Array((Jahangir,90.25), (Jiahou,77.33333333333333), (Jill,88.6), (Jack,73.66666666666667))


In [58]:
grades.aggregateByKey((0,0))(f1,f2)

res42: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[80] at aggregateByKey at <console>:28


<br><br><br>
<span style="color:green;font-size:xx-large">groupByKey</span>
<p>
<li>groups the data by key</li>
<li>similar to python's group by</li>


In [51]:
val grades = sc.parallelize(List(("Jack",74),("Jill",92),("Jiahou",66),("Jahangir",89),("Jack",54),("Jahangir",99),
                 ("Jill",87),("Jack",76),("Jiahou",95),("Jill",67),("Jahangir",84),("Jack",93),
                 ("Jill",98),("Jahangir",89),("Jiahou",71),("Jack",65),("Jack",80),("Jill",99)))
val temp = grades.groupByKey
val total_scores = temp.map(l => (l._1,l._2.reduce((x,y)=>x+y)))
total_scores.collect


grades: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[13] at parallelize at <console>:24
temp: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[14] at groupByKey at <console>:27
total_scores: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[15] at map at <console>:28
res43: Array[(String, Int)] = Array((Jahangir,361), (Jiahou,232), (Jill,443), (Jack,442))


In [2]:
temp.collect

res1: Array[(String, Iterable[Int])] = Array((Jahangir,CompactBuffer(89, 99, 84, 89)), (Jiahou,CompactBuffer(66, 95, 71)), (Jill,CompactBuffer(92, 87, 67, 98, 99)), (Jack,CompactBuffer(74, 54, 76, 93, 65, 80)))


In [None]:
CompactBuffer // scala object, mutable array

In [5]:
val x = Array(Array(1,2),Array(3,4))
val y = sc.parallelize(x)
y.collect

x: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4))
y: org.apache.spark.rdd.RDD[Array[Int]] = ParallelCollectionRDD[3] at parallelize at <console>:25
res4: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4))


<br><br><br>
<span style="color:green;font-size:xx-large">combineByKey</span>
<p>
    <li>groupBy and groupByKey operations are expensive</li>
    <li>because grouping needs to be done on the entire data and can't be broken up at the partition level</li>
    <li><span style="color:red">combineByKey</span> data at partition level (by key) and then combines results across partitions (also by key)</li>
    <p>
        
<li>three function arguments</li>
<li><b>Combiner</b>: Creates an accumulator (e.g. (key,1)) for each unseen key in a partition</li>
<li><b>Merger</b>: Merges values of "seen keys" into the accumulator in a partition</li>
<li><b>Merge Combiner</b>: Merges same key values across partitions</li>
<li>A more general version of aggregateByKey (the initial value is replaced by a function) </li>
<li><b>Notice that the key is implicit in the entire operation!</b></li>

<img src="combineByKey.png">

<br><br><span style="color:blue;font-size:x-large">Average by key using combineByKey</span>
<br>
<li><span style="color:red">combiner</span>: a function that initializes the accumulator in a single partition. The combiner is called when combineByKey sees a key for the first time</li>
<li><span style="color:red">merger</span>: a function that updates the accumulator</li>
<li><span style="color:red">mergeAndCombiner</span>: A function that merges the accumulator from two partitions</li>

In [None]:
// no shuffle 

<img src="combiner.png">

In [52]:
//Initializes a new key to a count of 1 and a total of the value of the new key
val combiner = (x: Double) => (1,x) 



combiner: Double => (Int, Double) = $Lambda$3772/0x00000008013f2840@7a5d4dd4


<img src="merger.png">

In [61]:
//update the accumulator by adding 1 to the count and adding the new value to the running total (for a key)
val merger = (x: (Int, Double),y: Double) => {
    val (c,acc) = x
    (c+1, acc + y)
}




merger: ((Int, Double), Double) => (Int, Double) = $Lambda$3828/0x00000008013bd840@17e290cb


<img src="combiner_and_merger.png">

<img src="merge_and_combiner.png">

In [62]:
val mergeAndCombiner = (x1: (Int, Double), x2: (Int, Double)) => {
    val (c1, acc1) = x1
    val (c2, acc2) = x2
    (c1+c2,acc1+acc2)
}

mergeAndCombiner: ((Int, Double), (Int, Double)) => (Int, Double) = $Lambda$3829/0x00000008013bc040@43ca9172


In [63]:
val combiner = (x: Double) => (1,x) 
val merger = (x: (Int, Double),y: Double) => {
    val (c,acc) = x
    (c+1, acc + y)
}
val mergeAndCombiner = (x1: (Int, Double), x2: (Int, Double)) => {
    val (c1, acc1) = x1
    val (c2, acc2) = x2
    (c1+c2,acc1+acc2)
}
agency_time_map.combineByKey(combiner,merger,mergeAndCombiner).collect

res47: Array[(String, (Int, Double))] = Array((DOE,(3672,159538.97833333333)), (DOF,(3042,60117.35351851855)), (OFFICE OF TECHNOLOGY AND INNOVATION,(2,1.556087962962963)), (HPD,(1233934,1.634022126119216E7)), (MAYORâS OFFICE OF SPECIAL ENFORCEMENT,(57511,632225.4226967591)), (OSE,(3,0.3794560185185185)), (DOT,(583951,8464187.076435175)), (DCA,(12623,39300.98708333333)), (FDNY,(1,402.1443981481481)), (DSNY,(1527321,1.066818482179398E7)), (EDC,(11897,666243.4302083347)), (DOHMH,(151211,2328213.5930555584)), (TLC,(6035,323971.086064815)), (DHS,(59416,74710.31688657409)), (NYPD,(4435356,1506126.069664351)), (DOITT,(503,14281.264930555546)), (DEP,(227815,1140595.9000000004)), (DPR,(248101,1.637301842385417E7)), (DFTA,(6,80.34435185185185)), (DOB,(273229,1.0685014203206025E7)))


<br><br>
<span style="color:blue;font-size:large">Calculating the average</span>
<br>

In [65]:
agency_time_map
    .combineByKey(combiner,merger,mergeAndCombiner) //Get counts and totals
    .map(t => (t._1,t._2._2.toDouble/t._2._1))   //average by dividing totals by counts

res49: org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionsRDD[25] at map at <console>:31


In [70]:
agency_time_map
    .combineByKey(combiner,merger,mergeAndCombiner) //Get counts and totals
    .map(t => (t._1,t._2._2/t._2._1))   //average by dividing totals by counts
    .collect

res54: Array[(String, Double)] = Array((DOE,43.447434186637615), (DOF,19.76244362870432), (OFFICE OF TECHNOLOGY AND INNOVATION,0.7780439814814815), (HPD,13.242378653308977), (MAYORâS OFFICE OF SPECIAL ENFORCEMENT,10.993121710572918), (OSE,0.12648533950617283), (DOT,14.494687185115147), (DCA,3.1134426905912487), (FDNY,402.1443981481481), (DSNY,6.984900241530092), (EDC,56.00096076391819), (DOHMH,15.397117888616293), (TLC,53.682035801957745), (DHS,1.257410746037668), (NYPD,0.33957275800732817), (DOITT,28.392176800309237), (DEP,5.006676030990059), (DPR,65.99335925229713), (DFTA,13.390725308641976), (DOB,39.10644259286542))


In [69]:
%%time
agency_time_map
    .combineByKey(combiner,merger,mergeAndCombiner) //Get counts and totals
    .map(t => (t._1,t._2._2.toDouble/t._2._1))   //average by dividing totals by counts
    .collect

Time: 2.0663950443267822 seconds.



res53: Array[(String, Double)] = Array((DOE,43.447434186637615), (DOF,19.76244362870432), (OFFICE OF TECHNOLOGY AND INNOVATION,0.7780439814814815), (HPD,13.242378653308977), (MAYORâS OFFICE OF SPECIAL ENFORCEMENT,10.993121710572918), (OSE,0.12648533950617283), (DOT,14.494687185115147), (DCA,3.1134426905912487), (FDNY,402.1443981481481), (DSNY,6.984900241530092), (EDC,56.00096076391819), (DOHMH,15.397117888616293), (TLC,53.682035801957745), (DHS,1.257410746037668), (NYPD,0.33957275800732817), (DOITT,28.392176800309237), (DEP,5.006676030990059), (DPR,65.99335925229713), (DFTA,13.390725308641976), (DOB,39.10644259286542))


In [71]:
val resultRDD = agency_time_map
    .combineByKey(combiner,merger,mergeAndCombiner) //Get counts and totals
    .map(t => (t._1,t._2._2.toDouble/t._2._1))

resultRDD: org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionsRDD[37] at map at <console>:29


In [72]:
%%time
resultRDD.collect

Time: 2.3064072132110596 seconds.



res55: Array[(String, Double)] = Array((DOE,43.447434186637615), (DOF,19.76244362870432), (OFFICE OF TECHNOLOGY AND INNOVATION,0.7780439814814815), (HPD,13.242378653308977), (MAYORâS OFFICE OF SPECIAL ENFORCEMENT,10.993121710572918), (OSE,0.12648533950617283), (DOT,14.494687185115147), (DCA,3.1134426905912487), (FDNY,402.1443981481481), (DSNY,6.984900241530092), (EDC,56.00096076391819), (DOHMH,15.397117888616293), (TLC,53.682035801957745), (DHS,1.257410746037668), (NYPD,0.33957275800732817), (DOITT,28.392176800309237), (DEP,5.006676030990059), (DPR,65.99335925229713), (DFTA,13.390725308641976), (DOB,39.10644259286542))


<br><br><br>
<span style="color:green;font-size:xx-large">folding in Spark</span>
<p>
    <li><b>Important</b>Note that these are Spark API functions that share the same name as Scala functions. They are not the scala functions!</li>


<span style="color:blue;font-size:large">fold</span>
<p>
    <li>fold is an <span style="color:red">action</span> on an RDD</li>
    <li>similar to scala fold (initial value + function)</li>
    <p>
        <span style="color:red">Example</span>: find a student with the highest score in the class

In [26]:
val grades = sc.parallelize(List(("Jack",74),("Jill",92),("Jiahou",66),("Jahangir",89),("Jack",54),("Jahangir",99),
                 ("Jill",87),("Jack",76),("Jiahou",95),("Jill",67),("Jahangir",84),("Jack",93),
                 ("Jill",98),("Jahangir",89),("Jiahou",71),("Jack",65),("Jack",80),("Jill",99)))

val start = ("xyz",0)
val highest = grades.fold(start)((acc,score) => {
    if (acc._2 <= score._2) score else acc
})



grades: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[26] at parallelize at <console>:26
start: (String, Int) = (xyz,0)
highest: (String, Int) = (Jahangir,99)



<span style="color:blue;font-size:large">foldByKey</span>
<p>
    <li>Like fold, but works on keys and on RDDs</li>
    <li>Example: What is the highest score for each student</li>
    <li>Note that, below, the key is implicit</li>

In [27]:
grades.foldByKey(0)((acc,score) => if (acc < score) score else acc)

res15: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[27] at foldByKey at <console>:26


In [28]:
grades.foldByKey(0)((acc,score) => if (acc < score) score else acc).collect

res16: Array[(String, Int)] = Array((Jahangir,99), (Jiahou,95), (Jill,99), (Jack,93))


<br><br><br>
<span style="color:green;font-size:xx-large">foldByKey vs reduceByKey vs combineByKey</span>
<p>
            <li>The underlying implementation of the three byKey operations is more or less the same</li>
        <li>reduceByKey calls combineByKey. Use combineByKey when you want more control over what happens in partitions</li>
        <li>use reduceByKey when you want code simplicity and control over what happens in partitions doesn't help</li>
        <li>foldByKey also calls combineByKey and is essentially reduceByKey with the ability to set the initial value. Use foldByKey when you want to set the initial value</li>


In [29]:
grades.reduceByKey((a,b) => if (a<b) b else a).collect

res17: Array[(String, Int)] = Array((Jahangir,99), (Jiahou,95), (Jill,99), (Jack,93))


In [30]:
grades.combineByKey(((a: Int) => a),((a: Int, b: Int) => if (a<b) b else a),((a: Int,b: Int) => if (a<b) b else a)).collect

res18: Array[(String, Int)] = Array((Jahangir,99), (Jiahou,95), (Jill,99), (Jack,93))
