<span style="color:red;font-size:50px">MapReduce</span>
<li><span style="color:blue">MapReduce</span> is a parallel and distributed programming paradigm used to process large datasets</li>
<li>In MapReduce, data is often organized in the form of (key, value) pairs</li>
<li>And then processed to produce “by key” results</li>
<li><b>Note that MapReduce is a programming paradigm, not a programming operation</b> Each platform implements MapReduce in its own way</li>
<li>In practice, programmers combine map and reduce functions to simulate the map reduce operation</li>


In [1]:
//used in functional programming

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.0.149:4041
SparkContext available as 'sc' (version = 3.3.0, master = local[*], app id = local-1664235452977)
SparkSession available as 'spark'


<span style="color:green;font-size:xx-large">The MapReduce process</span>
<li>The data is either already distributed (as in Hadoop) or the controller distributes the data across workers</li>
<li>A three stage process is applied:
    <ul>
        <li><span style="color:red">Map stage</span>: Each worker applies a map function to each data item</li>
    <li><span style="color:red">Shuffle stage</span>: The controller then reorganizes the data so that each key is on the same machine (as far as is possible)</li>
    <li><span style="color:red">Reduce stage</span>: Finally, each worker applies a function to all members of a single key to produce a result</li>

<span style="color:green;font-size:x-large">Example: MapReduce to get the count of each word in a document</span>
<img src="wordcount.png">

<span style="color:green;font-size:xx-large">map</span><p>
<li>An operation that maps a function to a range of values and produces a new list of results<p>

$ map(f,[x_{1},x_{2},x_{3},…,x_{n}]) ——> [f(x_{1}),f(x_{2}),f(x_{3}),….,f(x_{n})] $
<p>
<li>map applies a function to an iterable</li>
    

<span style="color:green;font-size:xx-large">Example:</span>
<li>An array of tuples containing income information</li>
<li>map a tax calculation function to get the tax owed by each individual</li>

<span style="color:green;font-size:xx-large">Aside: Scala tuples</span>
<li>Immutable collections that allow for mixed data types</li>
<li>Elements are accessed by ._1, ._2, etc</li>
<li>Important: Elements are not accessed by ._0!</li>
<li>Tuples are data work horses. We'll see a lot of them!</li>
<li>In the example below: $(String, String, Int, Int)$ is a tuple definition</li>

In [2]:
//in python numpy array, you know the byte size of each element, so you can index directly to the element
//but with scala mixed data type tuples, you don't know each size of the element. It need to search for.

In [5]:
val x = (1,22.3,"John")

x: (Int, Double, String) = (1,22.3,John)


In [4]:
// x._1=2

<console>: 24: error: reassignment to val

In [4]:
// x._0 //error: value _0 is not a member of (Int, Double, String)

In [5]:
x._1

res3: Int = 1


In [6]:
x._2

res4: Double = 22.3


In [7]:
val demographics = Array(("John","New York",350000,32),("Jill","Boston",432000,22)) 

demographics: Array[(String, String, Int, Int)] = Array((John,New York,350000,32), (Jill,Boston,432000,22))


In [8]:
demographics(1)._1

res5: String = Jill


In [9]:
println(demographics(0))
println("The first person is: " + demographics(0)._1 + " who is " +demographics(0)._4 +" years old")

(John,New York,350000,32)
The first person is: John who is 32 years old


In [6]:
// x.map(t=>println(t))

<console>: 26: error: value map is not a member of (Int, Double, String)

<span style="color:green;font-size:large">map the print function to each element of the Array</span>
<p>
    <li>Use the <span style="color:blue">map</span> function</li>
    <li>Or, as an alternative, the <span style="color:blue">foreach</span> function</li>

In [10]:
demographics.map(t=>println("The first person is: " + t._1 + " who is " + t._4 + " years old"))
//map will return same number of elements as in the original collection.

The first person is: John who is 32 years old
The first person is: Jill who is 22 years old


res7: Array[Unit] = Array((), ())


In [11]:
demographics.foreach(t=>println("The first person is: " + t._1 + " who is " + t._4 + " years old")) 
//foreach doesn't return anything, mainly for print

The first person is: John who is 32 years old
The first person is: Jill who is 22 years old


In [12]:
demographics.foreach(println)

(John,New York,350000,32)
(Jill,Boston,432000,22)


<span style="color:green;font-size:x-large">Calculating tax using the map function</span>

In [9]:
val income = Array(("John",212343),("Jack",179343),("Jill",231222),("Qing",500222),
                   ("Overbrook",23123),("Savitri",923111),
                  ("Barton",723000),("Olafur",290000))
val taxes = income.map(i => (i._1,i._2*0.2))
val progressive_taxes = income.map(v => if (v._2>500000.0) (v._1,v._2*.5)
                                   else if (v._2 > 200000.0) (v._1,v._2*.2)
                                  else (v._1,v._2*.1))

income: Array[(String, Int)] = Array((John,212343), (Jack,179343), (Jill,231222), (Qing,500222), (Overbrook,23123), (Savitri,923111), (Barton,723000), (Olafur,290000))
taxes: Array[(String, Double)] = Array((John,42468.600000000006), (Jack,35868.6), (Jill,46244.4), (Qing,100044.40000000001), (Overbrook,4624.6), (Savitri,184622.2), (Barton,144600.0), (Olafur,58000.0))
progressive_taxes: Array[(String, Double)] = Array((John,42468.600000000006), (Jack,17934.3), (Jill,46244.4), (Qing,250111.0), (Overbrook,2312.3), (Savitri,461555.5), (Barton,361500.0), (Olafur,58000.0))


<span style="color:red;font-size:40px">Filter</span>
<li>In a MapReduce operation, map sets up the data for the reduce operation</li>
<li>For example, by creating (word, 1) tuples in the word count example</li>
<li>The “1” serves as a placeholder for the count</li>
<li>MapReduce often uses an intermediate “filter” operation to reduce the size of the data BEFORE the shuffle operation</li>
<li>the shuffle operation is computationally expensive</li>
<li>So removing unnecessary data before shuffling helps reduce computational time</li>

<li><span style="color:red">filter</span>: maps a <span style="color:blue">predicate function</span> to a range of values and produces a subset of that range</li>
<li>example, Array of individuals who pay more the 50000 in taxes</li>


In [14]:
income.map(v => (v._1,v._2*0.2)).filter(t=>t._2>50000.0)

res10: Array[(String, Double)] = Array((Qing,100044.40000000001), (Savitri,184622.2), (Barton,144600.0), (Olafur,58000.0))


<span style="color:red;font-size:40px">Reduce</span>
<li>takes a function of two elements and applies this function to successive pairs of elements in a range to produce a single output</li>
<li>reduce initializes the first argument to the first value in the series</li>
<li>and then increments that value by the value of each successive element in the series</li>
<li>Example: total taxes paid by individuals who pay more than $50,000 in taxes

In [15]:
income
    .map(v => (v._1,v._2*0.2)) //Create (name,tax) pairs Array[(String,Double)]
    .filter(t=>t._2>50000.0) //filter out tuples with tax > 50,000  Array[(String,Double)]
    .map(t => t._2) //Remove the name from each tuple  Array[Double]
    .reduce((a,b) => a+b) //use reduce to find the total Double //a and b have the same data type // a is accumulator

res11: Double = 487266.60000000003


In [11]:
Array(("Qing",100044.40000000001), ("Savitri",184622.2), ("Barton",144600.0), ("Olafur",58000.0))
    .reduce((a,b)=>("Total Tax",a._2+b._2))

res2: (String, Double) = (Total Tax,487266.60000000003)


In [17]:
Array[Int]().reduce((a,b)=>a+b) //because nothing to initialize

java.lang.UnsupportedOperationException:  empty.reduceLeft

<span style="color:red;font-size:40px">Why do we need map, filter, reduce?</span><p>
<li>When dealing with large datasets</li>
<ul>
<li>map and filter can be applied <b>in parallel</b> to different segments of the data</li>
<li>Each map and filter function computes some set of values</li>
<li>The results of each computation can be collected by the reduce operation</li>
    </ul>
<li>Without these functions, analysis of big data would be very very hard!</li>

<span style="color:green;font-size:40px">Example: Count the frequency of each 3-letter word in a document</span>

In [1]:
val text = """My heart aches and a drowsy numbness pains My sense as though of hemlock I had drunk
Or emptied some dull opiate to the drains One minute past and Lethe-wards had sunk
Tis not through envy of thy happy lot But being too happy in thine happiness
That thou light-winged Dryad of the trees
In some melodious plot
Of beechen green, and shadows numberless
Singest of summer in full-throated ease"""

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.0.149:4046
SparkContext available as 'sc' (version = 3.3.0, master = local[*], app id = local-1671596714749)
SparkSession available as 'spark'


text: String =
My heart aches and a drowsy numbness pains My sense as though of hemlock I had drunk
Or emptied some dull opiate to the drains One minute past and Lethe-wards had sunk
Tis not through envy of thy happy lot But being too happy in thine happiness
That thou light-winged Dryad of the trees
In some melodious plot
Of beechen green, and shadows numberless
Singest of summer in full-throated ease


<span style="color:green;font-size:x-large">Clean up the text and extract words</span>
<li><span style="color:blue">split</span> splits a string into an array on a substring</li>
<li><span style="color:blue">trim</span> removes leading and trailing spaces (does not remove \n characters)</li>
<li><span style="color:red">Note:</span> in Scala, you don't have to include () in function calls if they are obviously functions and if there is no ambiguity</li>

In [13]:
val x = "Jack,Jill,Hill"
x.split(",")

x: String = Jack,Jill,Hill
res3: Array[String] = Array(Jack, Jill, Hill)


In [14]:
val x = "Jack ,Jill,Hill" //Note the extra space afte Jack
x.split(",")(0).trim


x: String = Jack ,Jill,Hill
res4: String = Jack


<span style="color:green;font-size:large">Split and trim the text</span><p>
<li>convert the text to lower case</li>
<li>split it at spaces</li>
<li>use map to apply trim to each word in the text</li>

In [15]:
val a = text.toLowerCase.split(" ")
val t = a.map(i=>i.trim)
//val t = a.map(_.trim)

a: Array[String] =
Array(my, heart, aches, and, a, drowsy, numbness, pains, my, sense, as, though, of, hemlock, i, had, drunk
or, emptied, some, dull, opiate, to, the, drains, one, minute, past, and, lethe-wards, had, sunk
tis, not, through, envy, of, thy, happy, lot, but, being, too, happy, in, thine, happiness
that, thou, light-winged, dryad, of, the, trees
in, some, melodious, plot
of, beechen, green,, and, shadows, numberless
singest, of, summer, in, full-throated, ease)
t: Array[String] =
Array(my, heart, aches, and, a, drowsy, numbness, pains, my, sense, as, though, of, hemlock, i, had, drunk
or, emptied, some, dull, opiate, to, the, drains, one, minute, past, and, lethe-wards, had, sunk
tis, not, through, envy, of, thy, happy, lot, but, being, too, happy, in, thine, happiness
tha...


<span style="color:green;font-size:large">Create (word,1) tuples</span>


In [16]:
val m = t.map(i=>(i,1))
//val m = t.map((_,1))

m: Array[(String, Int)] =
Array((my,1), (heart,1), (aches,1), (and,1), (a,1), (drowsy,1), (numbness,1), (pains,1), (my,1), (sense,1), (as,1), (though,1), (of,1), (hemlock,1), (i,1), (had,1), (drunk
or,1), (emptied,1), (some,1), (dull,1), (opiate,1), (to,1), (the,1), (drains,1), (one,1), (minute,1), (past,1), (and,1), (lethe-wards,1), (had,1), (sunk
tis,1), (not,1), (through,1), (envy,1), (of,1), (thy,1), (happy,1), (lot,1), (but,1), (being,1), (too,1), (happy,1), (in,1), (thine,1), (happiness
that,1), (thou,1), (light-winged,1), (dryad,1), (of,1), (the,1), (trees
in,1), (some,1), (melodious,1), (plot
of,1), (beechen,1), (green,,1), (and,1), (shadows,1), (numberless
singest,1), (of,1), (summer,1), (in,1), (full-throated,1), (ease,1))


<span style="color:green;font-size:large">Notice that we're creating actual objects for a and b</span>
<li>That's not necessary</li>
<li>We can use lazy val to avoid intermediate outputs</li>
<li>And only force evaluation when necessary</li>

In [17]:
lazy val a = text.toLowerCase.split(" ")
lazy val t = a.map(w => w.trim)
lazy val m = t.map(w => (w,1))

a: Array[String] = <lazy>
t: Array[String] = <lazy>
m: Array[(String, Int)] = <lazy>


In [18]:
m

res5: Array[(String, Int)] =
Array((my,1), (heart,1), (aches,1), (and,1), (a,1), (drowsy,1), (numbness,1), (pains,1), (my,1), (sense,1), (as,1), (though,1), (of,1), (hemlock,1), (i,1), (had,1), (drunk
or,1), (emptied,1), (some,1), (dull,1), (opiate,1), (to,1), (the,1), (drains,1), (one,1), (minute,1), (past,1), (and,1), (lethe-wards,1), (had,1), (sunk
tis,1), (not,1), (through,1), (envy,1), (of,1), (thy,1), (happy,1), (lot,1), (but,1), (being,1), (too,1), (happy,1), (in,1), (thine,1), (happiness
that,1), (thou,1), (light-winged,1), (dryad,1), (of,1), (the,1), (trees
in,1), (some,1), (melodious,1), (plot
of,1), (beechen,1), (green,,1), (and,1), (shadows,1), (numberless
singest,1), (of,1), (summer,1), (in,1), (full-throated,1), (ease,1))


<span style="color:green;font-size:large">Apply the filter</span>


In [19]:
m.filter(t => t._1.length() == 3) 

res6: Array[(String, Int)] = Array((and,1), (had,1), (the,1), (one,1), (and,1), (had,1), (not,1), (thy,1), (lot,1), (but,1), (too,1), (the,1), (and,1))


<li>When there is only one argument to a scala function, the argument is obvious</li>
<li>We can simplify the anonymous function as follows</li>


In [20]:
lazy val f = m.filter(_._1.length() == 3)

f: Array[(String, Int)] = <lazy>


<span style="color:green;font-size:large">apply groupby to group instances of a word together</span>
<li>groupBy creates a Scala <span style="color:blue">Map object</span></li>
<li>Scala <span style="color:blue">Map objects</span> are key-->value pairs</li>


In [21]:
f.groupBy(_._1)

res7: scala.collection.immutable.Map[String,Array[(String, Int)]] = Map(too -> Array((too,1)), thy -> Array((thy,1)), but -> Array((but,1)), had -> Array((had,1), (had,1)), not -> Array((not,1)), lot -> Array((lot,1)), and -> Array((and,1), (and,1), (and,1)), one -> Array((one,1)), the -> Array((the,1), (the,1)))


In [22]:
lazy val gp = f groupBy(_._1)



gp: scala.collection.immutable.Map[String,Array[(String, Int)]] = <lazy>


In [23]:
gp("too")

res8: Array[(String, Int)] = Array((too,1))


<span style="color:green;font-size:large">reduce the grouped data</span>
<li>Add up the 1's in the value part of the map</li>
<li>And return (word, count) pairs</li>

In [46]:
//gp.map(w => (w._1,w._2.map(_=>1).reduce((a,b)=>a+b)))
gp.map(w => (w._1,w._2.map(_=>1)))
//.reduce(_+_)

res30: scala.collection.immutable.Map[String,Array[Int]] = Map(too -> Array(1), thy -> Array(1), but -> Array(1), had -> Array(1, 1), not -> Array(1), lot -> Array(1), and -> Array(1, 1, 1), one -> Array(1), the -> Array(1, 1))


In [None]:
val result = gp.map(w => (w._1,w._2.map(_=>1).reduce(_+_)))

<span style="color:green;font-size:large">Putting it all together in a function</span>

In [None]:
text
    .toLowerCase
    .split(" ")
    .map(w=>(w.trim,1))
    .filter(_._1.length() == 3)
    .groupBy(_._1)
    .map(w => (w._1,w._2.map(_=>1).reduce(_+_)))

In [10]:
def count_words(text: String) = 
    text
        .toLowerCase
        .split(" ")
        .map(w=>(w.trim,1))
        .filter(_._1.length() == 3)
        .groupBy(_._1)
        //.mapValues(w=>w.map(_=>1).reduce((a,b)=>a+b))
        .map(w => (w._1,w._2.map(_=>1).reduce(_+_)))

count_words: (text: String)scala.collection.immutable.Map[String,Int]


In [11]:
count_words(text)

res4: scala.collection.immutable.Map[String,Int] = Map(too -> 1, thy -> 1, but -> 1, had -> 2, not -> 1, lot -> 1, and -> 3, one -> 1, the -> 2)


<span style="color:green;font-size:x-large">Comparison between scala and python word count</span>
<img src="scala vs python word count.png">

<span style="color:red;font-size:xx-large">More example</span><p>
Given an array of tuples containing names of students and their scores on a test, return the average score for that test

In [None]:
//Return the average score on a test using map/reduce
val scores = Array(("Jack",95),("Jill",78),("Qing",99),("Olafur",87),("Ludovica",65))
val average = scores.map(t=>t._2).reduce(_+_).toDouble/scores.length

In [None]:
5/3