# Processing Collections using Map Reduce APIs

As part of this class we have covered

* Operations on Set
* Understanding reduce
* Aggregate functions such as sum, min, max etc
* Reiterated on groupBy
* Sorting data using sorted and sortBy

### myReduce using loops

In [2]:
def myReduce(c: List[Int], agg: (Int, Int) => Int) = {
  var total = c(0)
  for(i <- c.tail) {
     total = agg(total, i)
  }
  total
}

myReduce: (c: List[Int], agg: (Int, Int) => Int)Int


### Sorting Data using sorted

* sorted will sort the data in natural order of the elements in the collection
* Element type in the collection should have implicit function with Ordering

In [3]:
val l = List(1, 2, 5, 6, 2, 3, 1)
l.sorted

l = List(1, 2, 5, 6, 2, 3, 1)


List(1, 1, 2, 2, 3, 5, 6)

### Sorting Data using sortBy

Problem Statement: Sort data by order customer id (3rd field in orders data)

In [5]:
val orders = scala.io.Source.
  fromFile("/data/retail_db/orders/part-00000").getLines.toList
// sorted will sort the data in natural order of the elements in the list
orders.
  sortBy(k => k.split(",")(2).toInt)

orders = List(1,2013-07-25 00:00:00.0,11599,CLOSED, 2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT, 3,2013-07-25 00:00:00.0,12111,COMPLETE, 4,2013-07-25 00:00:00.0,8827,CLOSED, 5,2013-07-25 00:00:00.0,11318,COMPLETE, 6,2013-07-25 00:00:00.0,7130,COMPLETE, 7,2013-07-25 00:00:00.0,4530,COMPLETE, 8,2013-07-25 00:00:00.0,2911,PROCESSING, 9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT, 10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT, 11,2013-07-25 00:00:00.0,918,PAYMENT_REVIEW, 12,2013-07-25 00:00:00.0,1837,CLOSED, 13,2013-07-25 00:00:00.0,9149,PENDING_PAYMENT, 14,2013-07-25 00:00:00.0,9842,PROCESSING, 15,2013-07-25 00:00:00.0,2568,COMPLETE, 16,2013-07-25 00:00:00.0,7276,PENDING_PAYMENT, 17,2013-07-25 00:00:00.0,2667,COMPLETE, 18,2013-07-25 00:00:00.0,1205,CLOSED, 19,2013-07-25 00:00:...


List(1,2013-07-25 00:00:00.0,11599,CLOSED, 2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT, 3,2013-07-25 00:00:00.0,12111,COMPLETE, 4,2013-07-25 00:00:00.0,8827,CLOSED, 5,2013-07-25 00:00:00.0,11318,COMPLETE, 6,2013-07-25 00:00:00.0,7130,COMPLETE, 7,2013-07-25 00:00:00.0,4530,COMPLETE, 8,2013-07-25 00:00:00.0,2911,PROCESSING, 9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT, 10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT, 11,2013-07-25 00:00:00.0,918,PAYMENT_REVIEW, 12,2013-07-25 00:00:00.0,1837,CLOSED, 13,2013-07-25 00:00:00.0,9149,PENDING_PAYMENT, 14,2013-07-25 00:00:00.0,9842,PROCESSING, 15,2013-07-25 00:00:00.0,2568,COMPLETE, 16,2013-07-25 00:00:00.0,7276,PENDING_PAYMENT, 17,2013-07-25 00:00:00.0,2667,COMPLETE, 18,2013-07-25 00:00:00.0,1205,CLOSED, 19,2013-07-25 00:00:...

### Exercises

* Sort Data by product price in descending order
    * Location: /data/retail_db/products/part-00000
    * Price is 5th element in the data
    * Filter out the record with product_id 685
* Sort Data by product category id in ascending order
    * Location: /data/retail_db/products/part-00000
    * Category id is second element
* Sort Data in ascending order by category id and descending order by product price
    * Location: /data/retail_db/products/part-00000
    * Category is second element and Product Price is 5th element
    * Filter out the record with product_id 685
* Compute order revenue for each order id and sort data in descending order by order revenue
    * Location: /data/retail_db/order_items/part-00000
    * Order id is second element and order item subtotal is 5th element
    * First compute revenue for each order id and then sort the data in descending order by revenue
    * Output should have only order_id and computed revenue