# 02 MapReduce
__Math 3280 - Data Mining__ : Snow College : Dr. Michael E. Olson

-----

In this chapter, we look at the computer requirements when dealing with big data.

When dealing with large computations, such as large-scale models, one computer will not be enough.
> As an undergraduate, I created a model of air pollution in North Salt Lake. It was a 24-hour model covering a 100-km^2 area. It took well over 1 hour on my computer to get the results. Imagine how much more time it would have taken as a 3-Dimensional model covering the entire planet... By the time my computer finished a forecast model, the event being forecasted would have happened weeks ago.

To handle large computers, we utilize __parallel processing__, where several processors are linked together and work on parts of the problem simultaneously. This helps the calculations to complete in far less time.
* Each processor is called a __node__.
* The collection of nodes is called a __supercomputer__.

In data science, however, we are not only dealing with large computations, but with large amounts of data as well. 
* For example, large-scale Web services, such as Google or Amazon, are continually dealing with large amounts of data and customer interactions.

To handle this, we use not only the processors on each node, but the storage space as well. 
* These systems are known as __computing clusters__.
* The software to manage the data and queries is a __distributed file system__.

## Cluster Computing and the Distributed File System
Each node is installed on a __rack__. There are often 8-64 nodes on a rack. Each node on that rack is connected by a localized network - typically a gigabit ethernet.

Several racks are then connected by another level of network or a switch. In order to get all the information from the racks to work with each other, they need more bandwidth than the rack itself has. We will learn about how these are used soon. First, let's look at the hardware challenges.

All hardware eventually fails. With heavy usage, it will fail faster. 
* In large-scale services, one node can last about 3 years (a little more than 1000 days)
* If I have a server of 1000 nodes, that means that on average, 1 node will fail every day
* A server at Google may have a million nodes, which means there are about 1000 nodes that fail every day

With so many failures, we have to ensure no disruption in data or in calculations if the failure happens while the program is running. To ensure this happens, there are two requirements:
1. Files must be stored redundantly
2. Computations must be divided into smaller tasks
    * If one task fails, then only that one task needs to be restarted, not the entire program
    

### The Distributed File System Organization
A distributed file system (DFS) works by dividing the data file into separate pieces and copying them.
1. Files are divided into __chunks__, typically 64 MB
    * Size can be determined by the user
2. Each chunk is saved on different nodes
3. Each chunk is the replicated and saved on different nodes, perhaps 3 times
    * Number of copies can be determined by the user
    * The nodes holding the copies should be on different racks so copies aren't all lost if a rack fails
4. A __master node__ (or name node) tracks the location of all chunks so retrieval is simplified
    * The master node is also replicated

A DFS is often used when,
* individual files are large (terabytes), and
* files are rarely updated

There is no need for files to be distributed if they are small. And if any file is frequently updated, then the process becomes very complicated. So, this may be a good system for data on a global scale, but wouldn't work well for Amazon who has changes in inventory and prices daily.

## MapReduce
__MapReduce__ is the style of computing that is used to implement the DFS methodology. There are many different implementations:
* (GFS) Google File System - The original
* (HDFS) Hadoop DFS - Open-source, distributed by the Apache Software Foundation
* Spark
* Colossus - An improved version of GFS

The MapReduce process only involves two functions: *Map* and *Reduce*. The process is as follows. We'll follow the process with an example of counting the number of words.
1. *Map tasks* are given one or more chunks from the DFS and matches it into key-value pairs
    * Each map task looks for the words $w_1$, $w_2$, etc.
    * The key-value pair would be ($w_1$, 1), ($w_2$, 1), etc.
    * The result is a list of all key-value pairs ($w$, 1) for all documents
2. A __master controller__ sorts these key-value pairs and assigns them to a *Reduce task* 
    * All word pairs are sorted as ($w_1$, 1), ($w_1$, 1), ($w_1$, 1), ... , ($w_1$, 1), ($w_2$, 1), ($w_2$, 1), ... ($w_2$, 1), ($w_3$, 1), ...
    * All pairs with $w_1$ are given to one reduce task, $w_2$ to another, etc.
      * Input to ReduceTask1: ($w_1$, [1,1,1,1])
      * Input to ReduceTask2: ($w_2$, [1,1,1,1,1])
      * Input to ReduceTask3: ($w_3$, [1,1]), ($w_4$, [1,1,1])
      * ...
3. *Reduce Tasks* work with one key at a time, combining the values associated with that key in some way
    * Add all the values together
      * Output from ReduceTask1: ($w_1$, 4)
      * Output from ReduceTask2: ($w_2$, 5)
      * Output from ReduceTask3: ($w_3$, 2), ($w_4$, 3)
      
What happens if a node fails in the middle?
* Best case scenario: only a single map task or reduce task needs to be restarted
* Worst cast scenario: the node at which the Master is executing fails, and the entire MapReduce job needs to be restarted.

## Algorithms using MapReduce
* Matrix-Vector Multiplication


### Matrix-Vector Multiplication with small vectors

### Matrix-Vector Multiplication with large vectors

### Relational Algebra

## Homework
1. Exercise 2.2.1 (a,b)