> **Jupyter slideshow:** This notebook can be displayed as slides. To view it as a slideshow in your browser, type the following in the console:


> `> ipython nbconvert [this_notebook.ipynb] --to slides --post serve`


> To toggle off the slideshow cell formatting, click the `CellToolbar` button, then `View --> Cell Toolbar --> None`.

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Intro to Big Data

_Authors: Dave Yerrington (SF)_

---

![](https://snag.gy/SZOEv2.jpg)

### Student Pre-Work
*Before this lesson, you should already be able to:*
- Run Python scripts from the UNIX shell.
- Recall how the `cat` and `sort` UNIX commands work.
- Download the VM link [here](https://www.dropbox.com/s/egzz6129w90okzf/GA%20DSI%20bigdata%200.9.ova?dl=0).

### Learning Objectives
*By the end of this lesson, you will be able to:*
- Recognize big data problems.
- Explain how the MapReduce algorithm works.
- Understand the difference between high performance computing and cloud computing.
- Describe the divide and conquer strategy.
- Perform a MapReduce on a single node using Python.

### Lesson Guide
- [Introduction](#intro)
- [What is Big Data?](#big-data)
- [High Performance Computing (HPC)](#hpc)
- [Cloud Computing](#cloud)
- [Parallelism](#parallelism)
- [Divide and Conquer](#dc)
- [MapReduce](#mapreduce)
- [MapReduce: Key-Value Pairs](#kv-pairs)
- [Guided Practice: Word Count on Paper](#guided-practice)
    - [Simple MapReduce](#simple)
- [Combiners](#combiners)
- [MapReduce in Python](#python)
    - [`mapper.py`](#mapper)
    - [`reducer.py`](#reducer)
    - [Running the Code in Terminal](#terminal)
- [Independent Practice](#ind-practice)
- [Conclusion](#conclusion)
- [Additional Resources](#resources)

<a name="intro"></a>
## Introduction
---

This lesson identifies some major trends in the field of big data and data infrastructure, including common tools and problems you may encounter working as a data scientist. 

It's time to take the tools you've learned to a new level by increasing the size of the data sets you can tackle.


<img src="https://snag.gy/mDzP4d.jpg" style="height: 300px">


## What Do You Think Big Data Is?

> **Big data is a hot topic. It refers to techniques and tools that allow us to store, process, and analyze large scale (multi-terabyte) data sets.**

## Can You Think of Any Data Sets That Would Be Considered Big Data?

- Facebook social graphs.
- Netflix movie preferences.
- Large recommender systems.
- The activity of visitors to a website.
- Customer activity in a retail store (i.e., Target).

## What Challenges Exist With Such Large Amounts of Data?

- Processing time.
- Cost
- Architecture maintenance and set up.
- Difficulty in visualization.

<a name="big-data"></a>
## What is Big Data?
---

Big data is a term used for data that exceed the processing capacity of typical databases. We need a big data analytics team when data sets are large and growing quickly and we want to uncover hidden patterns, unknown correlations, and build models. 

**There are three main features in big data (the three Vs):**
- **Volume**: Large amounts of data.
- **Variety**: Different types of structured, unstructured, and multi-structured data.
- **Velocity**: The need to be analyzed quickly.

**David Yerrington's Fourth V (An Unofficial Big Data Tenet):**
- **Value**: It's important to assess the business value of predictions, and understanding the underpinnings of cost versus benefit is even more essential in the context of big data. It's easy to misunderstand the three Vs without looking at the bigger picture and connecting the value of the business cases involved.

![3v](./assets/images/3vbigdata.png)

<a id='hpc'></a>
## High Performance Computing (HPC)
---

Supercomputers, or HPCs, are very expensive, powerful calculators used by researchers to solve complicated math problems.

![supercomputer](./assets/images/supercomputer.png)


## Can You Think of Advantages and Disadvantages of HPC Configurations?

**Advantages:**
- Can perform very complex calculations.
- Centrally controlled.
- Useful for research and complicated math problems.

**Disadvantages:**
- Expensive.
- Difficult to maintain (self-managed or managed hosting both incur operations overhead).
- Scalability is bounded (before big data, this would be medium data).

<a id='cloud'></a>
## Cloud Computing
---

Instead of using one huge machine, what if we bought a bunch of commodity machines?

> *Note: Commodity hardware is a term used in operations to describe mixed-server hardware, but it can also refer to the basic machines you would use in an office.*

![Commodity hardware](https://snag.gy/fNYgt0.jpg)<center>*Actual AWS Datacenter*</center>.

**Can you think of advantages and disadvantages of this configuration?**


**Advantages:**
- Relatively cheaper.
- Easier to maintain (as a user of the cloud system).
- Scalability is unbounded (just add more nodes to the cluster).
- A variety of turnkey solutions are available through cloud providers.

**Disadvantages:**
- Complex infrastructure. 
- Subject matter expertise required to leverage lower-level resources within the infrastructure.
- Mainly tailored for parallelizable problems.
- Relatively small CPU power at the lowest level.
- More input/output between machines.

The term big data refers to the cloud computing case in which commodity hardware with unlimited scalability is used to solve highly parallelizable problems.

# How Do You Think Many Computers Process Data?

**How does this contrast with how you perform analysis on your laptop?**

<a id='parallelism'></a>
## Parallelism
---

The conceptual foundation of big data processing is the idea that a problem can be computed by multiple machines in pieces simultaneously. Many resources are being used in parallel with each other.

![](https://snag.gy/MknIN6.jpg)

- Running multiple instances to process data.
- Data can be subset and solved iteratively.
- Sub-solutions can be solved independently.

<a id='dc'></a>
## Divide and Conquer
---

<img src="https://snag.gy/xh2mJA.jpg">

The divide and conquer strategy is a fundamental algorithmic technique for solving a task. Its steps are:

1) Split the task into subtasks.
2) Solve these subtasks independently.
3) Recombine the subtask results into a final result.

For a problem to be suitable for the divide and conquer approach, you must be able to break it into smaller independent subtasks. Many processes are suitable for this strategy, but there are also plenty that aren't.

<a id='mapreduce'></a>
## MapReduce

---

<img src="https://snag.gy/XBgCOs.jpg">

**MapReduce** is a two-phase divide and conquer algorithm initially invented and publicized by Google in 2004. It involves splitting a problem into subtasks and processing these subtasks parallelly. MapReduce has two phases:

1) The **mapper** phase.
2) The **reducer** phase.

In the **mapper phase**, data are split into chunks and the same computation is performed on each, while in the **reducer phase**, data are aggregated back to produce a final result.

MapReduce uses a functional programming paradigm. The data processing primitives are mappers and reducers.

- **Mappers**: Filter and transform data.
- **Reducers**: Aggregate results.

The functional paradigm is good for describing how to solve a problem but not for describing data manipulations (e.g., relational joins).

<a id='kv-pairs'></a>
## MapReduce: Key-Value Pairs

---

<img src="https://snag.gy/k2FCar.jpg">

Data are passed through the various phases of a **MapReduce pipeline** as key-value pairs.

**What Python data structures could be used to implement a key-value pair?**



- A **dictionary**.
- A **tuple** of two elements.
- A **list** of two elements.
- A named **tuple**.

To understand MapReduce, you need to always keep in mind that data are flowing through a pipeline as key-value pairs.


<a name="guided-practice"></a>
## Guided Practice: Word Count on Paper
---

Let's perform a simple MapReduce in class. Our task is to find the 10 most common words in the paragraph below.

    1:  MapReduce is a programming model for large-scale distributed data processing.
    3:  It is inspired by the map function and the reduce function of the functional
    4:  programming languages such as Lisp, Haskell, or Python. One of the most
    5:  important features of MapReduce is that it allows us to hide the low-level
    6:  implementation such as message passing or synchronization from users and
    7:  allows to split a problem into many partitions. This is a great way to make
    8:  trivial parallelization of data processing without any need for
    9:  communication between the partitions.
    10: MapReduce became mainstream because of Apache Hadoop, which is an open-
    11: source framework that was derived from Google's MapReduce paper.
    12: MapReduce allows us to process massive amounts of data in a distributed
    13: cluster. In fact, there are many implementations of the MapReduce
    14: programming model. Some of them are shown in the following list. It is
    15: important to say that MapReduce is not an algorithm; it is just a part
    16: of a high performance infrastructure that provides a lightweight
    17: way to run a program in a lot of parallel machines.
    18:                From: Practical Data Analysis, Hector Cuesta, 2013


### Simple MapReduce

**Instructions:**
- Students will perform the mapper function.
- Instructor will perform the reducer function.

Each student will be assigned one line of text. You'll have to produce a list of key-value pairs `(word, 1)` to provide to the instructor. 

**Check:** What preprocessing should you perform on your tokens in order to improve the results?

Example: The first line will produce this list:

    (MapReduce, 1)
    (is, 1)
    (a, 1)
    (programming, 1)
    (model, 1)
    (for, 1)
    (large-scale, 1)
    (distributed, 1)
    (data, 1)
    (processing, 1)

Your instructor will then sort the key-value pairs, add up the `1`s for each word, and produce the counts.

**Check:** What additional operation did the instructor perform in order to complete the aggregation?


> ***Answer**: Ignore punctuation and transform all to lowercase.*

---

> *Instructor Notes:*
*1) If there are more than 18 students, group the students to obtain 18 groups.*
*2) If there are less than 18 students, give each student more than one line so that all of the lines are processed.*
*3) Make sure that students hand in a list of key-value pairs where the key is the word and the value is `1`.*
*4) There's no need to actually perform the count — here are the expected results:*
>
>        ('of', 10)
>        ('a', 9)
>        ('is', 8)
>        ('the', 8)
>        ('mapreduce', 7)
>        ('to', 6)
>        ('that', 4)
>        ('it', 4)
>        ('in', 4)
>        ('data', 4)

---

> *The instructor had to shuffle the key-value pairs handed in by the students in order to find the common key and add up the corresponding values.*


<a id='combiners'></a>
## Combiners
---

Combiners are intermediate reducers that are performed at the node level in a multi-node architecture.

![](https://snag.gy/lFYfoC.jpg)

When data are really large, we can distribute them to several mappers running on different machines. Sending a long list of `(word, 1)` pairs to the reducer node isn’t efficient. We can first aggregate at the mapper node level and send the result to the reducer. This is possible because aggregations are associative.

**Let's repeat the previous exercise with a small change:**
1) Divide the class into three groups. In each group, one student will be the combiner and the others will be the mappers.
2) Let's split the text into three parts and give each group one part.
3) Mapper students produce a list of `(word, 1)` pairs for each line they receive and hand the list to the combiner.
4) Combiner students sort the lists and sum the counts for words that appear in each list.
5) Finally, combiner students hand their list of counts to the instructor, who will combine the intermediate sums and produce the final result.

**Check:** What changed?

Congratulations! You just performed a MapReduce sum.

**Check:** Can you think of other aggregation tasks that can be parallelized in this way?


> *Less message passing to the instructor.*

---

> *Answer:*
*- Count, sum, and average.*
*- Grep, sort, and inverted index.*
*- Graph traversals and some ML algorithms.*


<a name="python"></a>
## MapReduce in Python
---

Now that we've performed a MapReduce by hand, let's try it in Python. Below, you’ll find the code for a simple mapper and reducer that calculate word count.

Let's look at them in detail.

<a id='mapper'></a>
### `mapper.py`


In [1]:
# mapper.py:
import sys

# Get text from the standard input.
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print '%s\t%s' % (word, 1)

SyntaxError: invalid syntax (<ipython-input-1-7e116e44b788>, line 9)

**Check:** What kind of input does `mapper.py` expect?

**Check:** What kind of output does `mapper.py` produce?

<a id='reducer'></a>
### `reducer.py`

In [None]:
# reducer.py:
from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# Input comes from STDIN.
for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    
    # Try to count if the error is continue.
    try:
        count = int(count)
    except ValueError:
        continue

    # This IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it's passed to the reducer.
    if current_word == word:
        current_count += count
    else:
        if current_word:
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word

# Don't forget to output the last word if necessary.
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

**Check:** What kind of input does `reducer.py` expect?

**Check:** What kind of output does `reducer.py` produce?

<a id='terminal'></a>
### Running the Code in Terminal

**You can find `mapper.py`, `reducer.py`, and some text input files in the `code` directory.**

This code can be run using the following command from your terminal:

```bash
cat <input-file> | python mapper.py | sort -k1,1 | python reducer.py
```

**Check:** Can you explain what each of the four steps in the pipeline accomplishes?

- **Cat**: Reads the file and streams it line by line.
- **Mapper**.
- **Sort**: Shuffles the mapper output to sort it by key so that counting is easier.
- **Reducer**: Aggregates by word.

**Check:** Can you figure out how our previous example *could* be represented in the diagram below?
![map reduce word count](./assets/images/word_count_dataflow.jpg)

<a name="ind-practice"></a>
## Independent Practice
---

Now that you have a basic word counter set up in Python, try to do some of the following:

1) Process a much larger text file of your choice.
    - For example, use a page from Wikipedia or a blog article. If you're really ambitious, you can take books from Project Gutenberg.
2) Explore how the execution time scales with file size.
3) Read [this article](http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html) to discover some powerful shell tricks. Learning to use the shell will save you time munging data on your file system.

<a name="conclusion"></a>
## Conclusion
---

In this lesson, we've learned about big data and the MapReduce process. MapReduce is an algorithm that works well for aggregations on very large data sets.

**Check:** Now that you know how big data works, can you think of some more specific business applications?

**Examples:**

- For processing log files to find security breaches.
- For processing medical records to assess spending.
- For processing news articles to decide on investments.

<a id='resources'></a>
### Additional Resources

---

- [Top 500 supercomputers](http://www.top500.org/lists/).
- [Google MapReduce paper](http://research.google.com/archive/mapreduce.html).