# Python Higher Order Functions

## Practice Problems: MapReduce

Practicing the MapReduce programming paradigm. 

We will complete the tasks using the accompanied *mapreduce* package (as **mapreduce.py**) with one or more MapReduce "jobs". Please download the **mapreduce.py** file from Blackboard, and place it in the same folder with your notebook.

For each such job (with mr.run()), you are expected to supply a mapper and a reducer as needed. Below are sample usage of the package:

```python
    # Run on input1 using your mapper1 and reducer1 function
    output = list(mr.run(input1, mapper1, reducer1))

    # Run on input2 using only your mapper2, no reduce phase
    output = list(mr.run(input2, mapper2))
    
    # Run on input3 using 2 nested MapReduce jobs
    output = list(mr.run(mr.run(input3, mapper3, reducer3), mapper4, reducer4))
```
    
Please note that the output of the mr.run() is always a **generator**. You have to cast it to a list if you'd like to view, index or print it out.

The tasks below alos include those that are in Homework 3, but we're using MapReduce instead of Python's general Higher Order Functions.

You will need **book.txt** and **citibike.csv** file from Blackboard.

In [3]:
import csv
import mapreduce as mr

## Warm-up

Here is another concrete example on "Word Count" using the package. Assuming we have a text file named *book.txt*. Our task is to count the frequency of words in this document, and print the top 10. For illustration purposes, we use only the first 1000 lines of the book for counting.

In [14]:
with open('book.txt', 'r') as fi:
    lines = [line.strip() for i,line in enumerate(fi) if i<1000] # After this, 'lines' stores a list of 1000 text lines

def mapper(line):
    for word in line.strip().split(' '): #for loop to split the line by all the blank spaces
        if len(word)>0:
            yield (word.lower(), 1)
    
def reducer(k2v2): #k2v2 is the generator obtained from mapper
    #print k2v2
    word, list_of_ones = k2v2
    return (word, sum(list_of_ones))

wCounts = list(mr.run(lines, mapper, reducer))
sortedCounts = sorted(wCounts, key=lambda x: -x[1])
print sortedCounts[:10]

[('the', 411), ('of', 337), ('and', 250), ('a', 184), ('or', 162), ('with', 102), ('to', 101), ('in', 90), ('on', 69), ('as', 58)]
