# <u><p style="text-align: center;">MapReduce</p></u>

### Contents of this notebook 
* Combining map and reduce to operate on data
* MapReduce operations in Python

### Background

`Map` and `reduce` operations are combined into a workflow for data transformations called **mapReduce**. As the name suggests, in this workflow the functions are first mapped to data. Then, the results go through a `reduce` operation, to arrive at a single result. This workflow allows a high degree of parallelization since both operations are parallelizable and as a result suitable for working with large amounts of data. 

### Code examples
In this section we are going to see some examples of how to combine the `map` and `reduce` functions.  
***Example 1:*** applies  the `map` and `reduce` functions to find the most recent year in a list of dates.  
***Example 2:*** applies the `map` and `reduce` functions to pre-process a given dataset of chickens and find the chicken with the longest gait.  
***Example 3:*** applies the `map` and `reduce` functions to calculate the net weight of a potato crates, as well as the average net weight per crate.  

#### Example 1: Finding the most recent year
Usually, different datasets come with different representations for dates. A common representation is to store dates as `dd-mm-yyyy`. In our first example, we are going to use mapReduce to process such formatted dates in order to find the most recent year in the dataset. 

Below is our list of dates:

In [None]:
dates = ['9-1-2004',
        '19-5-1986',
        '27-10-2018',
        '5-4-2021',
        '16-8-1936']

We define a function that extracts the year from our date:

In [None]:
def extract_year(date):
    return date[-4:]    # retain the last 4 characters of the date string

print(extract_year('27-10-2018'))

Now we can proceed with the `map` and `reduce` operations. First, we use `map` TO apply our function `extract_year` to the list of dates. We get a new list (stored in variable `years`) of all the years from the dates. Next, we apply `reduce` to this list of years, with the `max` function, to get the highest number from the list, which is the most recent year.

In [None]:
from functools import reduce

years = list(map(extract_year, dates))
most_recent_year = reduce(max, years)
print(most_recent_year)

#### Example 2: chicken gait measurements
In our next example, we have some chicken gait length measurements. These measurements are not in the right unit and for this reason we want to convert them from millimeters to meters. Also, after the conversion, we want to find the chicken with the longest gait.

Our chicken entries are stored in a list of dictionaries. Each dictionary holds the name and the gait length for one chicken:

In [None]:
chickens = [{'name': 'Cesar', 'gait_length': 50 },
            {'name': 'Daisy', 'gait_length': 30 },
            {'name': 'Coolio', 'gait_length': 10 },
            {'name': 'Puff', 'gait_length': 70},
            {'name': 'Cocoa', 'gait_length': 40}]

We define a function that converts the gait from millimeters to meters for one chicken entry. The function returns a new dictionary. In the new dictionary, the same name is stored again, but the gait length is divided by 1000 so that it is represented in meters. Note, again, that the function performs this operation on only one chicken, not on the whole dictionary. We need simple functions that can be repeated many times, not complicated functions that need to be performed on the entire dataset.

In [None]:
def mm_to_m(chicken):
    
    chicken_dict = {'name': chicken['name'], 'gait_length': chicken['gait_length'] / 1000}
    
    return chicken_dict

We test this function:

In [None]:
print(mm_to_m({'name': 'Cesar', 'gait_length': 50 }))

We also define a function that compares the gait of two chicken entries:

In [None]:
def max_gait(chicken_1, chicken_2):
    
    if chicken_1['gait_length'] > chicken_2['gait_length']:
        longest_gait_chicken = chicken_1
    else:
        longest_gait_chicken = chicken_2
    
    return longest_gait_chicken

print(max_gait({'name': 'Daisy', 'gait_length': 30 }, {'name': 'Cocoa', 'gait_length': 40}))

Finally, we will combine the `map` and `reduce` operations to find the chicken with the longest gait. First, we map our conversion function `mm_to_m` to the whole dictionary of chickens, which results in a new dictionary that now has the right unit. Then we reduce the result by applying the `max_gait` function to the new dictionary. 

As a note, there was no real need to convert the gait from millimeters to meters, but usually we need to perform some kind of pre-processing to the data in order to be able to use certain functions. Therefore we included this step in the example.

In [None]:
converted_entries = list(map(mm_to_m, chickens))
longest_gait_chicken = reduce(max_gait, converted_entries)
print(longest_gait_chicken)

#### Example 3: weighting potato crates
In our final example, we are given the gross weight of some potato crates, and we want to calculate the net weight of the potatoes as well as the average net weight per crate. We have a dataset that contains the gross weights in kilograms.

In [None]:
gross_weights = [30, 32, 28, 28.7, 31.2]    # in kg

To calculate the net weight we need to subtract the crate weight from each gross weight. We define a function for this, and then map this function to our list of dates, which gives us a new list of net weights:

In [None]:
def calculate_net_weight(gross_weight):
    return gross_weight - 1         # we assume the crate weight to be 1, we subtract it from the gross weight

net_weights = list(map(calculate_net_weight, gross_weights))

To calculate the average net weight, we first need to take the sum of all weights. For this, we need the `reduce` function. Then, use function `len` to know the amount of crates we have, and then we can calculate the average net weight per crate:

In [None]:
def add(x, y):
    return x + y

total_net_weight = reduce(add, net_weights)
average_net_weight = total_net_weight / len(net_weights)
print(average_net_weight)

<span style="display:none" id="question1">W3sicXVlc3Rpb24iOiAiV2hpY2ggb2YgdGhlIGZvbGxvd2luZyBzdGF0ZW1lbnRzIGFyZSBjb3JyZWN0IGZvciBtYXBSZWR1Y2U/IiwgInR5cGUiOiAibXVsdGlwbGVfY2hvaWNlIiwgImFuc3dlcnMiOiBbeyJjb2RlIjogIml0IGJyZWFrcyBkb3duIHRoZSAgICAgIHdvcmtsb2FkIGludG8gc21hbGxlciAgIHBpZWNlcyBvZiB3b3JrIHRoYXQgY2FuIGJlIG9wZXJhdGVkIG9uIGluICAgICAgIHBhcmFsbGVsIiwgImNvcnJlY3QiOiB0cnVlfSwgeyJjb2RlIjogIml0IGNhbm5vdCBiZSB1c2VkIGZvciAgIHNtYWxsIGRhdGFzZXRzIiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIkl0IGlzIHN1aXRhYmxlIGZvciBib3RoIHNtYWxsIGFuZCBiaWcgZGF0YXNldHMifSwgeyJjb2RlIjogIml0IGlzIHN1aXRhYmxlIG9ubHkgZm9yIG51bWVyaWNhbCBkYXRhc2V0cyIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJJdCBpcyBub3QgbGltaXRlZCB0byBudW1lcmljYWwgZGF0YSJ9XX1d</span>

## Quiz

#### Q1:

In [None]:
from jupyterquiz import display_quiz

display_quiz("#question1")

### Further reading

* [MapReduce](https://en.wikipedia.org/wiki/MapReduce)