# Big Data - Advanced  Map-Reduce Exercises

In this notebook we are going to look at map reduce jobs. We will run through the map reduce logic entirely in python so you can see what is happening at each point in time.

IMPORTANT: This notebook exists in the container and you will lose any changes when you close the container!
To keep a copy for reference you can go to menu bar above and select File->Download as->Notebook and save to your laptop.

First we will move the articles file into the hdfs. Here we will be using the two approaches to using BASH in Jupyter Notebooks both the ! symbol and the magic symbol %%. First check the file articles.part exists in the container by running the code below:

In [None]:
!ls

# Python mapper and reducer

In this example we will simulate the map and redcue phases entirely in Python. This allows us to run MapReduce jobs quickly without having to wait for the distributed filesystem.

In this example we will be see python code that reads article and uses tupples to hold key value pairs (this makes it easier for us to sort).

The next cell will pull the head and tail of our data to local files for us to work with. This allows us to simulate that we have a large file split into two parts.

In [None]:
%%bash
cat articles.part | head -n 5 > part1.txt
cat articles.part | tail -n 2 > part2.txt

We have split our articles files into two parts, to represent how large files get split in the hdfs files system.

Use bash to check if the two new files have been added.

Check the contentrs of one of the files using bash. Make sure the file contain wikipedia aritcles. Try to see if you can understand how the data is structured, each article is a single 'line' in the file, it start with an ID number followed by a tab and then the full article text.

Let's define three simple python functions to replicate the functionality we need. A Mapper, a sort, and a reducer.

Make sure you understant what each of these are doing.

The Mapper uses regular expression, to split the lines into words, for more

In [None]:
import string
import re

In [None]:
def word_count_mapper(wiki_lines):
    # List to store the kep value tuples the mapper will return
    key_value_pairs = []

    # Iterate through lines in the input file
    for line in wiki_lines:
        # Seperate the article ID and the text into seperate variable
        article_id, text = (line.strip()).split('\t', 1)

        # Split the text into words using regular expressions
        words = re.split("\W*\s+\W*", text, flags=re.UNICODE)

        # For each word add a key value pair to the list, where the key is the word and the value is always 1.
        for word in words:
            # Remove punctuation and make word lower case
            word_stripped = word.lower().translate(str.maketrans('', '', string.punctuation))

            #Add word to results
            key_value_pairs.append((word_stripped, 1))

    return key_value_pairs

Lets apply our mapper to a simple example to check what it is doing:

In [None]:
map_result = word_count_mapper(["ArticleID	Hello I'm Bob. Hello Bob, I'm Greg."])
map_result

In [None]:
def shuffle_sort(multiple_key_value_pair_lists):
    sorted_kvp = []

    # We need to sort all key value pairs together from multiple lists of key value pairs
    for kvp_list in multiple_key_value_pair_lists:

        #Add each key value pair to a single list
        for kvp in kvp_list:
            sorted_kvp.append(kvp)

    # Sort the list, the argument here says we want to use the first part of the tupple (the word) as the sorting criteria
    sorted_kvp.sort(key=lambda tup: tup[0])

    return sorted_kvp

Let's test the shuffle and sort phase:

In [None]:
sorted_result = shuffle_sort([map_result])
sorted_result

In [None]:
def reducer(key_value_pairs):
    reduced_key_value_pairs = []
    current_key = None
    word_sum = 0

    # here we iterate through all the key value pairs, if the key is the same as the previous we just add to the count,
    # if not we add the word and it's final count to the final list the function will return
    total_count = len(key_value_pairs)
    for i in range(total_count):
        key, count = key_value_pairs[i]
        # if the new key is different to the previous (and not None) it is added to our final list and the wordcount reset
        if current_key != key:
            if current_key:
                reduced_key_value_pairs.append((current_key, word_sum))
            word_sum = 0
            current_key = key

        #Add to the counter for the current word
        count = int(count)
        word_sum += count

        #Add the last word
        if i + 1 == total_count:
            reduced_key_value_pairs.append((key, word_sum))

    return reduced_key_value_pairs

Finally, lets see what the reducer does:

In [None]:
reducer(sorted_result)

There we have our word count result.

Would this reducer work if the input list wasn't sorted?

## Running map reduce

Let's put it all together and simulate the map reduce process with python:

In [None]:
# Import data into the variables we will work with
part1 = open("part1.txt")
part2 = open("part2.txt")
file_parts = [part1, part2]

# Map phase, the mapper function will run on multiple file parts
mapper_result = []

for part in file_parts:
    mapper_result.append(word_count_mapper(part))

# Sort phase running on mapper_result (which is a list of lists of key value pairs)
sort_result = shuffle_sort(mapper_result)

# Reduce phase on sorted results
reduce_result = reducer(sort_result)

# Here we sort the results of the reducer so we can find the most common word
reduce_result.sort(key=lambda tup: tup[1])

# Most common word tuple
reduce_result[len(reduce_result)-1]

Predictably 'the' is the most common word in our example. Use python to see the top 20 in the variable above.

# Map reduce exercise

For this example we are interested in the distribution of the number of words in each sentence that has more than one word (sentences with one word in this dataset are mostly html formatting).

Write a new simple python mapper and reducer to accomplish this task. Note we will use the same sort function, so please return a list of key value pairs as tupples from your mapper. Take a look at the simple python mapper above.

From your result we want to find:

the most words in a sentence,

the average sentence length in number of words (the mean sentence length),

and the most common sentence length in number of words.

Keep this in mind.

If you are have trouble try running each fucntion seperately on a simple text to see if the output looks like what you would expect.

## Find the most words in a sentence

Fill in the code below to create a mapper and reducer that will what is the maximum number of words in a sentence:

In [None]:
def sentence_mapper(wiki_lines):
    # List to store the kep value tuples the mapper will return
    key_value_pairs = []

    # Iterate through lines in the input file
    for line in wiki_lines:
        # Seperate the article ID and the text into seperate variable
        article_id, text = (line.strip()).split('\t', 1)

        # Split the text into sentences
        sentences = (text).split('.')

        # YOUR CODE HERE.....
        # Remember to removes sentence of length < 2


    return key_value_pairs

Test your mapper on the simple example below:

In [None]:
sentence_map_result = sentence_mapper(["ArticleID	Hello I'm Bob. Hello Bob, I'm Greg. Nice hair Greg."])
sentence_map_result

In [None]:
def sentence_reducer(key_value_pairs):
    reduced_key_value_pairs = []
    current_key = None
    word_sum = 0

    # here we iterate through all the key value pairs, if the key is the same as the previous we just add to the count,
    # if not we add the word and it's final count to the final list the function will return
    total_count = len(key_value_pairs)
    for i in range(total_count):
        # YOUR CODE HERE ......

    return reduced_key_value_pairs


Test your reducer below:

In [None]:
sentence_reducer(shuffle_sort([sentence_map_result]))

Lets test your files below.


In [None]:
# Import data into the variables we will work with
part1 = open("part1.txt")
part2 = open("part2.txt")
file_parts = [part1, part2]

# Map phase, running on all file parts
mapper_result_s = []

for part in file_parts:
    mapper_result_s.append(sentence_mapper(part))

# Sort phase running on mapper_result
sort_result_s = shuffle_sort(mapper_result_s)

# Reduce phase on sorted results
reduce_result_s = sentence_reducer(sort_result_s)


From your result can you find:

the most words in a sentence

If needed adjust your mapper and reducer to find the following results:

least words in a sentence (should be more than 1)

the most common sentence length in number of words and how many sentence have this legnth

The average sentence length in number of words

Say we were just interested in the average number of words in a sentence. Can we just have a map job that calculates the average for each part of the file and then a reduce job that takes the average of these map results? Why / Why not?

No, the average of the averages for each part would not equal the global average.

Do you think you have minised the amount of information passed to the reducer in each case? Have you performed the maximum quantity of computation at the data?

This question is just to check that you are not passing the full sentence to the reducer.
For a number of the questions we could reduce the information passed to the reducer, for example for average words per sentence the only information the reducer needs is the total number of words and total number of sentences in each part.