In the last file, we learned about MapReduce and implemented our own MapReduce framework. Here is the workflow for using it:

![image.png](attachment:image.png)

No matter what framework we use for MapReduce, as a user, we always need to provide the mapper and reducer functions. In this file, we'll practice designing these functions.

The `map_reduce()` function from the last file will be available. It takes input four arguments as input:

![image.png](attachment:image.png)

In this file, we'll practice using MapReduce to analyze English words. Specifically, we'll use the `english_words.txt` file, which contains a list of 370,103 English words. We gathered this file from this [GitHub repository](https://github.com/matthewreagan/WebstersEnglishDictionary/blob/master/dictionary.json).

Before we do that, let's review how to use the `map_reduce()` function from the last file. We'll use it to calculate the minimum value of a list of numbers. We can do this by using the `min()` built-in function as both the mapper and the reducer.

In [1]:
import math
from multiprocessing import Pool
import functools

def make_chunks(data, num_chunks):
    chunk_size = math.ceil(len(data) / num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    with Pool(num_processes) as pool:
        chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)

**Task**

![image.png](attachment:image.png)

**Answer**

In [2]:
values = [98, 63, 55, 80, 45, 51, 91, 64, 65, 48, 48, 92, 76, 99, 57, 42, 79, 61, 63, 49]
min_value = map_reduce(values, 4, min, min)
print(min_value)

42


Now we'll use MapReduce to calculate the length of the longest word in the English language.

Imagine that the file only contained the following words:

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Let's implement this.

**Task**

![image.png](attachment:image.png)

**Answer**

In [3]:
with open("english_words.txt") as f:
    words = [word.strip() for word in f.readlines()]
    
def map_max_length(words_chunk):
    return max([len(word) for word in words_chunk]) 

In [None]:
max_len = map_reduce(words, 1, map_max_length, max)
print(max_len)

Above we used MapReduce to calculate the length of the longest English word. We saw that it has twenty-one characters!

Aren't we curious to see what it is? Let's write another MapReduce program that calculates the actual word instead of the length.

![image.png](attachment:image.png)

As we can see in the diagram above, the result of each chunk now is a string, not a number.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Let's implement this.

**Task**

![image.png](attachment:image.png)

**Answer**

In [None]:
def map_max_len_str(words_chunk):
    return max(words_chunk, key=len) 

def reduce_max_len_str(word1, word2):
    return map_max_len_str([word1, word2])


max_len_str = map_reduce(words, 1, map_max_len_str, reduce_max_len_str)
print(max_len_str)

Above we calculated the longest English word, and the result was `chromophotolithograph`. If we do a Google search for English longest word we actually find another result: `pneumonoultramicroscopicsilicovolcanoconiosis`.

According to Wikipedia, the latter is a made-up word, but it is in the Oxford dictionary. How can we use MapReduce to find out if a given word is in the list of English words we're using in this file?

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

The target string (the one we're looking for) is in a variable named `target`. Note that this doesn't pass as arguments to neither the mapper nor the reducer. They can access it because it is declared outside.

![image.png](attachment:image.png)

**Answer**

In [None]:
target = "pneumonoultramicroscopicsilicovolcanoconiosis"

def map_contains(words_chunk):
    return target in words_chunk

def reduce_contains(contains1, contains2):
    return contains1 or contains2

is_contained = map_reduce(words, 1, map_contains, reduce_contains)
print(is_contained)

![image.png](attachment:image.png)

This is similar to what we did in the Introduction to MapReduce file to count skills in data engineering job postings. This time, we're counting characters rather than words. The mapper function will create a frequency table for the provided chunk of words.

Then, the reducer will take two frequency tables and merge them by adding together the counts of both letters. In this case, the two dictionaries will not necessarily have the same set of keys because some characters might not occur in one of the chunks.

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Answer**

In [None]:
def map_char_count(words_chunk):
    char_freq = {}
    for word in words_chunk:
        for c in word:
            if c not in char_freq:
                char_freq[c] = 0
            char_freq[c] += 1
    return char_freq

def reduce_char_count(freq1, freq2):
    for c in freq2:
        if c in freq1:
            freq1[c] += freq2[c]
        else:
            freq1[c] = freq2[c]
    return freq1

char_freq = map_reduce(words, 1, map_char_count, reduce_char_count)
print(char_freq)

Above, we used MapReduce to count the frequency of characters in English words. If the result made it difficult to see which one is the most frequent, we can get the answer by using the [`operator` module], like this:

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [None]:
def map_average(words_chunk):
    return sum([len(word) for word in words_chunk]) / len(words)

def reduce_average(res1, res2):
    return res1 + res2
    
average_word_len = map_reduce(words, 1, map_average, reduce_average)
print(average_word_len)

Throughout this file we've practiced using MapReduce to analyze the English words. Let's finish this file with a challenge.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [None]:
def map_adjacent(words_chunk):
    adj_freq = {}
    for word in words_chunk:
        for i in range(len(word) - 1):
            seq = word[i] + word[i + 1]
            if seq not in adj_freq:
                adj_freq[seq] = 0
            adj_freq[seq] += 1
    return adj_freq


def reduce_adjacent(freq1, freq2):
    for seq in freq2:
        if seq in freq1:
            freq1[seq] += freq2[seq]
        else:
            freq1[seq] = freq2[seq]
    return freq1
    
pair_freq = map_reduce(words, 1, map_adjacent, reduce_adjacent)
unique_pairs = [seq for seq in pair_freq if pair_freq[seq] == 1]

print(unique_pairs)

Great success! above, we calculated all pairs of characters that occur only once in a word. We only calculated the pairs of characters, not the actual words where they occur. It would be a good exercise for us to try to modify the solution so that we get the actual words.

If we're curious, here are the words corresponding to each unique pair of characters:

![image.png](attachment:image.png)