# The idea behind MapReduce is a generalization of what we implemented in the Process Pool Executors lesson:

1. Divide: divide the data into chunks.
2. Map: use parallel processing to process each chunk.
3. Reduce: combine the individual chunk results into a global result.

The goal of this lesson is to implement this workflow as a generic framework where a user provides some data, a map function, and a reduce function, and the data processes automatically. The parallel processing will occur during the map stage.


## We provide you with the make_chunks function.

1. Rename the df argument to data.

2. On the first and second lines of the function, replace df.shape[0] with len(data).

3. On the second line of the function, replace df[i:i+chunk_size] by data[i:i+chunk_size].

4. Test the new implementation by calling it with arguments [1, 2, 3, 4, 5, 6] and 3. Assign the result to a variable named chunks.

Inspect the chunks variable. Its value should be [[1, 2], [3, 4], [5, 6]]

In [2]:
import math

def make_chunks(data, num_chunks):
    chunk_size = math.ceil(len(data) / num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

chunks = make_chunks([1, 2, 3, 4, 5, 6],3)
chunks

[[1, 2], [3, 4], [5, 6]]

## Let's assume that we want to calculate the maximum value in a list of numbers using MapReduce. 

The first step is to use the make_chunks() function from the previous screen to break the list into smaller chunks. 

The next step is to implement a map function that, given a chunk, calculates the answer for that chunk. In this case, we can use the max() built-in function as our mapper function.



### To make it easier to use, we can transform it into a function. The function has three arguments:

1. mapper: the function that we want to apply to each chunk. In the example above, it would be the max() built-in function.

2. data: the data.

3. num_processes: the number of processes to use.

The function implementation is the same as the code above, except that we replaced the max() built-in function with the provided mapper argument:


### We've provided you with the map_parallel() function and a list of numbers, values, that contains the same numbers as the diagram above.

1. Call the map_parallel() function using the max() built-in function as mapper, the values list as data, and 5 processes. Assign the result to a variable named results.

2. Inspect the value of results, and compare it with the diagram. It should be the same five numbers.


In [4]:
import concurrent.futures

def map_parallel(mapper, data, num_processes):
    chunks = make_chunks(data, num_processes)
    with concurrent.futures.ProcessPoolExecutor() as executor:
        futures = [executor.submit(mapper, chunk) for chunk in chunks]
    return [future.result() for future in futures]

values = [1, 4, 5, 2, 7, 21,     \
          31, 41, 3, 40, 5, 14,  \
          9, 32, 12, 18, 1, 30,  \
          6, 19, 23, 35, 12, 13, \
          0, 12, 42, 41, 11, 9]

# Write code here
results = map_parallel(max,values,5)
results

[21, 41, 32, 35, 42]

### We will learn a simpler way of doing this by using the Pool object from the multiprocessing module.

We will learn a simpler way of doing this by using the **Pool object** from the multiprocessing module.

We can use the **Pool** object to create and run a group of processes. We can initialize a **Pool** object by providing the number of processes that we want to run, like this:

```
from multiprocessing import Pool
pool = Pool(num_processes)
```

If the number of processes is not set, then it will automatically set to **os.cpu_count()**, which is the number of CPUs in the machine, like this:

```
import os
print(os.cpu_count())
```

We can use the Pool.map() method to execute a function on chunks of data in parallel: 
1. The first argument is the function we want to apply. 
2. The second argument is an iterable with all function arguments. 

In our case, the second argument is a list that contains all chunks of data.

Note that after calling the **Pool.map()** method, we call the **Pool.close()** and **Pool.join()** methods. The **Pool.close()** method prevents the addition of new processes to the pool. We need to execute this before we can join the processes. As before, the **Pool.join()** method makes the main program wait for all processes to finish before continuing executing

We've provided you with the **values** list from the previous screen. We already broke it into six chunks using the **make_chunks()** function.

1. Import the **Pool** object from the **multiprocessing** module.

2. Use the **Pool()** constructor to build a pool with six processes. Assign the instance to a variable named **pool**.

3. Use the **Pool.map()** method to apply the **max** function to **chunks**. Assign the result to a variable named **results**.

4. Use the **Pool.close()** method to close the process pool.

5. Use the **Pool.join()** method to wait for the processes to finish executing.

In [8]:
values = [1, 4, 5, 2, 7, 21,     \
          31, 41, 3, 40, 5, 14,  \
          9, 32, 12, 18, 1, 30,  \
          6, 19, 23, 35, 12, 13, \
          0, 12, 42, 41, 11, 9]

chunks = make_chunks(values, 6)

from multiprocessing import Pool
pool = Pool(6)

results = pool.map(max, chunks)
pool.close()
pool.join()
print(results)

[7, 41, 32, 30, 35, 42]


In reality, the **Pool.map()** function automatically blocks the execution until the processes finish their execution. So, we don't actually need to call the Pool.join() method. However, it is very important to call the Pool.close() method to destroy the processes after the execution. 

### We've provided you with the values list from the previous screen. We already broke it into six chunks using the make_chunks() function.

* Use a context manager (**with** statement) to create a **Pool** instance with six processes. Create it with the name **pool**.

* Inside the context manager, use the **Pool.map()** method to apply the **max** function to **chunks**. Assign the result to a variable named **results**.


In [19]:
values = [1, 4, 5, 2, 7, 21,     \
          31, 41, 3, 40, 5, 14,  \
          9, 32, 12, 18, 1, 30,  \
          6, 19, 23, 35, 12, 13, \
          0, 12, 42, 41, 11, 9]

chunks = make_chunks(values, 6)

# Write code here

with Pool(6) as pool:
    results = pool.map(max, chunks)
print(results)

[7, 41, 32, 30, 35, 42]


### We need a way to combine the results of each chunk into a global result for the original data.

For this, we can use the reduce function from the functools module. The functools.reduce() function takes two arguments:

1. A reducer function.
2. An iterable (a list for example) on which we want to apply the reducer function.

The **functools.reduce()** function starts by applying the provided function to the first two elements of the iterable. Then it applies it to the first result and the third element. Then it applies the previous result and the fourth element. It continues like this until a single result remains.

For example, imagine that the reducer function is the following add() function that adds two numbers:

```
def add(x, y):
    return x + y
```

Then **functools.reduce(add, [5, 7, 3, 9, 4, 6])** will add all values in the provided list by applying the **add()** function between the previous result and the next number in the list:

```
import functools
print(functools.reduce(add, [5, 7, 3, 9, 4, 6]))
```

### We've provided you with the values list from the previous screen.

1. Import the functools module.

2. Use the functools.reduce() function to calculate the maximum value of values. Assign the result to a variable named max_value.


In [12]:
values = [1, 4, 5, 2, 7, 21,     \
          31, 41, 3, 40, 5, 14,  \
          9, 32, 12, 18, 1, 30,  \
          6, 19, 23, 35, 12, 13, \
          0, 12, 42, 41, 11, 9]

# Write code here

import functools
max_value = functools.reduce(max, values)
max_value

42

### We now have everything we need to implement our own MapReduce framework. Let's review the steps and determine which function we should use for each step.

Let's begin where we want to calculate the maximum value of a list of numbers. The workflow is the following:

The list of numbers is in a variable named data. We've provided the variable num_processes with the number of processes to use.

1. Divide the data into chunks using the make_chunks() function.
2. Use the Pool.map() method with the max function to calculate the maximum value of each chunk.
3. Use the functools.reduce() function with the max function to combine the results of all chunks and identify the maximum value.

The list of numbers is in a variable named **data**. We've provided the variable **num_processes** with the number of processes to use.

1. Use the **make_chunks()** function to break the **data** into **num_processes** chunks. Assign the result to a variable named **chunks**.

2. Use a context manager to create a **Pool** with **num_processes** processes. Use it with the name **pool**.

3. Use the **Pool.map()** method to apply the **max()** built-in function to **chunks**. Assign the result to a variable named **chunk_results**.

4. Use the **functools.reduce()** function to apply the **max()** built-in function to **chunk_results**. Assign the result to a variable named **overall_result**.

5. Inspect the value of overall_result to ensure it is the maximum value of data.


In [20]:
data = [1, 4, 5, 2, 7, 21,     \
        31, 41, 3, 40, 5, 14,  \
        9, 32, 12, 18, 1, 30,  \
        6, 19, 23, 35, 12, 13, \
        0, 12, 42, 41, 11, 9]

num_processes = 5

# Write code here
chunks = make_chunks(data, num_processes)

with Pool(num_processes) as pool:
    chunk_results = pool.map(max, chunks)

overall_result = functools.reduce(max, chunk_results)
overall_result

42

We're going to finalize our work and create a map_reduce() function that we can use to apply MapReduce to any dataset. Here's the information a user needs to provide:

1. The dataset itself.
2. The number of processes to use (in other words, the number of chunks).
3. The function that applies to each chunk. We call this the "mapper function".
4. The function that combines the results of each chunk. We call this the "reducer function".

The list of numbers is in a variable named data. We've provided the map_reduce() function that we developed in this lesson.

This means that the map_reduce() function will have the following signature:

```
def map_reduce(data, num_processes, mapper, reducer):
    # Implementation goes here
```

The implementation will be very similar to what we did in the previous cell. The difference is that instead of applying the max() built-in function, we will apply the provided mapper and reducer functions.

The first step is to break the data into num_processes chunks:

```
def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
```

Then we need to create a pool and use the Pool.map() method to apply the mapper function to each chunk of data:

```
def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    with Pool(num_processes) as pool:
        chunk_results = pool.map(mapper, chunks)
```

Finally, we use the functools.reduce() function to combine the chunk_results into a global result and return it:

```
def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    with Pool(num_processes) as pool:
        chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)
```
The list of numbers is in a variable named data. We've provided the map_reduce() function that we developed in this lesson.
* Use the map_reduce() function to calculate the maximum value of data. Use 4 processes. As before, the mapper and reducer functions are both the max() built-in function.

* Assign the result from the previous step to a variable named max_value.


In [23]:
from multiprocessing import Pool
import functools

data = [1, 4, 5, 2, 7, 21,     \
        31, 41, 3, 40, 5, 14,  \
        9, 32, 12, 18, 1, 30,  \
        6, 19, 23, 35, 12, 13, \
        0, 12, 43, 41, 11, 9]

def make_chunks(data, num_chunks):
    chunk_size = math.ceil(len(data) / num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    with Pool(num_processes) as pool:
        chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)

# Write code here
max_value = map_reduce(data, 4, max, max)
print(max_value)

43


## We've read the data into the job_postings and skills DataFrame. Your goal is to implement the mapper() and reducer() function and apply the map_reduce() function. This time, we'll break the skills DataFrame into chunks.

1. Define a function named mapper() with a single argument named skill_chunk.

2. Implement the mapper() function by doing the following:
    1. Initializing an empty dictionary named frequency.
    2. For each skill_name in skill_chunk["Name"], count the number of occurrences of skill_name in the job_postings DataFrame (note that the job_postings DataFrame is available from the outside and not passed as arguments).
    3. Assign the result to frequency[skill_name].
    4. After the for loop, return the frequency dictionary.

3. Define a function named reducer() with two arguments freq_chunk1 and freq_chunk2. These arguments will be the frequency tables of two chunks. This function should merge the results.

4. Implement the reducer() function by using the dict.update() method to merge freq_chunk1 and freq_chunk2 into a single dictionary. Return the merged result.

5. Call the map_reduce() with arguments: skills, 4, mapper, and reducer. Assign the result to skill_freq.


In [27]:
from multiprocessing import Pool
import functools

import pandas as pd
job_postings = pd.read_csv("DataEngineer.csv")
job_postings["Job Description"] = job_postings["Job Description"].str.lower()
skills = pd.read_csv("Skills.csv")

def mapper(skill_chunk):
    frequency = {}
    for skill_name in skill_chunk["Name"]:
        frequency[skill_name] = job_postings["Job Description"].str.count(skill_name).sum()
    return frequency

def reducer(freq_chunk1, freq_chunk2):
    freq_chunk1.update(freq_chunk2)
    return freq_chunk1

def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    with Pool(num_processes) as pool:
        chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)

#skill_freq = map_reduce(skills, 4, mapper, reducer)

In [28]:
import pandas as pd
job_postings = pd.read_csv("DataEngineer.csv")
job_postings["Job Description"] = job_postings["Job Description"].str.lower()
skills = pd.read_csv("Skills.csv")

# Write code here
def mapper(jobs_chunk):
    frequency = {}
    for skill_name in skills["Name"]:
        frequency[skill_name] = jobs_chunk["Job Description"].str.count(skill_name).sum()
    return frequency

def reducer(freq_chunk1, freq_chunk2):
    merged = {}
    for skill in freq_chunk1:
        merged[skill] = freq_chunk1[skill] + freq_chunk2[skill]
    return merged

#skill_freq = map_reduce(job_postings, 4, mapper, reducer)

### In the Introduction to MapReduce lesson, we learned about MapReduce and implemented our own MapReduce framework. Here is the workflow for using it:

1. Implement a mapper function that takes as input a single chunk of data. This function should process the data chunk and return the processed data.

2. Implement a reducer function that takes as input two results from the mapper function and combines them into a single result.

3. Provide both functions to the MapReduce framework. The data will then automatically divide into chunks and process in parallel.


The map_reduce() function from the Introduction to MapReduce lesson will be available on all screens. It takes input four arguments as input:

1. The data
2. The number of processes
3. The mapper function
4. The reducer function

### We've provided a list of numbers values.

1. Use the map_reduce() function to calculate the minimum value in the values list using 4 processes. Assign the result to a variable name min_value.

In [30]:
from multiprocessing import Pool
import functools

def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    with Pool(num_processes) as pool:
        chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)

values = [98, 63, 55, 80, 45, 51, 91, 64, 65, 48, 48, 92, 76, 99, 57, 42, 79, 61, 63, 49]

min_value = map_reduce(values, 4, min, min)
print(min_value)

42


### We'll use MapReduce to calculate the length of the longest word in the English language.

#### We've read the word_list.txt file into a list named words.

1. Define a function named map_max_length with a single argument named words_chunk.

2. Implement the map_max_length() function so that it returns the length of the longest string in the words_chunk list.

3. Use the map_reduce() function to calculate the length of the longest string in words using 4 processes. Assign the result to a variable named max_len.

4. Print the value of max_len to see the length of the longest English word.

In [46]:
# import math
# from multiprocessing import Pool
# import functools

# with open("english_words.txt") as f:
#     words = [word.strip() for word in f.readlines()]

# def map_max_length(words_chunk):
#     return max([len(word) for word in words_chunk])

# def make_chunks(data, num_chunks):
#     chunk_size = math.ceil(len(data) / num_chunks)
#     return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

# if __name__ == "__main__":
#     __spec__ = None
#     def map_reduce(data, num_processes, mapper, reducer):
#         chunks = make_chunks(data, num_processes)
#         with Pool(num_processes) as pool:
#             chunk_results = pool.map(mapper, chunks)
#         return functools.reduce(reducer, chunk_results)

#     max_len = map_reduce(words, 4, map_max_length, max)
#     print(max_len)

%run longest_english_word.py

21


### Let's write another MapReduce program that calculates the actual word instead of the length.

To calculate the longest string in a list, we can actually use the max() built-in function. From the documentation, we see that the max() function accepts an argument called key. We can use this argument to specify how to compare the values in the list.

The key argument should be a function that returns a numeric value. This function will then apply to each element. The one with the highest value will return. By using the len() built-in function as the key arguments, we get the longest string:

```
max_str = max(["science", "programming", "database", "python"], key=len)
print(max_str)

programming
```

### We've read the word_list.txt file into a list named words.

1. Define a function named map_max_len_str with a single argument named words_chunk.

2. Implement the map_max_len_str() function so that it returns the longest string in words_chunk. You can use the max() built-in function with the key keyword argument to implement it.

3. Define a function reduce_max_len_str with two arguments word1 and word2.

4. Implement the reduce_max_len_str() function so that it returns word1 if it is longer than or equal to the length of word2. Otherwise, return word2.

5. Use the map_reduce() function to calculate the longest string in words using 4 processes. Assign the result to available named max_len_str.

6. Print the value of max_len_str to see the longest English word.


In [47]:
# import math
# from multiprocessing import Pool
# import functools

# with open("english_words.txt") as f:
#     words = [word.strip() for word in f.readlines()]
    
# def map_max_len_str(words_chunk):
#     return max(words_chunk, key=len) 

# def reduce_max_len_str(word1, word2):
#     return map_max_len_str([word1, word2])

# def make_chunks(data, num_chunks):
#     chunk_size = math.ceil(len(data) / num_chunks)
#     return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

# if __name__ == "__main__":
#     __spec__ = None
#     def map_reduce(data, num_processes, mapper, reducer):
#         chunks = make_chunks(data, num_processes)
#         with Pool(num_processes) as pool:
#             chunk_results = pool.map(mapper, chunks)
#         return functools.reduce(reducer, chunk_results)

#     max_len_str = map_reduce(words, 4, map_max_len_str, reduce_max_len_str)
#     print(max_len_str)
    
%run longest_english_string.py

chromophotolithograph


### How can we use MapReduce to find out if a given word is in the list of English words we're using in this lesson?

For each chunk of words, we can use the in operator to check whether the word is in the chunk. This means that the mapper function will map each chunk into a Boolean value: True if the word is in the chunks and False otherwise.

Then, the reducer function needs to combine these results in a way that tells us if the word is in any of the chunks. We can do this by using the or Boolean operator to combine the results of the two chunks. Given any number of Boolean values, the logical or of all those values will be true if at least one of them is true.

### We've read the word_list.txt file into a list named words. The target string (the one we're looking for) is in a variable named target. Note that this doesn't pass as arguments to neither the mapper nor the reducer. They can access it because it is declared outside.

1. Define a function named map_contains with a single argument named words_chunk.

2. Implement the map_contains() function so that it returns True if the words_chunk contains the string in target.

3. Define a function reduce_contains with two Boolean arguments contains1 and contains2.

4. Implement the reduce_contains function so that it returns True if at least one of contains1 or contains2 is true. Otherwise, the function should return False.

5. Use the map_reduce() function to check whether the words list contains the word pneumonoultramicroscopicsilicovolcanoconiosis, which is stored in the variable target. Assign the result to a variable named is_contained.

6. Print the value of is_contained.

In [49]:
# import math
# from multiprocessing import Pool
# import functools

# with open("english_words.txt") as f:
#     words = [word.strip() for word in f.readlines()]

# target = "pneumonoultramicroscopicsilicovolcanoconiosis"

# def map_contains(words_chunk):
#     return target in words_chunk
    
# def reduce_contains(contains1, contains2):
#     return contains1 or contains2

# def make_chunks(data, num_chunks):
#     chunk_size = math.ceil(len(data) / num_chunks)
#     return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

# if __name__ == "__main__":
#     __spec__ = None
#     def map_reduce(data, num_processes, mapper, reducer):
#         chunks = make_chunks(data, num_processes)
#         with Pool(num_processes) as pool:
#             chunk_results = pool.map(mapper, chunks)
#         return functools.reduce(reducer, chunk_results)

#     is_contained = map_reduce(words, 4,map_contains, reduce_contains)
#     print(is_contained)

%run processing_data_w_map_reduce.py

False


Let's continue practicing MapReduce by counting the frequency of the characters throughout the entire list of words. The idea is to see which letters from the English alphabet appear most frequently. For example, we want to count how many times the letter a appears, how many times the letter b appears, and so on.

If the list only had the words data and science then the result would be:

```
{
    'd': 1, 
    'a': 2, 
    't': 1, 
    's': 1, 
    'c': 2, 
    'i': 1, 
    'e': 2, 
    'n': 1
}
```

We're counting characters rather than words. The mapper function will create a frequency table for the provided chunk of words.

Then, the reducer will take two frequency tables and merge them by adding together the counts of both letters. In this case, the two dictionaries will not necessarily have the same set of keys because some characters might not occur in one of the chunks.

For example, if the result of one chunk is {'a': 3, 'b': 2} and the result of the other chunk is {'b': 2, 'c': 2} then the reduce result must be:

```
{
    'a': 3,
    'b': 4,
    'c': 2
}
```

### We've read the word_list.txt file into a list named words.

1. Define a function named map_char_count with a single argument named words_chunk.

2. Implement the map_char_count() function by following these steps:
    1. Initialize an empty dictionary in a variable named char_freq.
    2. Use a for loop over words_chunks with the variable word.
    3. Inside, use a second for loop over word with variable c. This loop will iterate over all characters in word.
    4. Use the in operator to check whether c is in char_freq. If it isn't, set char_freq[c] to 0.
    5. Inside the second for loop but outside of the if statement, increment the value of char_freq[c] by 1.
    6. At the end, outside of the two for loops, return the value of char_freq.

3. Define a function named reduce_char_count with two arguments named freq1 and freq2. These will be two dictionaries with character frequencies of two chunks of words.

4. Implement the reduce_char_count() function by following these steps:
    1. Use a for loop to iterate over all keys in freq2 using variable c.
    2. If c is in freq1, add freq2[c] to freq1[c].
    3. Otherwise, set the value of freq1[c] to freq2[c].
    4. Outside of the for loop, return the value of freq1.

5. Use the map_reduce() function to build a character frequency table of the words list using 4 processes. Assign the result to a variable named char_freq.

6. Print the value of char_freq and see which English character appears most frequently.

In [50]:
# import math
# from multiprocessing import Pool
# import functools

# with open("english_words.txt") as f:
#     words = [word.strip() for word in f.readlines()]
    
# def map_char_count(words_chunk):
#     char_freq = {}
#     for word in words_chunk:
#         for c in word:
#             if c not in char_freq:
#                 char_freq[c] = 0
#             char_freq[c] += 1
#     return char_freq

# def reduce_char_count(freq1, freq2):
#     for c in freq2:
#         if c in freq1:
#             freq1[c] += freq2[c]
#         else:
#             freq1[c] = freq2[c]
#     return freq1

# def make_chunks(data, num_chunks):
#     chunk_size = math.ceil(len(data) / num_chunks)
#     return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

# if __name__ == "__main__":
#     __spec__ = None
#     def map_reduce(data, num_processes, mapper, reducer):
#         chunks = make_chunks(data, num_processes)
#         with Pool(num_processes) as pool:
#             chunk_results = pool.map(mapper, chunks)
#         return functools.reduce(reducer, chunk_results)

#     char_freq = map_reduce(words, 4, map_char_count, reduce_char_count)
#     print(char_freq)
    
%run char_freq.py

{'a': 71683, 'm': 26264, 'r': 61289, 'o': 59303, 'n': 55947, 'i': 73051, 'c': 38519, 'l': 47576, 'b': 15500, 't': 59007, 'e': 90720, 's': 50422, 'u': 30414, 'k': 5278, 'd': 24568, 'f': 10137, 'y': 18011, 'g': 18161, 'h': 21742, 'v': 8166, 'x': 2701, 'w': 5511, 'p': 27149, 'j': 1250, 'q': 1656, 'z': 2722}


Now let's calculate the average length of English words.

The average length is the sum of the word lengths divided by the number of words: 

```
(4 + 9 + 11 + 7) =  31/4 = 7.75
(4)
```

We might think to implement the mapper function by calculating the average of the given chunk. However, since the reducer function merges two results at a time, it is difficult to merge the chunk averages in a way that yields a global average.

A simpler solution is to make mappers calculate the sum of the lengths of all words in that chunk divided by the total number of words in the entire word list. In this way, the reducer function can simply add the results to get the overall average.

### We've read the word_list.txt file into a list named words.

1. Define a function named map_average with a single argument named words_chunk.

2. Implement the map_average() function with the following steps:
    1. Calculate the sum of the lengths of all words in words_chunk.
    2. Divide the result by the length of words. The words variable is available from the outside.

3. Define a function named reduce_average with two arguments named res1 and res2.

4. Implement the reduce_average() function so that it returns the value of res1 + res2.

5. Use the map_reduce() function to calculate the average length of the words the words list using 4 processes. Assign the result to a variable named average_word_len.

6. Print the value of average_word_len and see the average length of the English words.

In [51]:
# import math
# from multiprocessing import Pool
# import functools

# with open("english_words.txt") as f:
#     words = [word.strip() for word in f.readlines()]
    
# def map_average(words_chunk):
#     length = 0
#     for word in words_chunk:
#         length += len(word)
#     return length / len(words)

# def reduce_average(res1, res2):
#     return res1 + res2

# def make_chunks(data, num_chunks):
#     chunk_size = math.ceil(len(data) / num_chunks)
#     return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

# if __name__ == "__main__":
#     __spec__ = None
#     def map_reduce(data, num_processes, mapper, reducer):
#         chunks = make_chunks(data, num_processes)
#         with Pool(num_processes) as pool:
#             chunk_results = pool.map(mapper, chunks)
#         return functools.reduce(reducer, chunk_results)

#     average_word_len = map_reduce(words, 4, map_average, reduce_average)
#     print(average_word_len)
    
%run avg_word_length.py

8.622276685612974


### We are interested in words with characters that appear next to each other the least often. For example, consider the word science. The characters that occurs next to each other are as follows:

```
sc
ci
ie
en
nc
ce
```

We can calculate them using a for loop, like so:
```
word = "science"
for i in range(len(word) - 1):
    seq = word[i] + word[i + 1]
    print(seq)
```

The goal is to find which pairs of characters occur next to each other in only one word.

For this, we will use MapReduce to build a frequency table of all pairs of consecutive characters in all words of the dataset. The pairs that occurs only once will be the ones with a frequency equal to one.

The mapper function will build the frequency table of each chunk and the reducer function will merge the frequency table in the same way as we did before.

### We've read the word_list.txt file into a list named words.

1. Define a function named map_adjacent with a single argument named words_chunk.

2. Implement the map_adjacent() function so that it calculates a frequency table of the pairs of adjacent characters of all words in words_chunk.

3. Define a function named reduce_adjacent with two arguments named freq1 and freq2.

4. Implement the reduce_adjacent() function so that it returns a dictionary that results from merging freq1 and freq2.

5. Use the map_reduce() function to calculate all pairs of consecutive characters that appear a single time in the words list using 4 processes. Assign the result to a variable named pair_freq.

6. Use a for loop to create a list unique_pairs with all keys from pair_freq whose values is one.

In [53]:
# import math
# from multiprocessing import Pool
# import functools

# with open("english_words.txt") as f:
#     words = [word.strip() for word in f.readlines()]
    
# def map_adjacent(words_chunk):
#     adj_freq = {}
#     for word in words_chunk:
#         for i in range(len(word) - 1):
#             seq = word[i] + word[i + 1]
#             if seq not in adj_freq:
#                 adj_freq[seq] = 0
#             adj_freq[seq] += 1
#     return adj_freq


# def reduce_adjacent(freq1, freq2):
#     for seq in freq2:
#         if seq in freq1:
#             freq1[seq] += freq2[seq]
#         else:
#             freq1[seq] = freq2[seq]
#     return freq1

# def make_chunks(data, num_chunks):
#     chunk_size = math.ceil(len(data) / num_chunks)
#     return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

# if __name__ == "__main__":
#     __spec__ = None
#     def map_reduce(data, num_processes, mapper, reducer):
#         chunks = make_chunks(data, num_processes)
#         with Pool(num_processes) as pool:
#             chunk_results = pool.map(mapper, chunks)
#         return functools.reduce(reducer, chunk_results)

#     pair_freq = map_reduce(words, 4, map_adjacent, reduce_adjacent)
#     unique_pairs = [seq for seq in pair_freq if pair_freq[seq] == 1]
#     print(unique_pairs)
    
%run rare_adj_chars.py

['zt', 'zp', 'xk', 'wq', 'vn', 'gv', 'cg', 'zg', 'fj', 'cf', 'jr', 'vh', 'jh', 'gj', 'jy', 'pv', 'cw', 'zq', 'zm', 'yq', 'rx', 'zv', 'vd', 'vl']
