# Multiple Tests Below
Here are a couple tests to show how the engine works.  
All code tested on AWS m4.2xlarge (8 CPUs and 32 GB of RAM) with Python 3. Also ran on Python 2 successfully; it appears that this code is Python 3 and 2 compliant.

In [1]:
# first make the output directory
!mkdir -p /home/ubuntu/cluster_results

## Setup
__User-Defined Function (UDF)__: The user creates a function that is the MCE argument ```function_to_process``` which has the following mandatory arguments: ```input_file_name, cluster_output_dir, cluster_id, n_cpus, cpu_id```. The MCE passes those keyword arguments into the UDF, so your UDF must take them in but doesn't have to use it. In the following example UDF, ```save_a_string``` just saves to file a couple strings. Notice that the UDF must import any of the libraries it uses.  

__MCE arguments__:  This must be a dictionary which has keyword arguments you pass into the MCE instance. It must has the following arguments (which everything you see in the example):  
* cluster_job_name: what ever you like to name the cluster job. This will help create the final output directory, so use a string that doesn't have any weird characters or spaces. 
* n_cpus_list: number of CPUs for each cluster. Largest cluster in the beginning; for example: [5, 2, 2]. You should rarely make a cluster with 1 CPU because then it would defeat the purpose of multi-CPU map-reduce.
* ram_limit_in_GB: threshold for total RAM usage across all clusters. If clusters use more RAM, then it will automatically kill the cluster that uses the most RAM.   
* wait_time_in_seconds: the number of seconds between profiling RAM and determining whether if any cluster is available to send the next file. I just use 1 second, so it profiles every second. 
* input_file_names: a list of file names. Absolute paths are preferred as relative path not guaranteed to work. 
* output_parent_dir: Directory of the parent folder that will contain the final output directory. Absolute path preferred and this directory has to already exist.
* function_to_process: Your UDF here
* function_kwargs_dict: an argument dictionary that is specific to your UDF. If your UDF takes no extra arguments than the mandatory arguments, then put an empty dictionary here.  

__MCE instance__: The instance of the MultipleClusterEngine class which you instantiate with your mce arguments dictionary. To run the whole thing, just call the main() method, which will call the following methods: 
* create_cluster_output_dir(): create the final output directory, which is based on your output_parent_dir + cluster_job_name + a uniquely incremented number
* start_all_clusters(): starts all the clusters with the number of CPUs you set
* run_clusters(): doles out the files to be processed by the clusters. Most of the runtime should be here
* kill_all_clusters(): kills all the clusters when the processing is completed. 

## Debugging: 
If MCE.main() finishes without error, the clusters will naturally be killed and everything is fine. Sometimes if your UDF fails or something weird happens, kill_all_clusters() won't work. You have to go ```htop``` or ```ps aux``` to see if the cluster is still running. For the following example, to manually kill the clusters:
```bash
# kill each of the 3 clusters since n_cpus_list has length 3
!ipcluster stop --profile save_to_string_0
!ipcluster stop --profile save_to_string_1
!ipcluster stop --profile save_to_string_2

```
Also, once a file has been sent to a cluster, you cannot stop it from inside Python. Hence, you have to manually kill the cluster to get the job to stop.

## Results
__```MCE.cluster_output_dir```__: This is the final output directory which is composed of ```output_parent_dir``` + ```cluster_job_name``` + a uniquely incremented number. Hence, it is safe to run the same code with same configurations, as the output will be written to a different final output directory. Moreover, you will know which is the latest run, as the directory will have the highest number.  
__logging__: MCE will generate 3 logs which always include time stamps: ```status.log```, ```ram_usage.log```, ```failure.log```. 
* ```status.log``` will tell you when the clusters started, which files were sent to which clusters, if any clusters are prematurely killed, when the clusters are killed after processing has been completed, and how many files appear to be successfully processed on how many remaining clusters. If certain files failed to process successfully (for example, if a invoice file has a different delimiter than the others, then your function will fail), then the number of files successfully processed will be less than the length of ```input_file_names```. If a cluster was killed prematurely (probably due to exceeding RAM limit), then the number of surviving clusters will be less than length of ```n_cpus_list```. Basically, for a quick sanity check, see if number of files successfully processed equals the length of your input files--which would indicate everything worked out smoothly. Of course, please do separate sanity check and look at your files to make sure things actually did work out correctly. 
* ```ram_usage.log```: Every ```wait_time_in_seconds```, RAM usage for each cluster (and the sum to get all clusters) is recorded. If RAM usage exceeds ```ram_limit_in_GB```, then cluster will automatically kill the cluster using the most RAM and record here the cluster number and which file it was processing. 
* ```failure.log```: If a function failed on a file, then the cluster number, error type, and file name will be recorded here. Also, if RAM usage exceeds ```ram_limit_in_GB```, a message will be recorded here too. If everything worked out perfectly, this file will be empty. The goal is to have this file be empty.  

__```MCE.async_results_dict```__: basically the instance attribute that records the history of everything MCE did

## UDF Example 0

In [2]:
%%time
from multiple_cluster_engine import MultipleClusterEngine
import os

# create some fake files
!mkdir -p fake_data
for i in range(10):
    !echo {i + 10} > fake_data/{i}.tmp
    

def save_a_string(string_saved_to_file, # actual args used in this function
    # mandatory args, you can choose not to use them but function has to take them in
     input_file_name, cluster_output_dir, cluster_id, n_cpus, cpu_id  
):
    import os # function has to import all the libraries it uses
    with open(input_file_name, 'r') as in_:
        text = ''.join(in_.readlines())        
    output_file_name = '_'.join(['cluster' + str(cluster_id), 'cpu' + str(cpu_id), input_file_name.split('/')[-1]])
    output_file_name = os.path.join(cluster_output_dir, output_file_name)
    with open(output_file_name, 'w') as out_:
        out_.write('My cluster_id is {}\n'.format(cluster_id))
        out_.write('The number of CPUs in this cluster is {}\n'.format(n_cpus))
        out_.write('My CPU_id is {}\n'.format(cpu_id))
        out_.write('My string is: {}\n'.format(string_saved_to_file))


mce_args = {
    'input_file_names': sorted([os.path.abspath('fake_data/' + file_name) 
        for file_name in os.listdir('fake_data')]), # gets absolute file paths
    'output_parent_dir': '/home/ubuntu/cluster_results', # absolute path preferred, directory has to already exist
    'function_to_process': save_a_string,
    'function_kwargs_dict': {'string_saved_to_file': 'pee-a-boo!'}, # function arguments here
    'cluster_job_name': 'save_a_silly_string_', # no spaces as it will be part of directory name
    'n_cpus_list': [4, 3, 2], # 1st cluster is always the largest or equal to the other clusters
    'ram_limit_in_GB': 20.0,
    'wait_time_in_seconds': 1
    }

mce0 = MultipleClusterEngine(**mce_args)
mce0.main()

100%|██████████| 10/10 [00:10<00:00,  1.01s/it]


CPU times: user 380 ms, sys: 36.7 ms, total: 417 ms
Wall time: 1min


In [3]:
mce0.async_results_dict # contains the details of each AsyncResultObject such as time to complete, etc.

defaultdict(list,
            {0: [<AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>],
             1: [<AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>],
             2: [<AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>]})

## Explanation of Results Specifically for this Function
__```mce0.cluster_output_dir```__: Since ```output_parent_dir``` is set to ```/home/ubuntu/cluster_results``` (which must already exist) and ```cluster_job_name``` is set to ```save_a_silly_string_```, the final output directory is ```/home/ubuntu/cluster_results/save_a_silly_string_``` + a uniquely incremented number. The first time you run this code, final output directory will be ```/home/ubuntu/cluster_results/save_a_silly_string_0```. The second time you run this code, final output directory will be ```/home/ubuntu/cluster_results/save_a_silly_string_1```.  
__logging__:  if you want to see a log file update in real time, you can type ```tail -f status.log```
* ```status.log```: has start and kill time of clusters. Shows which file went to which cluster. We inputted 10 files and asked for 3 clusters. Last line says "Appears that 10 files were successfully processed using 3 surviving clusters in 0.18333333333333332 minutes". Looks like we are good.
* ```ram_usage.log```: Every 1 second (```wait_time_in_seconds```), RAM usage for each cluster (and the sum to get entire MCE RAM usage) is recorded. Our RAM usage never got too high. If you wish, you can write a script to extract the timestamps and RAM usage to make a cool plot!    
* ```failure.log```: Empty. That's what we want!  

__```mce0.async_results_dict```__: basically the history of everything MCE did. For example:
```python
history = mce0.async_results_dict[0][-1] # get the history of cluster 0 and the last file sent to it.
history.done() # returns True
history.elapsed # returns 0.81518 seconds
history.successful() # True since function worked. It will return False if there was an error
```

## UDF Example 1

In [4]:
%%time
from multiple_cluster_engine import MultipleClusterEngine
import os

# create some fake files
!mkdir -p fake_data
for i in range(10):
    !echo {i + 10} > fake_data/{i}.tmp


def error_func1( # this function takes no real args that are not mandatory
    # mandatory args, you can choose not to use them but function has to take them in
     input_file_name, cluster_output_dir, cluster_id, n_cpus, cpu_id):
    1 / 0
    

mce_args = {
    'input_file_names': sorted([os.path.abspath('fake_data/' + file_name)
        for file_name in os.listdir('fake_data')]), # gets absolute file paths    
    'output_parent_dir': '/home/ubuntu/cluster_results', # use absolute path since it's safer, has to already exist
    'function_to_process': error_func1,
    'function_kwargs_dict': {}, # error_func1() takes no additional arguments
    'cluster_job_name': 'error_one_', # no spaces as it will be part of directory name
    'n_cpus_list': [4, 3, 2], # 1st cluster is always the largest or equal to the other clusters
    'ram_limit_in_GB': 20.0,
    'wait_time_in_seconds': 1
    }

mce1 = MultipleClusterEngine(**mce_args)
mce1.main()

100%|██████████| 10/10 [00:10<00:00,  1.01s/it]


CPU times: user 338 ms, sys: 60.3 ms, total: 398 ms
Wall time: 1min


In [5]:
mce1.async_results_dict # when a function fails on a file, its async_result history is removed

defaultdict(list, {0: [], 1: [], 2: []})

## Explanation of Results Specifically for this Function
__```mce1.cluster_output_dir```__: Since ```output_parent_dir``` is set to ```/home/ubuntu/cluster_results``` (which must already exist) and ```cluster_job_name``` is set to ```error_one_```, the final output directory is ```/home/ubuntu/cluster_results/error_one_0``` the first time you run this code.  
__logging__:  
* ```status.log```: Last line: "Appears that 0 files were successfully processed using 3 surviving clusters in 0.18333333333333332 minutes". It's true since 0 files were processed because our UDF doesn't work. Fortunately, all 3 clusters survived. 
* ```ram_usage.log```: Every 1 second (```wait_time_in_seconds```), RAM usage for each cluster (and the sum to get all clusters) is recorded. Our RAM usage never got too high.
* ```failure.log```: shows which file was sent to which cluster when something bad occurred. Explains that the function hit a ```ZeroDivisionError```, which is just what we expect this UDF (```error_func1()``` in this case) to have.  

__```mce1.async_results_dict```__: basically the history of everything MCE did. When a function fails on a file, the engine deletes the history for that file--I had to do it this way for the engine to work. That's why ```mce1.async_results_dict``` is equal to ```defaultdict(list, {0: [], 1: [], 2: []})```

## UDF Example 2

In [6]:
%%time
from multiple_cluster_engine import MultipleClusterEngine
import os

# create some fake files
!mkdir -p fake_data
for i in range(10):
    !echo {i + 10} > fake_data/{i}.tmp


def error_func2( # this function takes no real args that are not mandatory
    # mandatory args, you can choose not to use them but function has to take them in 
     input_file_name, cluster_output_dir, cluster_id, n_cpus, cpu_id):
    '1' + 2
    
mce_args = {
    'input_file_names': sorted([os.path.abspath('fake_data/' + file_name) 
        for file_name in os.listdir('fake_data')]), # gets absolute file paths
    'output_parent_dir': '/home/ubuntu/cluster_results', # use absolute path since it's safer, has to already exist
    'function_to_process': error_func2,
    'function_kwargs_dict': {}, # error_func2() takes no additional arguments    
    'cluster_job_name': 'error_two_', # no spaces as it will be part of directory name
    'n_cpus_list': [4, 3, 2], # 1st cluster is always the largest or equal to the other clusters
    'ram_limit_in_GB': 20.0,
    'wait_time_in_seconds': 1
    }

mce2 = MultipleClusterEngine(**mce_args)
mce2.main()

100%|██████████| 10/10 [00:10<00:00,  1.01s/it]


CPU times: user 354 ms, sys: 45.4 ms, total: 399 ms
Wall time: 1min


In [7]:
mce2.async_results_dict # when a function fails on a file, its async_result history is removed

defaultdict(list, {0: [], 1: [], 2: []})

## Explanation of Results Specifically for this Function
__```mce2.cluster_output_dir```__: the final output directory is ```/home/ubuntu/cluster_results/error_two_0``` the first time you run this code.  
__logging__:  
* ```status.log```: Last line: "Appears that 0 files were successfully processed using 3 surviving clusters in 0.18333333333333332 minutes". It's true since 0 files were processed because our UDF doesn't work. Fortunately, all 3 clusters survived. 
* ```ram_usage.log```: Every 1 second, RAM usage for each cluster (and the sum to get all clusters) is recorded. Our RAM usage never got too high.
* ```failure.log```: shows which file was sent to which cluster when something bad occurred. Explains that the function hit a ```TypeError```, which is just what we expect this UDF (```error_func2()``` adds a string and an integer) to have.  

__```mce2.async_results_dict```__: basically the history of everything MCE did. When a function fails on a file, the engine deletes the history for that file. That's why ```mce2.async_results_dict``` is equal to ```defaultdict(list, {0: [], 1: [], 2: []})```

## UDF Example 3

In [8]:
%%time
from multiple_cluster_engine import MultipleClusterEngine
import os

# create some fake files
!mkdir -p fake_data
for i in range(10):
    !echo {i + 10} > fake_data/{i}.tmp

    
def exceed_memory_limit( # this function takes no real args that are not mandatory
    # mandatory args, you can choose not to use them but function has to take them in 
     input_file_name, cluster_output_dir, cluster_id, n_cpus, cpu_id):
    import time
    time.sleep(5)
    if cluster_id % 2: # if cluster_id is odd, increase RAM usage until it is killed
        exceed_ram_limit = []
        current_value = 0
        while True:
            exceed_ram_limit.append(current_value)
            current_value += 1
    return cluster_id
        
    
mce_args = {
    'input_file_names': sorted([os.path.abspath('fake_data/' + file_name) 
        for file_name in os.listdir('fake_data')]), # gets absolute file paths
    'output_parent_dir': '/home/ubuntu/cluster_results', # use absolute path since it's safer, has to already exist
    'function_to_process': exceed_memory_limit,
    'function_kwargs_dict': {}, # exceed_memory_limit()takes no additional arguments
    'cluster_job_name': 'exceed_memory_', # no spaces as it will be part of directory name
    'n_cpus_list': [4, 3, 2, 1, 1, 1], # 1st cluster is always the largest or equal to the other clusters
    'ram_limit_in_GB': 20.0,
    'wait_time_in_seconds': 1
    }

mce3 = MultipleClusterEngine(**mce_args)
mce3.main()

100%|██████████| 10/10 [00:13<00:00,  1.31s/it]


CPU times: user 354 ms, sys: 125 ms, total: 478 ms
Wall time: 2min 23s


In [None]:
mce3.async_results_dict # when a function fails due to exceeding memory limit, its async_result history is removed

defaultdict(list,
            {0: [<AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>],
             2: [<AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>],
             4: [<AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>]})

## Explanation of Results Specifically for this Function
__```mce3.cluster_output_dir```__: the final output directory is ```/home/ubuntu/cluster_results/exceed_memory_0``` the first time you run this code.  
__logging__:  
* ```status.log```: Last line: "Appears that 7 files were successfully processed using 3 surviving clusters in 1.0166666666666666 minutes". We gave 10 files and 7 were correctly processed. This is what we expected since ```exceed_memory_limit()``` is specifically designed to increase RAM indefinitely on odd numbered clusters. We started with 6 clusters and ended up with 3. Our RAM profiler and ```ram_limit_in_GB``` worked to automatically kill cluster 1, 3, and 5. For example, one line says:  "Killing cluster job exceed\_memory\_'s 1th cluster with CPU processes: [4126, 4128, 4130] due to exceeding RAM limit"
* ```ram_usage.log```: Every 1 second, RAM usage for each cluster (and the sum to get all clusters) is recorded. Shows when RAM gets too high, a cluster is killed. For example, the last line is "Killing cluster job exceed\_memory\_'s 5th cluster due to exceeding RAM limit"
* ```failure.log```: Explains that the function (```exceed_memory_limit()```) hit memory limits on which clusters processing which files. These files would have to be processed again.

__```mce3.async_results_dict```__: basically the history of everything MCE did. In the dictionary values, there are only 7 elements in the lists representing the 7 files that were successfully processed. 3 files were sent to clusters that were killed prematurely.

## UDF Example 4

In [None]:
%%time
from multiple_cluster_engine import MultipleClusterEngine
import os

# create some fake files
!mkdir -p fake_data
for i in range(10):
    !echo {i + 10} > fake_data/{i}.tmp


def exceed_memory_limit_all( # this function takes no real args that are not mandatory
    # mandatory args, you can choose not to use them but function has to take them in 
     input_file_name, cluster_output_dir, cluster_id, n_cpus, cpu_id):
    import time
    time.sleep(cluster_id * cpu_id)
    exceed_ram_limit = []
    current_value = 0
    while True:
        exceed_ram_limit.append(current_value)
        current_value += 1
    return cluster_id
        
    
mce_args = {
    'input_file_names': sorted([os.path.abspath('fake_data/' + file_name) 
        for file_name in os.listdir('fake_data')]), # gets absolute file paths
    'output_parent_dir': '/home/ubuntu/cluster_results', # use absolute path since it's safer, has to already exist
    'function_to_process': exceed_memory_limit_all,
    'function_kwargs_dict': {}, # exceed_memory_limit_all() takes no additional arguments
    'cluster_job_name': 'exceed_memory_all_', # no spaces as it will be part of directory name
    'n_cpus_list': [4, 3, 2, 1, 1, 1], # 1st cluster is always the largest or equal to the other clusters
    'ram_limit_in_GB': 20.0,
    'wait_time_in_seconds': 1
    }

mce4 = MultipleClusterEngine(**mce_args)
mce4.main() # I'm expecting an Exception here due to killing all clusters

In [None]:
mce4.async_results_dict # when a function fails due to exceeding memory limit, its async_result history is removed

## Explanation of Results Specifically for this Function
__```mce4.cluster_output_dir```__: the final output directory is ```/home/ubuntu/cluster_results/exceed_memory_all_1``` the first time you run this code. Weird. It should be ```exceed_memory_all_0```. Must be a small bug. 

__logging__:  
* ```status.log```: Last line: " All clusters have been killed prematurely (probably due to exceeding RAM limit)". We started 6 clusters but all of them were killed due to our function, which I coded the engine to raise an Exception. This is what we expected. 
* ```ram_usage.log```: Every 1 second, RAM usage for each cluster (and the sum to get all clusters) is recorded. Shows when RAM gets too high, a cluster is killed. You see 1 by 1, all the clusters are eventually killed.  
* ```failure.log```: Explains that the function (```exceed_memory_limit_all()```) hit memory limits on which clusters processing which files. These files would have to be processed again. All remaining files that weren't sent to a cluster would also have to be processed again. Do a strong sanity check here to see if any of the resulting output files make any sense/work processed correctly.

__```mce4.async_results_dict```__: Since none of the files were processed, they were all removed from the dictionary values. Whenever a cluster hits RAM limit and is prematurely killed, the cluster is removed from the defaultdict (ie its cluster number will be popped from defaultdict). Hence, this defaultdict dict is completely empty without any keys. The drawback is suppose if cluster 4 processed 5 files correctly and then hits a memory limit on the 6th file, then all 6 files history in ```mce4.async_results_dict``` will be deleted (```mce4.async_results_dict.pop(4)``` will be called when cluster killed prematurely).

## UDF Example 5: Actual Map-Reduce Example

In [1]:
%%time
# create some fake data
import pandas as pd
import numpy as np

!mkdir fake_data

for i in range(10):
    userID = np.random.randint(1000, size=100000)
    invoice_payment = np.random.randint(100000, size=100000)

    df = pd.DataFrame({'userID': userID, 'invoice_payment': invoice_payment})
    df.to_csv('fake_data/{}.tmp'.format(i), index=False, header=True)

mkdir: cannot create directory ‘fake_data’: File exists
CPU times: user 2.94 s, sys: 83.9 ms, total: 3.02 s
Wall time: 3.12 s


In [2]:
def user_invoice_total(
    # mandatory args, you can choose not to use them but function has to take them in
     input_file_name, cluster_output_dir, cluster_id, n_cpus, cpu_id  
):
    import os # function has to import all the libraries it uses
    from collections import defaultdict
    import gc
    import pandas as pd
    
    def mapper(input_file_name, n_cpus, cpu_id):
        data_dict = defaultdict(list)
        with open(input_file_name, 'r') as in_:
            header = next(in_).strip().split(',') # don't forget to strip newline
            userID_index = header.index('userID') # safe way to determine column number
            invoice_payment_index = header.index('invoice_payment') # safe way to determine column number
            for line in in_:
                line = line.strip().split(',')
                userID, invoice_payment = int(line[userID_index]), int(line[invoice_payment_index])
                if hash(userID) % n_cpus == cpu_id:
                    data_dict[userID].append(invoice_payment)
        return data_dict
    
    def reducer(data_dict, cluster_output_dir, cluster_id, cpu_id):
        results_dict = {}
        for userID in sorted(data_dict):
            data = data_dict.pop(userID) # reduces data_dict size
            results_dict[userID] = [pd.DataFrame(data).sum()[0]] # only create a DataFrame for 1 userID at a time
            
        output_file_name = '_'.join(['cluster' + str(cluster_id), 
             'cpu' + str(cpu_id)])
        # I use double underscore to denote end cluster/CPU id and start of original file name
        output_file_name += '__' + input_file_name.split('/')[-1]
        output_file_name = os.path.join(cluster_output_dir, output_file_name)
        pd.DataFrame(results_dict).T.to_csv(output_file_name, header=False)

    
    data_dict = mapper(input_file_name, n_cpus, cpu_id)
    gc.collect() # garbage collector might reduce RAM usage; quick to run
    reducer(data_dict, cluster_output_dir, cluster_id, cpu_id)

In [3]:
%%time
from multiple_cluster_engine import MultipleClusterEngine
import os


# You have the choice to include all MCE arguments or just the most important ones.
# I recommend that you put all of them to gain full control of the engine and not use the default parameters.
all_mce_args = {
    'cluster_job_name': 'get_user_invoice_total_', # no spaces as it will be part of directory name
    'n_cpus_list': [4, 3, 2], # 1st cluster is always the largest or equal to the other clusters
    'ram_limit_in_GB': 20.0,
    'wait_time_in_seconds': 1,
    'input_file_names': sorted([os.path.abspath('fake_data/' + file_name) 
        for file_name in os.listdir('fake_data')]), # gets absolute file paths
    'output_parent_dir': '/home/ubuntu/cluster_results', # use absolute path since it's safer, has to already exist
    'function_to_process': user_invoice_total,
    'function_kwargs_dict': {} # user_invoice_total() takes no additional arguments    
    }

"""
minimum_required_mce_args = {
    'input_file_names': sorted([os.path.abspath('fake_data/' + file_name) 
        for file_name in os.listdir('fake_data')]), # gets absolute file paths
    'output_parent_dir': '/home/ubuntu/cluster_results', # use absolute path since it's safer, has to already exist
    'function_to_process': user_invoice_total,
    'function_kwargs_dict': {} # user_invoice_total() takes no additional arguments    
}
"""
mce5 = MultipleClusterEngine(**all_mce_args)
mce5.main()

100%|██████████| 10/10 [00:10<00:00,  1.01s/it]


CPU times: user 353 ms, sys: 28.1 ms, total: 381 ms
Wall time: 59.5 s


In [4]:
mce5.async_results_dict

defaultdict(list,
            {0: [<AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>],
             1: [<AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>],
             2: [<AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>,
              <AsyncMapResult: <lambda>:finished>]})

### Optional: Concatenating each cluster's CPU results together

In [5]:
# This code concatenates output files from all the CPUs in a cluster together.
# This code is not part of MCE, but easily can be something at the end of your MCE script, as 
# it requires the MCE object and ASSUMES that you use double underscore in your file name to 
# separate the cluster/CPU ID from the original file name. For example, cluster1_cpu2__0.tmp
# If you use this concatenation code, then don't put headers in each CPU output file since then
# you'll have repeating header lines
import os
from glob import glob


concatenated_dir = os.path.join(mce5.cluster_output_dir, 'concatenated_files') # hard-coded folder name
os.makedirs(concatenated_dir)

for file_name in mce5.input_file_names:
    base_file_name = os.path.basename(file_name)
    with open(os.path.join(concatenated_dir, base_file_name), 'w') as output_file:
        for input_file in sorted(glob(os.path.join(mce5.cluster_output_dir, '*__' + base_file_name))): # double underscore denotes beginning of original file name
            with open(input_file) as input_file_:
                for line in input_file_:
                    output_file.write(line)

### How I Debug my UDF when there is an error

In [6]:
%%time
# Basically, I put print statements in my UDF until the function seems to work.
# You can also have the UDF print how long the mapping and reducing steps took to see if 
# you are IO constrained by the mapper, or CPU constrained by the reducer.

!mkdir -p test_output
user_invoice_total(**{
    'input_file_name': 'fake_data/0.tmp',
    'cluster_output_dir': '/home/ubuntu/multiple-cluster-engine-in-python3/test_output',
    'cluster_id': 2,
    'n_cpus': 2,
    'cpu_id': 0,
    })

CPU times: user 361 ms, sys: 4.13 ms, total: 365 ms
Wall time: 467 ms


### Check if the resulting output files are correct

In [7]:
%%time
# on raw, unconcatenated files
input_file_names = sorted([os.path.abspath('fake_data/' + file_name) for file_name in os.listdir('fake_data')])
for file_name in input_file_names:
    base_file_name = os.path.basename(file_name)
    original_data_manually_processed = pd.read_csv(file_name)
    original_data_manually_processed = pd.DataFrame(original_data_manually_processed.groupby('userID')['invoice_payment'].sum())
    original_data_manually_processed = original_data_manually_processed.reset_index().sort_values('userID')
        
    mce_processed_data = []
    for file_ in glob('/home/ubuntu/cluster_results/get_user_invoice_total_1/*{}'.format(base_file_name)): # hard coded final output directory
        mce_processed_data.append(pd.read_csv(file_, header=None))
    mce_processed_data = pd.concat(mce_processed_data)
    mce_processed_data.sort_values(0, inplace=True)
    mce_processed_data.reset_index(inplace=True, drop=True)
    mce_processed_data.columns = ['userID', 'invoice_payment']
    
    
    print('MCE results match manual extraction for file: {}: {}'.format(base_file_name,
        pd.DataFrame.equals(original_data_manually_processed, mce_processed_data)))

ValueError: No objects to concatenate

In [8]:
# OPTIONAL: on concatenated files, need MCE object
for file_name in sorted(mce5.input_file_names):
    base_file_name = os.path.basename('fake_data/' + file_name)

    original_data_manually_processed = pd.read_csv(file_name)
    original_data_manually_processed = pd.DataFrame(original_data_manually_processed.groupby('userID')['invoice_payment'].sum())
    original_data_manually_processed = original_data_manually_processed.reset_index().sort_values('userID')

    mce_processed_data = pd.read_csv(os.path.join(mce5.cluster_output_dir, 'concatenated_files', base_file_name), header=None)
    mce_processed_data.sort_values(0, inplace=True)
    mce_processed_data.reset_index(inplace=True, drop=True)
    mce_processed_data.columns = ['userID', 'invoice_payment']    

    print('MCE results match manual extraction for file: {}: {}'.format(base_file_name,
        pd.DataFrame.equals(original_data_manually_processed, mce_processed_data)))

MCE results match manual extraction for file: 0.tmp: True
MCE results match manual extraction for file: 1.tmp: True
MCE results match manual extraction for file: 2.tmp: True
MCE results match manual extraction for file: 3.tmp: True
MCE results match manual extraction for file: 4.tmp: True
MCE results match manual extraction for file: 5.tmp: True
MCE results match manual extraction for file: 6.tmp: True
MCE results match manual extraction for file: 7.tmp: True
MCE results match manual extraction for file: 8.tmp: True
MCE results match manual extraction for file: 9.tmp: True


## Explanation of Results Specifically for this Function
__```mce5.cluster_output_dir```__: the final output directory is ```/home/ubuntu/cluster_results/get_user_invoice_total_1``` the first time you run this code. Weird. It should be ```get_user_invoice_total_1```. Must be a small bug. 

__logging__:  
* ```status.log```: Last line: "Appears that 10 files were successfully processed using 3 surviving clusters in 1.8 minutes". We sent in 10 files and 10 files were successfully processed. We created 3 clusters and 3 survived to the end. Great!
* ```ram_usage.log```: Every 1 second, RAM usage for each cluster (and the sum to get all clusters) is recorded. RAM never gets very high. 
* ```failure.log```: Empty. Perfect! 

__```mce5.async_results_dict```__: basically the history of everything MCE did. For example:
```python
history0 = mce5.async_results_dict[0][-1] # get the history of cluster 0 and the last file sent to it.
history1 = mce5.async_results_dict[2][-1] # get the history of cluster 2 and the last file sent to it.
print(history0.elapsed, history1.elapsed) # prints 17.952412 30.098238. This makes sense because cluster 0 had 4 CPUs and cluster 2 had 2 clusters. More clusters means faster processing times.  
```