### Multiple Cluster Engine (MCE) Usage
Hello! Thank you for considering using the Multiple Cluster Engine.  
Here's when you WANT to use this engine:
* When you want to perform map-reduce on your data, which is constrained to 1 large machine (lots of RAM and CPUs) probably due to security concerns of migrating your data to other machines.  

Do NOT use this engine:
* If you have access to traditional map-reduce (like AWS Elastic Map Reduce) AND your dataset is incredibly massive, then regular map-reduce is probably faster if you throw lots of machines at it. If you have the money, just use regular map-reduce.
* If you want to asychronously perform one task on lots of files that don't require map-reduce. For example, if you simply want to copy lots of big files from 1 directory to another asychronously at the same time, then just use the ipyparallel load balanced view. Basically you just want the multiprocessing module but don't want to write multiprocessing syntax because the multiprocessing module has some annoying limitations. Whatever you do using multiprocessing module can be done as easily or easier using ipyparallel. You can do something like this instead, which creates 1 cluster of 10 CPUs where each CPU does NOT talk to each other. :
```python
!ipcluster start --n=10 --profile=my_favorite_cluster --daemonize
time.sleep(10) # it takes a few seconds for the bash command to start up the clusters
client = ipp.Client('my_favorite_cluster')
load_balanced_view = client.load_balanced_view()
async_result = load_balanced_view.map_async(your_function_here, your_files_here, other_argument_here_if_you_have_any)
# with async_result, you can see how much time for each file has elapsed, whether if there was an
# error, and other interesting statistics. Basically async_result records the history of what the cluster is doing
!ipcluster stop --profile=my_favorite_cluster
```

### Motivtion for Using the Multiple Cluster Engine
The motivation is that the MCE is quite fast for large datasets, as it performs all calculations in RAM as opposed to a traditional map-reduce job which after the mapping steps, saves data to disk, sorts, shuffles, and load back to RAM. Traditional map-reduce is slower in that the reducing step cannot start until the entirety of the mapping step is completed. Any operations involving saving to disk instead of in RAM will be much slower. The MCE overcomes this hurdle in that each CPU in a cluster is effectively isolated from each other and performs BOTH mapping and reducing steps and thus isn't dependent on the completion of other CPUs in its cluster.  
__CAVEAT:__ With enough money for many machines and with a dataset large enough, traditional map-reduce can still be faster. Basically, 20 small machines might be faster than 1 large machine--and it maybe be possible that 20 small machines might be cheaper if runtime is significantly reduced. 

### How the Multiple Cluster Engine Works
The Multiple Cluster Engine allows vertical scaling on a large machine in place of horizontal scaling where you use multiple small machines. Here is the original paper called "Multi-Cluster Engine: A Pragmatic Approach to Big Data"  (https://docs.google.com/document/d/17-ItaCOXHbSqa2YmykpU4_PyC8--8bdMhEBGKjEURQo/edit), describing our approached of performing map-reduce to extract features from large datasets.  
A high-level explanation of MCE: Each CPU is effectively a complete map-reduce job on its PARTITION of the user IDs. Each cluster composed of CPUs extracts features for ALL the user IDs for the input file. Multiple clusters can extract features for multiple files simultaneously.  

Suppose we create a cluster with 7 CPUs (labeled CPU0, CPU1, ..., CPU6), each CPU will read all the data for 1 month. 
* Mapping stage: For each line, each CPU hashes the user ID and performs mod 7 and determines if that user ID belongs to that CPU (meaning the remainder == CPU ID). Using this scheme, we can guarantee that for a specific user ID, all the invoice data for that month will go to only 1 CPU. For each CPU, the data it remembers will be stored in a defaultdict where the key is the user ID and the value is a list of invoice data. Use the defaultdict(list), so you can simply append entries for each user ID.
* Reducing stage: create a pandas dataframe for only 1 user ID's data at a time and calculate the features you care about. After getting the results, store the results into a separate dictionary and immediately dump the raw data in order to reduce RAM usage. Once you iterate through all the user IDs, then save the results to file. You're done!

1 cluster is working on 1 file at 1 time. MCE abstracts this further where you create multiple clusters, so you can extract from multiple files files asychronously at the same time. The next file is sent to the next available cluster.  

### Practical Advice about the Multiple Cluster Engine
* MCE setup assumes that the number of CPUs for clusters are ordered from most to fewest (i.e. cluster0: 5 CPUs, cluster1: 4 CPUs, cluster2: 3 CPUs) and the list of input files are ordered from largest to smallest (i.e. 60 GB, 40 GB, 20 GB). The reason is that in MCE, cluster0 receives the leftmost/largest files in the list, and all other clusters receive files from the rightmost/smallest files; when MCE finishes all the files, the last file to be processed will actually be somewhere in the middle of the input list. The idea for the double-ended queue is to load balance the RAM. A counter-factual example is that if you queued up files from smallest to largest, the files at the end would take too much RAM as large files would be loaded into RAM simultaneously and ultimately crashing the RAM. Luckily, the MCE constantly profiles RAM and will kill the cluster using the most RAM after exceeding the user's preset RAM limit, so this will never happen. Nonetheless, killing a cluster means you are operating with fewer clusters.  
* Here is the tradeoff you experience to optimize the objective function (reducing run time): CPU, RAM, and storage.
* More CPUs in a single cluster means that for 1 file the reducing step will be faster (fewer user IDs per CPU) but the mapping step will still take the same time (every CPU will read the entire file). However, more CPUs in 1 cluster means you can't put the CPUs on other clusters. More clusters means processing more files simultaneously. However, more clusters also mean using more RAM as each file will be loaded into RAM.  
* You don't want more CPUs in the MCE than you have logical CPUs. For example, if you 24 CPUs on your VM, then the MCE should have at max 22 or 23 CPUs--leave 1 or 2 for remaining processes like OS. 
* Althought it may seem inefficient to have every CPU in 1 cluster read the same file independently, it turns out that map-reduce is not constrained by read time nor is it constrained by loading the data into RAM into the defaultdict. The bulk of the time is actually performing the calculations. If you are concerned about read time and performing unnecessary hashes on user ID, then you can write a Multiple Cluster job that first partitions the data based on user ID and then write that to disk. Basically, you do a 1-time map-reduce job where the mapping step collects all the data from a specific user ID and the reducing step is just writing each user ID data block back to file. The benefit is the data is now effectively already shuffled and sorted so you don't need to load in all the data any more. For future map-reduce jobs, you can just load in 1 user ID data block at a time, perform calculation, append to some results file, immediately delete the raw data from RAM, and repeat with next user ID. The tradeoff is that you use more SSD space, which is not cheap. Of course, a further optimization is writing to file how long the mapping (reading data) and reducing (performing calculations) steps take, so you know how many CPUs to put on a cluster. 
* The hashing of user ID is to randomly distribute user IDs to different CPUs. An interesting property is that the number of invoices for a user ID exhibits a power law distribution. Ideally, each CPU will have an equal amount of user IDs that have lots of invoices. The only downside is that there are a handful of user IDs that have incredible amounts of invoices: Santander bank has ~13% of all invoices. Hence, when a CPU is assigned to Santander bank, then it is the slowest CPU as it probably has to load more into its RAM during mapping and ultimately create a very large dataframe for reducing. Hence, in a cluster, you will see that this CPU slows down the entire cluster, as the cluster cannot move onto the next file until all the results for each user ID is completed for its input file. The wait time when other CPUs lay idle while 1 CPU is processing Santander is not trivial. In my opinion, this problem is the biggest thing that slows down the entire MCE job. Ideally you want for each CPU in a cluster to finish processing its user IDs around the same time. A resolution is to partition Santander and other known user IDs with inordinately large number of invoices from the rest of the dataset but __ONLY__ if the MCE job is too slow for you. Zen of Python: premature optimization is the root of all evil. 
* After the mapping step and before the reducing step, run gc.collect() to release some memory. It helps sometimes and is quick to run.
* The transition between Python 3 and 2 isn't very large: mostly print function vs statement. The main issue will be in the invoice data where the text are encoded with Window's encoding (cp-1252) (which had problems for reasons I forget), so we had to convert to UTF or ISO Latin 8859. Hence, for the function you put into MCE, you just have to be careful with Python 3's string encoding which will be UTF, which you'll have to figure out how to play nice with cp-1252. A practical example are Spanish characters with tildes or accents that, at least in Python 2, caused issues. (Perhaps, it was that cp-1252 strings were not hashable).  
* From observation, it appears that you don't want to get too close to 100% RAM utilization as all operations take significantly longer. I am not sure why this is happening. My guess is try to keep RAM utilization peaked at 80%; just a guess. 
* You want to use SSD when you use this engine. The reason is HDD is significantly slower to read data, and it probably can't read from different files simultaneously very well. 

Cool Built-In Features:
* Profiles RAM for each cluster (and sums total to get total RAM for entire MCE) every ```wait_time_in_seconds``` and saves to ```ram_usage.log```. If RAM of all clusters exceed user's preset RAM limit, then MCE will automatically kill the cluster using the most RAM (in practice, probably cluster0) AND write to ```failure.log``` which cluster got killed and which file it was working on at the time. 
* If the function put into MCE failed (like typeError, DivisionByZero, etc.), then that failure exception will be written into ```failure.log``` along with which cluster on which file. This is useful because the invoice data is frankly not always standardized--sometimes the delimiters change, sometimes the headers change, etc. Basically, the ```failure.log``` file tells you what went wrong on which file. However, to debug your function, you have to have to run the function without MCE but pass in arguments that MCE would have passed in, in order to reproduce the error as well as determine which line the code failed on. 
* After setting up MCE, the main() method will create the final output directory, start all the clusters, run the clusters where files are sent to the clusters, and then killing all the clusters at the end. Ideally everything is self contained. Occasionally, if there is an error and MCE stops running, you have to go to ```htop``` and find if there are any cluster processes still alive and manually kill them using ```ipcluster stop --profile=cluster_name_here```
* Each time you run MCE, MCE will create a new final output directory. Hence, you can safely run the same code multiple times and not overwrite your previous output. 
* Written and tested to be Python 3 and 2 compliant. In terms of RAM, Python 2 is a bit greedy in that it doesn't release all unused memory to the OS. The benefit for Python 3 is that it releases more (but not all) unused memory back to the OS. For example, if you create a large list in Python 2 and delete it, the amount of free RAM is lower than before. In Python 3 if you create a large list and delete it, the amount of free RAM is higher than that of Python 2. This is important in that when processing large files in MCE, each CPU will have some RAM attached after its finished processing its file, despite having no variables in its namespace. You want to have the smallest RAM footprint when each CPU finishes what its doing, so you can have more clusters. 

### Area for Future Improvements
* Allow queue instead of double-ended queue for input file lists. The motivation is that you actually want the results from some input files before others in a predictable way. This should be actually quite easy.
* Currently, if a cluster is running smoothly, it will never kill itself. Suppose RAM accumulates because of a memory leak, then implement a method to kill and restart the cluster. This should be easy. Personally, I don't know if memory leaks will be a big issue, so I didn't build this feature.
* If a cluster is killed, then perhaps in the future, turn that cluster back on--basically self-regenertion. This will be quite difficult for lots of reasons. For practical purposes, I do not see this as a feature, as there can be many things that can go wrong. 

### Additional Resources
* My quick, 5-minute tutorial of ipyparallel to cover all the basics: https://github.com/eugeneh101/ipyparallel-vs-MRJob/blob/master/ipyparallel_tutorial.ipynb  
* A more extensive coverage of ipyparallel at https://github.com/DaanVanHauwermeiren/ipyparallel-tutorial
* I drew inspiration for MCE from MRJob, a library that makes map-reduce easier using OOP. Here's my MRJob example done in less than 10 lines of code: https://github.com/eugeneh101/ipyparallel-vs-MRJob/blob/master/ipyparallel_example_vs_mrjob.ipynb
* This MRJob tutorial shows how you can deploy Map-reduce on AWS EMR using very little code. The only caveat is I was able to use MRJob locally but never figured out how set up the AWS EMR version correctly. https://github.com/donnemartin/data-science-ipython-notebooks/blob/master/mapreduce/mapreduce-python.ipynb

In [None]:
# write shell script for configuration and installation
# unittest with a mapper/reducer?

I just use this Jupyter Notebooks magic function (%%file) to save the code to the .py file. Basically, the code cell below functioned as my Sublime editor (with syntax highlighting and tab completion).

In [5]:
%%file multiple_cluster_engine.py
import ipyparallel as ipp
from collections import defaultdict
import os
import psutil
import time
from datetime import datetime
from tqdm import tqdm
import copy
import itertools

import logging # logging can create duplicate entries if you don't reload logging
try:
    from importlib import reload # Python 3
except: # Python 2 reload is a builtin
    pass

class MultipleClusterEngine(object):
    def __init__(self, 
                 cluster_job_name, 
                 n_cpus_list, 
                 ram_limit_in_GB,
                 wait_time_in_seconds,
                 input_file_names,
                 output_parent_dir,
                 function_to_process,
                 function_kwargs_dict): # always put function args in a dictionary
        reload(logging)
        self.cluster_job_name = cluster_job_name
        self.n_cpus_list = n_cpus_list
        self.ram_limit_in_GB = ram_limit_in_GB
        self.wait_time_in_seconds = wait_time_in_seconds
        self.output_parent_dir = output_parent_dir
        self.input_file_names = input_file_names
        self.function_to_process = lambda kwargs: function_to_process(**kwargs)
        self.function_kwargs_dict = function_kwargs_dict
        
        assert cluster_job_name, "Needs cluster name"
        assert len(n_cpus_list) > 0, "Needs the number of CPUs per cluster"
        assert os.path.isdir(self.output_parent_dir), "Output directory doesn't exist"
        assert len(self.input_file_names) > 0, "Need input files"

        # used by engine
        self.cluster_dict = {}
        self.load_balanced_view_dict = {}
        self.async_results_dict = defaultdict(list) # collects all the async_results
        self.file_to_cluster_order_dict = defaultdict(list) # remembers which file is sent to which cluster
        self.cluster_indexes = None
        self.logger_status = None
        self.logger_failure = None
        self.start_time = None
        self.end_time = None
        self.cluster_output_dir = None
        self.cluster_RAM_use_dict = {}
        self.cluster_pid_dict = {}
        self.n_unsuccessful_files = 0

    def create_cluster_output_dir(self):
        """Creates the folder which all the results will be saved to based on
        the output_parent_dir and the cluster job name and an incremented number"""
        subdirs = [name for name in os.listdir(self.output_parent_dir) if 
                   os.path.isdir(os.path.join(self.output_parent_dir, name))]
        existing_results_dir = []
        for subdir in subdirs:
            try:
                existing_results_dir.append(int(subdir.strip(self.cluster_job_name)))
            except ValueError:
                pass
        dir_index = max(existing_results_dir) + 1 if existing_results_dir else 0
        self.cluster_output_dir = os.path.join(self.output_parent_dir, 
                                               self.cluster_job_name + str(dir_index))
        os.makedirs(self.cluster_output_dir)
                
    def create_logger(self, logger_name, log_file):
        """Create logger objects"""
        l = logging.getLogger(logger_name)
        fileHandler = logging.FileHandler(log_file)
        l.addHandler(fileHandler)
        l.setLevel(logging.INFO)
    
    def activate_logger(self):
        """Create a logger for status, failure, and RAM usage updates"""
        self.create_logger('status', os.path.join(self.cluster_output_dir, "status.log"))
        self.create_logger('failure', os.path.join(self.cluster_output_dir, "failure.log"))
        self.create_logger('ram_usage', os.path.join(self.cluster_output_dir, "ram_usage.log"))
        self.logger_status = logging.getLogger('status')
        self.logger_status.propagate = False
        self.logger_failure = logging.getLogger('failure')
        self.logger_failure.propagate = False
        self.logger_ram_usage = logging.getLogger('ram_usage')
        self.logger_ram_usage.propagate = False
        
    def profile_memory_for_cluster(self, cluster_id): # RAM in GB
        """Given a cluster ID, determines how much RAM that cluster is using,
        which is just the sum of the RAM attached the CPUs in the cluster"""
        return sum(psutil.Process(pid).memory_info().rss for 
                   pid in self.cluster_pid_dict[cluster_id]) / 1e9
        
    def profile_memory_for_all_clusters(self):
        """Determines the RAM for all the clusters"""
        self.cluster_RAM_use_dict.clear()
        for jth_cluster in sorted(self.load_balanced_view_dict):
            self.cluster_RAM_use_dict[jth_cluster] = self.profile_memory_for_cluster(jth_cluster)           
            self.logger_ram_usage.info('{}: {}th cluster uses {} GB of RAM'.format(
                datetime.now(), jth_cluster, self.cluster_RAM_use_dict[jth_cluster]))
        self.logger_ram_usage.info('{}: All clusters use {} GB of RAM'.format(
                datetime.now(), sum(self.cluster_RAM_use_dict.values())))
        
    def clear_memory_on_cluster(self, cluster_id): # not as effective as imagined
        """Ideally clears out the RAM when a cluster successfully processes a file.
        In practice, it is hard to say how effective this is in clearing the RAM.
        Fortunately, running this method is very fast"""
        import gc
        self.cluster_dict[cluster_id][:].apply_async(gc.collect)
        
    def start_cluster(self, n_cpus, cluster_id):
        """Given a cluster ID, start the cluster with that ID and then attach
        to the cluster and then store the CPU PIDs cluster_pid_dict and then
        log it in the status.log file
        """
        self.logger_status.info("{}: \tAttempting to start cluster job "
            "{}'s {}th cluster with {} CPUs".format(datetime.now(), 
            self.cluster_job_name, cluster_id, n_cpus))
        os.system("ipcluster start --n={} --profile={}{} --daemonize".format(
            n_cpus, self.cluster_job_name, cluster_id)) # should deprecate to use a safer OS call

        attempt_ctr = 0 
        while attempt_ctr < 3: # Attempt to connect to client 3 times
            time.sleep(10) # hard coded
            try:
                cluster = ipp.Client(profile='{}{}'.format(self.cluster_job_name, cluster_id))
            except ipp.error.TimeoutError:
                attempt_ctr += 1
            else:
                self.cluster_pid_dict[cluster_id] = cluster[:].apply_async(os.getpid).get()
                self.logger_status.info(('{}: \t\tCPU processes ready for action'
                    ': {}').format(datetime.now(), self.cluster_pid_dict[cluster_id]))
                return cluster
            # if there is any other error other than TimeoutError, then the error will be raised
            
    def start_all_clusters(self):
        """Starts all the clusters specified and also writes updates to status.log"""
        self.activate_logger()
        self.logger_status.info('{}: Starting Multiple Cluster Engine'.format(datetime.now()))
        for cluster_id, n_cpus in enumerate(self.n_cpus_list):
            self.cluster_dict[cluster_id] = self.start_cluster(n_cpus, cluster_id)
            self.load_balanced_view_dict[cluster_id] = self.cluster_dict[cluster_id].load_balanced_view()            
        self.start_time = datetime.now()
        self.logger_status.info('{}: All clusters started at {}'.format(datetime.now(), self.start_time))
        self.cluster_indexes = itertools.cycle(sorted(self.load_balanced_view_dict))
        
    def kill_cluster(self, cluster_id):
        """Given a cluster ID, kills that cluster and write update to status.log and
        update cluster_indexes to know which clusters are remaining"""
        self.logger_status.info(("{}: \tAttempting to kill cluster job {}'s {}th "
            "cluster with CPU processes: {}").format(datetime.now(), 
            self.cluster_job_name, cluster_id, self.cluster_pid_dict[cluster_id]))
        self.load_balanced_view_dict.pop(cluster_id)
        # cluster.purge_everything() # sometimes this line takes forever
        self.cluster_dict[cluster_id].close()
        self.cluster_dict.pop(cluster_id)
        os.system('ipcluster stop --profile={}{}'.format(self.cluster_job_name, cluster_id))
        self.logger_status.info('{}: \t\tCluster successfully killed'.format(datetime.now()))
        self.cluster_indexes = itertools.cycle(sorted(self.load_balanced_view_dict))
        time.sleep(5) # hard-coded
        
    def kill_all_clusters(self):
        """Kills all clusters that are remaining and writes updates to status.log"""
        self.end_time = datetime.now()
        n_surviving_clusters = len(self.cluster_dict)
        self.logger_status.info('{}: Killing all remaining clusters'.format(datetime.now()))
        for cluster_id in sorted(self.cluster_dict):
            self.kill_cluster(cluster_id)
        self.logger_status.info('{}: All clusters have been killed'.format(datetime.now()))
        self.logger_status.info('{}: Multiple Cluster Engine shut down at {}'.format(
            datetime.now(), self.end_time))
        self.logger_status.info(("{}: Appears that {} files were successfully "
            "processed using {} surviving clusters in {} minutes").format(
            datetime.now(), len(self.input_file_names) - self.n_unsuccessful_files, 
            n_surviving_clusters, (self.end_time - self.start_time).seconds / 60.0))
        logging.shutdown()
        
    def early_kill_cluster(self, cluster_id):
        """Given a cluster ID, kills that cluster and writes which file that cluster
        was working on to failure.log and updates status and RAM logs. An early cluster
        kill happens when total RAM usage exceeds the limit set by the user"""
        self.logger_failure.info(("{}: Killing cluster job {}'s {}th cluster which "
            "was processing file {} due to exceeding RAM limit").format(
            datetime.now(), self.cluster_job_name, cluster_id, 
            self.file_to_cluster_order_dict[cluster_id][-1]))
        self.logger_ram_usage.info(("{}: Killing cluster job {}'s {}th cluster due "
            "to exceeding RAM limit").format(datetime.now(), self.cluster_job_name, cluster_id))        
        self.logger_status.info(("{}: Killing cluster job {}'s {}th cluster with CPU "
            "processes: {} due to exceeding RAM limit").format(datetime.now(), 
            self.cluster_job_name, cluster_id, self.cluster_pid_dict[cluster_id]))
        self.cluster_dict[cluster_id].close()
        os.system('ipcluster stop --profile={}{}'.format(self.cluster_job_name, cluster_id))
        self.load_balanced_view_dict.pop(cluster_id)
        self.cluster_dict.pop(cluster_id)
        self.async_results_dict.pop(cluster_id)
        self.n_unsuccessful_files += 1
        
    def kill_cluster_if_ram_limit_exceeded(self): 
        """Checks if total RAM used exceeds limit set by user. If so, kill the 
        cluster that uses the most RAM and update some logs. Only kills at 
        max 1 cluster per method call"""
        if sum(self.cluster_RAM_use_dict.values()) > self.ram_limit_in_GB:
            cluster_id = sorted(self.cluster_RAM_use_dict, 
                                 key=self.cluster_RAM_use_dict.get, reverse=True)[0]
            self.early_kill_cluster(cluster_id)
            self.cluster_indexes = itertools.cycle(sorted(self.load_balanced_view_dict))
        if len(self.load_balanced_view_dict) == 0:
            self.logger_failure.info(("{}: All clusters have been killed prematurely "
                "(probably due to exceeding RAM limit), so it would be a good idea"
                " to determine which files, if any, successfully processed"
                 ).format(datetime.now()))
            self.logger_status.info(("{}: All clusters have been killed prematurely "
                "(probably due to exceeding RAM limit)").format(datetime.now()))
            raise Exception('All clusters have been killed prematurely')
    
    def check_if_function_in_cluster_failed(self, cluster_id):
        """Given a cluster ID, checks if the most recent file (sent to
        that cluster) successfully processed or not. If there was an error
        in processing, then write the error to failure.log and remove
        its async_result history"""
        if self.async_results_dict[cluster_id] == []: # cluster just started, so it
            return # doesn't have any files sent to the cluster yet
        else:
            exception = self.async_results_dict[cluster_id][-1].exception()
            if exception:
                self.logger_failure.info(("{}: {}th cluster has error {} on "
                    "file {}").format(datetime.now(), cluster_id, exception.args[0], 
                    self.file_to_cluster_order_dict[cluster_id][-1]))
                self.async_results_dict[cluster_id].pop()
                self.n_unsuccessful_files += 1

    def create_kwargs_dict_list(self, input_file_name, cluster_id, n_cpus):
        """Packages up the arguments into a dictionary to be sent to each cluster. 
        There are function arguments as well as cluster arguments (cluster ID and 
        number of CPUs in that cluster). If the cluster has n CPUs, then create
        a list of n copies of this kwargs dictionary"""
        function_kwargs_dict = copy.deepcopy(self.function_kwargs_dict)
        function_kwargs_dict.update({'input_file_name': input_file_name,
                                    'cluster_output_dir': self.cluster_output_dir,
                                    'cluster_id': cluster_id,
                                    'n_cpus': n_cpus})
        function_kwargs_dict_list = []
        for cpu_id in range(n_cpus):
            function_kwargs_dict_list.append(copy.deepcopy(function_kwargs_dict))
            function_kwargs_dict_list[cpu_id]['cpu_id'] = cpu_id
        return function_kwargs_dict_list
                
    def run_clusters(self):
        """Iterates through all the files. You want to order your files from largest
        to smallest. The reason is that the largest files (file0, file1, file2, etc)
        will be sent to the largest cluster/cluster with the most CPUs (which 
        we will your first cluster). For each file, create a kwargs dictionary to 
        be sent to the cluster, write to status.log which file is going to which
        cluster, and store the async_result in async_results_dict. You can also
        inspect the order of files sent to which clusters by checking 
        file_to_cluster_order_dict after the engine has finished processing
        all the files. For every wait_time_in_seconds, check if a cluster is
        finished processing its current file and available to send the next
        file. In addition, for every wait_time_in_seconds, check if RAM usage
        exceeds limit set by user. If so, kill the cluster using the most RAM"""
        small_file_ctr = 1 # effectively a dequeue scheme
        big_file_ctr = 0
        for ith_file in tqdm(range(len(self.input_file_names))):
            while True:
                time.sleep(self.wait_time_in_seconds)
                self.profile_memory_for_all_clusters()
                self.kill_cluster_if_ram_limit_exceeded()
                jth_cluster = next(self.cluster_indexes)                
                if (not self.async_results_dict[jth_cluster][-1:]
                        or self.async_results_dict[jth_cluster][-1].done()): # check if cluster j is available                       
                        self.clear_memory_on_cluster(jth_cluster)
                    self.check_if_function_in_cluster_failed(jth_cluster) # check if previous file failed to process
                    if jth_cluster == 0: # send large files to large cluster (which ALWAYS has id == 0)
                        index = big_file_ctr
                        big_file_ctr += 1
                    else: # send small files to small clusters (which ALWAYS have id > 0)
                        index = -small_file_ctr
                        small_file_ctr += 1
                                                                                   
                    kwargs_dict_list = self.create_kwargs_dict_list(
                        self.input_file_names[index],
                        jth_cluster, 
                        len(self.cluster_dict[jth_cluster].ids))                    
                    
                    async_result = self.load_balanced_view_dict[jth_cluster].map_async(
                        self.function_to_process, kwargs_dict_list)                                              
                    self.async_results_dict[jth_cluster].append(async_result)
                    self.file_to_cluster_order_dict[jth_cluster].append(self.input_file_names[index])
                    # write status to file--it will only have start times, no end times
                    self.logger_status.info(("{}: {} is the {}th file and is sent to "
                        "{}th cluster for processing").format(datetime.now(),
                        self.input_file_names[index], ith_file, jth_cluster))
                    break # break out of inner loop to determine if other clusters are available

        while not all(self.async_results_dict[jth_cluster][-1].done()
                      for jth_cluster in self.async_results_dict): # wait for all clusters to finish
            time.sleep(self.wait_time_in_seconds)
            self.profile_memory_for_all_clusters()
            self.kill_cluster_if_ram_limit_exceeded()
                
        cluster_set = set()
        for jth_cluster in self.cluster_indexes:
            if jth_cluster in cluster_set:
                break
            cluster_set.add(jth_cluster)
            self.check_if_function_in_cluster_failed(jth_cluster) # check if last file failed to process
        # async_results_dict; save to disk for later inspection? determine whether results takes too much RAM
        
    def main(self):
        """Runs the entire thing"""
        self.create_cluster_output_dir()
        self.start_all_clusters()
        self.run_clusters()
        self.kill_all_clusters()

Overwriting multiple_cluster_engine.py
