cs416 mapreduce nikolay feldman, janelle barcia
Instructions for running this program: Upon calling 'make', this program should run with the following command line structure (all options and arguments are required): ./mapred -a [wordcount, sort] -i [procs, threads] -m num_maps -r num_reduces infile outfile
Description of the framework: After error checking the user inputs, fork() and execlp() are called on split.sh to split the input file into map number of files. After splitting the input file, main calls sort or wordcount depending on the user's -a option choice. When wordcount is called, a function by the name of "wc_mapreduce" is called and is given the program's arguments. For each input file partition, the file's content and a few other struct fields are passed as arguments to a newly created thread. There will be num_maps number of threads (specified by the -m option) and equal number of file partitions. Each thread calls a function "wc_map" and the main thread blocks until all child threads finish. The map worker threads traverse the each line of its partition's file contents and psuhes any words (delimited by any non-alphabet letter) into a vector that each thread pushes their map pairs into. When pushing to this vector, each thread synchronizes and push one at a time thanks to the help of semaphores. When each thread ends, the main thread picks up and sorts all the keywords so identical keywords come one after the other. Then for each reduce thread create, a subset of the keyword/value pairs gets passed to each reduce thread to count up. The threads then add the result pair into a vector that all reduce threads have access to. When a thread finishes its assigned work, it signals the main thread to assign the next available keyword subset. Work is continued until all keyword pairs are reduced. For reduce, we used a combination of semaphores, mutex, and conditional variables (just to get some experience with everything, not just semaphores). Finally, the main thread sorts the keywords and writes them to the specified output file.
When sort is called, a function called 'sort_mapreduce' takes in the user arguments (in the form of an Arg struct contained within the header). A loop first to put splited files put each line into a SORT_MapStruct struct and then a thread is created to pass this struct to the sort_map method, this runs the number of map times denoted by the user. In sort_map called with each thread, the list of lines is sorted. Upon the thread joining back to the sort_mapreduce method, this sorted list is stored in a vecture structure called all_maps in SORT_ReduceStruct, that will be passed to the sort_reduce (which is run with only one thread). sort_reduce gets all of the individually sorted maps stored in the all_maps vector, it reduces them by finding the lowest of the first indexes of all the maps. The lowest of all the maps is put into a final_sorted vector of SORT_Nodes and is then removed for the original SORT_ReduceStruct. This continues untill all maps in all_maps are empty. The final_sorted vector is written to the output file.
Testcases: For sort multiple tests were conducted using the provided bash script (randomnums.sh) to create data sets to sort. For word count, we tested using multiple lengthy books found on Project Gutenberg. We also tried error cases such empty files, invalid amount of arguments, negative amount of workers, and a case when there are more partitions of the input file than there are actual lines in the file.
Difficulties/problems faced: One of the main problems was figuring out how reducing would work for the integer sort problem. If there was 1 thread, it was easy. If there were multiple threads, that's what tricked us. We started this over the spring break, before the FAQ that said only one reduce thread will actually do the work for integer sort. Another poblems faced was thinking of ways to reduce sort in an efficieny manner. We soon realized that if we had more time, we would change sort from ascending to descending order so that when each of the first elements is removed, all elements don't have to shift forward. This way we would use pop_back() instead of erase() and front(), and would overall be more efficient.