Please sign in to comment.
New README for public consumption.
- Loading branch information...
|@@ -1,14 +1,3 @@|
|-Linear Programming Final Project|
|+An experimental linear program solver, plpsolve is capable of solving any feasible general linear program. There are currently some numerical stability issues that probably won't go away for a while. This is because the main focus of this project is to explore parallelization methods for general form simplex and not to implement the fastest or most-correct solver out there.|
|-For my final project I analyzed the parallelizability of a linear program solver. The solver in question is based on the pivoting version of general form simplex. It is able to initialize dictionaries with both unbounded variables and infeasible constraints. As well, it implements Bland's Rule and a separate rule that selectively enables and disables Bland's Rule to ensure termination. Lastly, several several optimizations were added, such as keeping a running account of the values of the objective function and the constraints. This meant that the values didn't have to be re-calculated during each pivot.|
|-Two parallelization methods were considered: data- and task-parallel. The data parallel method simply required the instrumenting of the only hot loop with a large amount of instructions inside of it with an OpenMP parallel-for pragma. This resulted in an overall slowdown of the application as the loop was still fairly light weight and each thread was accessing data that resided in the same cache line. With larger problem sizes this method might be more useful, but more gain would be obtained from transitioning to revised simplex and parallelizing a loop that is part of that algorithm.|
|-I also developed a task-parallel algorithm that I was unable to implement in time for this assignment. This algorithm starts by creating a worker thread for each processing element in the system and pegging the thread to a specific core. Each worker has a local priority queue in which it keeps work for it to process. Each work unit consists of a pointer to a dictionary and a list of all possible entering and leaving pairs. The priority of a work unit is the number of possible entering and leaving pairs, with lower numbers of pairs indicating higher priority. This decision is meant to decrease memory usage as dictionaries with fewer pairs will get completed and subsequently freed first. When a worker discovers that there is no work in their local work queue they will select the neighbor with the most amount of work in their queue and steal the last dictionary from the lowest priority level. The selected work unit should be the first one added to this priority level, meaning that it has probably been pushed out of the cache due to pressure from the lower priority levels. This will hopefully reduce the amount of cache coherency protocol traffic generated during the work-stealing phase.|
|-When a worker removes a work unit they take the initial dictionary and perform the specified pivot on it, producing a new dictionary. A global Bloom filter is then consulted, using the labels of the basis variables as the key, to see if this dictionary has been encountered before. If it has it is thrown away and the worker continues with the next entering/leaving pair in the work unit. If the dictionary hasn't been encountered before the worker will generate a list of all entering and leaving pairs, attach them to the dictionary, and add them to their local priority queue as a new work unit. If the generated dictionary is final a global flag is set to indicate this, along with a pointer to the final dictionary, and the worker then frees all remaining dictionaries on its work queue. The other workers check to see if a final dictionary has been seen before starting a work unit, and if a final dictionary has been seen they will free their remaining work units and await the start of another round of simplex.|
|-This algorithm is started by a manager thread placing the initial feasible dictionary, along with the corresponding entering and leaving variable pairs, into the queue of one of the workers. The manager then waits for a worker to find the final dictionary and returns it once it has been found.|
|-This algorithm allocates a new dictionary on each pivot, and therefor its performance will be dependent on the memory allocator used. To judge this impact I intend to replace the standard library malloc implementation with three other allocators that are designed specifically for multi-threaded applications. These allocators are StreamFlow, tcmalloc, and jemalloc. Lastly, to evaluate the scalability of the two parallelization methods I plan to plot the speedup versus the serial implementation over a range of worker threads and problem sizes. My prediction is that the task-parallel algorithm will perform better overall.|
|+In reality this code shouldn't be used to solve actual problems, but is probably a good example of how to code general-form simplex (including initialization) and then how to parallelize it in different ways.|