New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MapReduce Transactions #22

Open
apavlo opened this Issue Jan 11, 2012 · 2 comments

Comments

Projects
None yet
2 participants
@apavlo
Owner

apavlo commented Jan 11, 2012

This project is for improving H-Store's current implementation of MapReduce transactions.

  1. Add support for non-blocking Reduce execution. Currently, the Reduce-phase of a MapReduce transaction is executed like the Map-phase as multiple single-partition transactions that block all other transactions from executing on each partition. This is unnecessary because Reduce only operates on the data generated from Map (i.e., it is not allowed to access the database). Thus, this task is to explore different ways to execute the Reduce jobs without blocking the main execution threads. We will implement two different approaches for processing this work. The client will be able to pass in at runtime which technique to use process the transaction.
    • The Reduce jobs will be offloaded to the MapReduceHelperThread for execution. The MapReduceHelperThread will execute the Reduce job for each partition serially.
    • The Reduce jobs will be entered into a special queue at each partition that contains miscellaneous work that the PartitionExecutor can execute whenever it is blocked and needs something to do (e.g., waiting for a TransactionWorkResult from distributed query). A special method will allow the system to process a subset of the ReduceInput data (e.g., the list of values for just one key) and quickly return to check whether the thing that it was blocked on has arrived. This can either be implemented as a special thread that the PartitionExecutor can quickly restart/block or just using the PartitionExecutor's thread (the former is likely easier but I am not sure of the CPU cache implications).
  2. Add support for dispatching Map jobs as asynchronous single-partition transactions. Currently, each MapReduce transaction is invoked as a distributed transaction that blocks all partitions. Although the PartitionExecutor will continue to execute non-MapReduce single-partition transactions, its partition is blocked from executing any non-MapReduce distributed transaction. We should reduce the initialization step and have each new MapReduce request get executed immediately.
    • Need to double-check whether the Map-job executing at the base partition releases all of the locks in the cluster when it completes. I'm not sure if it does that now.
  3. Conduct experiments that compare distributed transactions versus MapReduce transactions.
    • Execute an entirely single-partitioned workload and measure the drop in performance when executing the distributed/MapReduce transactions.
    • Compare the latency of distributed versus MapReduce transactions.
    • Research metrics for determing the accuracy of "fuzzy queries". This will allow us to measure how different the results are when consistency is ignored. This actually a bit tricky to do because we can't execute the distributed transaction and the MR transaction at the exact same time, so we will need to think about how we can do this deterministically.
  4. If the fuzzy query metrics in the experiments describe above are significantly different, then we will want implement the ability to dynamically enable snapshot serialization at run time for Map jobs. When the MapReduce transaction request arrives, the system will turn timestamp versioning on tuples. Any tuple that is modified when this is enabled are appended to the end of the table and are tagged with a timestamp (to save space, the timestamp can be a simple counter). When the Maptuples are never updated, I think that this is the only We cannot use the HyPer approach of forking the JVM because that it still does does not guarantee . See Serializable Isolation for Snapshot Databases for a basic overview of this technique.
  5. Improve support in the query planner for multi-aggregate queries.
SELECT ol_number, SUM(ol_amount), AVG(ol_quantitiy)
  FROM order_line, item 
 WHERE order_line.ol_i_id item.i_id 
 GROUP BY ol_number ORDER BY ol_number
  1. Optional: Add support for writing out Map results to disk. We need to think about this a bit, but we may have the problem where the Map job generates a lot of output that we don't want to keep in memory. Instead, we can serialize the VoltTable out to disk. The more I think about this, however, the more I think it's unnecessary.

@ghost ghost assigned xinjiacs Jan 11, 2012

@xinjiacs xinjiacs closed this Jan 20, 2012

@xinjiacs xinjiacs reopened this Jan 20, 2012

@xinjiacs

This comment has been minimized.

Show comment
Hide comment
@xinjiacs

xinjiacs Mar 14, 2012

Collaborator

Record the progress:
1.Finish blocking way and non-blocking way to execute REDUCE PHASE.
2.Finish asynchronous way to execute MAP PHASE.
3.Finish Python scripts to run H-Store on Amazon EC2 for latency and throughput performance test.
4.H-Store can execute natural Join query and write JoinAgg stored procedure to test.

What's next:

  1. Fix the original distributed query plan optimizer to push down the aggregates.
Collaborator

xinjiacs commented Mar 14, 2012

Record the progress:
1.Finish blocking way and non-blocking way to execute REDUCE PHASE.
2.Finish asynchronous way to execute MAP PHASE.
3.Finish Python scripts to run H-Store on Amazon EC2 for latency and throughput performance test.
4.H-Store can execute natural Join query and write JoinAgg stored procedure to test.

What's next:

  1. Fix the original distributed query plan optimizer to push down the aggregates.
@xinjiacs

This comment has been minimized.

Show comment
Hide comment
@xinjiacs

xinjiacs Apr 8, 2012

Collaborator

1.Original distributed query plan optimizer can push down the aggregates.
2.Add wikipedia benchmark (Still working on this part)
3.Add other distributed query and other MapReduce stored procedures.

Collaborator

xinjiacs commented Apr 8, 2012

1.Original distributed query plan optimizer can push down the aggregates.
2.Add wikipedia benchmark (Still working on this part)
3.Add other distributed query and other MapReduce stored procedures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment