Find file
Fetching contributors…
Cannot retrieve contributors at this time
28 lines (19 sloc) 2.04 KB
The beginnings of a library to perform sorting of large datasets outside the scope of a map-reduce jobs.
Please set hadoop.version and hadoop.path in to point to your version of
Once commoncrawl-mergeutils.jar has been built, you can execute a org.commoncrawl.hadoop.mergeutils.MergeSortSpillWriterUnitTest via the bin/ script.
The luancher runs the command in the background. You can monitor progress via either ./logs/<<ClassName>>.log for LOG output, or ./logs/<<ClassName>>_run.log for stdout output.
The main classes of interest are MergeSortSpillWriter and SequenceFileMerger. MergeSortSpillWriter can be fed unsorted records via the SpillRecord API. It will
internally buffer records until a configurable threshold is reached, and then sort the intermediate records and spill them to a temp sequence file. This will continue
until the close method is called. Close will trigger the class to spill the final set of records and then feed the Part files to SequenceFileMerger. SequenceFileMergerwill
perform a merge-sort of the records and spill them to a configurable output SpillWriter. To optimize the sort, one should specify a RawKeyValueComparator or to squeeze even more
performance use the OptimizedKeyGeneratorAndComparator class to generate a long key value from key,value pairs or a long key + buffer secondary key from a key,value pair.
The MergeSortSpillWriter and SequenceFileMerger classes have been used in production to sort very large recordsets (100s of millions of records), but one should be
aware that the library is a work in progress. The combiner code in SequenceFileMerge should be avoided for now.
Upcoming features include:
1. The ability to spill Raw Records via MergeSortSpillWriter.
2. Native (C++) quick-sort support when an optimized comparator/generator is used.
3. TFileReader and TFileSpillWriter class additions.
4. Flushing out of the Combiner/Reducer implementation in the SequenceFileMerger class.
5. Making the quick-sort mutli-threaded, and parallelizing the merge-sort.
Feedback is welcome :-)