Skip to content

External Sorting algorithm for Databases having constrained storage hierarchy

License

Notifications You must be signed in to change notification settings

divy9881/External-Sorting-for-Databases

Repository files navigation

External-Sorting-for-Databases

External Sorting algorithm for Databases having constrained storage hierarchy

Group Members

Individual Contributions

  • Divy: Cache-size mini runs, Device-optimized page sizes, Spilling memory-to-SSD, Spilling from SSD to disk, Graceful degradation, Optimized merge patterns, Testing and Memory Leak Check
  • Sahil: Tournament trees, Offset-value coding, Minimum count of row & column comparisons, Optimized merge patterns, Large-size records, Testing and Memory Leak Check
  • Devaki: Tournament trees, Offset-value coding, Large-size records
  • Manaswini: Verification

Techniques Implemented by our submission and the corresponding Source Files and Lines

  • Tournament trees: File Tree.cpp @ Line 196

  • Offset-value coding: File DataRecord.cpp @ Line 122

  • Minimum count of row & column comparisons

  • Cache-size mini runs: File SortRecords.cpp @ Line 26

  • Device-optimized page sizes: File SortRecords.cpp @ Line 81 and Line 136

  • Spilling memory-to-SSD: File SortRecords.cpp @ Line 65

  • Spilling from SSD to disk: File SortRecords.cpp @ Line 69 and Line 125

  • Graceful degradation: File SortRecords.cpp @ Line 72, Line 74 and Line 151

    • Into merging
    • Beyond one merge step
  • Optimized merge patterns: File SortRecords.cpp @ Line 150 and Line 151

  • Verifying: File Iterator.cpp @ Line 69 and Line 84

    • sets of rows & values: File Iterator.cpp @ Line 84
    • sort order: File Iterator.cpp @ Line 69
  • Replacement selection?

  • Run size > memory size?

  • Variable-size records?

  • Compression?

  • Prefix truncation?

  • Quicksort

Reasons we chose to implement the specific subset of techniques

  • Tournament-tree priority queue was used in order to achieve high fan-in for merging our sorted run inputs of records and less number of comparisons than a standard tree-of-winners
  • Offset-value coding was used to achieve minimum column value comparisons
  • Cache-size mini runs were used to be able to fit the sort inputs, for tournament-tree, in the cache. This enabled us to leverage the low-latency accesses when there are cache hits
  • Device-optimized page sizes were used in order to being cognizant about the access-profile(latency, bandwidth) of various devices in the storage hierarchy. For SSD, we used 8KB(100 MB/s * 0.1 ms ~ 10KB) and for HDD, we used 1MB(100 MB/s * 10 ms ~ 1MB)
  • We achieved graceful-degradation by spilling cache-size runs from cache to memory, spilling memory-size runs from memory to SSD and spilling SSD-size runs from SSD to HDD
  • Also HDD-page size(1MB) sorted runs were written to SSD prior to actually merging runs on the HDD. This is to leverage low-latency accesses of flash drives(SSD)
  • Sort-order, set of rows and their values were verified as part of sorting the input records. This is to verify the correctness and integrity of our sort algorihthm

Project's state

  • The implementation of the External-Sort is complete with all of the techniques which were expected from us as part of the course project
  • The sort was tested against 1KB size records and with 12M number of records(although it takes ~1hr to complete the sort, for this particular test-case)
  • The sort algorithm was tested against valgrind to check for any memory leaks introduced while developing. The codebase does not have any memory leaks, from the latest leak-report on the most recent code version

How to run our programs

  • To run our program, first compile the source code using following command, under External-Sort directory
$ cd External-Sort
$ make ExternalSort.exe
  • After compiling the source code, to execute the External Sort with custom arguments, run following command inside External-Sort directory
# Where,
# "-c" gives the total number of records
# "-s" is the individual record size
# "-o" is the trace of your program run
$ ./ExternalSort.exe -c 120 -s 10 -o trace0.txt
  • The program creates three directories on the completion of the sort algorithm:

    • input: This directory consist of the input table which has records generated by the random-generator in arbitrary order
    • output: This directory consist of the output table which has records from input table but in a sorted order, sorted using our sort algorithm
    • trace: This directory consists of trace files generated from the sort. The trace file consists of logs related SSD and HDD device accesses. And the logs related to sort state machine
  • In order to remove all the generated binaries, executables, and the utility directories mentioned above, run the following command

$ make clean

Initial Setup

$ docker run -it --privileged -v $pwd/External-Sort:/External-Sort ubuntu bash
$ apt-get update
$ apt-get install build-essential make g++ vim sudo valgrind -y
$ cd ./External-Sort

Generating ExternalSort.exe and Running the Executable

$ make ExternalSort.exe

# Where,
# "-c" gives the total number of records
# "-s" is the individual record size
# "-o" is the trace of your program run
$ ./ExternalSort.exe -c 120 -s 1000 -o trace0.txt

Run Valgrind for a Leak Check

# Creates a log file `valgrind` inside the External-Sort directory
$ valgrind --track-origins=yes --log-file="/External-Sort/valgrind" --leak-check=yes ./ExternalSort.exe -c 120 -s 1000 -o trace0.txt

About

External Sorting algorithm for Databases having constrained storage hierarchy

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •