Set Similarity joins in the GPGPU paradigm
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
gpujoin
.gitignore
CMakeLists.txt
LICENSE
README.md
allpairs.h
allpairs_policies.h
build.sh
candidateset.h
classes.h
cmdline.h
cmdline_allpairs.cxx
cmdline_groupjoin.cxx
cmdline_groupjoin.h
cmdline_groupjoin_cosine.cxx
cmdline_groupjoin_def.h
cmdline_groupjoin_dice.cxx
cmdline_groupjoin_hamming.cxx
cmdline_groupjoin_jaccard.cxx
cmdline_mpjoin.cxx
cmdline_mpjoin.h
cmdline_mpjoin_def.h
cmdline_mpjoin_jaccard.cxx
cpucycles.h
data.h
frequencysorting.h
functions.h
gpu_handler.cu
gpu_handler.h
groupjoin.h
groupjoin_indexes.h
indexes.h
input.cxx
input.h
intify.cxx
inv_index.h
lengthfilter.h
main.cxx
manverify.cxx
mpjoin.h
mpjoinpolicies.h
mpltricks.h
output.h
ppjoinpolicies.h
script.sh
similarity.h
statistics.cxx
statistics.h
template_unroll.h
timing.cxx
timing.h
utilities.h
verify.h

README.md

gpussjoin

Set Similarity joins in the GPGPU paradigm

This work is based on the filter-verifcation framework developed by Mann (http://ssjoin.dbresearch.uni-salzburg.at/).

We employ the GPGPU paradigm in order to accelerate exact set similarity join. The CPU is responsible for the candidate generation (filtering phase) while verification is delegated to the GPU. Due to execution overlap between the CPU and GPU, on large datasets where billions of candidate pairs are verified, there is a 2.6x speedup over the sequential implementation.

Dependencies

Building

./build (Release|Debug) SM_ARCH
./build Release 61 #builds an executable for Compute Capability 6.1

Arguments & Execution

  --algorithm arg       algorithm to (allpairs, ppjoin, groupjoin)
  --threshold arg       jaccard threshold
  --input arg           file, each line a record
  --threads arg         Number of threads per block
  --devmemory arg       device memory to use (e.g. 512M, 4GB)
  --scenario arg        gpu scenario to execute (1, 2, 3)
./set_sim_join --algorithm allpairs --threshold 0.9 --input ~/datasets/dblp.txt --devmemory 1G --scenario 3

The datasets and the preprocess scripts can be found at http://ssjoin.dbresearch.uni-salzburg.at/datasets.html