Skip to content


Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Set Similarity Search Bencmarks

Benchmark data sets for set similarity search algorithms.

Data set Note Number of sets Number of tokens File size Papers
BMS-POS (Source) A set is a purchase in a shop; a token is a product category in that purchase 515,597 1,657 3.8 MB 1
Kosarak (Source) A set is a user; a token is a link clicked by the user 990,002 41,270 13 MB 1
Flickr A set is a photo; a token is a tag or a word from the title 1,680,490 810,660 29 MB 1,4
Netflix (Source) A set is a user; a token is a movie rated by the user 480,189 17,770 166 MB 1
Orkut (Source) A set is a user; a token is a group membership of the user 1,853,285 15,293,693 378 MB 1
Canada-US-UK Open Data
Query Benchmark 1k
Query Benchmark 10k
Query Benchmark 100k
A set is a table column; a token is a data value 745,414 562,320,456 2.52 GB 2
WDC Web Table 2015, English Relational-Only
Query Benchmark 100
Query Benchmark 1k
Query Benchmark 10k
A set is a table column; a token is a data value 163,510,917 184,644,583 4.32 GB 2,3

All data sets follow the same format:

  • Compressed using gzip.
  • First line of the main file is <number of sets> <number of tokens> and optionally a third number <sum of all set sizes>
  • All other lines are <set size>\t<1>,<2>,<3>,..., where \t is a tab separator, <1> and so on are tokens.
  • All tokens are integers, transformed from the original strings using a global ascending frequency order.

Papers in set similarity search using the above data sets:

  1. An Empirical Evaluation of Set Similarity Join Techniques, VLDB 2016
  2. LSH Ensemble: Internet Scale Domain Search, VLDB 2016
  3. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes, SIGMOD 2019 (To Appear)
  4. Spatio-textual similarity joins, VLDB 2012


Benchmark Datasets for Set Similarity Search







No releases published


No packages published