Spark hands-on exercise for the lecture Distributed Data Analytics.
- Discovers all unary inclusion dependencies in a given dataset (within and between all tables).
- The input dataset may consist of multiple tables.
- Evaluate the program using the TPCH dataset provided on this website.
- The inclusion dependency discovery is based on the paper Scaling Out the Discovery of Inclusion Dependencies (Kruse, Papenbrock, Naumann, 2015).
- Build a fatjar using
sbt assembly
- Run the main method with the following program arguments:
--path <path to folder>
- Path to the folder containing the dataset csv files. Optional, defaults to./TPCH
.--paths <fileA,fileB,fileC>
- Direct path to the dataset files seperated by comma. Optional, defaults to--path
argument.--cores <number of cores>
- Number of local cores to use. Optional, defaults to4
.