Spark Examples

Spark hands-on exercise for the lecture Distributed Data Analytics.

Task

Discovers all unary inclusion dependencies in a given dataset (within and between all tables).
The input dataset may consist of multiple tables.
Evaluate the program using the TPCH dataset provided on this website.
The inclusion dependency discovery is based on the paper Scaling Out the Discovery of Inclusion Dependencies (Kruse, Papenbrock, Naumann, 2015).

Build a fatjar using sbt assembly
Run the main method with the following program arguments:
- --path <path to folder> - Path to the folder containing the dataset csv files. Optional, defaults to ./TPCH.
- --paths <fileA,fileB,fileC> - Direct path to the dataset files seperated by comma. Optional, defaults to --path argument.
- --cores <number of cores> - Number of local cores to use. Optional, defaults to 4.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
project		project
src/main/scala		src/main/scala
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt