SoK: Efficient Privacy-preserving Clustering
This repository contains the source code for the PETS'21 paper HMS+21 by Aditya Hegde, Helen Möllering, Thomas Schneider, and Hossein Yalame.
A brief description of the subdirectories in the codebase is given below. The README in each subdirectory provides more information on compilation and usage.
he_meanshift: An implementation of the HE-Meanshift protocol presented by Cheon et al. in CKP19.
hc_protocols: An implementation of the hierarchical clustering protocols of Meng et al. in MPO19.
utils: Scripts to automate simple tasks and aid in analysis.
data: A sample dataset to use as input. See the Datasets section for more details.
Building the Project
All required dependencies to compile and run the project are available through the docker image. To use docker run the following:
docker pull adishegde/sok-ppcluster:latest docker run -it adishegde/sok-ppcluster:latest
To locally build the docker image run the following:
docker build -t sokppcluster . docker run -it sokppcluster
We observed the build process to require at least 4GB RAM which must be explicitly set in case of Windows and MacOS.
The code is written in C++17 and uses
hc_protocols implementations have different external dependencies and can be built separately using the instructions given in their respective READMEs.
The datasets we use for evaluating clustering quality are available at the public GitHub repository gagolews/clustering_benchmarks_v1.
While the above repository provides datasets in text format saved as
.gz files, the C++ benchmark programs require the input dataset to be in Numpy's
utils/transform_data.py program can be used to convert the
.gz file into
Please refer to the README in the
utils directory for usage information.
A sample dataset in the above formats along with the corresponding ground truth as created using Sci-kit learn's
make_blobs function is available in the
It consists of 128 data records each having 1 attribute and consists of 2 clusters.