This repository contains an example application of SMCQL developed by Johes Bater, a Multi-Party Computation (MPC) protocol for secure database queries. If you use this experiment for research, please consider citing the original paper [1].
This code is for research and demonstration purposes only and is not suitable for use in production.
In SMCQL, clients can query data held by two organisations in their own private databases according to a public schema, via a broker assumed to be honest. The query response is returned as if all data were contained in a single unified database, but SMCQL guarantees that the broker learns nothing about data held by each organisation individually beyond what can be deduced by the result of the query, and that the organisations do not learn anything about each other's data.
The computation works roughly as follows. The broker decomposes the client's SQL query in a specific way, compiles bytecode according to this decomposition to enable Secure Multi-Party Computation (SMC, or MPC), and runs this code over SSH on the remote systems. The bytecode enables the remote sites to perform SMC and return the result to the broker. Details can be found in the paper [1].
Two different setups are provided:
- Local: A single docker container contains a broker and two 'remote' databases (all simulated on localhost). This example is close to the one provided in the SMCQL repository.
- Remote: Three docker containers, one containing the honest broker, and the other two containing a database to be queried by the broker. Each database (one for the broker, and one for each site) and the corresponding data are entirely separate, emulating a real use case.
The sample-conf
directory contains versions of the conf
directory
from the SMCQL repo for each entity in the experiment: e.g.,
./sample-conf/local/
is the configuration directory conf
from the
original repo (used for the local experiment in this repo), with
anything irrelevant for the experiment removed: in the SMCQL repo, the
conf
directory contains all data for both test sites, but in
'reality' this information should not be accessible to both sites and
the broker.
This experiment has only been tested on a 64-bit Linux system.
docker-compose
must be installed.
In the remote experiment, the scripts set up a docker bridge network with subnet 172.123.0.0/24. If this needs to be changed (e.g. because you already have a docker network with this subnet) the following files need updating:
- ./docker/docker-compose-remote.yml
- ./sample-conf/remote/broker/setup.remote
- ./sample-conf/remote/broker/connections/localhost
- ./sample-conf/remote/broker/connections/remote-hosts
All databases must share the same schema, defined in
conf/workload/testDB/test_schema.sql
. Databases in the SMCQL
experiments are:
smcql_testDB
: The broker's local database.smcql_testDB_site1
: Site 1's database, which is accessed by the broker at runtime.smcql_testDB_site2
: Site 2's database, which is accessed by the broker at runtime.
The databases are created via the
conf/workload/testDB/create_test_dbs.sh
script that runs as part of
the setup.sh
script. It performs the following processing:
- Executes
test_schema.sql
to create the same set of (empty) tables insmcql_testDB
,smcql_testDB_site1
andsmcql_testDB_site2
. More specifically, thetest_schema.sql
file defines:- Table structure (i.e. column headers and types).
- Permissions on the rows as public, protected or private (see Queries for more details). If a row's permissions are not set in this file, they default to 'private' and cannot be queried.
- Populates the
smcql_testDB_site1
andsmcql_testDB_site2
databases using CSV files in thesample-conf/local/workload/testDB/1
andsample-conf/local/workload/testDB/2
directories for the local experiment, andsample-conf/remote/site1/workload/testDB/1
andsample-conf/remote/site1/workload/testDB/2
for the remote experiment. - Executes
setup_test_registries.sql
to populate 'query tables' that are used for runtime queries. Their structure must be defined in thetest_schema.sql
file since the broker must know their structure, but their content (for each site) is populated in thecreate-test-dbs.sh
script. The reason for creating tables for specific queries in this file is to minimise the amount of computation that is done in SMC (which is several of orders of magnitude more expensive than computing in the clear). For example,WHERE
clauses are not supported as SMCQL queries because they can generally be avoided by locally refining the tables on which the SMCQL queries will be run.
NOTE It is mandated that there be a 'site' table in each of the
site databases, containing a single, unique site_id
entry. This
table is used internally by SMCQL. Additionally, where tables owned
by different sites are aggregated during SQL queries later (e.g. where
selecting all rows from either site where some predicate holds), the
tables must contain a site
column.
The databases can be inspected by entering the container (e.g. with
docker exec -it smcql_site1 bash
), changing to the postgres
user
and running psql
:
sudo su - postgres
psql -d smcql_testDB_site1
> \dt
> SELECT * FROM <table name>
By default, queries are stored in the
sample-conf/remote/broker/workload/sql
directory (or the
sample-conf/local/worksload/sql
directory for the local experiment).
Queries can be made to obtain information on:
- Public attributes;
- Aggregated values on protected attributes provided the compiler decides there is sufficient privacy in revealing the information;
- Nothing that depends on private attributes.
See notes in subsection above on the contraints on SMCQL queries.
Run ./scripts/build.sh
from the repo root to build the docker images.
(Note that running the start.sh
script will initiate the build if and
only if the images have not already been built.) The build script builds
two images:
smcql
: Used to create containers for the local experiment and for the broker in the remote experiment.smcql_remote
: Used to create containers for the remote sites in the remote experiment.
The remote sites essentially just need to run a Java Runtime Environment and a host a database, whereas the broker also needs to be able to 'compile' SQL queries into SMC executables.
Inside the container, there is a user called smcql
(who has sudo
privileges). In the broker container, the home directory of the SMCQL
user contains a copy of the SMCQL repository.
All commands below should be run from within the repository's root
directory, i.e. /home/smcql/smcql
, from either the local container
(in the local experiment) or the broker container (in the remote
experiment).
The smcql
user is also a superuser (in both container types) for the
purpose of managing databases.
For simplicity (and because this is just a test experiment), the
smcql
user has password smcql
wherever a password is required for
the user (e.g. in the postgres databases).
To run this experiment, perform the following steps:
- Run
./scripts/start.sh local
. - Run
docker exec -it smcql_local bash
to enter the container. - Run
./build_and_execute.sh ./conf/workload/sql/test_query_1.sql testDB1 testDB2
to run the querytest_query_1.sql
.
The docker-compose-remote.yml
file creates three containers:
smcql_broker
: The honest brokersmcql_site1
: Site 1 hosting its databasesmcql_site2
: Site 2 hosting its database
SSH access is set up from the broker to the remote sites in the
start.sh
script.
To run this experiment, perform the following steps:
- Run
./scripts/start.sh remote
. - Run
docker exec -it smcql_broker bash
to enter the broker container. - Run
./build_and_execute.sh ./conf/workload/sql/test_query_1.sql testDB1 testDB2
to run the querytest_query_1.sql
.
The setup.sh
script must be run in each of the three containers
(located at /home/smcql/smcql
) following any changes to:
- The database schema defined in
sample-conf/*/workload/testDB/test_schema.sql
(which must be the same for all entities); - The dynamic query tables constructed in
setup_test_registries.sql
; or - The CSV files containing table data.
- The 'Auth Cancel' error is due to invalid SSH keys. You can try
regenerating them in the broker and copying them manually to each
site using
ssh-copy-id
. - When running an SMCQL query, if you get an 'array out of bounds'
exception, then you have requested an SQL query that SMCQL cannot
parse. For example, this could be because it contains a
WHERE
clause (see notes in the Databases section on how to get around this)
The SMCQL repo says that for a remote version, the testDB1
option
should be replaced with remoteDB1
(and similarly testDB1
with
remoteDB2
). The choice between testDB1
and remoteDB1
apparently
only indicates to the software whether it should read setup.local
or
setup.remote
, and in particular does NOT correspond to the names of
the connections in the files under the connections
directory -- the
files in those directories should always refer to testDB1
and
testDB2
. The remoteDB1
value on the command line tells the
software to use the remote-hosts
file to find the configuration,
instead of the localhosts
file. However, it appears that rewriting
the setup.local
testDB1
and testDB2
connection configuration
with remote information is the only way to force the software to use
remote databases -- the configuration from the setup.remote
file
does not seem to be read in correctly even when the command line
arguments state to use remoteDB1
and remoteDB2
. This is tracked
as Issue #4 in the SMCQL
repo.
In other words, the files are used as follows:
- In the local experiment:
sample-conf/broker/connections/localhosts
tells the broker connection information for the local site hosts.
- In the remote experiment:
sample-conf/broker/connections/localhosts
tells the broker connection information for the remote site hosts
sample-conf/broker/connections/remote-hosts
is not used.
[1] SMCQL: Secure Query Processing for Private Data Networks, Johes Bater, Gregory Elliott, Craig Eggen, Satyender Goel, Abel N Kho, and Jennie Rogers, Proceedings for VLDB Endow, 10(7) pp.673-684.