# Secure Data Disclosure on Kubernetes: Deployment and Server Administration

This notebook showcases how a data owner could set up the service on a kubernetes cluster, add and make their data available to certain user. We will do this in a step by step fashion.

## Deploying the service

### Building the server image
The SDD service is comprised of a fastapi server and a MongoDB database for keeping state and administration. While the database image is public, the server image must first be built and pushed to a registry.

NOTE: For now, the server configuration file is copied and put into the server container. This is of course not practical (and not safe, since the configuration file contains passwords and secrets) and will be updated in future versions. The `config/example_config.yaml` is the one that is copied into the container. One has to change it and rebuild+push the server container in order to change the server configuration.

In [None]:
# !docker login (=> use personal token from dockerhub, has to be done only once)

!cd .. && docker build --target sdd_server_prod -t <your_registry>/sdd_server_prod:latest .
!cd .. && docker push <your_registry>/sdd-poc-server:latest

# Start of DEMO

In [1]:
URL = 'https://sdd-demo.lab.sspcloud.fr/'

### Deploying the service Helm chart
We use a Helm chart to deploy the service on a Kubernetes cluster. The sdd-server chart is located at `deploy/helm/charts/sdd_server`, let us change our working directory to this location.

In [90]:
import os
os.chdir('../deploy/helm/charts/sdd_server')

The `values.yaml` file contains all the configuration values for the service. We must now update the `image.repository` field to the one we pushed the server container image to. One can also change the url to which the service will be published with `ingress.hosts[0].host` (or disable this feature by setting `ingress.enabled` to `False`).

    => Update `values.yaml` file

As previously stated, the service is made up of a server and a MongoDB database. Before installing the chart, we must thus first download that dependency.

In [91]:
!helm dependency update

Saving 1 charts
Downloading mongodb from repo oci://registry-1.docker.io/bitnamicharts
Pulled: registry-1.docker.io/bitnamicharts/mongodb:13.18.1
Digest: sha256:f3b2a691537260044746bc4a8898e9ae68e8c29864639737b6da920f99aebe97
Deleting outdated charts


Now the chart is ready to be installed, so let the magic happen!

In [92]:
!helm install -f values.yaml sdd-service .

NAME: sdd-service
LAST DEPLOYED: Mon Sep 18 07:49:24 2023
NAMESPACE: user-paulineml
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
  https://sdd-demo.lab.sspcloud.fr/


The installation notes show the url at which the server is exposed. One can have a look at the api docummentation by visiting `<server_url>/docs`

One can also check the whether the service started error free by using the `kubectl get all` command as well as inspecting the server logs with `kubectl logs <server-pod-name>`

## Administering the service by accessing the mongoDB

Let's switch directory again and move to the administration script.

In [93]:
import os
os.chdir('../../../../src/')

Let's add a formatting function to have more readable outputs.

In [94]:
from ast import literal_eval
import subprocess

def run(command, to_dict=False):
    command = f"python mongodb_admin.py {command}"
    completed_process = subprocess.run(command, shell=True, text=True, capture_output=True)
    output = completed_process.stdout
    if to_dict:
        return literal_eval(output)
    else:
        output = output.rstrip('\n').replace(r'\n', '\n')
        print(output)

We should now have the required environment to interact with the admin database.

### Preparing the database

You can visualise all the options offered by the database by running the command `python mongodb_admin.py --help`. We will go through through each of them in the rest of the notebook.

In [95]:
run("--help") # !python mongodb_admin.py --help

usage: MongoDB administration script for the user database [-h]
                                                           {add_user,add_user_with_budget,del_user,add_dataset_to_user,del_dataset_to_user,set_budget_field,set_may_query,show_user,create_users_collection,add_dataset,add_datasets,drop_collection,show_collection}
                                                           ...

options:
  -h, --help            show this help message and exit

subcommands:
  {add_user,add_user_with_budget,del_user,add_dataset_to_user,del_dataset_to_user,set_budget_field,set_may_query,show_user,create_users_collection,add_dataset,add_datasets,drop_collection,show_collection}
                        user database administration operations
    add_user            add user to users collection
    add_user_with_budget
                        add user with budget to users collection
    del_user            delete user from users collection
    add_dataset_to_user
                        add dataset w

Let's first make sure the database is empty and in a clean state.

In [96]:
run("drop_collection --collection datasets")  # !python mongodb_admin.py drop_collection --collection datasets

Deleted collection datasets.


In [97]:
run("drop_collection --collection metadata")  # !python mongodb_admin.py drop_collection --collection metadata

Deleted collection metadata.


In [98]:
run("drop_collection --collection users")     # !python mongodb_admin.py drop_collection --collection users

Deleted collection users.


### Datasets (add and drop)

We first need to set the dataset meta-information. For each dataset, 2 informations are required:
- the type of database in which the dataset is stored
- a path to the metadata of the dataset (stored as a yaml file).

To later perform query on the dataset, metadata are required. In this secure server the metadata information is expected to be in the same format as [SmartnoiseSQL dictionary format](https://docs.smartnoise.org/sql/metadata.html#dictionary-format), where among other, there is information about all the available columns, their type, bound values (see Smartnoise page for more details). It is also expected to be in a `yaml` file.

These information (dataset name, dataset type and metadata path) are stored in the `datasets` collection. Then for each dataset, its metadata is fetched from its `yaml` file and stored in a collection named `metadata`.

We then check that there is indeed no data in the dataset and metadata collections yet:

In [99]:
run("show_collection --collection datasets")

[]


In [100]:
run("show_collection --collection metadata")

[]


#### Add one dataset

We can add **one dataset** with its name, database type and path to medata file:

In [101]:
run("add_dataset -d PENGUIN -db CONSTANT_PATH_DB -mp collections/metadata/penguin_metadata.yaml")

Added dataset PENGUIN with database CONSTANT_PATH_DB and metadata from collections/metadata/penguin_metadata.yaml.


We can now see the dataset and metadata collection with the Penguin dataset:

In [102]:
run("show_collection --collection datasets", to_dict = True)

[{'dataset_name': 'PENGUIN',
  'database_type': 'CONSTANT_PATH_DB',
  'metadata_path': 'collections/metadata/penguin_metadata.yaml'}]

In [103]:
run("show_collection --collection metadata", to_dict = True)

[{'PENGUIN': {'': {'Schema': {'Table': {'max_ids': 1,
      'row_privacy': True,
      'species': {'type': 'string'},
      'island': {'type': 'string'},
      'bill_length_mm': {'type': 'float', 'lower': 30.0, 'upper': 65.0},
      'bill_depth_mm': {'type': 'float', 'lower': 13.0, 'upper': 23.0},
      'flipper_length_mm': {'type': 'float', 'lower': 150.0, 'upper': 250.0},
      'body_mass_g': {'type': 'float', 'lower': 2000.0, 'upper': 7000.0},
      'sex': {'type': 'string'}}}},
   'engine': 'csv'}}]

#### Add multiple datasets

Or a path to a yaml file which contains all these informations to do **multiple datasets** in one command:

In [104]:
run("add_datasets --path collections/dataset_collection.yaml")

Added datasets collection from yaml at collections/dataset_collection.yaml. 
Added metadata of IRIS dataset. 
Added metadata of PENGUIN dataset. 
Added metadata of RANDOM dataset. 


Let's see all the dataset collection:

In [105]:
run("show_collection --collection datasets", to_dict = True)

[{'dataset_name': 'IRIS',
  'database_type': 'CONSTANT_PATH_DB',
  'dataset_path': 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv',
  'metadata_path': 'collections/metadata/iris_metadata.yaml'},
 {'dataset_name': 'PENGUIN',
  'database_type': 'CONSTANT_PATH_DB',
  'dataset_path': 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv',
  'metadata_path': 'collections/metadata/penguin_metadata.yaml'},
 {'dataset_name': 'RANDOM',
  'database_type': 'S3Database',
  's3_bucket': 'my_bucket',
  's3_key': 'folder/my_data.csv',
  'metadata_path': 'collections/metadata/penguin_metadata.yaml'}]

Finally let's have a look at the  stored metadata:

In [106]:
run("show_collection --collection metadata", to_dict = True)

[{'IRIS': {'': {'Schema': {'Table': {'max_ids': 1,
      'petal_length': {'type': 'float', 'lower': 0.5, 'upper': 10.0},
      'petal_width': {'type': 'float', 'lower': 0.05, 'upper': 5.0},
      'row_privacy': True,
      'sepal_length': {'type': 'float', 'lower': 2.0, 'upper': 10.0},
      'sepal_width': {'type': 'float', 'lower': 1.0, 'upper': 6.0},
      'species': {'type': 'string'}}}},
   'engine': 'csv'}},
 {'PENGUIN': {'': {'Schema': {'Table': {'max_ids': 1,
      'row_privacy': True,
      'species': {'type': 'string'},
      'island': {'type': 'string'},
      'bill_length_mm': {'type': 'float', 'lower': 30.0, 'upper': 65.0},
      'bill_depth_mm': {'type': 'float', 'lower': 13.0, 'upper': 23.0},
      'flipper_length_mm': {'type': 'float', 'lower': 150.0, 'upper': 250.0},
      'body_mass_g': {'type': 'float', 'lower': 2000.0, 'upper': 7000.0},
      'sex': {'type': 'string'}}}},
   'engine': 'csv'}},
 {'RANDOM': {'': {'Schema': {'Table': {'max_ids': 1,
      'row_privacy': 

### Users

#### Adding users
Let's see which users are alreay loaded:

In [107]:
run("show_collection --collection users")

[]


And now let's add few users.

In [108]:
run("add_user_with_budget --user 'Mrs. Daisy' --dataset 'IRIS' --epsilon 10.0 --delta 0.001")

Added access to user Mrs. Daisy with dataset IRIS, budget epsilon 10.0 and delta 0.001.


In [109]:
run("add_user_with_budget --user 'Dr. Antartica' --dataset 'PENGUIN' --epsilon 10.0 --delta 0.002")

Added access to user Dr. Antartica with dataset PENGUIN, budget epsilon 10.0 and delta 0.002.


In [22]:
run("add_user_with_budget --user 'Lord McFreeze' --dataset 'PENGUIN' --epsilon 10.0 --delta 0.001")

Added access to user Lord McFreeze with dataset PENGUIN, budget epsilon 10.0 and delta 0.001.


In [23]:
run("add_dataset_to_user --user 'Lord McFreeze' --dataset 'IRIS' --epsilon 5.0 --delta 0.005")

Added access to dataset IRIS to user Lord McFreeze with budget epsilon 5.0 and delta 0.005.


And we can also modify existing the total budget of a user:

In [24]:
run("set_budget_field --user 'Dr. Antartica' --dataset 'PENGUIN' --field initial_epsilon --value 20.0")

Set budget of Dr. Antartica for dataset PENGUIN of initial_epsilon to 20.0.


Let's see the current state of the database:

In [110]:
run("show_collection --collection users", to_dict = True)

[{'user_name': 'Mrs. Daisy',
  'may_query': True,
  'datasets_list': [{'dataset_name': 'IRIS',
    'initial_epsilon': 10.0,
    'initial_delta': 0.001,
    'total_spent_epsilon': 0.0,
    'total_spent_delta': 0.0}]},
 {'user_name': 'Dr. Antartica',
  'may_query': True,
  'datasets_list': [{'dataset_name': 'PENGUIN',
    'initial_epsilon': 10.0,
    'initial_delta': 0.002,
    'total_spent_epsilon': 0.0,
    'total_spent_delta': 0.0}]}]

Do not hesitate to re-run this command after every other command to ensure that everything runs as expected.

#### Removing users

In [26]:
run("set_may_query --user 'Lord McFreeze' --value False")

Set user Lord McFreeze may query.


Now, he won't be able to do any query (unless you re-run the query with --value True).

A few days have passed and the investigation reveals that he was aiming to do unethical research, you can remove his dataset by doing:

In [28]:
run("del_dataset_to_user --user 'Lord McFreeze' --dataset 'PENGUIN'")

Remove access to dataset PENGUIN from user Lord McFreeze.


Or delete him completely from the codebase:

In [29]:
run("del_user --user 'Lord McFreeze'")

Deleted user Lord McFreeze.


Let's see the resulting users:

In [30]:
run("show_collection --collection users", to_dict = True)

[{'user_name': 'Mrs. Daisy',
  'may_query': True,
  'datasets_list': [{'dataset_name': 'IRIS',
    'initial_epsilon': 10.0,
    'initial_delta': 0.001,
    'total_spent_epsilon': 0.0,
    'total_spent_delta': 0.0}]},
 {'user_name': 'Dr. Antartica',
  'may_query': True,
  'datasets_list': [{'dataset_name': 'PENGUIN',
    'initial_epsilon': 20.0,
    'initial_delta': 0.002,
    'total_spent_epsilon': 0.0,
    'total_spent_delta': 0.0}]}]

### Finally, many users can actually be loaded directly from a single file

Let's delete the existing user collection first:

In [111]:
run("drop_collection --collection users")
run("show_collection --collection users")

Deleted collection users.
[]


We add the data based on a yaml file:

In [112]:
run("create_users_collection --path collections/user_collection.yaml")

Added user data from yaml at collections/user_collection.yaml.


And let's see the resulting collection:

In [113]:
run("show_collection --collection users", to_dict = True)

[{'user_name': 'Alice',
  'may_query': True,
  'datasets_list': [{'dataset_name': 'IRIS',
    'initial_epsilon': 10,
    'initial_delta': 0.0001,
    'total_spent_epsilon': 1,
    'total_spent_delta': 1e-06},
   {'dataset_name': 'PENGUIN',
    'initial_epsilon': 5,
    'initial_delta': 0.0005,
    'total_spent_epsilon': 0.2,
    'total_spent_delta': 1e-07}]},
 {'user_name': 'Dr. Antartica',
  'may_query': True,
  'datasets_list': [{'dataset_name': 'PENGUIN',
    'initial_epsilon': 45,
    'initial_delta': 0.005,
    'total_spent_epsilon': 0,
    'total_spent_delta': 0}]},
 {'user_name': 'Bob',
  'may_query': True,
  'datasets_list': [{'dataset_name': 'IRIS',
    'initial_epsilon': 10,
    'initial_delta': 0.0001,
    'total_spent_epsilon': 0,
    'total_spent_delta': 0}]}]

## Archives of queries

In [115]:
run("show_collection --collection queries_archives", to_dict = True)

[{'user_name': 'Dr. Antartica',
  'dataset_name': 'PENGUIN',
  'epsilon': 0.6000000000000001,
  'delta': 1.4999949999983109e-05,
  'query': 'SELECT AVG(bill_length_mm) AS avg_bill_length_mm FROM Schema.Table'},
 {'user_name': 'Dr. Antartica',
  'dataset_name': 'PENGUIN',
  'epsilon': 2.0,
  'delta': 1.4999949999983109e-05,
  'query': 'SELECT COUNT(*) AS nb_penguin, STD(bill_length_mm) AS std_bill_length_mm FROM Schema.Table'},
 {'user_name': 'Dr. Antartica',
  'dataset_name': 'PENGUIN',
  'epsilon': 40.0,
  'delta': 0.00014999500000001387,
  'query': 'SELECT COUNT(*) AS nb_penguin,          species,         AVG(bill_length_mm) AS avg_bill_length_mm,         STD(bill_length_mm) AS std_bill_length_mm         FROM Schema.Table  GROUP BY species'}]

## Stopping the service: Let's not do it right now!

To tear down the service, we simply execute the command `helm uninstall sdd-service`

In [116]:
!helm uninstall sdd-service

release "sdd-service" uninstalled
