# Introduction to Smartnoise-SQL

[Smartnoise-SQL](https://docs.smartnoise.org/sql/index.html) is a python library that enables to perform differentially private SQL queries. 

SmartNoise is intended for scenarios where the analyst is trusted by the data owner.

## Step 1: Install the Library

Smartnoise-sql is available on pypi, it can be installed via the pip command. We will use the latest version of the library to date: version 1.0.6.

In [2]:
!pip install smartnoise-sql==1.0.6

Defaulting to user installation because normal site-packages is not writeable
Collecting smartnoise-sql==1.0.6
  Downloading smartnoise_sql-1.0.6-py3-none-any.whl.metadata (9.6 kB)
Collecting antlr4-python3-runtime==4.9.3 (from smartnoise-sql==1.0.6)
  Downloading antlr4-python3-runtime-4.9.3.tar.gz (117 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting graphviz<1.0,>=0.17 (from smartnoise-sql==1.0.6)
  Downloading graphviz-0.21-py3-none-any.whl.metadata (12 kB)
Collecting opendp<0.13.0,>=0.8.0 (from smartnoise-sql==1.0.6)
  Downloading opendp-0.12.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.1 kB)
Collecting sqlalchemy<3.0.0,>=2.0.0 (from smartnoise-sql==1.0.6)
  Downloading sqlalchemy-2.0.43-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting deprecated (from opendp<0.13.0,>=0.8.0->smartnoise-sql==1.0.6)
  Downloading Deprecated-1.2.18-py2.py3-none-any.whl.metadata (5.7 kB)
Collecting wrapt<2,>=1.10 (from

For this notebook, we will also use `pandas` library, which is one if the main python library to work with tables. We also install it via `pip`.

In [6]:
!pip install pandas==2.2.3

Defaulting to user installation because normal site-packages is not writeable


## Step 2: Load and Prepare Data

In this notebook, we will work with the [penguin dataset]("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv") from [seaborn datasets](https://github.com/mwaskom/seaborn-data).
We load the dataset via pandas in a dataframe `df`.

In [8]:
import pandas as pd

In [9]:
path_to_data = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
df = pd.read_csv(path_to_data)

We can look at the first rows of the dataframe to get to know the data:

In [10]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


We see that there are 7 columns: 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g' and 'sex' with various data types.

## Step 3: Prepare Analysis with Smarnoise-SQL

Before doing a query, `smartnoise-sql` requires a reader object [(see doc here)](https://docs.smartnoise.org/sql/api/index.html#snsql.connect.from_df). When working with pandas dataframe, this object takes three parameters:
- df: The Pandas DataFrame to be queried (which we loaded in step 2)
- privacy: A Privacy object with the desired privacy parameters (we instantiate it in step 3.1)
- metadata: The metadata describing the data source (we instantiate it in step 3.2)

### Step 3.1: Privacy object

The `Privacy` object [(see doc)](https://docs.smartnoise.org/sql/api/index.html#privacy) enables to select the privacy budget used by queries. The budget is specified with an $\epsilon$ and $\delta$ as in approximate differential privacy.
$$
Pr[M(S) \in O] < e^{\epsilon} Pr[M(S') \in O] + \delta
$$

We select $\epsilon=0.1$ and $\delta=0.0001$.

Optionnaly, it can also represent desired accuracy bounds or specify mechanisms for certain statistics but this is out of scope for this notebook.

In [1]:
from snsql import Privacy

In [18]:
# TODO: fill epsilon and delta values
# EPSILON = ...
# DETLA = ...

# Correction
EPSILON = 0.1
DELTA = 1/10000

In [19]:
privacy = Privacy(epsilon=EPSILON, delta=DELTA)

### Step 3.2: Prepare the metadata

Next we prepare the metadata. The format expected is explained [here](https://docs.smartnoise.org/sql/metadata.html#metadata) in `smartnoise-sql` documentation. It can be provided in different format such as an external `yaml` file or a dictionnary. In this notebook we will use the [dictionnary format](https://docs.smartnoise.org/sql/metadata.html#dictionary-format).

There are `Table Options` and `Column Options`. 
- `Table Options` apply on the whole table and can further configure queries. The have predertermined default values and should only be overriden with caution. For now, we will keep the default.
- `Column Options` are compulsory and describe the table column by column.
    - Each column must have the exact same name in the metadata and in the column of the table.
    - Each column needs a `type`, indicates the type for all values in the column (type may be one of 'int', 'float', 'str', 'boolean', or 'datetime').
    - Columns with numbers ('int', 'float') should additionally have `lower` and `upper` bounds. Meaning the minimum and maximum theoretically possible values for this column. This is important as it enables to compute the sensitivy and hence, calibrate the differentially private noise.
    - Optionnally a boolean `nullable` can be provided if the user knows that there are no `null` values. By default it is True meaning that columns may contain `null` values.
    - Other options are possible but won't be treated in this notebook.

We look at the dataset again to determine the types:

In [17]:
df.head(1)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE


In [26]:
species_col = {'type': 'str', 'nullable': False}
island_col = {'type': 'str', 'nullable': False}
bill_length_col = {'type': 'float', 'lower': 30.0, 'upper': 65.0}
bill_depth_col = {'type': 'float', 'lower': 13.0, 'upper': 23.0}
flipper_length_col = {'type': 'float', 'lower': 150.0, 'upper': 250.0}

In [27]:
# TODO: Fill body_mass_g and sex column metadata knowing that these specied of penguins typically weight between 2000.0 and 7000.0 grammes. We cannot say for sure that there are no nulls in these columns.
# body_mass_g_col = ...
# sex_col = ...

# Correction
body_mass_g_col = {'type': 'float', 'lower': 2000.0, 'upper': 7000.0}
sex_col = {'type': 'str'}

In [28]:
# 'str' is for a chain of character and 'float' is for decimal numbers.
metadata_columns = {
    'species': species_col,
    'island': island_col,
    'bill_length_mm': bill_length_col,
    'bill_depth_mm': bill_depth_col,
    'flipper_length_mm': flipper_length_col,
    'body_mass_g': body_mass_g_col,
    'sex': sex_col, 
}

### Step 3.3: Instantiate the reader

All arguments are now available to create the reader object mentionned at the begin of step 3.

In [None]:
# TODO: Instantiate the reader object
reader = from_df(df = ..., metadata=..., privacy=privacy)

## Step 4: Differentially Private Dataset Query