# Introduction to DiffPrivLib

[DiffPrivLib](https://diffprivlib.readthedocs.io/en/latest/index.html) is a python library dedicated to differential privacy and machine learning. It is based on `scikit-learn` library. 

Some other [introduction notebooks](https://github.com/IBM/differential-privacy-library/tree/main/notebooks) are available directly in the official library repository.

## Step 1: Install the Library

Diffprivlib is available on pypi, it can be installed via the pip command. We will use the latest version of the library to date: version 0.6.6.

In [3]:
!pip install diffprivlib==0.6.6

Defaulting to user installation because normal site-packages is not writeable
Collecting diffprivlib==0.6.6
  Downloading diffprivlib-0.6.6-py3-none-any.whl.metadata (10 kB)
Downloading diffprivlib-0.6.6-py3-none-any.whl (176 kB)
Installing collected packages: diffprivlib
Successfully installed diffprivlib-0.6.6


## Step 2: Load and Prepare Data

### Load penguin dataset

In this notebook, we will work with the [penguin dataset]("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv") from [seaborn datasets](https://github.com/mwaskom/seaborn-data).
We load the dataset via pandas in a dataframe `df`.

In [6]:
import pandas as pd

In [15]:
path_to_data = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
df = pd.read_csv(path_to_data)

We can look at the first rows of the dataframe to get to know the data:

In [16]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


### Handle null values

DiffPrivLib does not allow null values so we will have to remove or convert them. For simplicity, we will just drop the rows with null values from the dataset.

In [17]:
print(f"{df.shape[0]} rows before dropping nulls")
df = df.dropna()
print(f"{df.shape[0]} rows after dropping nulls")

344 rows before dropping nulls
333 rows after dropping nulls


### Encode columns for Machine Learning

In the following analysis, we will use the `sex` column as a feature column. We encode the `MALE` and `FEMALE` strings in numbers that the models will the able to use.

In [33]:
df["sex"] = df["sex"].map({"MALE": 0, "FEMALE": 1})

In [34]:
df.head(2)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,0
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,1


## Step 3: Logistic Regression with DiffPrivLib

We want to predict penguin's species based on bill length, bill depth, flipper length, body mass and sex. Therefore, we will do a logistic regression.

We first split the data between features and target (to predict).

In [35]:
feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex']
target_columns = ['species']

In [36]:
feature_data = df[feature_columns]
label_data = df[target_columns]

And then split the data to get a training and a testing set with the [train_test_split function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#train-test-split) from scikit-learn.

In [37]:
from sklearn.model_selection import train_test_split

In [38]:
TEST_SIZE = 0.2
RANDOM_STATE = 1 

x_train, x_test, y_train, y_test = train_test_split(
    feature_data,
    label_data,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
)
y_train = y_train.to_numpy().ravel()

Then we define the logistic regression pipeline [see doc](https://diffprivlib.readthedocs.io/en/latest/modules/models.html#logistic-regression):

In [39]:
from sklearn.pipeline import Pipeline
from diffprivlib import models

In [44]:
dpl_pipeline = Pipeline([
    ('classifier', models.LogisticRegression(epsilon = 1.0))
])

And fit it on the training set:

In [45]:
dpl_pipeline = dpl_pipeline.fit(x_train, y_train)



We see that we get a `PrivacyLeakWarning` warning because we did not specify the `data_norm` parameter. 

Differential privacy mechanisms need to know how much one individual’s record can change the model. This depends on the sensitivity of the loss function, which in turn depends on the size of feature vectors. `data_norm` is that bound: The maximum L2 norm of any single row (feature vector) in the dataset. 

If it is not specified, DiffPrivLib will infer if from the training data. This may leak information about the dataset (e.g. what the max value was), hence the PrivacyLeakWarning. To avoid that, we should decide on `data_norm` based on domain knowledge before looking at the data.

As common knowledge (without looking at the data), we know that:
- bill length $\in [30.0, 65.0]$,
- bill depth $\in [13.0, 23.0]$,
- flipper length $\in [150.0, 250.0]$,
- body mass $\in [2000.0, 7000.0]$,
- sex $\in [0, 1]$.

Formally, for a row $x = (x_1, \ldots, x_d)$, its L2 norm is $\|x\|_2 = \sqrt{x_1^2 + x_2^2 + \cdots + x_d^2}$.

The ℓ2 norm of a row \(x\) can be bounded using the feature sensitivities:

$$
\|x\|_2 = \sqrt{
(\text{bill length sens})^2 +
(\text{bill depth sens})^2 +
(\text{flipper length sens})^2 +
(\text{body mass sens})^2 +
(\text{sex sens})^2
}
$$

where $sens$, the sensitivity of each feature is defined as $\text{sens} = \text{max} - \text{min}$.

Writing it in a metadata dictionnary, we can then compute the `data_norm`:

In [None]:
import numpy as np

In [46]:
bounds = {
    'bill_length_mm': {'lower': 30.0, 'upper': 65.0},
    'bill_depth_mm': {'lower': 13.0, 'upper': 23.0},
    'flipper_length_mm': {'lower': 150.0, 'upper': 150.0},
    'body_mass_g': {'lower': 2000.0, 'upper': 7000.0},
    'sex': {'lower': 0.0, 'upper': 1.0}, 
}

In [51]:
# TODO: compute the data_norm
#sensitivities = ...
#data_norm = ...

# Correction
sensitivities = [v['upper'] - v['lower'] for v in bounds.values()]
data_norm = np.sqrt(sum(s**2 for s in sensitivities))
print("data_norm =", data_norm)

data_norm = 5000.13259824177
