# Applying l-diversity to data minimization to a trained regression ML model

In this tutorial I will show how to use the l_diversity.py and privacy_grid.py modules from the new apt/security package.

These are meant to be enhancements to the minimization.py framework explored by the file goldsteen2022data.pdf

## Setup

In [1]:
!git clone https://github.com/cmalvr/ai-privacy-toolkit.git

Cloning into 'ai-privacy-toolkit'...
remote: Enumerating objects: 1528, done.[K
remote: Counting objects: 100% (595/595), done.[K
remote: Compressing objects: 100% (225/225), done.[K
remote: Total 1528 (delta 490), reused 375 (delta 370), pack-reused 933 (from 2)[K
Receiving objects: 100% (1528/1528), 1.70 MiB | 6.53 MiB/s, done.
Resolving deltas: 100% (1035/1035), done.


In [2]:
! python --version

Python 3.11.11


In [3]:
%cd /content/ai-privacy-toolkit/
import sys
sys.path.append('/content/ai-privacy-toolkit/apt')
print(sys.path)

/content/ai-privacy-toolkit
['/content', '/env/python', '/usr/lib/python311.zip', '/usr/lib/python3.11', '/usr/lib/python3.11/lib-dynload', '', '/usr/local/lib/python3.11/dist-packages', '/usr/lib/python3/dist-packages', '/usr/local/lib/python3.11/dist-packages/IPython/extensions', '/root/.ipython', '/content/ai-privacy-toolkit/apt']


In [4]:
! pip install -r /content/ai-privacy-toolkit/requirements.txt;

Collecting numpy==1.24.2 (from -r /content/ai-privacy-toolkit/requirements.txt (line 1))
  Downloading numpy-1.24.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting pandas==1.1.05 (from -r /content/ai-privacy-toolkit/requirements.txt (line 2))
  Downloading pandas-1.1.5.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting scipy==1.10.1 (from -r /content/ai-privacy-toolkit/requirements.txt (line 3))
  Downloading scipy-1.10.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.9/58.9 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scikit-learn<=1.1.3,>=0.22.2 (from -r /content/ai

In [5]:
! pip install adversarial-robustness-toolbox

Collecting adversarial-robustness-toolbox
  Downloading adversarial_robustness_toolbox-1.19.1-py3-none-any.whl.metadata (11 kB)
Downloading adversarial_robustness_toolbox-1.19.1-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: adversarial-robustness-toolbox
Successfully installed adversarial-robustness-toolbox-1.19.1


## Load data
QI parameter for Quasi Indentifier deifinition.

In [6]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

dataset = load_breast_cancer()
X, y = dataset.data, dataset.target
features = list(dataset.feature_names)  # All features from the dataset
QI = ["mean texture", "mean perimeter", "mean smoothness"]
sensitive_attribute = "mean radius"

In [7]:
print(features)

['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']


## Train DecisionTreeRegressor model

In [8]:
import warnings
warnings.filterwarnings("ignore")
from sklearn.tree import DecisionTreeRegressor

#Dataset split to train baseline model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=14)

# Train a DecisionTreeRegressor as our baseline model.
model = DecisionTreeRegressor(random_state=10, min_samples_split=2)
model.fit(X_train, y_train)

#Measure base line accuracy
print('Base model accuracy (R2 score): ', model.score(X_test, y_test))

Base model accuracy (R2 score):  0.579424475598187


## Data pre-processing


In [9]:
from apt.utils.datasets import ArrayDataset

# In a real-world scenario, the model is pre-trained and we no longer have access to the original training data.
# Here, we assume the test set is all we have.
# Create a DataFrame for the test data.
df_test = pd.DataFrame(X_test, columns=features)
df_test['target'] = y_test

# Wrap the test data into an ArrayDataset.
test_dataset = ArrayDataset(df_test.drop(columns=['target']), df_test['target'])
test_dataset.features_names = list(df_test.columns.drop('target'))

## l-diversity and grid-search: Application



In [10]:
from apt.security.privacy_grid import grid_search_privacy
from apt.security.privacy_grid import display_grid_search_results
import contextlib, io


# Set the desired target accuracy for the minimizer.
target_accuracy = 0.6

#Grid Search (Silencing the minimizer)
with contextlib.redirect_stdout(io.StringIO()):
  results = grid_search_privacy(
      dataset=test_dataset,
      sensitive_attribute="mean radius",
      quasi_identifiers=["mean texture", "mean perimeter", "mean smoothness"],
      model=model,
      features=features,
      target_accuracy=target_accuracy,
      k_min=2, k_max=100,   # Adjust the range as needed.
      l_min=1
  )

In [11]:
#Example display of the results
display_grid_search_results(results, min_accuracy=0.7, sort_by="deletion_ratio")

Parameters: k=83, l=79
Deletion Ratio: 0.291
Accuracy on minimized data: 0.700
Generalizations: {'ranges': {'mean texture': [], 'mean perimeter': [91.09000015258789], 'mean smoothness': []}, 'categories': {}, 'untouched': ['worst texture', 'mean compactness', 'fractal dimension error', 'worst area', 'mean concave points', 'worst fractal dimension', 'worst perimeter', 'worst smoothness', 'perimeter error', 'mean fractal dimension', 'compactness error', 'concave points error', 'radius error', 'worst concavity', 'worst symmetry', 'worst concave points', 'area error', 'mean area', 'concavity error', 'mean concavity', 'worst radius', 'mean symmetry', 'symmetry error', 'mean radius', 'texture error', 'worst compactness', 'smoothness error'], 'category_representatives': {}, 'range_representatives': {'mean texture': [], 'mean perimeter': [91.09000015258789], 'mean smoothness': []}}
----------------------------------------
Parameters: k=83, l=80
Deletion Ratio: 0.291
Accuracy on minimized data: