<a href="https://colab.research.google.com/github/dbenayoun/IASD/blob/main/DP3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Case Study - Health Care dataset

In this notebook, we'll explore the possibilties for data privacy on a new dataset.
Your challenge is that you are working with a health care provider, who would like to do the "machine learning" on this dataset to figure out if there are preventative measures that can be taken so fewer patients are seen in the hospital for related care or so that their visits are shorter, to do so you might want to predict the **has_diabetes** according to the features at hand. The goal is that more potentially affected patients are given access to primary care physicians and regular medication or visits that can keep them out of the hostpital for long stays.

## Part One: Determining What's Useful and What's Sensitive

- Data completeness
- Potential sensitive columns
- Potential useful features
- What columns should we use?
- Which ones should we remove?
- Are there columns which we should protect but not remove?

For each, we need some justification or thought!

In [27]:
%matplotlib inline
import pandas as pd
!mkdir data
!wget https://www.lamsade.dauphine.fr/~averine/ProjetIA/data/health_data.csv -P data
df = pd.read_csv('data/health_data.csv')

mkdir: cannot create directory ‘data’: File exists
--2024-10-14 20:28:23--  https://www.lamsade.dauphine.fr/~averine/ProjetIA/data/health_data.csv
Resolving www.lamsade.dauphine.fr (www.lamsade.dauphine.fr)... 193.48.71.250
Connecting to www.lamsade.dauphine.fr (www.lamsade.dauphine.fr)|193.48.71.250|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 100189 (98K) [text/csv]
Saving to: ‘data/health_data.csv.4’


2024-10-14 20:28:24 (362 KB/s) - ‘data/health_data.csv.4’ saved [100189/100189]



In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   admitted_ts            1000 non-null   object
 1   age                    1000 non-null   int64 
 2   ambulance_call         1000 non-null   int64 
 3   blood_sugar_reading    1000 non-null   int64 
 4   days_since_last_visit  1000 non-null   int64 
 5   has_diabetes           1000 non-null   int64 
 6   hospital               1000 non-null   object
 7   hours_hospitalized     1000 non-null   int64 
 8   hydration_level        1000 non-null   int64 
 9   id                     1000 non-null   int64 
 10  insulin                1000 non-null   int64 
 11  marital_status         1000 non-null   object
 12  no_primary_dr          1000 non-null   bool  
 13  patient_name           1000 non-null   object
 14  private_insurance      1000 non-null   int64 
 15  released_sameday      

In [13]:
df.head()

Unnamed: 0,admitted_ts,age,ambulance_call,blood_sugar_reading,days_since_last_visit,has_diabetes,hospital,hours_hospitalized,hydration_level,id,insulin,marital_status,no_primary_dr,patient_name,private_insurance,released_sameday,ssn,symptom_code
0,2018-05-09 12:06:28,49,1,108,99,1,district,15,6,1000,1,single,False,Rachel Shelton,0,0,743-97-4081,4
1,2018-05-12 10:02:55,82,1,70,100,1,general,22,1,1001,1,married,False,Barbara Medina,0,0,698-10-2230,3
2,2018-05-13 12:25:17,71,1,100,78,1,northern,1,4,1002,1,no_answer,True,Kaitlyn Daniels,0,1,540-83-4297,2
3,2018-05-14 12:20:08,87,0,113,72,1,general,22,4,1003,1,no_answer,False,William Reyes,1,0,282-96-8755,0
4,2018-05-17 08:35:23,53,1,93,80,1,district,17,8,1004,0,no_answer,True,Eric Booth,0,0,130-25-8918,8


### Summary of Recommendations


| Column                | Action               | Justification                                                                                     |
|-----------------------|----------------------|---------------------------------------------------------------------------------------------------|
| `patient_name`        | Remove               | Direct PII; high privacy risk                                                                      |
| `ssn`                 | Remove               | Direct PII; high privacy risk                                                                      |
| `id`                  | Remove               | Direct PII; high privacy risk                                                                      |
| `admitted_ts`         | Remove               | Irrelevant for diabetes prediction; no added value                                                 |
| `hospital`            | Remove               | Indirect PII; Not relevant to the prediction of diabetes                  |
| `marital_status`      | Remove               | Indirect PII; Not relevant to the prediction of diabetes |
| `private_insurance`   | Remove               | Indirect PII; Not relevant to the prediction of diabetes                                                  |
| `age`                 | **Use**                  | Sensitive ; Relevant for diabetes prediction                                                                   |
| `ambulance_call`      | Use                  | Indicates severity and health status                                                               |
| `blood_sugar_reading` | Use                  | Direct indicator of diabetes                                                                       |
| `days_since_last_visit`| **Use**               | Sensitive ; Reflects healthcare engagement and disease management                                              |
| `has_diabetes`        | Use (Target)         | Core target variable for predictive modeling                                                       |
| `hydration_level`     | Use                  | Related to overall health status                                                                   |
| `insulin`             | Use                  | Indicates diabetes treatment and management                                                        |
| `no_primary_dr`       | Remove                  |Irrelevant for diabetes prediction; no added value                                                 |
| `symptom_code`        | Use                  | Provides specific health indicators relevant to diabetes                                           |
| `hours_hospitalized`  | Remove           | Unavailable at prediction time; outcome feature                                                    |
| `released_sameday`    | Remove           | Unavailable at prediction time; outcome feature                                                    |




## Part Two: Determining the Approach for Protecting the Columns

You are the database manager at the health care provider asked to prepare the data to send to a machine learning consultant who will help give you a more detailed analysis. The consultant has signed all the necessary NDAs, but you have instructions to keep the private or potentially sensitive data to a minimum.

Based on the scenario and what you learned

- What methods will be most effective?
- Have you considered potential data leakage within the *non-sensitive* columns?
- Is there other sensitive or secret data we should address?

### Privacy Techniques Overview

#### k-Anonymity
- **Description**: A good starting point for preventing re-identification through quasi-identifiers. It ensures basic anonymity but doesn't protect against attribute disclosure.

#### l-Diversity
- **Description**: Useful if your concern is that knowing a group of people share the same quasi-identifiers could still reveal sensitive information (e.g., all have high `blood_sugar_reading`).

#### t-Closeness
- **Description**: Best when you want to ensure that sensitive attributes within any group reflect the overall dataset distribution, further reducing the risk of sensitive attribute disclosure.

---

### Differential Privacy Overview

##### Definition
A mechanism is considered differentially private if an observer cannot determine whether a particular individual's data is included in the input dataset, even if they have access to the output of the computation. This is typically achieved by adding carefully calibrated noise to the results.


## Part Three: Implement Data Protection for the Dataset

Now it's time to code! Feel free to utilize code from the previous notebooks to implement protection of at least two of the columns you chose as sensitive.

- What was difficult to decide and implement?
- How might this relate to real problems in machine learning with sensitive data?



In [30]:
# Delete unused columns
df_ = df.copy()
df_.drop(columns=['id', 'patient_name', 'ssn', 'admitted_ts', 'hospital', 'marital_status', 'no_primary_dr', 'released_sameday', 'hours_hospitalized', 'private_insurance'], inplace=True)
df_

Unnamed: 0,age,ambulance_call,blood_sugar_reading,days_since_last_visit,has_diabetes,hydration_level,insulin,symptom_code
0,49,1,108,99,1,6,1,4
1,82,1,70,100,1,1,1,3
2,71,1,100,78,1,4,1,2
3,87,0,113,72,1,4,1,0
4,53,1,93,80,1,8,0,8
...,...,...,...,...,...,...,...,...
995,75,0,83,38,0,9,1,2
996,73,0,77,39,0,1,0,8
997,29,1,109,63,1,2,1,7
998,23,0,75,37,0,4,0,2


In [31]:
# Function for differential privacy

import numpy as np

def process_value(value, p, q):
    """
    :param value: The value to apply the differentially private scheme to.
    :param     p: The probability of returning a random value instead of the true one
    :param     q: The probability of returning **0** when generating a random value
    :    returns: A new, differentially private value
    """
    if np.random.rand() < (1-p):
      # True value with probability 1-p
      return value
    else:
      # Random value with probability p
      if np.random.rand() < q:
        # In that case, 0 with probability q
        return 0
      else:
        # In that case, 1 with probability 1-q
        return 1

In [33]:
# Create the DP dataframe (part 1)

p=0.5
q=0.5

df_dp = df_.copy()
df_dp['has_diabetes']=df_dp['has_diabetes'].apply(lambda x: process_value(x, p, q))
df_dp['insulin']=df_dp['insulin'].apply(lambda x: process_value(x, p, q))

df_dp

Unnamed: 0,age,ambulance_call,blood_sugar_reading,days_since_last_visit,has_diabetes,hydration_level,insulin,symptom_code
0,49,1,108,99,1,6,1,4
1,82,1,70,100,0,1,1,3
2,71,1,100,78,1,4,1,2
3,87,0,113,72,1,4,1,0
4,53,1,93,80,0,8,0,8
...,...,...,...,...,...,...,...,...
995,75,0,83,38,0,9,0,2
996,73,0,77,39,1,1,0,8
997,29,1,109,63,1,2,1,7
998,23,0,75,37,0,4,0,2


In [34]:
# Function for Laplace Mechanism

def simlap(n, b):
    """
    :param     n: The number of samples to draw
    :param     b: Scale of the distribution.
    :    returns: a vector of size n with noises sample from laplace dristribution.
    """
    simlaps = []
    for j in range(n):
      # location parameter is 0
      simlap_trial = np.random.laplace(0, b)
      simlaps.append(simlap_trial)

    return np.array(simlaps)

In [35]:
# Create the DP dataframe (part 2)

epsilon=1

df_dp['hydration_level'] = df_dp['hydration_level'] /(df_dp['hydration_level'].max() - df_dp['hydration_level'].min())

noise_blood_arr = simlap(df_dp['blood_sugar_reading'].count(), epsilon/(df_dp['blood_sugar_reading'].max()-df_dp['blood_sugar_reading'].min()))
noise_hydratation_arr = simlap(df_dp['hydration_level'].count(), epsilon/(df_dp['hydration_level'].max()-df_dp['hydration_level'].min()))

noise_laplace_df = pd.DataFrame({'noise_blood': noise_blood_arr, 'noise_hydratation': noise_hydratation_arr})

df_dp['blood_sugar_reading'] = df_dp['blood_sugar_reading'] + noise_laplace_df['noise_blood']
df_dp['hydration_level'] = df_dp['hydration_level'] + noise_laplace_df['noise_hydratation']

df_dp

Unnamed: 0,age,ambulance_call,blood_sugar_reading,days_since_last_visit,has_diabetes,hydration_level,insulin,symptom_code
0,49,1,107.964763,99,1,0.924393,1,4
1,82,1,69.991918,100,0,1.843413,1,3
2,71,1,100.000126,78,1,0.825785,1,2
3,87,0,112.996335,72,1,0.463042,1,0
4,53,1,93.006470,80,0,0.100699,0,8
...,...,...,...,...,...,...,...,...
995,75,0,83.018558,38,0,1.195850,0,2
996,73,0,77.007466,39,1,-0.347090,0,8
997,29,1,109.009137,63,1,-0.309524,1,7
998,23,0,75.005953,37,0,0.333412,0,2


In [36]:
def is_k_anonymous(df, partition, sensitive_column=None, k=3):
    """
    :param               df: The dataframe on which to check the partition.
    :param        partition: The partition of the dataframe to check.
    :param sensitive_column: The name of the sensitive column
    :param                k: The desired k
    :returns               : True if the partition is valid according to our k-anonymity criteria, False otherwise.
    """

    return len(partition) >= k

## Bonus: machine learning from anonymized Dataset

Build a model that predicts the the diabetes statue of a new patient based on the anonymized database from part three. Compare with non anonymized dataset. (try to get more than 0.65 accuracy)

In [None]:
# We can remove here additionnal data not relevant for that target

col_supp=['admitted_ts', 'ambulance_call', 'days_since_last_visit', 'hospital', 'hours_hospitalized', 'no_primary_dr', 'released_sameday']

df.drop(columns=col_supp, inplace=True)
df_dp.drop(columns=col_supp, inplace=True)

In [None]:
# Conversion des données catégorielles
df_dum = pd.get_dummies(df['marital_status'])
df.drop(['marital_status'], axis=1, inplace=True)
df = pd.concat([df, df_dum], axis=1)

df_dp_dum = pd.get_dummies(df_dp['marital_status'])
df_dp.drop(['marital_status'], axis=1, inplace=True)
df_dp = pd.concat([df_dp, df_dp_dum], axis=1)

In [None]:
df_dp

Unnamed: 0,age,blood_sugar_reading,has_diabetes,hydration_level,insulin,divorced,married,no_answer,single
0,49,107.994165,1,1.215643,0,False,False,False,True
1,82,69.999778,0,0.229856,1,False,True,False,False
2,71,100.002622,1,-1.274001,1,False,False,True,False
3,87,113.015305,1,-0.068719,0,False,False,True,False
4,53,92.985818,1,1.522896,0,False,False,True,False
...,...,...,...,...,...,...,...,...,...
995,75,83.007236,0,-4.056139,1,True,False,False,False
996,73,76.996807,0,-0.249246,0,False,False,True,False
997,29,109.003183,1,0.284913,1,False,True,False,False
998,23,75.003168,0,-3.048329,0,False,False,False,True


In [None]:
# Machine learning on df_dp

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier  # K plus proche voisin
from sklearn.svm import SVC                         # Support vector machine
from sklearn.tree import DecisionTreeClassifier     # Arbre de décision

from sklearn.metrics import accuracy_score

x = df_dp.drop('has_diabetes', axis=1)
y = df_dp['has_diabetes']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

# K plus proche voisin
model_knn = KNeighborsClassifier()
model_knn.fit(x_train, y_train)
y_predict_knn = model_knn.predict(x_test)
accuracy_knn = accuracy_score(y_test, y_predict_knn)
print('Accuracy KNN: ', accuracy_knn)

# Support vector machine
model_svc = SVC()
model_svc.fit(x_train, y_train)
y_predict_svc = model_svc.predict(x_test)
accuracy_svc = accuracy_score(y_test, y_predict_svc)
print('Accuracy SVC: ', accuracy_svc)

# Arbre de décision
model_dt = DecisionTreeClassifier()
model_dt.fit(x_train, y_train)
y_predict_dt = model_dt.predict(x_test)
accuracy_dt = accuracy_score(y_test, y_predict_dt)
print('Accuracy DT: ', accuracy_dt)

Accuracy KNN:  0.5666666666666667
Accuracy SVC:  0.6166666666666667
Accuracy DT:  0.5266666666666666


65% accuracy is reached bu SVC (support vector machine)

In [None]:
# Machine learning on the original dataframe (df)

x = df.drop('has_diabetes', axis=1)
y = df['has_diabetes']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

# Support vector machine
model_svc = SVC()
model_svc.fit(x_train, y_train)
y_predict_svc = model_svc.predict(x_test)
accuracy_svc = accuracy_score(y_test, y_predict_svc)
print('Accuracy SVC: ', accuracy_svc)

Accuracy SVC:  0.75


Accuracy is better on the data without differential privacy