# Differential Privacy for Data sets

Next, we'll work with differentially privat copies of data sets and see how that changes

We will use the [Data Synthesizer](http://demo.dataresponsibly.com/synthesizer/) tool from NYU to generate differntially private data sets

In [1]:
from IPython.display import display
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import accuracy_score, confusion_matrix, make_scorer, precision_recall_fscore_support, roc_auc_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, StandardScaler
import io
import requests

import numpy as np
import pandas as pd

## Simple learning exampl

First, train a basic classifier on a familiar dataset

In [14]:
df_adult_real = pd.read_csv('adult_reduced.csv')
df_adult_real.head()

Unnamed: 0,age,education,sex,relationship,marital-status,income
0,39,Bachelors,Male,Not-in-family,Never-married,<=50K
1,50,Bachelors,Male,Husband,Married-civ-spouse,<=50K
2,38,HS-grad,Male,Not-in-family,Divorced,<=50K
3,53,11th,Male,Husband,Married-civ-spouse,<=50K
4,28,Bachelors,Female,Wife,Married-civ-spouse,<=50K


In [15]:

def adult_lr_prep(df):
    y_all = df["income"].values
    df.drop("income", axis=1, inplace=True,)
    df = pd.get_dummies(df, columns=['sex',"education", "marital-status",  "relationship"])
    # split the data and train a logistic regression predictor
    X_train, X_test, y_train, y_test = train_test_split(
    df, y_all, test_size=0.25, stratify=y_all, )
    
    return X_train, X_test, y_train, y_test

In [16]:

X_train_real, X_test_real, y_train_real, y_test_real, = adult_lr_prep(df_adult_real)
X_train_real.head()

Unnamed: 0,age,sex_Female,sex_Male,education_10th,education_11th,education_12th,education_1st-4th,education_5th-6th,education_7th-8th,education_9th,...,marital-status_Married-spouse-absent,marital-status_Never-married,marital-status_Separated,marital-status_Widowed,relationship_Husband,relationship_Not-in-family,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife
6309,32,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
6449,51,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
29594,81,0,1,0,0,0,1,0,0,0,...,0,0,0,1,0,0,1,0,0,0
7909,31,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
9746,42,0,1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [19]:

clf = LogisticRegression()
clf_fit_real = clf.fit(X_train_real, y_train_real)
y_pred_real = clf_fit_real.predict(X_test_real)
accuracy_score(y_test_real, y_pred_real)



0.8209065225402284

## Impact of DP on learning

Next your goal is to evaluate how differentially private synthetic data impacts your ability to learn. To do this:

1. Generate a dataset from the Dataset Sythesizer
1. Load that data
1. train a classifier on the private data
1. evaluate on the real data.


Notes: 
- the code below is incomplete, you need to fill in places that have `#TODO`
- when you generate synthetic data, make the same number of samples as above or you will have to adjust things below

In [None]:
df_adult_dp = pd.read_csv(#TODO)
X_train_dp, X_test_dp, y_train_dp, y_test_dp, = adult_lr_prep(#TODO)
X_train_dp.head()

In [None]:
clf_fit_dp = clf.fit(X_train_dp, y_train_dp)

# check the fit on the private data first
y_pred_dp = clf_fit_dp.predict(#TODO)
accuracy_score(#TODO, y_pred_dp)

In [None]:
# now see how private training impacts real data performance
y_pred_dp_real = clf_fit_dp.predict(#TODO)
accuracy_score(#TODO, y_pred_dp_real)

Now, repeate the abo