# General Description
This data was extracted from the census bureau database found at
https://www2.census.gov/programs-surveys/acs/data/pums/2019/1-Year/ (1 year census data), and https://www2.census.gov/programs-surveys/acs/data/pums/2019/5-Year/ (5 year census data).  
Donor: Bargav Jayaraman and Zihao Su, University of Virginia. e-mail: zs3pv@virginia.edu for questions.  
The data records were filtered with conditions that AGEP>16 (age is older than 16 years old) and WKHP>0 (usual hours worked per week in the past 12 months is more than 0).  
The data has 1685316 row records for the 1 year census data and 8199834 row records for the 5 year census data. The data is split into features and labels. To see the data processing methods and the data dictionary, go to the `Features` and `Label` sections. To see how to import and use the data, go to the `Usage` section. To see benchmarks of the training models using the data, go to the `Benchmarks` section. To see how to reproduce the data, go to the `Steps to Reproduce Data` section. 

# Features
Found in `census_features.p` or `census_features.csv`, the features contain 13 columns, whose values are extracted from the raw census data and processed. Details of each column are listed below in order, along with the processing methods and data dictionary.
- `AGEP`: age in years. Only records with AGEP>16 are included in the dataset. 
- `COW`: class of worker. The values are shifted down by 1, so they start at 0.
	* 0: 'Private For-Profit'
	* 1: 'Private Non-Profit'
	* 2: 'Local Govt'
	* 3: 'State Govt'
	* 4: 'Federal Govt'
	* 5: 'Self-Employed Other'
	* 6: 'Self-Employed Own'
	* 7: 'Unpaid Job'
	* 8: 'Unemployed'
- `SCHL`: education attainment. The values are shifted down by 1, so they start at 0.
	* 0: 'No schooling completed'
	* 1: 'Nursery school, preschool'
	* 2: 'Kindergarten'
	* 3: 'Grade 1'
	* 4: 'Grade 2'
	* 5: 'Grade 3'
	* 6: 'Grade 4'
	* 7: 'Grade 5' 
	* 8: 'Grade 6'
	* 9: 'Grade 7'
	* 10: 'Grade 8'
	* 11: 'Grade 9'
	* 12: 'Grade 10'
	* 13: 'Grade 11'
	* 14: '12th grade - no diploma'
	* 15: 'Regular high school diploma'
	* 16: 'GED or alternative credential'
	* 17: 'Some college, but less than 1 year'
	* 18: '1 or more years of college credit, no degree'
	* 19: 'Associates degree'
	* 20: 'Bachelors degree'
	* 21: 'Masters degree'
	* 22: 'Professional degree beyond a bachelors degree'
	* 23: 'Doctorate degree'
- `MAR`: marital status. The values are shifted down by 1, so they start at 0.
	* 0: 'Married'
	* 1: 'Widowed'
	* 2: 'Divorced'
	* 3: 'Separated'
	* 4: 'Never married'
- `RAC1P`: recoded detailed race code. Alaskan Native and American Indians are combined into a "Native American" class. The values are then replaced by their ranks of the sorted distinct values (in ascending order). Thus, the values start at 0 and are consecutive integers.
	* 0: 'White'
	* 1: 'Black'
	* 2: 'Native American'
	* 3: 'Asian'
	* 4: 'Pacific Islander'
	* 5: 'Some Other Race'
	* 6: 'Two or More Races'
- `SEX`: sex. The values are shifted down by 1, so they start at 0.
	* 0: 'Male'
	* 1: 'Female'
- `DREM`: cognitive difficulty. The 'no' value is changed from 2 to 0, so the values start at 0.
	* 0: 'No'
	* 1: 'Yes'
- `DPHY`: ambulatory difficulty. The 'no' value is changed from 2 to 0, so the values start at 0.
	* 0: 'No'
	* 1: 'Yes'
- `DEAR`: hearing difficulty. The 'no' value is changed from 2 to 0, so the values start at 0.
	* 0: 'No'
	* 1: 'Yes'
- `DEYE`: vision difficulty. The 'no' value is changed from 2 to 0, so the values start at 0.
	* 0: 'No'
	* 1: 'Yes'
- `WKHP`: usual hours worked per week in the past 12 months. Only Records with WKHP>0 are included in the dataset.
- `WAOB`: world area of birth. The values are shifted down by 1, so they start at 0.
	* 0: 'US state'
	* 1: 'PR and US Island Areas'
	* 2: 'Latin America'
	* 3: 'Asia'
	* 4: 'Europe'
	* 5: 'Africa'
	* 6: 'Northern America'
	* 7: 'Oceania and at Sea'
- `ST`: state code. The values are replaced by their ranks of the sorted distinct values (in ascending order). Thus, the values start at 0 and are consecutive integers.
	* 0:'Alabama/AL'
    * 1:'Alaska/AK'
    * 2:'Arizona/AZ'
    * 3:'Arkansas/AR'
    * 4:'California/CA'
    * 5:'Colorado/CO'
    * 6:'Connecticut/CT'
    * 7:'Delaware/DE'
    * 8:'District of Columbia/DC'
    * 9:'Florida/FL'
    * 10:'Georgia/GA'
    * 11:'Hawaii/HI'
    * 12:'Idaho/ID'
    * 13:'Illinois/IL'
    * 14:'Indiana/IN'
    * 15:'Iowa/IA'
    * 16:'Kansas/KS'
    * 17:'Kentucky/KY'
    * 18:'Louisiana/LA'
    * 19:'Maine/ME'
    * 20:'Maryland/MD'
    * 21:'Massachusetts/MA'
    * 22:'Michigan/MI'
    * 23:'Minnesota/MN'
    * 24:'Mississippi/MS'
    * 25:'Missouri/MO'
    * 26:'Montana/MT'
    * 27:'Nebraska/NE'
    * 28:'Nevada/NV'
    * 29:'New Hampshire/NH'
    * 30:'New Jersey/NJ'
    * 31:'New Mexico/NM'
    * 32:'New York/NY'
    * 33:'North Carolina/NC'
    * 34:'North Dakota/ND'
    * 35:'Ohio/OH'
    * 36:'Oklahoma/OK'
    * 37:'Oregon/OR'
    * 38:'Pennsylvania/PA'
    * 39:'Rhode Island/RI'
    * 40:'South Carolina/SC'
    * 41:'South Dakota/SD'
    * 42:'Tennessee/TN'
    * 43:'Texas/TX'
    * 44:'Utah/UT'
    * 45:'Vermont/VT'
    * 46:'Virginia/VA'
    * 47:'Washington/WA'
    * 48:'West Virginia/WV'
    * 49:'Wisconsin/WI'
    * 50:'Wyoming/WY'
    * 51:'Puerto Rico/PR'

After all the column values are processed as described above, all values are divided by the maximum value of their respective column, which normalizes the values to be between 0 and 1. Therefore, the actual values stored in the data files **DO NOT** correspond directly to the data dictionary above. 
 

# Label
Found in `census_labels.p` or `census_labels.csv`, the labels contain 1 column of value that represents whether a person's total income is at least \$50,000. The value is 1 for records with total income equal to or over \$50,000 (obtained by filtering records through PINCP>=50,000, where PINCP is the total person's income). The value is 0 for records the remaining records.

# Usage
## Pickle files
Prerequisites: [Python 3.8](https://www.python.org/downloads/release/python-380/), [pickle](https://pypi.org/project/cloudpickle/), [numpy](https://pypi.org/project/numpy/).  
To load a `.p` file, use: `data = pickle.load(open(<DATA_PATH>, 'rb'))`.  
Sample usage code is seen below: 

Download the 1 year data:

In [1]:
!gdown --id 1oMaHqFJutp0RmAG8pHrmF1k_2H2vAegE
!gdown --id 1aDxPdt8iB4zUc7joWuieSQpRIhXcKPI2
!gdown --id 13UvEGw-I0Ylu5o-9uDLHCU-A3KAL7MiJ

Downloading...
From: https://drive.google.com/uc?id=1oMaHqFJutp0RmAG8pHrmF1k_2H2vAegE
To: /content/census_labels.p
100% 13.5M/13.5M [00:00<00:00, 81.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1aDxPdt8iB4zUc7joWuieSQpRIhXcKPI2
To: /content/census_feature_desc.p
100% 2.52k/2.52k [00:00<00:00, 1.62MB/s]
Downloading...
From: https://drive.google.com/uc?id=13UvEGw-I0Ylu5o-9uDLHCU-A3KAL7MiJ
To: /content/census_features.p
100% 175M/175M [00:01<00:00, 128MB/s]


Alternatively, download the 5 year data:

In [None]:
!gdown --id 1jGw9TnsdC8nXxiCCK46mqNbIRkWOZpHU
!gdown --id 1M7ms22gfdE1W1GecrWIaghWCgKOMY8lI
!gdown --id 1L-X-nplPyISi85W8YuI-m6eB2pdpMeoG

In [None]:
import pickle,numpy as np

In [1]:
X = pickle.load(open('./census_features.p', 'rb'))
X = np.array(X, dtype=np.float32)
print(f"Dimension of X: {X.shape}")
print(X[:3])

NameError: ignored

Note that the features matrix has 13 column values in each row, which correspond to columns in the `Features` in order.

Next, read the labels data

In [None]:
y = pickle.load(open('./census_labels.p', 'rb'))
y = np.array(y, dtype=np.int32)
print(f"Length of y: {len(y)}")
print(y[:20])

Length of y: 1685316
[0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0]


Read other helpful information

In [None]:
attribute_idx, attribute_dict, max_attr_vals = pickle.load(open('./census_feature_desc.p', 'rb'))

{'DREM': 6, 'DPHY': 7, 'DEAR': 8, 'DEYE': 9, 'SEX': 5, 'COW': 1, 'MAR': 3, 'RAC1P': 4, 'WAOB': 11, 'SCHL': 2}


`attribute_idx` contains the column number of features.

In [None]:
print(attribute_idx)

{'DREM': 6, 'DPHY': 7, 'DEAR': 8, 'DEYE': 9, 'SEX': 5, 'COW': 1, 'MAR': 3, 'RAC1P': 4, 'WAOB': 11, 'SCHL': 2}


`attribute_dict` contains the data dictionary for each categorical feature. 

In [None]:
print(attribute_dict)

{6: {0: 'No Cognitive Difficulty', 1: 'Cognitive Difficulty'}, 7: {0: 'No Ambulatory Difficulty', 1: 'Ambulatory Difficulty'}, 8: {0: 'No Hearing Difficulty', 1: 'Hearing Difficulty'}, 9: {0: 'No Vision Difficulty', 1: 'Vision Difficulty'}, 5: {0: 'Male', 1: 'Female'}, 1: {0: 'Private For-Profit', 1: 'Private Non-Profit', 2: 'Local Govt', 3: 'State Govt', 4: 'Federal Govt', 5: 'Self-Employed Other', 6: 'Self-Employed Own', 7: 'Unpaid Job', 8: 'Unemployed'}, 3: {0: 'Married', 1: 'Widowed', 2: 'Divorced', 3: 'Separated', 4: 'Never married'}, 4: {0: 'White', 1: 'Black', 2: 'Native American', 3: 'Asian', 4: 'Pacific Islander', 5: 'Some Other Race', 6: 'Two or More Races'}, 11: {0: 'US state', 1: 'PR and US Island Areas', 2: 'Latin America', 3: 'Asia', 4: 'Europe', 5: 'Africa', 6: 'Northern America', 7: 'Oceania and at Sea'}, 13: {0: 'Alabama/AL', 1: 'Alaska/AK', 2: 'Arizona/AZ', 3: 'Arkansas/AR', 4: 'California/CA', 5: 'Colorado/CO', 6: 'Connecticut/CT', 7: 'Delaware/DE', 8: 'District of C

`max_attr_vals` contains the maximum value for each column before normalization.

In [None]:
print(max_attr_vals)

You can reverse the normalization and get values that correspond to the data dictionary

In [None]:
reverse_normalization = np.around(np.array(X)*max_attr_vals)
print(reverse_normalization)

[[20.  0. 18. ... 30.  0.  7.]
 [30.  3. 15. ... 40.  0.  7.]
 [44.  3. 15. ... 42.  0.  7.]
 ...
 [46.  0. 20. ... 40.  0.  6.]
 [63.  1. 21. ... 45.  0.  6.]
 [61.  0. 17. ... 45.  0.  6.]]


## CSV files
Prerequisites: [Python 3.8](https://www.python.org/downloads/release/python-380/), [numpy](https://pypi.org/project/numpy/).  
To load a `.csv` file, use: `data = numpy.genfromtxt(<DATA_PATH>,delimiter=',')`.  
Sample usage code is seen below: 

Download the 1 year data:

In [None]:
!gdown --id 1FS25Lwn-0qgV2sPzvkmL_HdUYZ_HJRgy
!gdown --id 1d2dYbwK9CjRgh89ISCdcYLdUWG0lfDtc

Downloading...
From: https://drive.google.com/uc?id=1Ih4bHHhe012KFAhwWCZkLgnpHCMEiqn5
To: /content/census_features.csv
100% 548M/548M [00:03<00:00, 139MB/s]


Alternatively, download the 5 year data:

In [None]:
!gdown --id 1n7O0x2uRdWhWJY4GPhWBxS1osIk5npPQ
!gdown --id 1s45dppmjCv56hM6aFX4CTDPRz7I3tIED

In [None]:
import numpy as np

Read the features data:

In [None]:
X = np.genfromtxt('./census_features.csv',delimiter=',')
print(f"Dimension of X: {X.shape}")
print(X[:3])

Dimension of X: (1685316, 13)


Note that the features matrix has 13 column values in each row, which correspond to columns in the `Features` in order.

Read the labels data:

In [None]:
y = np.genfromtxt('./census_labels.csv',delimiter=',')
print(f"Length of y: {len(y)}")
print(y[:20])

Length of y: 1685316
[0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]


# Benchmarks
We conducted benchmarks by training various classifiers, as demonstrated below for train and test accuracy. For the data, we split into train-test into 2/3 and 1/3 randomly.

## 1 year data results  
Percentage of records with > \$50,000 total person's income: 58.1%   
Percentage of records with <= \$50,000 total person's income: 41.9%   

### Membership Inference Benchmarks  
We tested 4 membership inference attacks (Yeom, Shokri, Merlin and Morgan) against NN models  trained using the 1-year census data (with test-to-train ratio of 0.5). We performed tests in various differential privacy settings. We performed 5 repeated runs for each setting and the benchmark results below are the average values from the repeated runs.

**Training and Test Accuracy overview**  
For NN models without differential privacy

|           | No privacy |
|-----------|------------|
| Train acc | 0.83162    |
| Test acc  | 0.74172    |

For NN models with Gaussian differential privacy (GDP) ('eps' stands for epsilon, the privacy budget)

|           | eps=0.1 | eps=1.0 | eps=10.0 | eps=100.0 |
|-----------|---------|---------|----------|-----------|
| Train acc | 0.68624 | 0.74272 | 0.76174  | 0.77200   |
| Test acc  | 0.68576 | 0.73864 | 0.75652  | 0.75576   |

For NN models with Renyi differential privacy (RDP) ('eps' stands for epsilon, the privacy budget)

|           | eps=0.1 | eps=1.0 | eps=10.0 | eps=100.0 |
|-----------|---------|---------|----------|-----------|
| Train acc | 0.66876 | 0.74158 | 0.75350  | 0.77472   |
| Test acc  | 0.66596 | 0.73672 | 0.74584  | 0.75744   |

**Attack results for NN with no privacy:**

|           | Yeom           | Shokri         | Merlin         | Morgan         |
|-----------|----------------|----------------|----------------|----------------|
| PPV       | 0.6952 &plusmn; 0.0039 | 0.4084 &plusmn; 0.3335 | 0.6895 &plusmn; 0.0036 | 0.7284 &plusmn; 0.0106 |
| Advantage | 0.0894 &plusmn; 0.0060 | 0.0152 &plusmn; 0.0159 | 0.0488 &plusmn; 0.0063 | 0.0113 &plusmn; 0.0014 |

**Attack results for NN with GDP:**  
For epsilon=0.1:

|           | Yeom            | Shokri         | Merlin         | Morgan         |
|-----------|-----------------|----------------|----------------|----------------|
| PPV       | 0.6625 &plusmn; 0.0045  | 0.5353 &plusmn; 0.2677 | 0.6725 &plusmn; 0.0109 | 0.6960 &plusmn; 0.0275 |
| Advantage | -0.0029 &plusmn; 0.0039 | 0.0022 &plusmn; 0.0050 | 0.0003 &plusmn; 0.0076 | 0.0010 &plusmn; 0.0006 |

For epsilon=1.0:

|           | Yeom            | Shokri         | Merlin          | Morgan          |
|-----------|-----------------|----------------|-----------------|-----------------|
| PPV       | 0.6683 &plusmn; 0.0011  | 0.5374 &plusmn; 0.2687 | 0.6689 &plusmn; 0.0331  | 0.6216 &plusmn; 0.0685  |
| Advantage | 0.0057 &plusmn; 0.0040  | 0.0065 &plusmn; 0.0035 | -0.0043 &plusmn; 0.0056 | -0.0003 &plusmn; 0.0020 |

For epsilon=10.0:

|           | Yeom            | Shokri         | Merlin          | Morgan          |
|-----------|-----------------|----------------|-----------------|-----------------|
| PPV       | 0.6686 &plusmn; 0.0010  | 0.4037 &plusmn; 0.3296 | 0.6595 &plusmn; 0.0106  | 0.6581 &plusmn; 0.0159  |
| Advantage | 0.0054 &plusmn; 0.0023  | 0.0061 &plusmn; 0.0052 | -0.0024 &plusmn; 0.0028 | -0.0009 &plusmn; 0.0024 |

For epsilon=100.0:

|           | Yeom            | Shokri         | Merlin         | Morgan         |
|-----------|-----------------|----------------|----------------|----------------|
| PPV       | 0.6708 &plusmn; 0.0013  | 0.6717 &plusmn; 0.0025 | 0.6646 &plusmn; 0.0086 | 0.6660 &plusmn; 0.0281 |
| Advantage | 0.0139 &plusmn; 0.0013  | 0.0090 &plusmn; 0.0026 | 0.0029 &plusmn; 0.0069 | 0.0027 &plusmn; 0.0032 |


**Attack results for NN with RDP:**  
For epsilon=0.1

|           | Yeom             | Shokri         | Merlin          | Morgan         |
|-----------|------------------|----------------|-----------------|----------------|
| PPV       | 0.6672 &plusmn; 0.0031   | 0.6722 &plusmn; 0.0033 | 0.6832 &plusmn; 0.0368  | 0.7322 &plusmn; 0.1341 |
| Advantage | -0.0007 &plusmn; 0.0035  | 0.0083 &plusmn; 0.0021 | -0.0015 &plusmn; 0.0066 | 0.0006 &plusmn; 0.0009 |

For epsilon=1.0

|           | Yeom            | Shokri         | Merlin         | Morgan         |
|-----------|-----------------|----------------|----------------|----------------|
| PPV       | 0.6687 &plusmn; 0.0021  | 0.5368 &plusmn; 0.2684 | 0.6620 &plusmn; 0.0158 | 0.6434 &plusmn; 0.0605 |
| Advantage | 0.0015 &plusmn; 0.0017  | 0.0060 &plusmn; 0.0030 | 0.0014 &plusmn; 0.0041 | 0.0005 &plusmn; 0.0023 |

For epsilon=10.0

|           | Yeom            | Shokri         | Merlin          | Morgan         |
|-----------|-----------------|----------------|-----------------|----------------|
| PPV       | 0.6706 &plusmn; 0.0019  | 0.4027 &plusmn; 0.3288 | 0.6658 &plusmn; 0.0015  | 0.6836 &plusmn; 0.0242 |
| Advantage | 0.0103 &plusmn; 0.0028  | 0.0046 &plusmn; 0.0056 | -0.0029 &plusmn; 0.0045 | 0.0018 &plusmn; 0.0021 |

For epsilon=100.0

|           | Yeom            | Shokri         | Merlin          | Morgan         |
|-----------|-----------------|----------------|-----------------|----------------|
| PPV       | 0.6723 &plusmn; 0.0004  | 0.6705 &plusmn; 0.0027 | 0.6661 &plusmn; 0.0025  | 0.6799 &plusmn; 0.0126 |
| Advantage | 0.0177 &plusmn; 0.0010  | 0.0095 &plusmn; 0.0031 | -0.0013 &plusmn; 0.0055 | 0.0029 &plusmn; 0.0029 |







## 5 year data results
Percentage of records with >\$50,000 total person's income: 61.5%
Percentage of records with <=\$50,000 total person's income: 38.5%


First, download the 1 year data

In [None]:
!gdown --id 1oMaHqFJutp0RmAG8pHrmF1k_2H2vAegE
!gdown --id 1aDxPdt8iB4zUc7joWuieSQpRIhXcKPI2
!gdown --id 13UvEGw-I0Ylu5o-9uDLHCU-A3KAL7MiJ

Alternatively, download the 5 year data

In [None]:
!gdown --id 1jGw9TnsdC8nXxiCCK46mqNbIRkWOZpHU
!gdown --id 1M7ms22gfdE1W1GecrWIaghWCgKOMY8lI
!gdown --id 1L-X-nplPyISi85W8YuI-m6eB2pdpMeoG

Next, split the data into test and train data.

In [2]:
import pickle
import numpy as np
from sklearn.model_selection import train_test_split
x = pickle.load(open('./census_features.p', 'rb'))
y = pickle.load(open('./census_labels.p', 'rb'))
x = np.array(x, dtype=np.float32)
y = np.array(y, dtype=np.int32)

print(x.shape, len(y))

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

(1685316, 13) 1685316


Next, explore various classifier benchmarks

Mutinomial Naive Bayes

In [2]:
# https://scikit-learn.org/stable/modules/naive_bayes.html
from sklearn.naive_bayes import MultinomialNB
nbClassifier = MultinomialNB()
nbClassifier.fit(X_train,y_train)

y_pred = nbClassifier.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = nbClassifier.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")


Train accuracy: 0.672298281644513
Test accuracy: 0.6727351188068075


Gaussian Naive Bayes

In [3]:
# https://scikit-learn.org/stable/modules/naive_bayes.html
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(X_train, y_train)

y_pred = gnb.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = gnb.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

Train accuracy: 0.6788969863464998
Test accuracy: 0.6791362120272226


Logistic Regression

In [4]:
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression().fit(X_train, y_train)

y_pred = logistic.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = logistic.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

Train accuracy: 0.7539899093220541
Test accuracy: 0.7535399304150822


K-Nearest Neighbors (1) (did not run to completion after 30 minutes)

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(X_train,y_train)

y_pred = neigh.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = neigh.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

K-Nearest Neighbors (3) (did not run to completion after 30 minutes)

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train,y_train)

y_pred = neigh.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = neigh.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

Support Vector Machine (did not run to completion after 30 minutes)

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
svc = make_pipeline(StandardScaler(), SVC(gamma='auto'))
svc.fit(X_train, y_train)

y_pred = svc.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = svc.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

Decision Tree

In [6]:
# https://scikit-learn.org/stable/modules/tree.html
from sklearn import tree
decisionTree = tree.DecisionTreeClassifier()
decisionTree = decisionTree.fit(X_train, y_train)

y_pred = decisionTree.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = decisionTree.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

Train accuracy: 0.9406178569752232
Test accuracy: 0.7165376558693171


Random Forest

In [7]:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(max_depth=2, random_state=0)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = rf.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

Train accuracy: 0.7156649937431421
Test accuracy: 0.717889796909135


Multi-layer Perceptron

In [2]:
# https://scikit-learn.org/stable/modules/neural_networks_supervised.html
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
mlp.fit(X_train, y_train)

y_pred = mlp.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = mlp.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")


Train accuracy: 0.7533983196373236
Test accuracy: 0.7540811464429881


Stochastic Gradient Descent Classifier

In [5]:
# https://scikit-learn.org/stable/modules/sgd.html
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier(loss="hinge", penalty="l2", max_iter=100)
sgd.fit(X_train, y_train)

y_pred = sgd.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = sgd.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")


Train accuracy: 0.7535462170585062
Test accuracy: 0.7541782416772302


Ridge Classifier

In [8]:
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier

from sklearn.linear_model import RidgeClassifier
rc = RidgeClassifier().fit(X_train, y_train)

y_pred = rc.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = rc.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

Train accuracy: 0.7508938052235243
Test accuracy: 0.751477555717381


Gaussian Process Classifier (crashed due to limited RAM)

In [None]:

from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

kernel = 1.0 * RBF(1.0)
gpc = GaussianProcessClassifier(kernel=kernel,random_state=0).fit(X_train, y_train)

y_pred = gpc.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = gpc.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

# Steps to Reproduce Data
The code used to produce the dataset can be found in https://github.com/bargavj/EvaluatingDPML. Below are the steps to reproduce the dataset.
1. Assuming python 3 is installed and the terminal is at the project directory, install prerequisites by running `pip3 install -r requirements.txt`.
2. To crawl census data, first create a new folder named `dataset`, then `cd` to the `extra` folder and run `python3 crawl_census_data.py`. The data will be downloaded to the `dataset/census/` folder. By default, the 1 year census data from 2019 will be downloaded. To download the 5 year census data, run `python3 crawl_census_data.py --target_census_data='5year'` instead.
3. Next, to obtain the preprocessed census data, run `python3 preprocess_dataset.py census --preprocess=1`. The preprocessed data will be stored in the `dataset` folder as well.
4. To run a benchmark test of training NN models without performing attacks, go to `evaluating_dpml` folder and run `python3 main.py census --save_data=1` to prepare the data, then run `python3 main.py census --target_model='nn' --target_l2_ratio=1e-4 --benchmark=1`. 
5. To reproduce benchmark results of membership inference attacks, go to `improved_mi` folder and run `./run_experiments.sh census` (Note that this step may take over a day to finish). Then, run `python3 interpret_results.py census --gamma=0.5 --plot='benchmark'` to obtain the results. To explore benchmark results with other test-to-train ratios, try changing the `--gamma` argument value to 0.1, 1, 2, 10. 