# Empirical Privacy Sketch
Ideally, we can show that with a reasonably good attack, you cannot infer anything with good specificity.

Protected values:
- number of revoked VCs
- number of valid VCs

Values that could be good metadata inputs:
- total datasize in bytes
- number of filter layers
- datasize of first filter in bytes
- number of set bits in first filter
- entropy

Generate data in extra cell to re-use the dataset whenever we don't vary the filter construction:

In [1]:
import generateData

print("Generating data...")
d = generateData.generate_data(4_000)
print("Done generating data.")

Generating data...
Data point 0
[3098400.0, 9.0, 2756224.0, 1094846.0, 0.9692996851833051]
Data point 100
[6442168.0, 6.0, 6269624.0, 2677241.0, 0.9845743577258368]
Data point 200
[517888.0, 5.0, 482312.0, 169989.0, 0.9361926641308145]
Data point 300
[5859744.0, 5.0, 5734640.0, 2360977.0, 0.9773844519650865]
Data point 400
[9696800.0, 3.0, 9668328.0, 3396079.0, 0.9351846149827274]
Data point 500
[803096.0, 5.0, 711232.0, 230642.0, 0.9089497988156943]
Data point 600
[131536.0, 6.0, 13872.0, 4717.0, 0.9233527636629439]
Data point 700
[3831416.0, 7.0, 3743536.0, 1602510.0, 0.9850178486866383]
Data point 800
[6461832.0, 7.0, 6225272.0, 1797091.0, 0.8670018552794225]
Data point 900
[6113704.0, 10.0, 5706080.0, 1891435.0, 0.9164246693696896]
Data point 1000
[2800696.0, 8.0, 2663936.0, 807589.0, 0.8851062474135105]
Data point 1100
[7866512.0, 10.0, 7530696.0, 2362756.0, 0.8974518783687273]
Data point 1200
[6979088.0, 7.0, 6886784.0, 2524686.0, 0.9480219430062853]
Data point 1300
[6372776.0, 5

ML based on prior data gen:

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn.model_selection import cross_validate


def test_regressor(regressor, X_train, X_test, y_train, y_test):
    scores = cross_validate(regressor, X_train, y_train, cv=5)
    print("Cross-validation Score Mean:", scores["test_score"].mean())
    regressor.fit(X_train, y_train)
    y_pred = regressor.predict(X_test)
    print("MAE: ", mean_absolute_error(y_test, y_pred))
    print("MAPE: ", mean_absolute_percentage_error(y_test, y_pred))
    print("R2: ", regressor.score(X_test, y_test))


def build_and_eval_regressor(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6)
    print("Random Forest Regressor")
    reg1 = RandomForestRegressor(random_state=1)
    test_regressor(reg1, X_train, X_test, y_train, y_test)
    print("Linear Regressor")
    reg2 = LinearRegression()
    test_regressor(reg2, X_train, X_test, y_train, y_test)
    print("Decision Tree Regressor")
    reg3 = DecisionTreeRegressor()
    test_regressor(reg3, X_train, X_test, y_train, y_test)


print("Training and evaluating...")
build_and_eval_regressor(d[0], d[1])
print("Done.")

Training and evaluating...
Random Forest Regressor
Cross-validation Score Mean: 0.5161336219057666
MAE:  133636.47329999993
MAPE:  1.135435479837047
R2:  0.5141068892306577
Linear Regressor
Cross-validation Score Mean: 0.5593904909932272
MAE:  137774.70517053796
MAPE:  1.451016933322735
R2:  0.5504139542205628
Decision Tree Regressor
Cross-validation Score Mean: 0.10092806905309262
MAE:  167209.5284375
MAPE:  1.3644251029152012
R2:  0.17988965343975932
Done.
