# Empirical Privacy Sketch
Ideally, we can show that with a reasonably good attack, you cannot infer anything with good specificity.

Protected values:
- number of revoked VCs
- number of valid VCs

Values that could be good metadata inputs:
- total datasize in bytes
- number of filter layers
- datasize of first filter in bytes
- number of set bits in first filter
- entropy

Generate data in extra cell to re-use the dataset whenever we don't vary the filter construction:

In [3]:
import generateData
import pickle
import time

print("Generating data...")
d = generateData.generate_data(10_000)
print("Done generating data.")
with open(f"training-data-{time.time_ns()}.pkl", "wb") as outp:
    pickle.dump(d, outp, pickle.HIGHEST_PROTOCOL)
print("Done saving data.")

Generating data...
Data point 0
[460120.0, 6.0, 304200.0, 109667.0, 0.9430426058987464]
Data point 100
[668288.0, 5.0, 654984.0, 247429.0, 0.9564204950486079]
Data point 200
[753744.0, 8.0, 621456.0, 193414.0, 0.8945482222836709]
Data point 300
[449512.0, 6.0, 267728.0, 93457.0, 0.9331637104401018]
Data point 400
[450464.0, 6.0, 330440.0, 122463.0, 0.9510813011993982]
Data point 500
[587360.0, 8.0, 263984.0, 98251.0, 0.9522672875743439]
Data point 600
[147000.0, 3.0, 142432.0, 55586.0, 0.9648564066089804]
Data point 700
[500064.0, 7.0, 276408.0, 97665.0, 0.9369411326182266]
Data point 800
[891680.0, 4.0, 876888.0, 349883.0, 0.9703485743752509]
Data point 900
[715128.0, 6.0, 649944.0, 189034.0, 0.8697755156126021]
Data point 1000
[530248.0, 7.0, 249248.0, 108498.0, 0.9878460894138499]
Data point 1100
[352104.0, 9.0, 143768.0, 50296.0, 0.9337870853718151]
Data point 1200
[120288.0, 10.0, 42232.0, 15481.0, 0.947563641526374]
Data point 1300
[438592.0, 6.0, 353200.0, 108310.0, 0.8892072049

ML based on prior data gen:

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn.model_selection import cross_validate


def test_regressor(regressor, X_train, X_test, y_train, y_test):
    scores = cross_validate(regressor, X_train, y_train, cv=5)
    print("Cross-validation Score Mean:", scores["test_score"].mean())
    regressor.fit(X_train, y_train)
    y_pred = regressor.predict(X_test)
    print("MAE: ", mean_absolute_error(y_test, y_pred))
    print("MAE (revoked): ", mean_absolute_error(y_test[:,0], y_pred[:,0]))
    print("MAE (valid): ", mean_absolute_error(y_test[:,1], y_pred[:,1]))
    print("MAPE: ", mean_absolute_percentage_error(y_test, y_pred))
    print("MAPE (revoked): ", mean_absolute_percentage_error(y_test[:,0], y_pred[:,0]))
    print("MAPE (valid): ", mean_absolute_percentage_error(y_test[:,1], y_pred[:,1]))
    print("R2: ", regressor.score(X_test, y_test))


def build_and_eval_regressor(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6)
    print("--------------------------")
    print("Random Forest Regressor")
    print("--------------------------")
    reg1 = RandomForestRegressor(random_state=1)
    test_regressor(reg1, X_train, X_test, y_train, y_test)
    print("--------------------------")
    print("Linear Regressor")
    print("--------------------------")
    reg2 = LinearRegression()
    test_regressor(reg2, X_train, X_test, y_train, y_test)
    print("--------------------------")
    print("Decision Tree Regressor")
    print("--------------------------")
    reg3 = DecisionTreeRegressor()
    test_regressor(reg3, X_train, X_test, y_train, y_test)

with open("training-data-varyingSizeFPR-10k.pkl", "rb") as inp:
    d = pickle.load(inp)

print("Training and evaluating...")
build_and_eval_regressor(d[0], d[1])
print("Done.")

Training and evaluating...
--------------------------
Random Forest Regressor
--------------------------
Cross-validation Score Mean: 0.7775020633160044
MAE:  70568.3207662501
MAE (revoked):  3556.9444750000002
MAE (valid):  137579.69705750002
MAPE:  0.9477878269226165
MAPE (revoked):  0.1402500976365758
MAPE (valid):  1.7553255562086578
R2:  0.781082884260595
--------------------------
Linear Regressor
--------------------------
Cross-validation Score Mean: 0.6881115074593677
MAE:  89097.78693858719
MAE (revoked):  6669.561899201507
MAE (valid):  171526.0119779728
MAPE:  2.1437634266500742
MAPE (revoked):  1.4572076330494843
MAPE (valid):  2.8303192202506597
R2:  0.6842959961271748
--------------------------
Decision Tree Regressor
--------------------------
Cross-validation Score Mean: 0.5811757866595408
MAE:  92743.88225
MAE (revoked):  5015.2945
MAE (valid):  180472.47
MAPE:  1.0455097343101287
MAPE (revoked):  0.19961290205548887
MAPE (valid):  1.89140656656477
R2:  0.574099439028