# Empirical Privacy Sketch
Ideally, we can show that with a reasonably good attack, you cannot infer anything with good specificity.

Protected values:
- number of revoked VCs
- number of valid VCs

Values that could be good metadata inputs:
- total datasize in bytes
- number of filter layers
- datasize of first filter in bytes
- number of set bits in first filter
- entropy

Generate data in extra cell to re-use the dataset whenever we don't vary the filter construction:

In [1]:
import generateData
import pickle
import time

print("Generating data...")
d = generateData.generate_data(10_000)
print("Done generating data.")
with open(f"training-data-{time.time_ns()}.pkl", "wb") as outp:
    pickle.dump(d, outp, pickle.HIGHEST_PROTOCOL)
print("Done saving data.")

Generating data...
Data point 0
[979336.0, 3.0, 971128.0, 362516.0, 0.9531491903876095]
Data point 100
[831032.0, 5.0, 814392.0, 339899.0, 0.9801906789357169]
Data point 200
[545960.0, 3.0, 506304.0, 216607.0, 0.9848919303685655]
Data point 300
[375528.0, 3.0, 369392.0, 126036.0, 0.9259048661913027]
Data point 400
[1375208.0, 3.0, 1368192.0, 470871.0, 0.9287249900996528]
Data point 500
[1092976.0, 3.0, 1081600.0, 378726.0, 0.9341865640182512]
Data point 600
[736648.0, 3.0, 735096.0, 248939.0, 0.9234908589220723]
Data point 700
[1144856.0, 3.0, 1132384.0, 447389.0, 0.9679896515820658]
Data point 800
[282520.0, 4.0, 265336.0, 104314.0, 0.9667361682003808]
Data point 900
[208440.0, 4.0, 204400.0, 84195.0, 0.9774277351539489]
Data point 1000
[630840.0, 3.0, 623680.0, 227127.0, 0.9460620785623917]
Data point 1100
[661848.0, 4.0, 631672.0, 265304.0, 0.9814353739634925]
Data point 1200
[1169248.0, 4.0, 1132120.0, 495227.0, 0.9886662390003487]
Data point 1300
[1122584.0, 4.0, 1113256.0, 399109

ML based on prior data gen:

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn.model_selection import cross_validate


def test_regressor(regressor, X_train, X_test, y_train, y_test):
    scores = cross_validate(regressor, X_train, y_train, cv=5)
    print("Cross-validation Score Mean:", scores["test_score"].mean())
    regressor.fit(X_train, y_train)
    y_pred = regressor.predict(X_test)
    print("MAE: ", mean_absolute_error(y_test, y_pred))
    print("MAPE: ", mean_absolute_percentage_error(y_test, y_pred))
    print("R2: ", regressor.score(X_test, y_test))


def build_and_eval_regressor(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6)
    print("Random Forest Regressor")
    reg1 = RandomForestRegressor(random_state=1)
    test_regressor(reg1, X_train, X_test, y_train, y_test)
    print("Linear Regressor")
    reg2 = LinearRegression()
    test_regressor(reg2, X_train, X_test, y_train, y_test)
    print("Decision Tree Regressor")
    reg3 = DecisionTreeRegressor()
    test_regressor(reg3, X_train, X_test, y_train, y_test)

with open("training-data-varyingSize-10k.pkl", "rb") as inp:
    d = pickle.load(inp)

print("Training and evaluating...")
build_and_eval_regressor(d[0], d[1])
print("Done.")

Training and evaluating...
Random Forest Regressor
Cross-validation Score Mean: 0.66060798739812
MAE:  90296.7500537499
MAPE:  1.1281708816452312
R2:  0.7032674322445638
Linear Regressor
Cross-validation Score Mean: 0.836452059131633
MAE:  62666.940743234605
MAPE:  0.6163280645793884
R2:  0.8434691058265088
Decision Tree Regressor
Cross-validation Score Mean: 0.41553206212907134
MAE:  113939.76025
MAPE:  1.095053699763456
R2:  0.474192623270798
Done.
