# Empirical Privacy Sketch
Ideally, we can show that with a reasonably good attack, you cannot infer anything with good specificity.

Protected values:
- number of revoked VCs
- number of valid VCs

Values that could be good metadata inputs:
- total datasize in bytes
- number of filter layers
- datasize of first filter in bytes
- number of set bits in first filter
- entropy


In [1]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import generateData


def build_and_eval_regressor(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
    regressor = RandomForestRegressor(random_state=1)
    regressor = regressor.fit(X_train, y_train)
    return regressor.score(X_test, y_test)

print("Generating data...")
d = generateData.generate_data(10_000)
print("Training and evaluating...")
r_two_score = build_and_eval_regressor(d[0], d[1])
print('Done. R2 Score', r_two_score)

Generating data...
Data point 10
[40192.0, 3.0, 37008.0, 16610.0, 0.9921979410796613]
Data point 0
[474128.0, 3.0, 470264.0, 211498.0, 0.9926820740283162]
Data point 20
[389744.0, 3.0, 387768.0, 174488.0, 0.9927472566498068]
Data point 30
[364992.0, 3.0, 362936.0, 163087.0, 0.9925631711937974]
Data point 40
[593344.0, 4.0, 589856.0, 265051.0, 0.9925703432285395]
Data point 50
[13752.0, 3.0, 13184.0, 5892.0, 0.9911729940327305]
Data point 60
[505160.0, 3.0, 502768.0, 226520.0, 0.9929151634545441]
Data point 70
[701384.0, 4.0, 695624.0, 312949.0, 0.9927283547600467]
Data point 90
[104192.0, 3.0, 102160.0, 45978.0, 0.9927097375551366]
Data point 80
[1025848.0, 3.0, 1022048.0, 459686.0, 0.9926993693628177]
Data point 100
[497472.0, 3.0, 492760.0, 221494.0, 0.9926109299258883]
Data point 110
[95552.0, 3.0, 93800.0, 42236.0, 0.9927658152981493]
Data point 120
[23968.0, 3.0, 22344.0, 10053.0, 0.99237367237704]
Data point 140
[594760.0, 3.0, 592656.0, 266477.0, 0.9926533037885932]
Data point 1