# Empirical Privacy Sketch
Ideally, we can show that with a reasonably good attack, you cannot infer anything with good specificity.

Protected values:
- number of revoked VCs
- number of valid VCs

Values that could be good metadata inputs:
- total datasize in bytes
- number of filter layers
- datasize of first filter in bytes
- number of set bits in first filter
- entropy

Generate data in extra cell to re-use the dataset whenever we don't vary the filter construction:

In [3]:
import generateData

print("Generating data...")
d = generateData.generate_data(10_000)
print("Done generating data.")

Generating data...
Data point 10
[508688.0, 4.0, 502928.0, 226208.0, 0.9926945745681195]
Data point 0
[1101096.0, 3.0, 1094696.0, 492239.0, 0.9926674866496299]
Data point 20
[800104.0, 3.0, 796784.0, 358008.0, 0.9925645463032424]
Data point 30
[1099168.0, 3.0, 1094200.0, 492413.0, 0.9927729194148185]
Data point 40
[266864.0, 4.0, 263952.0, 118693.0, 0.9926486826941883]
Data point 50
[980752.0, 5.0, 972416.0, 437150.0, 0.9926350568329845]
Data point 70
[65744.0, 3.0, 63672.0, 28585.0, 0.9923308457824405]
Data point 60
[1091512.0, 4.0, 1083616.0, 487350.0, 0.9926924732678728]
Data point 80
[869392.0, 3.0, 863960.0, 388571.0, 0.9926938802221095]
Data point 100
[102192.0, 4.0, 98120.0, 44194.0, 0.9928074405289511]
Data point 90
[989000.0, 4.0, 981488.0, 441416.0, 0.9926909194629836]
Data point 110
[245768.0, 4.0, 240696.0, 108225.0, 0.9926330766903284]
Data point 120
[434048.0, 3.0, 428408.0, 192530.0, 0.9925823779881995]
Data point 130
[204552.0, 3.0, 202824.0, 91108.0, 0.9924984549569373

ML based on prior data gen:

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn.model_selection import cross_validate


def test_regressor(regressor, X_train, X_test, y_train, y_test):
    scores = cross_validate(regressor, X_train, y_train, cv=5)
    print("Cross-validation Score Mean:", scores["test_score"].mean())
    regressor.fit(X_train, y_train)
    y_pred = regressor.predict(X_test)
    print("MAE: ", mean_absolute_error(y_test, y_pred))
    print("MAPE: ", mean_absolute_percentage_error(y_test, y_pred))
    print("R2: ", regressor.score(X_test, y_test))


def build_and_eval_regressor(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6)
    print("Random Forest Regressor")
    reg1 = RandomForestRegressor(random_state=1)
    test_regressor(reg1, X_train, X_test, y_train, y_test)
    print("Linear Regressor")
    reg2 = LinearRegression()
    test_regressor(reg2, X_train, X_test, y_train, y_test)
    print("Decision Tree Regressor")
    reg3 = DecisionTreeRegressor()
    test_regressor(reg3, X_train, X_test, y_train, y_test)


print("Training and evaluating...")
build_and_eval_regressor(d[0], d[1])
print("Done.")

Training and evaluating...
Random Forest Regressor
Cross-validation Score Mean: 0.5227571750617626
MAE:  12653.76752499999
MAPE:  9.58125652607974
R2:  0.4831154547047316
Linear Regressor
Cross-validation Score Mean: 0.9536376486547591
MAE:  3062.8817133505445
MAPE:  6.704561104361035
R2:  0.9554730514750853
Decision Tree Regressor
Cross-validation Score Mean: 0.26084049772700985
MAE:  15205.54375
MAPE:  11.394349764461225
R2:  0.22909069062383192
Done.
