# Empirical Privacy Sketch
Ideally, we can show that with a reasonably good attack, you cannot infer anything with good specificity.

Protected values:
- number of revoked VCs
- number of valid VCs

Values that could be good metadata inputs:
- total datasize in bytes
- number of filter layers
- datasize of first filter in bytes
- number of set bits in first filter
- entropy

Generate data in extra cell to re-use the dataset whenever we don't vary the filter construction:

In [1]:
import generateData

print("Generating data...")
d = generateData.generate_data(10_000)
print("Done generating data.")

Generating data...
Data point 0
[157808.0, 3.0, 157184.0, 62228.0, 0.9683989045273371]
Data point 100
[579448.0, 3.0, 576504.0, 229131.0, 0.9694123558891725]
Data point 200
[244432.0, 3.0, 242712.0, 96207.0, 0.9687322139307177]
Data point 300
[468424.0, 3.0, 465792.0, 184940.0, 0.9691623786601333]
Data point 400
[1008816.0, 3.0, 1006256.0, 399269.0, 0.9690246226079609]
Data point 500
[212816.0, 3.0, 212576.0, 84561.0, 0.9695725749786135]
Data point 600
[904144.0, 4.0, 901720.0, 357959.0, 0.9691356927725572]
Data point 700
[1102152.0, 3.0, 1099600.0, 436616.0, 0.9691957125241686]
Data point 800
[355792.0, 3.0, 353800.0, 140825.0, 0.969746920265127]
Data point 900
[813104.0, 3.0, 809880.0, 321361.0, 0.9690294210511929]
Data point 1000
[1294944.0, 3.0, 1291808.0, 512809.0, 0.9691386664855841]
Data point 1100
[832336.0, 3.0, 830880.0, 329593.0, 0.9689565250670249]
Data point 1200
[614000.0, 3.0, 611328.0, 242750.0, 0.9691956287253127]
Data point 1300
[796992.0, 3.0, 794504.0, 315304.0, 0.9

ML based on prior data gen:

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from sklearn.model_selection import cross_validate


def test_regressor(regressor, X_train, X_test, y_train, y_test):
    scores = cross_validate(regressor, X_train, y_train, cv=5)
    print("Cross-validation Score Mean:", scores["test_score"].mean())
    regressor.fit(X_train, y_train)
    y_pred = regressor.predict(X_test)
    print("MAE: ", mean_absolute_error(y_test, y_pred))
    print("MAPE: ", mean_absolute_percentage_error(y_test, y_pred))
    print("R2: ", regressor.score(X_test, y_test))


def build_and_eval_regressor(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6)
    print("Random Forest Regressor")
    reg1 = RandomForestRegressor(random_state=1)
    test_regressor(reg1, X_train, X_test, y_train, y_test)
    print("Linear Regressor")
    reg2 = LinearRegression()
    test_regressor(reg2, X_train, X_test, y_train, y_test)
    print("Decision Tree Regressor")
    reg3 = DecisionTreeRegressor()
    test_regressor(reg3, X_train, X_test, y_train, y_test)


print("Training and evaluating...")
build_and_eval_regressor(d[0], d[1])
print("Done.")

Training and evaluating...
Random Forest Regressor
Cross-validation Score Mean: 0.49286363864050164
MAE:  11932.524649999972
MAPE:  1.1821936643044475
R2:  0.5290423196248111
Linear Regressor
Cross-validation Score Mean: 0.9250918104425349
MAE:  4230.1397392450335
MAPE:  0.5001697180455158
R2:  0.92850899108731
Decision Tree Regressor
Cross-validation Score Mean: 0.182453230323749
MAE:  14038.617250000001
MAPE:  0.8635927259405446
R2:  0.2699644869427583
Done.
