# Naive Bayes with differential privacy

We start by importing the required libraries and modules and collecting the data that we need from the [Adult dataset](https://archive.ics.uci.edu/ml/datasets/adult).

In [13]:
X_train = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                        usecols=(0, 4, 10, 11, 12), delimiter=",")

y_train = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                        usecols=14, dtype=str, delimiter=",")

Let's also collect the test data from Adult to test our models once they're trained.

In [14]:
X_test = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                        usecols=(0, 4, 10, 11, 12), delimiter=",", skiprows=1)

y_test = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                        usecols=14, dtype=str, delimiter=",", skiprows=1)
# Must trim trailing period "." from label
y_test = np.array([a[:-1] for a in y_test])

## Naive Bayes with no privacy

To begin, let's first train a regular (non-private) naive Bayes classifier, and test its accuracy.

In [15]:
nonprivate_clf = GaussianNB()
nonprivate_clf.fit(X_train, y_train)

In [16]:
print("Non-private test accuracy: %.2f%%" % 
     (nonprivate_clf.score(X_test, y_test) * 100))

Non-private test accuracy: 79.64%


## Differentially private naive Bayes classification

Using the `models.GaussianNB` module of diffprivlib, we can train a naive Bayes classifier while satisfying differential privacy.

If we don't specify any parameters, the model defaults to `epsilon = 1` and selects the model's feature bounds from the data. This throws a warning with `.fit()` is first called, as it leaks additional privacy. To ensure no additional privacy loss, we should specify the bounds as an argument, and choose the bounds indepedently of the data (i.e. using domain knowledge).

In [17]:
dp_clf = dp.GaussianNB(random_state=0)

If you re-evaluate this cell, the test accuracy will change. This is due to the randomness introduced by differential privacy. Nevertheless, the accuracy should be in the range of 87–93%.

In [18]:
dp_clf.fit(X_train, y_train)

print("Differentially private test accuracy (epsilon=%.2f): %.2f%%" % 
      (dp_clf.epsilon, dp_clf.score(X_test, y_test) * 100))

Differentially private test accuracy (epsilon=1.00): 79.93%




By setting `epsilon=float("inf")` we get an identical model to the non-private naive Bayes classifier.

In [19]:
dp_clf = dp.GaussianNB(epsilon=float("inf"), bounds=(-1e5, 1e5))
dp_clf.fit(X_train, y_train)

print("Agreement between non-private and differentially private (epsilon=inf) classifiers: %.2f%%" % 
     (dp_clf.score(X_test, nonprivate_clf.predict(X_test)) * 100))

Agreement between non-private and differentially private (epsilon=inf) classifiers: 100.00%


## Changing `epsilon`

On this occasion, we're going to specify the `bounds` parameter as a list of tuples, indicating the ranges in which we expect each feature to lie.

In [20]:
bounds = ([17, 1, 0, 0, 1], [100, 16, 100000, 4500, 100])

We will also specify a value for `epsilon`. High `epsilon` (i.e. greater than 1) gives better and more consistent accuracy, but less privacy. Small `epsilon` (i.e. less than 1) gives better privacy but worse and less consistent accuracy.

In [21]:
dp_clf2 = dp.GaussianNB(epsilon=0.1, bounds=bounds, random_state=0)

dp_clf2.fit(X_train, y_train)

In [22]:
print("Differentially private test accuracy (epsilon=%.2f): %.2f%%" % 
     (dp_clf2.epsilon, dp_clf2.score(X_test, y_test) * 100))

Differentially private test accuracy (epsilon=0.10): 78.42%



---

###  **Beoordeling van de resultaten**

| Modelversie                       | Epsilon | Accuracy (%) |
| --------------------------------- | ------- | ------------ |
| **Niet-private GaussianNB**       | —       | 79.64%       |
| **DP GaussianNB (epsilon = 1.0)** | 1.0     | 79.93%       |
| **DP GaussianNB (epsilon = 0.1)** | 0.1     | 78.42%       |

####  Wat is goed hier?

* **Zeer kleine daling in accuracy** na het toepassen van differential privacy. Dit toont aan dat **Gaussian Naive Bayes robuust is** tegen de toegevoegde ruis.
* Zelfs met **lage epsilon (0.1)** — dus hoge privacy — blijft de accuracy **boven de 78%**, wat sterk is.

####  Wat betekent dit?

* De DP-versie met `epsilon=1.0` presteert zelfs een tikkeltje **beter** dan de niet-private versie. Dit kan toeval zijn vanwege de ruis (zie uitleg hieronder).
* Met een **lagere epsilon (0.1)** zie je een lichte daling (\~1.5%), maar de prestaties zijn nog steeds goed. Dit betekent dat het model **bruikbaar blijft met hoge privacybescherming**.

---

###  Uitleg: "If you re-evaluate this cell, the test accuracy will change..."

Dit betekent:

> Elke keer dat je `.fit()` uitvoert op een DP-model, wordt er **willekeurige ruis toegevoegd** om privacy te garanderen. Daardoor verandert de uitkomst (de modelparameters) **elke keer een beetje**. Daarom **varieert de accuracy** licht per keer.

Bijvoorbeeld:

```python
dp_clf = dp.GaussianNB(random_state=0)
dp_clf.fit(X_train, y_train)
dp_clf.score(X_test, y_test)
```

→ Voer je dit **2x uit**, krijg je misschien 79.93% en 80.12%, enzovoort.

---

###  Uitleg: "Changing epsilon"

De parameter **`epsilon`** bepaalt de mate van privacybescherming. Dit betekent:

* **Grotere epsilon (bv. `1.0`, `10.0`)**:

  * Minder privacy
  * Minder ruis → **betere accuracy**
* **Kleinere epsilon (bv. `0.1`, `0.01`)**:

  * Meer privacy
  * Meer ruis → **lagere accuracy** (minder consistente resultaten)

```python
dp_clf = dp.GaussianNB(epsilon=0.1)  # Hoge privacy
dp_clf = dp.GaussianNB(epsilon=10.0) # Lage privacy, betere performance
```

---

###  Samenvatting

* **Jouw resultaten zijn goed** — zelfs met lage `epsilon` blijft het model krachtig.
* **Naive Bayes met DP** is een sterke keuze voor simpele datasets zoals de Adult dataset.
* De keuze van `epsilon` is een **balans tussen privacy en prestatie**.

