## Adversarial Example Detection

### Attribute-Steered Model
#### Attribute witness extraction
* Intersection of __attribute substitution__ and __attribute preservation__.
* New model is created by:
  * __Neuron weakening:__ weaken the non-witness neurons
  * __Neuron strengthening:__ strengthen the witness neurons

* Detection (??)
  * what does false positive on benign input mean?
  * incorrect classification to a class, is it classification with trigger?

### Neural network invariant checking
* Allow correct behaviors and forbit malicious behaviors (eg: assert certain _behaviors_)
* __Value invariants:__ possible value distribution for each neuron
* __Provenance invariants:__ possible delta between the values [_pattern_] of two layers of neurons
* Examples (??)

## Adversarial Sample Defenses
### Gradient Masking
* Hide or destroy the gradient (so that all gradient based attacks fail)
* Defense techniques:
  * __Distillation defense:__ changes the scale of last hidden layer
  * __Input preprocessing:__ transforms input images by resizing, cropping, discretizing pixels
  * __Defense GAN:__ uses GAN to transform perturbed images into clean images

#### Evading gradient masking
* Approximate gradients
* Hiding or breaking gradients makes the loss surface zig-zaggy, when doing backward pass replace function with difficult gradient with one that has nice gradient

### Certified Adversarial Robustness
* Model gives prediction and a certificate that the prediction is constant (holds) within an $l2$ around the input
* Randomized smoothing leads to smoother decision boundaries
  * Smooth 𝑓 into a new classifier $g$ (the “smoothed classifier”) as follows:
$$g(x) = \text{the most probable prediction by } f \text{ of random Gaussian corruptions of } x$$
* Large random perturbations “drown out” small adversarial perturbations (??)

## Confidentiality and Privacy Threats in ML
### Model Inversion
* Extra sensitive inputs by leverage knowledge about model structure and some information about an individual or object

#### Type 1
* For example, if one of the features is sensitive, attacker can judge the feature by setting it to 0 or 1 and checking the output label, $y$
* What if $y$ does not change on changing feature?
* White box or black box? White - attacker has information about parameters or black - attacker changes one of those but cannot see other parameters.
* Formally:
  * infer $x_n$ given $f$, $x_1, x_2, ..., x_{n-1}$ and $y$ (??) where $x_n \in \{v_1, v_2, ..., v_s\}$
  * compute $y_j = f(x_1, x_2, ..., x_{n-1}, v_j)$ for each $j$
  * output $v_j$ that maximizes:
  $$dist(y, y_j) \times P(v_j | x_1, x_2, ..., x_{n-1})$$
  * what is $y$? (??)

#### Type 2
* Given $f$ and $y$, infer $X$
* Use gradient descent to search for input $X$ which maximizes probability of $y$
* White box or black box?

### Model Extraction
* Learn a close approximation of the model $f$ using as few queries to the model as possible
* For example, logistic regression function can be converted to a linear equation in $n+1$ variables
* __Extraction attack:__ learn model architecture or parameters
* __Oracle attack:__ construct a substitute model

### Membership Inference
* Given an input $x$ and a black box access to the model $f$, determine if $x \in D$, meaning whether $x$ was part of the training data (or distribution (??))
* Privacy concern?
  * If $x$ is used for training a medical model, if one can determine $x \in D$, one can predict whether an individual have health issue or not

#### Attack stages
1. Development of shadow dataset
  * Goal: develop a dataset $D'$ which closely emulates the original dataset $D$
  * Techniques: statistics-based, query-based, active learning, region-based
2. Generation of attack model training dataset
  * Takes input from shadown dataset $D'$ as $(x', y')$ and outputs a probability vector $p = (p_1, p_2, ..., p_k)$ and a binary label indicating "in" or "out"
  * Partition $D'$ into $D_1, D_2, ..., D_s$
  * $\forall j,$ train $f_j$ to output "in" for $D_j$ and "out" for $D \setminus D_j$
  * Obtain attack training data, $p = (p_1, p_2, ..., p_k)$ and label "in" or "out" 
3. Training and depoloyment of membership inference attack model
  * Given input of probability vector, return "in" or "out"

## Differential Privacy
* Guarantees:
  * Raw data will not be viewed
  * Output will have distortions

### Confidence Interval
* Range of values for which we are fairly sure (say, $x%$) the true value lies in
$$\overline{X} \pm Z\dfrac{s}{\sqrt{n}}$$

### Standard DP
* Analysts sends a query to a software called _DP guard_
* Guard sends the query to the DB or model to retrieve the output
* Guard adds __noise__ to the output (in order to protect the confidentiality of the individual whose data was accessed from DB) and sends back the response to the analyst

### Local DP
* User anonymizes the data themselves and send to the aggregator
* Aggregator doesn't have access to the real data

#### Advantages / disadvantages
* Local DP less prone to data leak as the aggregator does not have access to real data
* Might be less accurate (?)

### Formalism
* Whether or not more data is adding into $D$, both the results with or without will be the same, $R$. $A = 1$ is ideal in which case both the results are identical, whereas if $A$ is much larger or smaller, the result deviates too much.
$$\dfrac{P(Q(D_I)) = R}{P(Q(D_{I \pm i})) = R} \leq A$$

* Putting, $A = e^\epsilon$:
$$\dfrac{P(R | D_I}{P(R | D_{I \pm i}) = R} \leq e^\epsilon$$

#### Global sensitivity
* $F(D) = X$ is a deterministic, non-privatized function over dataset $D$ which returns $X$, a vector $k$ real numbers.
* Global sensitivity is the sum of the worst case differences between datasets $D1$ and $D2$ differing by at most one element, $\Delta F$:
$$\Delta F = \max_{D1, D2} \left|\left| F(D1) - F(D2) \right|\right|_{L1}$$

#### Noise adding mechanism
* Privatizing by adding noise from Laplace distribution:
$$P(R = x | D \text{ is true world}) = \dfrac{\epsilon}{2 \Delta F} \exp{-\dfrac{\left| x - F(D) \right| \epsilon}{\Delta F}}$$
* Laplace ($\epsilon-$ differential)
$$F(x) = f(x) + Lap\left(\dfrac{s}{\epsilon}\right)$$
* Exponential ($\epsilon-$ differential)
  * Works with both numeric and categorical data
  * Releases the identity of the element with MAX noisy score and not that of the score itself (??)
* Gaussian ($\epsilon, \delta-$ differential)