How many secrets do you have?
This is a quick post, ruminating on the difference between two flavors of differential privacy in common use. I'm a fan of one, and the other weirds me out a bit. I think I have started to get a handle on why that is.
Without belaboring the motivation, differential privacy is a property of randomized computations whose inputs are multisets. It says that if you add or remove an element from the input, the distribution over outputs shouldn't change very much. Specifically, we say that a computation
M has (
del)-differential privacy when for any two input datasets
B with symmetric difference one, and for any set
S of outputs,
Pr[M(A) in S] <= exp(eps) * Pr[M(B) in S] + del
The intent is that because the presence or absence of a record has such a small impact on the output distribution, one shouldn't be able to infer much more about your record than one could if you hadn't participated. More generally, participation shouldn't make any set of unpleasant outputs (described by some set
S) substantially more likely, so why not participate?
The original formulation of differential privacy used
del = 0, meaning there was only a multiplicative term, and no additive term. We occasionally call this "pure" differential privacy.
How to set
The extension to non-zero
del happened so that one could analyze mechanisms that were mostly differentially private, but could occasionally screw up and do something unpleasant. Ideally
del should be really small, but "just how small?" wasn't something that people thought very hard about.
One opinion is that
del should be much smaller than one over the number of people in the input dataset, because otherwise the mechanism that randomly releases a record in the dataset would be (
del)-differentially private. Of course, we aren't excited about that mechanism, because it guarantees a privacy violation. Disappointingly, I think this rule-of-thumb stuck mostly because folks thought about this one attack, and for some reason believed that defeating it was sufficient.
I don't think that choosing
del based on the number of people in the dataset is the right approach. I'm going to try and argue this by constructing a mechanism with good (eps, del) guarantees, but which always releases concrete information about all participants.
Aside: I don't like privacy guarantees that depends on the number of people in the dataset; there could be more people than you expected, or fewer. Consider the mechanism that hashes each record in the dataset, and if all hashes are the same publishes the input dataset. Theoretically, it seems very unlikely that this mechanism would ever release the dataset (
del ~ 1/256^|data|, say), but would you want to be the first person to contribute your data?
I'm sure you are all familiar with Shirley Jackson's "The Lottery", a short story about a town that has a tradition of picking a random town member to stone to death, because .. literature or something. We read it in 10th grade, not so much for pleasure as much as a round of "how many triangles are in this picture" with "religious symbolism" in place of "triangles".
Thusly motivated, let's make a differentially private lottery! The idea is that we have a large known population, and their names will be the output of the mechanism. If we just run it, we get a totally random name of someone to stone to death, which is fair but still uncomfortable. To make it more interesting, let's imagine that a few folks get to vote; each voter gets to write down a name of someone they would like to save. Here are two ways we could do that:
Each vote makes a person a factor of
1/exp(eps)less likely to be chosen. If lots of people vote for you, you end up very unlikely to be chosen.
This mechanism has (
0)-differential privacy, or "pure" differential privacy. Any outcome is almost equally likely to happen with and without your vote.
Each vote makes a person ineligible. The "winner" is picked uniformly at random from people with no votes.
This mechanism has (
del)-differential privacy, where
delis roughly one over the population, if the number of voters is small relative to the population. You might say "what a dumb scenario", much like readers said of "The Lottery".
The first mechanism has pure differential privacy, and no one can really tell much about how you participated. The second mechanism has neither property.
Let's imagine before the lottery that you went to each member of the population and said "I'm a voter, and if you give me $1,000 I'll protect you". That's pretty scummy, because of course you can't do that for everyone, but your scumminess isn't the issue here. You pulled in a bunch of money, and have each person in the town independently thinking you will save them. You then write your own name down, because you've been told differential privacy will protect your data (and, because you are scummy).
The second mechanism, 100% of the time, guarantees that someone is going to be pissed at you. We don't learn what data you contributed (or even if you contributed anything), but we always learn something concrete about your data. We always get incontrovertible evidence of some value that your data wasn't. You will probably even have to give the winner their $1,000 back. This isn't a problem that pure differential privacy has.
The problem here is that, even with just one voter in the dataset (you, say), that person can have an arbitrarily large number of secrets. Here the voter has as many secrets as there are people in the population, each of the form "my data isn't actually your name". Perhaps you think that these are lame secrets, but you then enter the murky territory of valuating other people's secrets. How many secrets worth protecting can a person have? I have no clue, nor I suspect do most folks using (
del)-differential privacy folks care about this sort of example, it means that we need better guidance on how to chose
del. At the least (most?), we must pick
del to be vanishingly small with respect to one over the size of the domain of input records. If there are 1,000,000 names you might write down, we need
del to be much smaller than 1/1,000,000. Otherwise, we risk guaranteeing disclosure of one of the secrets of the form "your data was not X". (Intuitively, it feels like
del doesn't need to be much smaller than that, though).
So how large are common domains of input records? If we have a database of just bits, the domain is quite small: two! On the other hand, if we have a database of "strings people typed into Google", it .. well there really isn't a reasonable bound is there? Of course you can make something up, but let's agree that you are just making something up. How many web searches might you want to pretend you've done (e.g. when reviewer #2 is flabbergasted that you didn't)? How many distinct geo-locations might you want to pretend to have visited (e.g. when you've told your boss or spouse you've literally looked everywhere)? How many things don't you want people to know aren't true about your health, your finances, your private life?
You could ask "are the only bad examples about liars?" Maybe. Although, we could change the lottery example to have no secret payments and deals, but instead when the person is named and realizes you could have saved them but didn't, they 100% burn your house down. This is appreciably different from the (
0)-dp mechanism where you can always hug the winner with teary eyes and say "I tried", and then go chill in your comfy house. One mechanism gives you plausible deniability and one absolutely does not.
del)-differential privacy is a bit weirder than I think most people give it credit for. I'm still not comfortable with it, nor with the fact that so many other people seem to be as comfortable as they are.