Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we include NULL contributors in AID seed? #21

Closed
edongashi opened this issue Feb 10, 2021 · 12 comments
Closed

Do we include NULL contributors in AID seed? #21

edongashi opened this issue Feb 10, 2021 · 12 comments
Labels
question Further information is requested

Comments

@edongashi
Copy link
Member

If an AID contributes with NULL, do we include that AID in the seed material?

@edongashi edongashi added the question Further information is requested label Feb 10, 2021
@cristianberneanu
Copy link
Collaborator

After thinking about it a bit more, I think they should be included in the seed (but they should be ignored during aggregation).

NULL values were ignored during aggregation in the previous system, are ignored in the reference implementation and are also ignored in standard SQL aggregators.

But the seed is a property of the bucket, so any encountered AIDs have to contribute to it. If we find a way to compute the seed while we digest data, then we can drop NULL handling from the aggregators, simplifying stuff a bit more.

@sebastian
Copy link
Collaborator

On the assumption that it's the value contributed by the AID that's null, and not the AID itself, then I think it should be:

  • In count(*) then yes. The null value is counted and hence the AID should also be used in the seed material
  • For aggregates where null-values are ignored the AID does not contribute to the seed for such a value (such as count(price) for a price of null)

@sebastian
Copy link
Collaborator

But the seed is a property of the bucket, so any encountered AIDs have to contribute to it.

Why should any encountered AID contribute to the seed, if that AID doesn't otherwise contribute to the bucket?
I think exactly the "The seed is a property of the bucket" statement rings true, but I would read that as an argument for excluding the AID from the seed in the case where it didn't contribute to the aggregate.

@edongashi
Copy link
Member Author

edongashi commented Feb 10, 2021

Why should any encountered AID contribute to the seed, if that AID doesn't otherwise contribute to the bucket?

diffix_lcf(aid) has no knowledge of what's happening in aggregates.

Hmm, let's consider this scenario:

There are 100 rows of shape (aid, col) in a bucket where AIDs are unique in interval [1...100].
Let's suppose 99 rows have col set to NULL and we want to calculate count(col).
Should we suppress this bucket? diffix_lcf(aid) says no because it has no idea that col is NULL...

@cristianberneanu
Copy link
Collaborator

As much as I would like to use this opportunity to get rid of some code, I still think it should affect the seed.
Even if it has no effect on the aggregate, it still helps the bucket pass LCF.

@sebastian
Copy link
Collaborator

sebastian commented Feb 11, 2021

There are 100 rows of shape (aid, col) in a bucket where AIDs are unique in interval [1...100].
Let's suppose 99 rows have col set to NULL and we want to calculate count(col).
Should we suppress this bucket? diffix_lcf(aid) says no because it has no idea that col is NULL...

It passes the low count filter (and all AIDs contribute to the seed for the low count filter), but we don't produce an aggregate, because we have insufficient data for the aggregate.
We can still output some value such as null, <insufficient data for an aggregate count> (assuming the query was SELECT col, count(...)).

@cristianberneanu
Copy link
Collaborator

The main question is do we need to have the same seed for all aggregators (including LCF) or not.
Previously, the seed was computed separately and per bucket. We don't have this option now, so maybe we don't have to keep the same design, unless it causes a vulnerability.

@yoid2000
Copy link
Contributor

There are a number of different things being discussed here, so I'm a bit confused. The set of questions seem to be:

  1. Should a NULL AID value contribute to the seed?
  2. Should a NULL AID value be counted as a distinct user when counting the number of distinct users?
  3. Do we need to have the same seed for all aggregators (including LCF) or not?

Regarding 1, what would cause an AID to be NULL?

Regarding 2, can we avoid this question by always knowing what the actual AID is?

Regarding 3, this question doesn't arise for Publish AFAIK, and I think it is premature to ask it for the other variants.

In fact, none of this really matters for Publish...

@edongashi
Copy link
Member Author

Should a NULL AID value contribute to the seed?

The question is not for NULL AID but for NULL contribution coming from a (non-null) AID.

In the example below, which AIDs will be used for the AID noise seed for count(col): 1,2,4 or 1,2,3,4?

aid col
1 'a'
2 'b'
3 NULL
4 'c'

@yoid2000
Copy link
Contributor

Ah, I thought you meant the AID itself was NULL, not the aggregate.

My intuition is that the AIDs with NULL contribution to the aggregator should be included in the seed as well as the LCF computation.

And I can't think of an attack that would exploit this.

And I presume it is simpler to just include the AID in all cases (no special cases to deal with NULL).

So let's go with including all AIDs regardless of contribution to the aggregator.

@cristianberneanu
Copy link
Collaborator

And I presume it is simpler to just include the AID in all cases (no special cases to deal with NULL).

Actually, it is easier to exclude NULL contributions (simpler to ignore stuff sooner, rather than later).
But code is already written to include them in the extension, so maybe simpler to leave it so? Then we need to update the reference implementation to do the same.

@cristianberneanu
Copy link
Collaborator

Closing this as NULL contributions are already included in the seed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants