Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Imperial College London attack #35

Closed
yoid2000 opened this issue Jan 29, 2019 · 23 comments
Closed

Implement Imperial College London attack #35

yoid2000 opened this issue Jan 29, 2019 · 23 comments
Assignees

Comments

@yoid2000
Copy link
Contributor

Researchers from the Imperial College London published an attack on Diffix here: https://arxiv.org/abs/1804.06752. This task is to implement that attack. The criteria for the attack is singling out. You can see an example of a singling out attack at https://github.com/gda-score/code/blob/master/attacks/dumbList_SingOut.py.

In this attack, the attacker finds a set of column values where:

  1. The set of values isolates a user, and
  2. For each subset consisting of all but one column, the set of values has enough users to exceed low count filtering.

For example, consider a set of columns C and values V. Condition 1 says that the true answer to the following query is 1:

SELECT count(DISTINCT uid)
FROM table
WHERE C1 = V1 AND C2 = V2 and C3 = V3 and C4 = V4

We refer to the user who is isolated in the above query as the victim.

Condition 2 says that the following queries must exceed low count filter for the following queries:

SELECT count(DISTINCT uid)
FROM table
WHERE C1 = V1 AND C2 = V2 and C3 = V3
SELECT count(DISTINCT uid)
FROM table
WHERE C1 = V1 AND C2 = V2 and C4 = V4
SELECT count(DISTINCT uid)
FROM table
WHERE C1 = V1 AND C3 = V3 and C4 = V4
SELECT count(DISTINCT uid)
FROM table
WHERE C3 = V3 AND C2 = V2 and C4 = V4

The above example is for four different column/value attributes, but in general there may be 3 or more such attributes.

Once a set of conditions like this is found, then you can make an attack on an additional unknown attribute, call it Cu = Vu. The goal here is to discover if the victim has this attribute or not. If you discover that the victim has this attribute, then the victim is singled out.

The attack, using the above attributes as the example, proceeds as follows.

For each attribute C1/V1 through C4/V4, make the following queries (in this case assuming the C1/V1 attribute):

SELECT count(DISTINCT uid)
FROM table
WHERE C2 = V2 AND C3 = V3 and C4 = V4 and Cu = Vu
SELECT count(DISTINCT uid)
FROM table
WHERE C1 <> V1 AND C2 = V2 AND C3 = V3 and C4 = V4 and Cu = Vu

The first query Qi is the inclusive query because it includes the victim in the answer if the victim has attribute Cu=Vu, and the second query Qe is the exclusive query, because the condition C1 <> C2 will exclude the victim from the answer.

Now if we compute Qi-Qe, then the expected answer is 1 if the victim has the unknown attribute (and therefore is included in the first query) and 0 if the victim does not have the attribute.

Now we can repeat the two queries, but this time negating the unknown attribute:

SELECT count(DISTINCT uid)
FROM table
WHERE C2 = V2 AND C3 = V3 and C4 = V4 and Cu <> Vu
SELECT count(DISTINCT uid)
FROM table
WHERE C1 <> V1 AND C2 = V2 AND C3 = V3 and C4 = V4 and Cu <> Vu

In this case, the first query Qi includes the victim if the victim does not have the attribute. If we compute 1-Qi+Qe, then the expected answer is 1 if the victim has the unknown attribute and 0 otherwise.

If we generate these two computations for each attribute, then we'll have (in this case) a total of 8 computations. Of course, each of these computations has zero-mean noise, but if we take the average of all the computations, then we'll have higher confidence in the answer.

Given the above, to execute the attack, use askKnowledge() to learn the true values of the attributes for 3, 4, and 5 columns. For example:

SELECT C1, C2, C3, C4, count(DISTINCT uid)
FROM table
GROUP BY 1,2,3,4

With this information, search for cases where the above conditions apply. You can use 6 as the threshold for how many users are needed for condition 2. If you find such a case, then launch the attack using askAttack() for all values for some set of unknown columns Cu. Do this for unknown columns Cu that have relatively few distinct values.

@yoid2000
Copy link
Contributor Author

yoid2000 commented Feb 6, 2019

I've written a short article explaining how to write an attack. It is here: https://www.gda-score.org/quick-guide-to-writing-attacks/

@ku294714
Copy link
Contributor

ku294714 commented Mar 8, 2019

Hello Sir,

I am not able to connect to anon DB. However, I tried through the web interface and I could connect there.

"cloakBankingAnon": {
"host": "attack.aircloak.com",
"port": 8432,
"dbname": "gda_banking",
"user": "ankitdixit3004",
"password": "secret(assigned by you)",
"type": "aircloak"
}

Gives the error of authentication failed.

Error message:

Exception in thread Thread-4:
Traceback (most recent call last):
File "/anaconda3/envs/attacks/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/anaconda3/envs/attacks/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "../common/gdaScore.py", line 1032, in _dbWorker
conn = psycopg2.connect(connStr)
File "/anaconda3/envs/attacks/lib/python3.6/site-packages/psycopg2/init.py", line 130, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL: Authentication failed!

@yoid2000
Copy link
Contributor Author

yoid2000 commented Mar 9, 2019 via email

@ku294714
Copy link
Contributor

ku294714 commented Mar 9, 2019

Yes Sir.

@yoid2000
Copy link
Contributor Author

yoid2000 commented Mar 26, 2019

@ku294714

I got the following query working. This particular query does not do what we want, but the idea here is to join together queries that basically look at N-1 columns at a time (with more than 6 distinct uids), and then join that with a column that looks at all N columns but looking for 1 distinct uid.

Maybe you could play around with this concept and see what you come up with....

select t3.frequency as frequency,
       t3.acct_district_id as acct_district_id,
       t3.disp_type as disp_type,
       count(*)
from
(
    select t1.frequency as frequency,
           t1.acct_district_id as acct_district_id,
           t1.disp_type as disp_type
    from
        (select frequency, acct_district_id, '1234'::text as disp_type
        from accounts
        group by 1,2
        having count(distinct uid) > 6) t1
    join
        (select frequency, '1234'::text as acct_district_id, disp_type
        from accounts
        group by 1,3
        having count(distinct uid) > 6) t2
    on t1.frequency = t2.frequency
) t3
join 
    (select '1234'::text as frequency, acct_district_id, disp_type
    from accounts
    group by 2,3
    having count(distinct uid) > 6) t4
on t3.frequency = t4.frequency or
   t3.acct_district_id = t4.acct_district_id
group by 1,2,3

@ku294714
Copy link
Contributor

ku294714 commented Mar 26, 2019 via email

@yoid2000
Copy link
Contributor Author

@ku294714

Now I have a single query that finds all cases where both conditions (1 and 2) are true. Here is what it looks like for three columns:

select all_three.frequency, all_three.disp_type, all_three.acct_district_id
from (select freq_acct.frequency as frequency,
             freq_acct.acct_district_id as acct_district_id,
             freq_disp.disp_type as disp_type
      from
            (select frequency, acct_district_id
            from accounts
            group by 1,2
            having count(distinct uid) > 6) freq_acct
        join
            (select frequency, disp_type
            from accounts
            group by 1,2
            having count(distinct uid) > 6) freq_disp
        on freq_acct.frequency = freq_disp.frequency
        join 
            (select acct_district_id, disp_type
            from accounts
            group by 1,2
            having count(distinct uid) > 6) acct_disp
        on freq_disp.disp_type = acct_disp.disp_type and
           freq_acct.acct_district_id = acct_disp.acct_district_id
    ) pairs
join
    (select frequency, disp_type, acct_district_id
     from accounts
     group by 1,2,3
     having count(distinct uid) = 1) all_three
on pairs.frequency = all_three.frequency and
   pairs.acct_district_id = all_three.acct_district_id and
   pairs.disp_type = all_three.disp_type

here, pairs produces a table where all of the three possible column pairs have cistinct uid counts > 6, and all_three produces a table where all three column values have only one distinct uid.

To make pairs, we join three tables (freq_acct, freq_disp, and acct_disp), each of which selects rows where the pair of columns has more than 6 users.

Though I haven't tried it, I'm sure this can easily be generalized to X columns.

@ku294714
Copy link
Contributor

ku294714 commented Mar 28, 2019 via email

@ku294714
Copy link
Contributor

Hello Sir,

As of today, the progress that I have made is:

  1. Getting the user's isolated based on the combinations of 3 columns (all combinations included) --already discussed with you -- no issues in this.

  2. Used getPublicColValues function to use the values for unknown columns --discussed this last time -- now no performance issue and it gets executed for all the other columns except prior columns.

  3. You asked me to check the approach on claim measure.
    I am doing the below for claim:
    -I am computing Qi-Qe (count of distinct users returned when including the attribute and excluding the attribute respectively).
    --1-Qi+Qe with negating the unknown attribute, in a similar fashion as above.
    --Taking average for all the six values, corresponding to all the 3 known attributes, if it is greater than 0.5, I am claiming the respective Cu and Vu (unknown attributes).

There would be only one value across one Vu, through all the 6 queries, (3 for each known attributes, while including the unknown attribute, Cu-Vu and 3 with negating the unknown attribute).
Hence, there is no need to compare probabilities I guess.
I also checked the paper corresponding to this, and I think this makes sense.

Please let me know If I need to make any other changes or this approach is fine?

Thanks.

@yoid2000
Copy link
Contributor Author

yoid2000 commented Apr 16, 2019 via email

@ku294714
Copy link
Contributor

ku294714 commented Apr 16, 2019 via email

@yoid2000
Copy link
Contributor Author

yoid2000 commented Apr 17, 2019 via email

@ku294714
Copy link
Contributor

Sir,

As per our last discussion, you told me to generalize query for all DBs such that to filter out victim user we use hashing on 'uid' column.

This is the query that i made:

select all_three.frequency, all_three.disp_type, all_three.acct_district_id
from (select freq_acct.frequency as frequency,
freq_acct.acct_district_id as acct_district_id,
freq_disp.disp_type as disp_type
from
(select frequency, acct_district_id
from accounts
where
--birthdate < '1970-01-01'
left(md5(uid::TEXT),1) ~ '^[d]'
group by 1,2
having count(distinct uid) > 6) freq_acct
join
(select frequency, disp_type
from accounts
where
--birthdate < '1970-01-01'
left(md5(uid::TEXT),1) ~ '^[d]'
group by 1,2
having count(distinct uid) > 6) freq_disp
on freq_acct.frequency = freq_disp.frequency
join
(select acct_district_id, disp_type
from accounts
where
--birthdate < '1970-01-01'
left(md5(uid::TEXT),1) ~ '^[d]'
group by 1,2
having count(distinct uid) > 6) acct_disp
on freq_disp.disp_type = acct_disp.disp_type and
freq_acct.acct_district_id = acct_disp.acct_district_id
) pairs
join
(select frequency, disp_type, acct_district_id
from accounts
where
--birthdate < '1970-01-01'
left(md5(uid::TEXT),1) ~ '^[d]'
group by 1,2,3
having count(distinct uid) = 1) all_three
on pairs.frequency = all_three.frequency and
pairs.acct_district_id = all_three.acct_district_id and
pairs.disp_type = all_three.disp_type

However, previously I was executing this query on the full accounts table and I was able to find 3 victim users. Now it doesn't filter out any victim user.

My guess is, if we filter out records like this, it won't be able to find any victim users. So, I was thinking to do the below:

Put a condition that if the table has less than 10,000 records, then use all the records to find the victim.
Else, use the above-hashed filter to filter out records and then find victim user.

We can discuss more on this.

@yoid2000
Copy link
Contributor Author

Sorry, I made a mistake. The cloak does not implement md5() or any other hashing function.

Instead, the cloak offers a special function sample_users x%, where x is number in the sequence [..., 0.1, 0.2, 0.5, 1, 2, 5, ...]. This will select the appropriate percentage of random users.

An example of a query using sample_users is:

select count(*)
from accounts
sample_users 1%

You can play with this on the cloak browser interface so that you understand it.

For non-aircloak DBs, however, you'll have to use md5.

@ku294714
Copy link
Contributor

Hello Sir,

I have uploaded the imperial_attack.py file and results files (except census_persons table).
As we discussed, the results are restricted to a subset of victim users and few columns, just so that you can have a look at the results.

Here are the changes I have made in the file:
To find a victim user:

  1. Cant attack certain columns if they have many distinct values (using getpubliccolvalues)
  2. For a large number of columns scramble the combinatorics list and take first 100 combinations -- to reduce the number of columns.
  3. Used md5 hashing to reduce number of rows( if they are greater than 10,000)

To attack the victim user(Attack Phase):

  1. The number of victim users that I get for a few tables is more than 300, so, for now, I am taking on only 4 victim users.
  2. Also, I have reduced the number of columns (Cu) to attack by using getpubliccolvalues and apart from that, I am only taking 6 columns to attack for now.

I can increase the victim user and also an attack on all columns once you are satisfied with the subset-results.

@ku294714
Copy link
Contributor

Uploaded the other result files also (census_persons table)

@yoid2000
Copy link
Contributor Author

The attack should work on the pseudonymization tables, but your scores show that the attack isn't working. Also, some of your no_anon scores show that the attack doesn't work when they should.

Can you look into why this is?

PF

@ku294714
Copy link
Contributor

Hello sir,
Comments for the above queries:

'The attack should work on the pseudonymization tables, but your scores show that the attack isn't working' ---As I told yesterday that in these tables, victim users are not found, hence basically there is no attack. I can do this that if there are no victim users found, I will not generate that file itself and print the error message with same.
'Also, some of your no_anon scores show that the attack doesn't work when they should.' ---I will check on this and get back to you.

@yoid2000
Copy link
Contributor Author

yoid2000 commented Jun 26, 2019 via email

@yoid2000
Copy link
Contributor Author

What is the status of this work?

@ku294714
Copy link
Contributor

Sir, there are few bugs in my code, I am resolving them. it works fine for smaller tables, so checking for transactions. Meanwhile, I am trying to install pycharm in office system, so as to execute it my code on full table.

@ku294714
Copy link
Contributor

Hello Sir,

I had uploaded the results of 'no-anon' and 'diffix' DBs (i had discussed with you about the issue of graphs not working). Can you please check if they are fine. k-anon tables are also almost done. I will upload them as soon as possible.

@yoid2000
Copy link
Contributor Author

yoid2000 commented Mar 2, 2020

no need now. Usenix paper is published and authors show themselves that the attack doesn't work.

@yoid2000 yoid2000 closed this as completed Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants