# INFO 3402 – Class 21: Data re-identification

[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)  

In [1]:
import numpy as np
import pandas as pd

We're going to use some fake data made by the excellent folks over at [Mockaroo](https://mockaroo.com/schemas).

In [2]:
full_df = pd.read_csv('mockaroo.csv')
full_df.head()

Unnamed: 0,id,first_name,last_name,email,age,gender,race,ip_address,city,postalcode,state,country,company,department,username,password
0,866-53-0512,Alberta,Hadingham,ahadingham0@yahoo.co.jp,39,Female,Iroquois,207.70.124.222,New York City,10090,New York,United States,Talane,Outdoors,ahadingham0,Dr5ZlsjZxY
1,741-10-1556,Casar,Napthine,cnapthine1@state.tx.us,33,Male,Iroquois,119.86.253.72,Dayton,45470,Ohio,United States,Devpoint,Jewelery,cnapthine1,Fv0v9ptRCIId
2,396-49-8046,Creight,Donneely,cdonneely2@php.net,47,Male,Latin American Indian,178.84.127.82,Houston,77050,Texas,United States,Gabtune,Toys,cdonneely2,Myd5Vi
3,225-31-6513,Alick,Pattingson,apattingson3@ovh.net,34,Male,Mexican,169.161.82.159,Bakersfield,93305,California,United States,Yambee,Beauty,apattingson3,TSqaWPwGIhx
4,591-20-0261,Lorrie,Wiz,lwiz4@wix.com,31,Male,Navajo,89.143.226.220,Troy,48098,Michigan,United States,Leexo,Home,lwiz4,Cmgky6wm3Q


There's a lot of obviously identifying information in here:

* **id**
* **first_name**
* **last_name**
* **email**
* **ip_address**
* **username**

Perhaps we wanted to release an "anonymized" dataset that had only their gender, city, postalcode, company, and department, perhaps similar to something like the FEVS government survey data from before.

In [3]:
anonymized_df = full_df[['gender','city','postalcode','company','department']]
anonymized_df.head()

Unnamed: 0,gender,city,postalcode,company,department
0,Female,New York City,10090,Talane,Outdoors
1,Male,Dayton,45470,Devpoint,Jewelery
2,Male,Houston,77050,Gabtune,Toys
3,Male,Bakersfield,93305,Yambee,Beauty
4,Male,Troy,48098,Leexo,Home


But there are also public records (*e.g.*, voter records) that include some information like age, city, postalcode, and name.

In [4]:
public_df = full_df[['first_name','last_name','age','city','postalcode','state']]
public_df.head()

Unnamed: 0,first_name,last_name,age,city,postalcode,state
0,Alberta,Hadingham,39,New York City,10090,New York
1,Casar,Napthine,33,Dayton,45470,Ohio
2,Creight,Donneely,47,Houston,77050,Texas
3,Alick,Pattingson,34,Bakersfield,93305,California
4,Lorrie,Wiz,31,Troy,48098,Michigan


Now imagine that you're an engineer or scientist with access to private behavioral data like ip_address, email, gender, race, and username from orders, surveys, log data, *etc*.

In [5]:
private_df = full_df[['ip_address','email','gender','race','username','city','postalcode','state']]
private_df.head()

Unnamed: 0,ip_address,email,gender,race,username,city,postalcode,state
0,207.70.124.222,ahadingham0@yahoo.co.jp,Female,Iroquois,ahadingham0,New York City,10090,New York
1,119.86.253.72,cnapthine1@state.tx.us,Male,Iroquois,cnapthine1,Dayton,45470,Ohio
2,178.84.127.82,cdonneely2@php.net,Male,Latin American Indian,cdonneely2,Houston,77050,Texas
3,169.161.82.159,apattingson3@ovh.net,Male,Mexican,apattingson3,Bakersfield,93305,California
4,89.143.226.220,lwiz4@wix.com,Male,Navajo,lwiz4,Troy,48098,Michigan


## Task 1

Using only the data in `anonymized_df` and `public_df`, how many users in `full_df` can you uniquely identify?

In [6]:
anonymized_df.head(2)

Unnamed: 0,gender,city,postalcode,company,department
0,Female,New York City,10090,Talane,Outdoors
1,Male,Dayton,45470,Devpoint,Jewelery


In [7]:
public_df.head(2)

Unnamed: 0,first_name,last_name,age,city,postalcode,state
0,Alberta,Hadingham,39,New York City,10090,New York
1,Casar,Napthine,33,Dayton,45470,Ohio


In [10]:
reidentified_df = pd.merge(left = anonymized_df,
                           right = public_df,
                           left_on = ['city','postalcode'],
                           right_on = ['city','postalcode'],
                           how = 'inner'
                          )

reidentified_df.head(10)

Unnamed: 0,gender,city,postalcode,company,department,first_name,last_name,age,state
0,Female,New York City,10090,Talane,Outdoors,Alberta,Hadingham,39,New York
1,Male,Dayton,45470,Devpoint,Jewelery,Casar,Napthine,33,Ohio
2,Male,Dayton,45470,Devpoint,Jewelery,Jonell,Whitcher,35,Ohio
3,Male,Dayton,45470,Devpoint,Jewelery,Magdaia,Mulliner,32,Ohio
4,Female,Dayton,45470,Linkbridge,Automotive,Casar,Napthine,33,Ohio
5,Female,Dayton,45470,Linkbridge,Automotive,Jonell,Whitcher,35,Ohio
6,Female,Dayton,45470,Linkbridge,Automotive,Magdaia,Mulliner,32,Ohio
7,Female,Dayton,45470,Centidel,Baby,Casar,Napthine,33,Ohio
8,Female,Dayton,45470,Centidel,Baby,Jonell,Whitcher,35,Ohio
9,Female,Dayton,45470,Centidel,Baby,Magdaia,Mulliner,32,Ohio


## Task 2

Using only the data in `private_df` and `public_df`, how many users in `full_df` can you uniquely identify?