## Step-1: Create a Dummy Database

Here, we create a dummy database consisting of 5000 values representing 0's and 1's.

In [1]:
# Import Dependencies
import torch

# Set seed for reproducability
torch.manual_seed(101)

# Number of values in our Database
num_values = 5000

# Create the Database where '1' represents a value > 0.5 and '0' otherwise
db = torch.rand(num_values)
db = (db > 0.5).int()
db

tensor([0, 0, 0,  ..., 1, 1, 0], dtype=torch.int32)

In [2]:
db.shape

torch.Size([5000])

## Step-2: Generate Parallel Databases

Let us say we have a database with random 5000 values containing 0's and 1's in it. Now, if we remove one value at random from this database and copy thi leftover database with 4999 values, we now have a parallel database with 4999 values each.

If we repeat this process 5000 times, we get 5000 parallel databases with 4999 values each, i.e. one value missing at every incremental location in each database.

In [3]:
# Let's take the first 5 values fromt the Database
db[0:5]

tensor([0, 0, 0, 1, 0], dtype=torch.int32)

In [4]:
# Let us say we want to remove the value at 2nd index
remove_idx = 2

# Hence the above 5 values change to the following after removing the
# value at 2nd index
# db[0:2] => takes values at idx 0 and 1
# db[3:] => takes values from idx 3 till the end
torch.cat((db[0:2], db[3:]))[0:5]

tensor([0, 0, 1, 0, 0], dtype=torch.int32)

Hence, we can generalize the above term as follows:

Given the index to remove the value from the Database, we can define the remaining Database as follows:

**torch.cat((db[0:remove_idx], db[remove_idx+1:]))**

### Function to Create Parallel Databases

In [5]:
# Function to Create Parallel Databases
def get_parallel_db(db, remove_idx):
    return torch.cat((db[0:remove_idx], db[remove_idx+1:]))

In [6]:
# Test the function
# Remove the first value from the database
get_parallel_db(db, 0)[0:5]

tensor([0, 0, 1, 0, 0], dtype=torch.int32)

In [7]:
# Check the Shape of the new Database
get_parallel_db(db, 0).shape

torch.Size([4999])

In [8]:
# Function to get Parallel Databases
def get_parallel_dbs(db):
    # List to contain Parallel Databases
    parallel_dbs = list()
    
    # Iterate over all values in the Database to create new Databases
    for i in range(db.numpy().shape[0]):
        parallel_dbs.append(get_parallel_db(db, i))
    
    return parallel_dbs

In [9]:
# Get all Parallel Databases
pdbs = get_parallel_dbs(db)
len(pdbs)

5000

### Cleaning up the code and Writing the Functions in a Class

In [10]:
class createDatabase:
    # Init Function
    def __init__(self, num_values):
        self.num_values = num_values
        
    # Function to Create Parallel Databases by removin an element
    def get_parallel_db(self, db, remove_idx):
        return torch.cat((self.db[0:self.remove_idx], self.db[self.remove_idx+1:]))
    
    # Function to get list of Parallel Databases
    def get_parallel_dbs(self, db):
        # List to contain Parallel Databases
        parallel_dbs = list()
        # Iterate over all values in the Database to create new Databases
        for i in range(self.db.numpy().shape[0]):
            parallel_dbs.append(get_parallel_db(self.db, i))
        return parallel_dbs

    # Function to Create Parallel Databases
    def create_db_and_parallels(self):
        # Create a new Database
        db = db = torch.rand(self.num_values)
        db = (db > 0.5).int()
        # Get all Parallel Databases
        pdbs = get_parallel_dbs(db)
        return db, pdbs

In [11]:
# Create a Database with Parallel Database
database = createDatabase(20)
db, pdbs = database.create_db_and_parallels()

In [12]:
# Check the sample database
db

tensor([0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1],
       dtype=torch.int32)

In [13]:
# Check the sample parallel databases
pdbs

[tensor([1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1],
        dtype=torch.int32),
 tensor([0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1],
        dtype=torch.int32),
 tensor([0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1],
        dtype=torch.int32),
 tensor([0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1],
        dtype=torch.int32),
 tensor([0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1],
        dtype=torch.int32),
 tensor([0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1],
        dtype=torch.int32),
 tensor([0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1],
        dtype=torch.int32),
 tensor([0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1],
        dtype=torch.int32),
 tensor([0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1],
        dtype=torch.int32),
 tensor([0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1],
        dtype=torch.int32),
 tensor([0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,

## Step-3: Evaluate the Differential Privacy of a Funtion

We want to be able to query our database and evaluate that whether or not the result of the query is leaking "private" information. This is about evaluating that whether the output of a query changes when we remove someone from the database. If the output changes, that means it has learn the private information about the user that was removed. If the output does not change after removing the person, then it has not learn any private information.
Specifically, we want to evaluate the maximum amount the query changes when someone is removed. So, in order to evaluate how much privacy is leaked, we are going to iterate over each person in the database and measure the difference in the output of the query relative to when we query the entire database.

In [14]:
# Function to Query the Database
def query(db):
    return db.sum()

In [15]:
db

tensor([0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1],
       dtype=torch.int32)

In [16]:
# Query Output on Original Database
query(db)

tensor(12)

In [17]:
# Check the Query on the Parallel Database Values
for i in range(len(pdbs)):
    print("Database: {0}\t Query Output: {1}\n".format(i, query(pdbs[i])))

Database: 0	 Query Output: 12

Database: 1	 Query Output: 11

Database: 2	 Query Output: 12

Database: 3	 Query Output: 12

Database: 4	 Query Output: 12

Database: 5	 Query Output: 11

Database: 6	 Query Output: 12

Database: 7	 Query Output: 11

Database: 8	 Query Output: 11

Database: 9	 Query Output: 11

Database: 10	 Query Output: 11

Database: 11	 Query Output: 12

Database: 12	 Query Output: 11

Database: 13	 Query Output: 11

Database: 14	 Query Output: 11

Database: 15	 Query Output: 11

Database: 16	 Query Output: 12

Database: 17	 Query Output: 11

Database: 18	 Query Output: 12

Database: 19	 Query Output: 11



So, what do we see from above? We see that when we remove people from the original database "db", it changes the ouptut of the "query[pdbs]". This means that the output of the query is conditioned directly on the information of a lot of people in this database due to which when we run the query on the parallel database (with someone removed), the ouptut of the query changes as compared to the ouptut of query for the original database.

In [18]:
# Let's see how much the query for paralel database deviate from
# that of the original database

# Query Output for Original Database
full_db_result = query(db)

# Variable to store the max_deviation
sensitivity = 0

# Query Output for Parallel Databases
for pdb in pdbs:
    pdb_result = query(pdb)
    
    # Find deviation in outputs using L1 Norm
    db_distance = torch.abs(pdb_result - full_db_result)
    
    # Get the maximum deviation/distance
    if (db_distance > sensitivity):
        sensitivity = db_distance

In [19]:
# Maximum Deviation/Sensitivity/L1 Sensitivity
# Sensitivity: The max amount that the query changes when removing an individual from the database
sensitivity

tensor(1)

We get a very important result from above. Since, our database contains binary values, the most that any value can change is = 1 - 0 => 1. This is provde by the value of the "max_distance" above. This also holds true if we have 5000 values instead of just 20.

In [20]:
# Sensitivity Function
def sensitivity(query, n_entries=1000):
    # Create the Database and Iitialize all Parallel Databases
    database = createDatabase(n_entries)
    db, pdbs = database.create_db_and_parallels()

    # Sensitivity
    max_distance = 0
    
    # Query Output for Original Database
    full_db_result = query(db)
    
    # Calculate Sensitivity
    for pdb in pdbs:
        pdb_result = query(pdb)
    
        # Find deviation in outputs using L1 Norm
        db_distance = torch.abs(pdb_result - full_db_result)

        # Get the maximum deviation/distance
        if (db_distance > max_distance):
            max_distance = db_distance
    
    return max_distance

In [21]:
# Define the Query
def query(db):
    return db.sum()

In [22]:
# Test the Function
sensitivity(query)

tensor(1)

In [23]:
# Define Another Query
def query(db):
    return db.float().mean()

In [24]:
# Test the Function
sensitivity(query)

tensor(0.0005)

### L1 Sensitivity for Threshold

In [25]:
# Threshold Sensitivity Query
def query(db, threshold=5):
    return (db.sum() > threshold).float()

In [26]:
# Create and Query 10 databases 10 times
for i in range(10):
    print(sensitivity(query, 10))

0
0
tensor(1.)
0
tensor(1.)
0
tensor(1.)
tensor(1.)
0
0


## Step-4: Performing a Differencing Attack on the Database

**Aim:** 

Construct a database and then demonstrate that how wer can use two different sum queries to expose the value of the person represented by the row 10 in the database.

This attack works as follows:

1. If we have a base database, say, the original database.
2. We can get a parallel database by removing one person from the original database.
3. Then, if we take the difference between the sum of two databases, we can figure out the missing person.
4. We can also use make the Differencing Attack using the mean as well as the threshold values.

Example:

1. SELECT count(*) from my_cancer_database;
2. SELECT count(*) from my_cacer_database WHERE person_name!="john doe";

By comparing the above two SQL queries, we can know if john doe had cancer or not.

In [27]:
# Create a Database with 10 rows
# Create the Database and Iitialize all Parallel Databases
database = createDatabase(100)
db, _ = database.create_db_and_parallels()

In [28]:
# Original Database
db

tensor([0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
        1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1,
        1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1,
        1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0], dtype=torch.int32)

In [29]:
# Remove Index 10 from Parallel Database
pdb = get_parallel_db(db, remove_idx= 10)
pdb

tensor([0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1,
        0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1,
        1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1,
        1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0], dtype=torch.int32)

In [30]:
# Original Value
db[10]

tensor(0, dtype=torch.int32)

In [31]:
# Differencing Attack using Sum Query
sum(db) - sum(pdb)

tensor(0, dtype=torch.int32)

In [32]:
# Differencing Attack using Mean Query
(sum(db) / len(db)) - (sum(pdb) / len(pdb))

tensor(0, dtype=torch.int32)

In [33]:
# Differencing Attack using Threshold Query
(sum(db) > 49).float() - (sum(pdb) > 49).float()

tensor(0.)