## **Project 1: Generate Parallel Databases (Refer Section1: Differential Privacy)**
Key to the definition of differenital privacy is the ability to ask the question "When querying a database, if I removed someone from the database, would the output of the query be any different?". Thus, in order to check this, we must construct what we term "parallel databases" which are simply databases with one entry removed.

In this first project, I want you to create a list of every parallel database to the one currently contained in the "db" variable. Then, I want you to create a function which both:
* creates the initial database (db)
* creates all parallel databases

## Lesson: Towards Evaluating The Differential Privacy of a Function
Intuitively, we want to be able to query our database and evaluate whether or not the result of the query is leaking "private" information. As mentioned previously, this is about evaluating whether the output of a query changes when we remove someone from the database. Specifically, we want to evaluate the maximum amount the query changes when someone is removed (maximum over all possible people who could be removed). So, in order to **evaluate how much privacy is leaked**, we're going to iterate over each person in the database and measure the difference in the output of the query relative to when we query the entire database.

Just for the sake of argument, let's make our first "database query" a simple sum. Aka, we're going to count the number of 1s in the database.

In [2]:
# Database

import torch

num_entries = 5000

db = torch.rand(num_entries) > 0.5

db

# Create a function - To Remove Index
def get_parallel_db(db, remove_index):
    return torch.cat((db[0:remove_index],db[remove_index+1:]))



In [3]:
# Remove no that does not exist 
# it will return the entire database size
get_parallel_db(db,5423).shape

torch.Size([5000])

In [4]:
# Create Function 2 to iterate through db and create parallel db

def get_parallel_dbs(db):
    parallel_dbs = list()
    for i in range(len(db)):
        pdb = get_parallel_db(db,i)
        parallel_dbs.append(pdb)
    return parallel_dbs

pdbs = get_parallel_dbs(db)

In [5]:
# Function 3 to 
def create_db_and_parallels(num_entries):
    db = torch.rand(num_entries) > 0.5
    pdb = get_parallel_dbs(db)
    
    return db, pdbs

In [6]:
db , pdbs = create_db_and_parallels(20)

In [7]:
db

tensor([1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0],
       dtype=torch.uint8)

In [7]:
# Function to Query the database
def query(db):
    return db.sum()


In [8]:
full_db_result = query(db)

In [9]:
query(pdbs[6])

tensor(2471)

In [10]:
# iterate over every paralllel database
max_distance = 0 
for pdb in pdbs:
    pdb_result = query(pdb)
    
    # comapre paralled db and full db
    db_distance = torch.abs(pdb_result - full_db_result)
    
    if(db_distance > max_distance):
        max_distance = db_distance
        

In [11]:
max_distance

tensor(2461)

## Sensitivity
The maximum amount that the query changes when removing an individual from the database

## **Project 2 - Evaluating the Privacy of a Function**
In the last section, we measured the difference between each parallel db's query result and the query result for the entire database and then calculated the max value (which was 1). This value is called "sensitivity", and it corresponds to the function we chose for the query. Namely, the "sum" query will always have a sensitivity of exactly 1. However, we can also calculate sensitivity for other functions as well.

Let's try to calculate sensitivity for the "mean" function.
#### Create a single function called **sensitivity** accepts (query ,n_entires)
 *  Initialize a database of correct size
 *  Initialize all prallel database
 *  Run the Query over all the databases
 *  Correclty calculate sensitity
 *  Return the sensitivity

In [12]:
#function sensitivity
def sensitivity(query, n_entries = 1000):
    #Initialize a database of correct size and all parallel database
    db , pdbs = create_db_and_parallels(n_entries)
    full_db_result = query(db)
    #Run the Query over all the databases
    max_distance = 0 
    for pdb in pdbs:
        pdb_result = query(pdb)
        # comapre paralled db and full db
        db_distance = torch.abs(pdb_result - full_db_result)
        if(db_distance > max_distance):
            max_distance = db_distance
    #Return the sensitivity
    return max_distance
    

In [13]:
# Function to Query the database 
def query(db):
    return db.float().mean()

In [14]:
#n_entries = 1000 Run 1
sensitivity(query)

tensor(0.0079)

In [15]:
#n_entries = 1000 Run 2
sensitivity(query)

tensor(0.0249)

In [16]:
#n_entries = 1000 Run 2
sensitivity(query)

tensor(0.0239)

In [19]:
# n_entries = 100 Run 1
sensitivity(query)

tensor(0.0343)

In [20]:
# n_entries = 100 Run 2
sensitivity(query)

tensor(0.0259)

In [21]:
# n_entries = 100 Run 3
sensitivity(query)

tensor(0.0543)

In [31]:
db, pdbs = create_db_and_parallels(20)

In [32]:
db

tensor([1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1],
       dtype=torch.uint8)

Sensitivity is WAY lower.
**Note** the intuition here. $"Sensitivity"$ is measuring how sensitive the output of the query is to a person being removed from the database. For a simple sum, this is always 1, but for the mean, removing a person is going to change the result of the query by rougly 1 divided by the size of the database (which is much smaller). Thus, "mean" is a VASTLY less "sensitive" function (query) than SUM.


## **Project 3 : Calculate L1 Sensitivity For Threshold**
In this first project, I want you to calculate the sensitivty for the "threshold" function.

* First create a Query() function ,where you compute the sum over the database (i.e. **sum(db)**) and return whether that sum is greater than a certain threshold.
* Create 10 database of size 10, then query each databse threshold of 5 and calculate the sensitivity of the function.
* Finally, re-initialize the database 10 times and calculate the sensitivity each time.
* Print out the sensitivity of each database

In [1]:
# try this project here!

In [35]:
def query(db,threshold=5):
    return (db.sum() > threshold).float()

In [39]:
db, pdbs = create_db_and_parallels(10)
query(db)

tensor(1.)

In [37]:
for i in range(10):
    sensitivity_function = sensitivity(query ,n_entries = 10)
    print(sensitivity_function)

0
tensor(1.)
0
0
tensor(1.)
0
tensor(1.)
0
0
tensor(1.)


## Lesson: A Basic Differencing Attack
Sadly none of the functions we've looked at so far are differentially private (despite them having varying levels of sensitivity). The most basic type of attack can be done as follows.

Let's say we wanted to figure out a specific person's value in the database. All we would have to do is query for the sum of the entire database and then the sum of the entire database without that person!

## **Project 4: Perform a Differencing Attack on Row 10**
In this project, I want you to construct a database and then demonstrate how you can use two different sum queries to explose the value of the person represented by row 10 in the database (note, you'll need to use a database with at least 10 rows)

In [52]:
# Intiakize the database
db, _ = create_db_and_parallels(100)

In [53]:
db

tensor([1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1,
        1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1,
        0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0,
        1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1,
        0, 0, 0, 1], dtype=torch.uint8)

In [54]:
db[10]

tensor(1, dtype=torch.uint8)

In [56]:
# Intialize the parallel database with Row 10 missing
pdbs = get_parallel_db(db,remove_index=10)

In [67]:
sum(db)

tensor(49, dtype=torch.uint8)

In [68]:
sum(pdbs)

tensor(48, dtype=torch.uint8)

In [58]:
# Diffrencing attack using sum query
sum(db) - sum(pdbs)

tensor(1, dtype=torch.uint8)

In [66]:
pdbs.shape

torch.Size([99])

In [60]:
# Diffrencing attack using mean query
(sum(db).float()/len(db)) - (sum(pdbs).float()/len(pdbs))

tensor(0.0052)

In [61]:
# Diffrencing attack using mean query
(db.float().mean() - pdbs.float().mean())

tensor(0.0052)

In [64]:
# Diffrencing attack using threshold
(sum(db).float() > 49) - (sum(pdbs).float()  > 49)


tensor(0, dtype=torch.uint8)