
## **Project: Varying Amounts of Noise**
In this project,**augment the randomized response query from the previous project to allow for varying amounts of randomness to be added**. Specifically, I want you to bias the coin flip to be higher or lower and then run the same experiment.

Note - this one is a bit tricker than you might expect. 
* Add a new parameter to the query function. It will now accept the database and some noise paramter which is a percentage. 
* Properly rebalance the result of the query given this adjustable parameter.
* You need to both adjust the likelihood of the first coin flip AND the de-skewing at the end (where we create the "augmented_result" variable).

In [1]:
# Database

import torch

num_entries = 5000

db = torch.rand(num_entries) > 0.5

db

# Create a function - To Remove Index
def get_parallel_db(db, remove_index):
    return torch.cat((db[0:remove_index],db[remove_index+1:]))

# Create Function 2 to iterate through db and create parallel db

def get_parallel_dbs(db):
    parallel_dbs = list()
    for i in range(len(db)):
        pdb = get_parallel_db(db,i)
        parallel_dbs.append(pdb)
    return parallel_dbs

# Function 3 to 
def create_db_and_parallels(num_entries):
    db = torch.rand(num_entries) > 0.5
    pdbs = get_parallel_dbs(db)
    
    return db, pdbs

#function sensitivity
def sensitivity(query, n_entries = 1000):
    #Initialize a database of correct size and all parallel database
    db , pdbs = create_db_and_parallels(n_entries)
    full_db_result = query(db)
    #Run the Query over all the databases
    max_distance = 0 
    for pdb in pdbs:
        pdb_result = query(pdb)
        # comapre paralled db and full db
        db_distance = torch.abs(pdb_result - full_db_result)
        if(db_distance > max_distance):
            max_distance = db_distance
    #Return the sensitivity
    return max_distance
    

In [2]:
true_result = torch.mean(db.float())
true_result


tensor(0.4988)

In [3]:
# orginal local data or actual value from people
db

tensor([0, 0, 1,  ..., 1, 0, 1], dtype=torch.uint8)

In [4]:
db, pdbs = create_db_and_parallels(100)

In [5]:
# Pack all in the query
def query(db):
    true_result = torch.mean(db.float())
    first_coin_flip = (torch.rand(len(db)) > 0.5).float()
    second_coin_flip = (torch.rand(len(db)) > 0.5).float()
    augmented_db = (db.float() * first_coin_flip) + (1 - first_coin_flip) * second_coin_flip
    db_result = torch.mean(augmented_db.float()) * 2 - 0.5 
    
    return db_result, true_result


In [8]:
# Project add varying amount of noise

In [21]:
# Pack all in the query
def query(db, noise):
    true_result = torch.mean(db.float())
    # noise 
    #noise = 0.2
    first_coin_flip = (torch.rand(len(db)) > noise).float()
    second_coin_flip = (torch.rand(len(db)) > 0.5).float()
    
    augmented_db = (db.float() * first_coin_flip) + (1 - first_coin_flip) * second_coin_flip
    
    sk_result =  augmented_db.float().mean() 
    private_results = ((sk_result / noise) - 0.5) * noise / (1 - noise)
    
    return sk_result, private_results


In [6]:
db, pdbs = create_db_and_parallels(10)
private_results , true_results = query(db)
print("with noise:" + str(private_results))
print("withot noise:" + str(true_results))

with noise:tensor(0.7000)
withot noise:tensor(0.5000)


In [18]:
private_results = ((sk_result / noise) - 0.5) * noise / (1 - noise)

In [19]:
sk_result

tensor(0.6000)

In [20]:
private_results 

tensor(0.6250)

In [22]:
db, pdbs = create_db_and_parallels(10)
private_results , true_results = query(db,noise=0.2)
print("with noise:" + str(private_results))
print("withot noise:" + str(true_results))

with noise:tensor(0.3000)
withot noise:tensor(0.2500)


In [23]:
db, pdbs = create_db_and_parallels(100)
private_results , true_results = query(db,noise=0.1)
print("with noise:" + str(private_results))
print("withot noise:" + str(true_results))

with noise:tensor(0.5600)
withot noise:tensor(0.5667)


In [24]:
db, pdbs = create_db_and_parallels(100)
private_results , true_results = query(db,noise=0.2)
print("with noise:" + str(private_results))
print("withot noise:" + str(true_results))

with noise:tensor(0.5700)
withot noise:tensor(0.5875)


In [25]:
db, pdbs = create_db_and_parallels(100)
private_results , true_results = query(db,noise=0.3)
print("with noise:" + str(private_results))
print("withot noise:" + str(true_results))

with noise:tensor(0.5000)
withot noise:tensor(0.5000)


In [26]:
db, pdbs = create_db_and_parallels(100)
private_results , true_results = query(db,noise=0.4)
print("with noise:" + str(private_results))
print("withot noise:" + str(true_results))

with noise:tensor(0.5300)
withot noise:tensor(0.5500)


In [27]:
db, pdbs = create_db_and_parallels(100)
private_results , true_results = query(db,noise=0.5)
print("with noise:" + str(private_results))
print("withot noise:" + str(true_results))

with noise:tensor(0.5900)
withot noise:tensor(0.6800)


The size of the dataset allows you to add more noise or more privacy protection to individual who are insiode the dataset.
The more private data you have access to , the easire it is to protect the privacy of the people who are involved

In [28]:
db, pdbs = create_db_and_parallels(1000)
private_results , true_results = query(db,noise=0.5)
print("with noise:" + str(private_results))
print("withot noise:" + str(true_results))

with noise:tensor(0.5170)
withot noise:tensor(0.5340)


In [29]:
db, pdbs = create_db_and_parallels(10000)
private_results , true_results = query(db,noise=0.5)
print("with noise:" + str(private_results))
print("withot noise:" + str(true_results))

with noise:tensor(0.5038)
withot noise:tensor(0.5076)


**The larger the dataset , the more noise you can add while getting an accurate result**

In [31]:
db, pdbs = create_db_and_parallels(10000)
private_results , true_results = query(db,noise=0.8)
print("with noise:" + str(private_results))
print("withot noise:" + str(true_results))

with noise:tensor(0.5000)
withot noise:tensor(0.5000)


The way this works is that the dp is very complex kind filter and the way that the dp filter works is that , it looks for insformation that is consistent across the multiple diffrent indivuals. It tries to filter out perferct dp , so no inoformation lekage would, in theory we can block out any information this is unique about participants in dataset and onlu let through information that is consistent across multiple different people. 

But it allows to look for repeating statistical information inside the dataset. 
* If we have a small dataset , odd of it finding the same stastical pattern twice.
* if we have large dataset , it becomes easy to learn about patterns has you have more data points to look and compare with eachother to look for a similar statistical information