## **Local Differential Privacy**

Local differential privacy is where given a collection of individuals, each individual adds noise to their data before sending it to the statistical database itself. 
So everything that gets into the database is already noised. So the protection is happening at the local level.

**How much noise to add?** 
Well, it depends. Before that, you need to remember that differential privacy always requires a form of randomness or noise added to the query to protect from things like "Differencing Attack ".

## **Project: Local Differential Privacy**
As you can see, the basic sum query is not differentially private at all! In truth, differential privacy always requires a form of randomness added to the query. Let me show you what I mean.

#### **Randomized Response (Local Differential Privacy)**
" Randomized Response is a technique that is used in social science when trying to learn about the high-level trends for taboo behavior. " 

Let's say I have a group of people I wish to survey about a very taboo behavior which I think they will lie about (say, I want to know if they have ever committed a certain kind of crime). I'm not a policeman, I'm just trying to collect statistics to understand the higher level trend in society. So, how do we do this? One technique is to **add randomness to each person's response by giving each person the following instructions (assuming I'm asking a simple yes/no question):**

* Flip a coin 2 times.
* If the first coin flip is heads, answer honestly
* If the first coin flip is tails, answer according to the second coin flip (heads for yes, tails for no)!

Thus, each person is now protected with **"plausible deniability"**. If they answer "Yes" to the question "have you committed X crime?", then it might becasue they actually did, or it might be becasue they are answering according to a random coin flip. Each person has a high degree of protection. Furthermore, we can recover the underlying statistics with some accuracy, as the **"true statistics"** are simply **averaged with a 50% probability**. Thus, if we collect a bunch of samples and it turns out that 60% of people answer yes, then we know that the TRUE distribution is actually centered around 70%, because 70% averaged wtih 50% (a coin flip) is 60% which is the result we obtained.

However, it should be noted that, especially when we only have a few samples, this comes at the **cost of accuracy**. This tradeoff **exists across all of Differential Privacy**. The greater the privacy protection (plausible deniability) the less accurate the results.

Let's implement this local DP for our database before!

In [2]:
# Database

import torch

num_entries = 5000

db = torch.rand(num_entries) > 0.5

db

# Create a function - To Remove Index
def get_parallel_db(db, remove_index):
    return torch.cat((db[0:remove_index],db[remove_index+1:]))

# Create Function 2 to iterate through db and create parallel db

def get_parallel_dbs(db):
    parallel_dbs = list()
    for i in range(len(db)):
        pdb = get_parallel_db(db,i)
        parallel_dbs.append(pdb)
    return parallel_dbs

# Function 3 to 
def create_db_and_parallels(num_entries):
    db = torch.rand(num_entries) > 0.5
    pdbs = get_parallel_dbs(db)
    
    return db, pdbs

#function sensitivity
def sensitivity(query, n_entries = 1000):
    #Initialize a database of correct size and all parallel database
    db , pdbs = create_db_and_parallels(n_entries)
    full_db_result = query(db)
    #Run the Query over all the databases
    max_distance = 0 
    for pdb in pdbs:
        pdb_result = query(pdb)
        # comapre paralled db and full db
        db_distance = torch.abs(pdb_result - full_db_result)
        if(db_distance > max_distance):
            max_distance = db_distance
    #Return the sensitivity
    return max_distance
    

In [4]:
true_result = torch.mean(db.float())
true_result


tensor(0.4500)

In [5]:
# orginal local data or actual value from people
db

tensor([0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0,
        1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
        0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1,
        1, 1, 0, 0], dtype=torch.uint8)

In [3]:
db, pdbs = create_db_and_parallels(100)

In [6]:
# First coin flip - will help determine whether we want to use the value its actually 
# going to database or to use the value that is randomly generated according to theis second
# coin flip 
first_coin_flip = (torch.rand(len(db)) > 0.5).float()
second_coin_flip = (torch.rand(len(db)) > 0.5).float()
first_coin_flip

tensor([0., 1., 0., 1., 1., 0., 0., 1., 0., 1., 0., 1., 1., 0., 0., 0., 0., 0.,
        0., 1., 0., 1., 0., 0., 1., 1., 0., 0., 0., 1., 1., 1., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1., 1.,
        1., 1., 1., 1., 0., 0., 1., 0., 1., 1., 1., 0., 1., 0., 1., 0., 1., 0.,
        0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 1., 0., 0., 1., 0., 1., 1., 0.,
        1., 1., 0., 0., 1., 1., 1., 0., 0., 0.])

In [7]:
second_coin_flip

tensor([1., 1., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 1.,
        1., 0., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0.,
        0., 0., 1., 0., 1., 1., 1., 0., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0.,
        1., 1., 1., 0., 1., 1., 0., 1., 0., 0., 0., 1., 1., 1., 1., 1., 0., 1.,
        1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0.,
        0., 0., 1., 0., 0., 1., 1., 0., 1., 0.])

We can do this to create our noisy database or sythetic database or agumneted database. Half of the time , if the coin flip is one , we will use the database. So we can do this by **Multiplyting first coin flip by database** (first_coin_flip act like a mask)

In [8]:
# 
db.float() * first_coin_flip

tensor([0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 1., 0., 1., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1.,
        1., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 1., 1., 0., 0., 0.])

Some times we need to **lie**. WE need to choose randomly. SO if one minus our first coin flip . Below are all the places where we will actually choose randomly. So we can do this by simply samplying or multiplying times the second coin flip.

In [9]:
(1 - first_coin_flip) * second_coin_flip

tensor([1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1.,
        1., 0., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 0., 1.,
        1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 0., 1., 0.])

In [10]:
# NEw augmented db , which is DP coombine this two
augmented_db = (db.float() * first_coin_flip) + (1 - first_coin_flip) * second_coin_flip

In [11]:
# remeber : Half of the value is always true
# half of the value will always going to have a mean or try to have a mean of 0.5
# so this will skew the putput of our query towards 0.5 , half the time we are using
# second_coin_flip
db_result = torch.mean(augmented_db.float()) * 2 - 0.5 #


tensor(0.4200)

In [13]:
# Pack all in the query
def query(db):
    true_result = torch.mean(db.float())
    first_coin_flip = (torch.rand(len(db)) > 0.5).float()
    second_coin_flip = (torch.rand(len(db)) > 0.5).float()
    augmented_db = (db.float() * first_coin_flip) + (1 - first_coin_flip) * second_coin_flip
    db_result = torch.mean(augmented_db.float()) * 2 - 0.5 
    
    return db_result, true_result


In [16]:
db, pdbs = create_db_and_parallels(10)
private_results , true_results = query(db)
print("with noise:" + str(private_results))
print("withot noise:" + str(true_results))

with noise:tensor(0.7000)
withot noise:tensor(0.4000)


In [17]:
db, pdbs = create_db_and_parallels(100)
private_results , true_results = query(db)
print("with noise:" + str(private_results))
print("withot noise:" + str(true_results))

with noise:tensor(0.6800)
withot noise:tensor(0.4800)


In [18]:
db, pdbs = create_db_and_parallels(1000)
private_results , true_results = query(db)
print("with noise:" + str(private_results))
print("withot noise:" + str(true_results))

with noise:tensor(0.4380)
withot noise:tensor(0.4870)


Remeber : Whenever we are adding noise to the distrubution , were corrupting it. So the statistics of the query are going to be sensitive to this noise . However , the more data points that we have, the more this noise will tend to average out. It will tend to not affect the output of the query because on average, across the large number of people, sometimes the noise is making the result higher than it should be or lower than the result shoulb be. But on an average, its actually still centered around the same mean of the true data distribution.

Local DP is data hungry , In other to noise the dataset, you are adding a ton of noise.