**bold text**## Part 3: Sneak peak into What's next

### About PyDP
The PyDP package provides access to Differential Privacy algorithms in Python. This example uses the 1.0 version of the package that has the following limitations:

1. This version only supports [Google's Differential Privacy library](https://github.com/google/differential-privacy).
2. Laplace noise generation technique.
3. Supports only integer and floating point values

### What's in this Tutorial
To demonstrate what's new in PyDP; and how we can use it in a distributed system.

In [36]:
!pip install python-dp # installing PyDP



In [37]:
import pydp as dp # by convention our package is to be imported as dp (for Differential Privacy!)
from pydp.algorithms.laplacian import BoundedSum, BoundedMean, Count, Max
import pandas as pd
import statistics # for calculating mean without applying differential privacy


### Fetching the Data and loading it! 

In [38]:
# get carrots data from our public github repo
url1 = 'https://raw.githubusercontent.com/OpenMined/PyDP/dev/examples/Tutorial%204-Launch_demo/data/01.csv'
df1 = pd.read_csv(url1,sep=",", engine = "python")

In [39]:
url2 = 'https://raw.githubusercontent.com/OpenMined/PyDP/dev/examples/Tutorial%204-Launch_demo/data/02.csv'
df2 = pd.read_csv(url2,sep=",", engine = "python")

In [40]:
url3 = 'https://raw.githubusercontent.com/OpenMined/PyDP/dev/examples/Tutorial%204-Launch_demo/data/03.csv'
df3 = pd.read_csv(url3,sep=",", engine = "python")

In [41]:
url4 = 'https://raw.githubusercontent.com/OpenMined/PyDP/dev/examples/Tutorial%204-Launch_demo/data/04.csv'
df4 = pd.read_csv(url4,sep=",", engine = "python")

In [42]:
url5 = 'https://raw.githubusercontent.com/OpenMined/PyDP/dev/examples/Tutorial%204-Launch_demo/data/05.csv'
df5 = pd.read_csv(url5,sep=",", engine = "python")

#### Combining the whole data into one single dataframe. 

In [43]:
combined_df_temp = [df1, df2, df3, df4, df5]
original_dataset = pd.concat(combined_df_temp)

The size of the combined dataset: 

In [44]:
original_dataset.shape

(5000, 6)

In [45]:
sum_original_dataset = round(sum(original_dataset['sales_amount'].to_list()), 2)
dp_sum_og =  round(BoundedSum(epsilon= 1, lower_bound =  5, upper_bound = 250, dtype ='float').quick_result(original_dataset['sales_amount'].to_list()), 2)

#### Quering on the Partial

Consider a case when you are obtaining a stream of data from the Distributed database, and you want to give a partial result as and when you receive the data. 
The more stream of data you get, you get a better picture of what's there, but in this condition you have to give results as and when a new stream of data arrives. 



To achieve this, PyDP provides an option of using your partial privacy_budget. 



In [46]:
partial_dp_obj = BoundedSum(epsilon= 1, lower_bound =  5, upper_bound = 250, dtype ='float')

Combining first 3000 records in stream and then the other 2000 records.

In [47]:
new_df_1 = pd.concat([df1, df2, df3])
new_df_2 = pd.concat([df4, df5])
print(new_df_1.shape,new_df_2.shape)

(3000, 6) (2000, 6)


In [48]:
partial_dp_obj.add_entries(new_df_1['sales_amount'].to_list()) # adding the first 3000 records

In [49]:
partial_dp_obj.privacy_budget_left()

1.0

In [50]:
partial_sum_dp = round(partial_dp_obj.result(privacy_budget=0.3), 2) # using only 30% of available privacy budget 
print(partial_sum_dp)

381196.24


In [51]:
actual_partial_sum = round(sum(new_df_1['sales_amount'].to_list()), 2)
print(actual_partial_sum)

383911.03


In [52]:
print("Difference in sum for first 3000 records which used only 30% privacy budget= {}".format(round(abs(actual_partial_sum - partial_sum_dp), 2)))

Difference in sum for first 3000 records which used only 30% privacy budget= 2714.79


In [53]:
partial_dp_obj.privacy_budget_left()

0.7

In [54]:
partial_dp_obj.add_entries(new_df_2['sales_amount'].to_list()) # adding the remaining 2000 records to the list
partial_total_sum = round(partial_dp_obj.result(), 2)
print(partial_total_sum)

637449.44


In [55]:
partial_dp_obj.privacy_budget_left() # we have used up all the budget available to us

0.0

In [58]:
def sum_og_dataset(budget):
    '''
    Sample Function to calculate BoundedSum on the whole dataset with budget as specified
    '''
    dp_sum_original_dataset = BoundedSum(epsilon= 1, lower_bound =  5, upper_bound = 250, dtype ='float')
    dp_sum_original_dataset.reset()
    dp_sum_original_dataset.add_entries(original_dataset['sales_amount'].to_list())
    return round(dp_sum_original_dataset.result(budget), 2)


In [59]:
print("Actual Sum: {}".format(sum_original_dataset))
print("Sum from the previous run with privacy budget 1.0: {}".format(dp_sum_og))
print("Sum when using privacy_budget as 0.7 on the whole dataset together: {}".format(sum_og_dataset(budget=0.7)))
print("Sum from this run with privacy budget 0.7 on split dataset: {}".format(partial_total_sum))


Actual Sum: 636594.59
Sum from the previous run with privacy budget 1.0: 636568.44
Sum when using privacy_budget as 0.7 on the whole dataset together: 636587.08
Sum from this run with privacy budget 0.7 on split dataset: 637449.44


## What's Ahead:
PyDP GitHub: https://github.com/OpenMined/PyDP

Documentation: http://pydp.readthedocs.org/

Join Us: #lib_pydp on Slack: http://openmined.slack.com/


Contact me:
- Twitter: https://twitter.com/chinmayshah899
- Linkedin:https://www.linkedin.com/in/chinmayshah99/