### Exploring Cape Python Policy with Pandas and Cape Core

This Jupyter Notebook is accompanied by our [Medium Post on Getting Started with Cape Core](https://medium.com/dropoutlabs/cape-core-privacy-and-data-science-working-together-d25a55526506). To follow along, you will need to [download the example dataset](https://capeprivacy.com/example-dataset/) and put it in a relative folder called `data` (or update the file path below). You will also need to [download the policy file](https://github.com/capeprivacy/cape-python/blob/master/examples/policy/iot_example_policy.yaml) and put it in a relative folder called `policy` or ensure you have Cape Python installed locally and change the path to use the copy in the `examples` folder.

You will also need a local (or deployed version) of [Cape Core](https://github.com/capeprivacy/cape) running and have generated an API token to follow along.

In [1]:
spark

In [2]:
import cape_dataframes as cape_df

In [4]:
df = spark.read.csv('../data/iot_example.csv', header=True)

In [5]:
df.sample(0.1).limit(5).show()

+-------------------+---------------+-----------+---------+--------------------+------+------+
|          timestamp|       username|temperature|heartrate|               build|latest|  note|
+-------------------+---------------+-----------+---------+--------------------+------+------+
|2017-01-01T12:18:39|       moonjuan|         26|       76|22989085-e6fe-eae...|     1|   n/a|
|2017-01-01T12:22:52|           ylee|         29|       73|ff29e7ab-934f-f7b...|     0|  test|
|2017-01-01T12:32:20|    alicecampos|         29|       76|547ed6d5-0e12-4c2...|     0|  test|
|2017-01-01T12:36:40|   stevenmiller|         26|       64|e12b053c-d772-c94...|     0|update|
|2017-01-01T12:40:26|robinsongabriel|         17|       80|f0bfb52c-b805-cd1...|     1|   n/a|
+-------------------+---------------+-----------+---------+--------------------+------+------+



### Privacy Concerns

In this dataset which has mock data from wearable devices, we are concerned about the privacy of the individuals. It is a timeseries-based analysis, so we'd like to ensure we retain the ability to see the data of an individual change over time, but we want to provide some basic privacy protections for our exploratory data analysis and later model development activities.

The following policy file provides these protections:

- [Tokenization](https://docs.capeprivacy.com/libraries/cape-python/transformations/#tokenizer) of the username column with a maximum token length of 10 and a key defined in the file.
- [Date Truncation](https://docs.capeprivacy.com/libraries/cape-python/transformations/#date-truncation) for the timestamp column - removing the minutes and seconds of the data but keeping the year, month, date and hour.
- [Redaction](https://docs.capeprivacy.com/libraries/cape-python/redactions) of the build column, which reveals information about the device it was built on. In Cape, redaction involves dropping of the matching data so this will change the shape of your dataframes.

In [6]:
!cat ../policy/iot_example_policy.yaml

label: iot_dataset_policy
version: 1
rules:
  - match:
      name: username
    actions:
      - transform:
          type: "tokenizer"
          max_token_len: 10
          key: "Please change this :)"
  - match:
      name: timestamp
    actions:
      - transform:
          type: "date-truncation"
          frequency: "hour"
  - match:
      name: build
    actions:
      - transform:
          type: "column-redact"
          columns: ["build"] 


### With Cape Core

If you are using Cape Core and have a project setup and registered with the above policy as well as an API token, you can use the following code to download the policy from the Cape Coordinator.

In [3]:
c = cape_df.Client("http://localhost:8080")
c.login("INSERT YOUR CAPE TOKEN HERE")
policy = c.get_policy("first-project")

### Apply the parsed policy

To apply the parsed policy, call the `apply_policy` function to your dataframe and sample the results.

In [6]:
caped_df = cape_df.apply_policy(policy, df)



In [7]:
caped_df.sample(0.1).limit(5).show()

+----------+----------+-----------+---------+------+--------+
| timestamp|  username|temperature|heartrate|latest|    note|
+----------+----------+-----------+---------+------+--------+
|2017-01-01|1763f4313b|         22|       83|     1|  update|
|2017-01-01|d0c44f5675|         12|       77|     0|    wake|
|2017-01-01|0a89db1e39|         12|       78|     1|interval|
|2017-01-01|26594010f3|         29|       76|     0|    test|
|2017-01-01|37db75f0f1|         12|       71|     0|   sleep|
+----------+----------+-----------+---------+------+--------+



### Send it to Sink

Now it's time to send along our caped DataFrame to our clean sink or utilize it in a Spark task (for example, for analytics, EDA or machine learning). 

Note: You'll need to edit the database details below (or specify where you'd like the dataframe to be written.

In [None]:
caped_df.write \
    .format("jdbc") \
    .option("url", "jdbc:postgresql:dbserver") \
    .option("dbtable", "schema.tablename") \
    .option("user", "username") \
    .option("password", "password") \
    .save()