# Creating a Pseudonymized PII Lookup Table

In this lesson we'll create a pseudonymized key for storing potentially sensitive user data.

Our approach in this notebook is fairly straightforward; some industries may require more elaborate de-identification to guarantee privacy.

<img src="https://files.training.databricks.com/images/ade/ADE_arch_user_lookup.png" width="60%" />

## Learning Objectives
By the end of this lesson, students will be able to:
- Describe the purpose of "salting" before hashing
- Apply salted hashing to sensitive data for pseudonymization
- Use Auto Loader to process incremental inserts

## Setup
Begin by running the following cell to set up relevant databases and paths.

In [0]:
%run ../Includes/Classroom-Setup-6.1

## Auto Load Bronze Data

The following cell defines and executes logic to incrementally ingest data into the **`registered_users`** table with Auto Loader.

This logic is currently configured to process a single file each time a batch is triggered (currently every 10 seconds).

Executing this cell will start an always-on stream that slowly ingests new files as they arrive.

In [0]:
query = (spark.readStream
              .schema("device_id LONG, mac_address STRING, registration_timestamp DOUBLE, user_id LONG")
              .format("cloudFiles")
              .option("cloudFiles.format", "json")
              .option("cloudFiles.maxFilesPerTrigger", 1)
              .load(DA.paths.raw_user_reg)
              .writeStream
              .option("checkpointLocation", f"{DA.paths.checkpoints}/registered_users")
              .trigger(processingTime="10 seconds")
              .table("registered_users"))

DA.block_until_stream_is_ready(query)

Before moving on with this lesson, we need to:
1. Stop the existing stream
2. Drop the table we created
3. Clear the checkpoint directory

In [0]:
query.stop()
query.awaitTermination()
spark.sql("DROP TABLE IF EXISTS registered_users")
dbutils.fs.rm(f"{DA.paths.checkpoints}/registered_users", True)

Use the cell below to refactor the above query into a function that processes new files as a single incremental triggered batch.

To do this:
* Remove the option limiting the amount of files processed per trigger (this is ignored when executing a batch anyway)
* Change the trigger type to "availableNow"
* Make sure to add **`.awaitTermination()`** to the end of your query to block execution of the next cell until the batch has completed

In [0]:
# TODO
def ingest_user_reg():
    pass

Use your function below.

**NOTE**: This triggered batch will automatically cause our always-on stream to fail because the same checkpoint is used; default behavior will allow the newer query to succeed and error the older query.

In [0]:
ingest_user_reg()
display(spark.table("registered_users"))

Another custom class was initialized in the setup script to land a new batch of data in our source directory. 

Execute the following cell and note the difference in the total number of rows in our tables.

In [0]:
DA.user_reg_stream.load()

ingest_user_reg()
display(spark.table("registered_users"))

## Add a Salt to Natural Key
We'll start by defining a salt, here in plain text. We'll combine this salt with our natural key, **`user_id`**, before applying a hash function to generate a pseudonymized key.

Salting before hashing is very important as it makes dictionary attacks to reverse the hash much more expensive.  To demonstrate, try reversing the following SHA-256 hash of `Secrets123` by searching it's hash using Google: `FCF730B6D95236ECD3C9FC2D92D7B6B2BB061514961AEC041D6C7A7192F592E4` bringing you to this [link](https://hashtoolkit.com/decrypt-sha256-hash/fcf730b6d95236ecd3c9fc2d92d7b6b2bb061514961aec041d6c7a7192f592e4).  Next, try hashing `Secrets123:BEANS` [here](https://passwordsgenerator.net/sha256-hash-generator/) and perform a similar search.  Notice how adding the salt `BEANS` improved the security.

For greater security, we could upload the salt as a secret using the Databricks <a href="https://docs.databricks.com/security/secrets/secrets.html" target="_blank">Secrets</a> utility.

In [0]:
salt = 'BEANS'
spark.conf.set("da.salt", salt)

In [0]:
# # If using the Databricks secrets store, here's how you'd read it...
# salt = dbutils.secrets.get(scope="DA-ADE3.03", key="salt")
# salt

Preview what your new key will look like.

In [0]:
%sql
SELECT *, sha2(concat(user_id,"${da.salt}"), 256) AS alt_id
FROM registered_users

## Register a SQL UDF

Create a SQL user-defined function to register this logic to the current database under the name **`salted_hash`**. 

This will allow this logic to be called by any user with appropriate permissions on this function. 

Make sure your UDF accepts one parameter: a **`String`** and it should return a **`STRING`**.  You can access the configured salt value by using the expression `"${da.salt}"`.

For more information, see the <a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html" target="_blank">CREATE FUNCTION</a> method from the SQL UDFs docs.

In [0]:
%sql
-- TODO
CREATE FUNCTION salted_hash <FILL-IN>

If your SQL UDF is defined correctly, the assert statement below should run without error.

In [0]:
# Check your work
set_a = spark.sql(f"SELECT sha2(concat('USER123', '{salt}'), 256) alt_id").collect()
set_b = spark.sql("SELECT salted_hash('USER123') alt_id").collect()
assert set_a == set_b, "The 'salted_hash' function is returning the wrong result."
print("All tests passed.")

Note that it is theoretically possible to link the original key and pseudo-ID if the hash function and the salt are known.

Here, we use this method to add a layer of obfuscation; in production, you may wish to have a much more sophisticated hashing method.

## Register Target Table
The logic below creates the **`user_lookup`** table.

Here we're just creating our **`user_lookup`** table. In the next notebook, we'll use this pseudo-ID as the sole link to user PII.

By controlling access to the link between our **`alt_id`** and other natural keys, we'll be able to prevent linking PII to other user data throughout our system.

In [0]:
%sql
CREATE TABLE IF NOT EXISTS user_lookup (alt_id string, device_id long, mac_address string, user_id long)
USING DELTA 
LOCATION '${da.paths.working_dir}/user_lookup'

## Define a Function for Processing Incremental Batches

The cell below includes the setting for the correct checkpoint path.

Define a function to apply the SQL UDF registered above to create your **`alt_id`** to the **`user_id`** from the **`registered_users`** table.

Make sure you include all the necessary columns for the target **`user_lookup`** table. Configure your logic to execute as a triggered incremental batch.

In [0]:
# TODO
def load_user_lookup():
    (spark.readStream
        <FILL-IN>
        .option("checkpointLocation", f"{DA.paths.checkpoints}/user_lookup")
        <FILL-IN>
    )

Use your method below and display the results.

In [0]:
load_user_lookup()

display(spark.table("user_lookup"))

Process another batch of data below to confirm that the incremental processing is working through the entire pipeline.

In [0]:
DA.user_reg_stream.load()

ingest_user_reg()
load_user_lookup()

display(spark.table("user_lookup"))

The code below ingests all the remaining records to put 100 total users in the **`user_lookup`** table.

In [0]:
DA.user_reg_stream.load(continuous=True)

ingest_user_reg()
load_user_lookup()

display(spark.table("user_lookup"))

We'll apply this same hashing method to process and store PII data in the next lesson.

Run the following cell to delete the tables and files associated with this lesson.

In [0]:
DA.cleanup()