
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>


# PII Data Security

In this demo you will learn how to:

* How to handle PII Data Security with **Pseudonymization and Anonymization**

Further, you will also learn:
* Generate and trigger a Delta Live Table pipeline that manages both processes
* Explore the resultant DAG
* Land a new batch of data

## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

  - In the drop-down, select **More**.

  - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.

## A. Classroom Setup

Run the following cell to configure your working environment for this course. It will also set your default catalog to your unique catalog name and the schema to your specific schema name shown below using the `USE` statements.
<br></br>


```
USE CATALOG your-catalog;
USE SCHEMA your-catalog.pii_data;
```

**NOTE:** The `DA` object is only used in Databricks Academy courses and is not available outside of these courses. It will dynamically reference the information needed to run the course.

In [0]:
%run ./Includes/Classroom-Setup-1.2

Run the code below to view your current default catalog and schema. Confirm that they have the same name as the cell above.

In [0]:
%sql
SELECT current_catalog(), current_schema()


## B. Generate and Trigger DLT Pipeline
Run the cell below to auto-generate your DLT pipeline using the provided configuration values.

After creation, the pipeline will run. The initial run will take a few minutes while a cluster is provisioned.

In [0]:
# Generate and Configure Pipeline
DA.generate_pipeline(
    pipeline_name=f"1.2_PII_Data_Security", 
    use_schema = 'pii_data',
    notebooks_folder='Pipeline', 
    pipeline_notebooks=[
        'DP 1.2.1 - Pseudonymized PII Lookup Table',
        'DP 1.2.2 - Anonymized Users Age'
        ],
    use_configuration = {
        'user_reg_source':DA.paths.stream_source.user_reg,
        'daily_user_events_source':DA.paths.stream_source.daily,
        'lookup_catalog': DA.catalog_name
        }
    )

# Trigger the pipeline
DA.start_pipeline()


#### Pipeline Overview

This Delta Live Tables Pipeline is based in two notebooks located in the "Pipeline" folder:

- [DP 1.2.1 - Pseudonymized PII Lookup Table]($./Pipeline/DP 1.2.1 - Pseudonymized PII Lookup Table): Provides an overview of how to ingest and stream **registered_user_data** to apply two **Pseudonymization** techniques such as:
  - Hashing
  - Tokenization
 
- [DP 1.2.2 - Anonymized Users Age]($./Pipeline/DP 1.2.2 - Anonymized Users Age): Provides an overview of how to ingest and stream **user_events_raw** data into a **users_bronze** and apply **Binning Anonymization** on User's Ages into a materialized view **user_bins**.

### B1. Open the DLT Pipeline

In the left navigation bar, complete the following to open your DLT pipeline:

1. Right-click on Pipelines and select *Open in New Tab*.

2. Find and select your DLT pipeline named **your_catalog_name_1.2_PII_Data_Security**.

3. Leave the DLT pipeline page open and continue to the next steps.

4. Once the pipeline completes, here is the graphed execution flow:

![PII Data Security DLT Pipeline DAG](./Includes/images/piidata_security_dag.png)

## C. Pseudonymization

As a recap:

- Switches original data point with pseudonym for later re-identification
- Only authorized users will have access to keys/hash/table for re-identification
- Protects datasets on record level for machine learning
- A pseudonym is still considered to be personal data according to the GDPR
Two main pseudonymization methods: hashing and tokenization


[DP 1.2.1 - Pseudonymized PII Lookup Table]($./Pipeline/DP 1.2.1 - Pseudonymized PII Lookup Table): Provides an overview of how to ingest and stream **registered_users** to apply two **Pseudonymization** techniques such as:
  1. Creates the **registered_users** table from the source JSON files with PII.

  1. Hashing: Handled in table **user_lookup_hashed**

  1. Tokenization: Handled in tables **registered_users_tokens** and **user_lookup_tokenized**


#### Pseudonymization section in DAG

![Pseudonymization DAG](./Includes/images/pii_data_security_pseudo_dag.png)

### C1. Preview the registered_users Table

The table **registered_users** will be our source for the ingested users, where we'll apply *Pseudonymization* and *Anonymization*. 

Run the cell and view the original source data. Notice that no data has been anonymized.


In [0]:
%sql
SELECT
    user_id,
    device_id,
    mac_address
FROM registered_users 
LIMIT 5;


### C2. Option 1 - Hashing

Objectives:

- Apply SHA or other hashes to all PII.
- Add a random string "salt" to values before hashing.
- Databricks secrets can be leveraged for obfuscating the salt value.
- This leads to a slight increase in data size.
- Some operations may be less efficient.

In our pipeline, we leverage the **registered_users** table and apply hashing to the **user_id** column using a salt value of *BEANS*, creating a column **alt_id** in the **user_lookup_hashed** table.

See the cell below for the results and compare both the **user_id** and **alt_id** columns.

**NOTE:** The **user_id** column should be removed after processing. It is kept for demo purposes.

In [0]:
%sql
SELECT 
  alt_id,
  user_id,
  device_id,
  mac_address 
FROM user_lookup_hashed

### C3. Option 2 - Tokenization

**Tokenization** objectives:

- Converts all PII to keys.
- Values are stored in a secure lookup table.
- Slow to write, but fast to read.
- De-identified data is stored in fewer bytes.

Similar to the previous step, our pipeline leverages the **registered_users** table. This time, the pipeline creates a new table called **registered_users_tokens** to store the relationship between the generated token (using the [uuid function](https://docs.databricks.com/en/sql/language-manual/functions/uuid.html)) and the **user_id** column.

See the token column generated for each **user_id** in the **registered_users_tokens** table below.



In [0]:
%sql
SELECT * 
FROM registered_users_tokens

Now we can use and leverage the table **registered_users_tokens** and create a new lookup table with tokenized **user_id** column, held in **user_lookup_tokenized** table.

In [0]:
%sql
SELECT 
  alt_id as Tokenized,
  device_id,
  mac_address, 
  registration_timestamp 
FROM user_lookup_tokenized

## D. Anonymization

As a recap:

- Protects entire dataset (tables, databases or entire data catalogues) mostly for Business Intelligence
- Personal data is irreversibly altered in such a way that a data subject can no longer be identified directly or indirectly
- Usually a combination of more than one technique used in real-world scenarios
- Two main anonymization methods: data suppression and generalization


[DP 1.2.2 - Anonymized Users Age]($./Pipeline/DP 1.2.2 - Anonymized Users Age): Provides an overview of how to ingest and stream **user_events_raw** data into a **users_bronze** and apply **Binning Anonymization** on User's Ages into a materialized view **user_age_bins**.

#### Anonymization section in DAG

![Anonymization DAG](./Includes/images/piidata_security_anon_dag.png)

### D1. Explore the Date Lookup and User Events Raw tables

- The **date_lookup** table is used for the **date** and **week_part** association. It is joined with the **user_events_raw** data to identify which **week_part** the **Date of Birth (DOB)** belongs to. 
  - For example: (date) 2020-07-02 = (week_part) 2020-27.

In [0]:
%sql
SELECT * 
FROM date_lookup
LIMIT 5;

- The **user_events_raw** table represents the ingested user event data in JSON format, which is later unpacked and filtered to retrieve only user information.

In [0]:
%sql
SELECT 
  string(key), 
  string(value)
FROM user_events_raw
LIMIT 5;

### D2. Users Bronze

The table **users_bronze** is our focus and will be our source for the ingested user information, where we'll apply **Binning Anonymization** to the **Date of Birth (DOB)**.



In [0]:
%sql
SELECT 
  user_id,
  dob,
  gender,
  city,
  state 
FROM users_bronze

### D3. User Age Bins

The table **user_age_bins** shows the results of the **binning anonymization** applied, check **age** column and the range provide for each user.


In [0]:
%sql
SELECT * 
FROM user_age_bins

## E. Land New Data and Trigger the Pipeline

Run the cell below to land more data in the source directory, then navigate to the Delta Live Table UI and manually trigger a pipeline update.

As we continue through the course, you can return to this notebook and use the method provided below to land new data. Running this entire notebook again will delete the underlying data files for both the source data and your DLT Pipeline and enable you to start over.

In [0]:
## Load files into (your catalog -> pii_data -> volumes -> pii -> stream_source -> daily)
DA.load(copy_from=DA.paths.stream_source.daily_working_dir,
        copy_to=DA.paths.stream_source.daily,
        n=4)


&copy; 2025 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the 
<a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use">Terms of Use</a> | 
<a href="https://help.databricks.com/">Support</a>