**12 Days of Demos**
# üéÖ Secure, Reusable Unity Catalog Functions for Santa's Workshop üéÑ

Santa receives **millions of letters** from children worldwide, containing sensitive information: personal identifiers that need protection, cities and provinces where children live, etc. The North Pole Modernization Office (NPMO) uses **Unity Catalog Functions** to create governed, reusable data tools that: 

* **Protect PII** - Automatic masking of sensitive data
* **Enable safe queries** - Governed access patterns
* **Track lineage** - Unity Catalog monitors all usage
* **Are reusable everywhere** - SQL, Python, dashboards, and applications

Security isn't just policy at the North Pole - it's built into the data platform.

---

### ü¶å Step 1: Configuration

Before you begin: Update the configuration below to match your environment.

The default values point to the demo dataset, but you can customize:
* **Catalog name** - Your Unity Catalog catalog
* **Schema names** - Where your raw data and processed results are stored
* **Sample size** - Number of letters to process in examples

üëá **Update the cell below with your values, then run it!**

In [0]:
# TODO: Optionally update these values for your environment
TARGET_CATALOG = "main"
TARGET_SCHEMA = "dbrx_12daysofdemos"
TARGET_VOLUME = "raw_data_volume"

print(f"‚úÖ Configuration loaded.")

In [0]:
# Derived names (no need to change)
full_schema = f"{CATALOG}.{SCHEMA}"
source_table = f"{full_schema}.{SOURCE_TABLE}"

In [0]:
# Set up the catalog and schema
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {full_schema}")
spark.sql(f"USE SCHEMA {SCHEMA}")

print(f"‚úÖ Using catalog: {CATALOG}")
print(f"‚úÖ Using schema: {full_schema}")

### üìä Step 2: Explore the Raw Letter Data

Let's look at our raw letter data to understand why we need governance.

This raw data is NOT safe to share with:

* External dashboards
* Demo environments
* Third-party applications

In [0]:
# RAW DATA - Contains visible PII!
print("‚ö†Ô∏è Raw data with visible PII - NOT safe to share!\n")

df = spark.sql(f"""
SELECT
  name,
  city,
  province,
  LEFT(letter, 150) AS letter_preview,
  gifts
FROM {source_table}
LIMIT 10
""")
display(df)

### üîí Step 3a: Building PII Protection Functions

What Are UC Scalar Functions?
Scalar functions take input values and return a single output value. Perfect for:

* Data masking (hiding sensitive parts)
* Data transformation (formatting, cleaning)
* Calculations (derived values)

Once created, these functions can be used everywhere - SQL, Python, dashboards!

**Example: `mask_name()` - Name Anonymization**

Children's names require special protection:

* Child Privacy Laws - Extra protection for minors
* Anonymization - Enable analytics without identifying individuals

How It Works:

* Input: Emma
* Output: E**a
* Shows first and last character only

In [0]:
# Create the mask_name() function
spark.sql("""
CREATE OR REPLACE FUNCTION mask_name(name STRING)
RETURNS STRING
COMMENT 'Masks personal names for privacy - shows first and last character only'
RETURN
  CASE
    WHEN name IS NULL OR TRIM(name) = '' THEN NULL
    WHEN LENGTH(TRIM(name)) <= 2 THEN REPEAT('*', LENGTH(TRIM(name)))
    ELSE CONCAT(
      SUBSTRING(TRIM(name), 1, 1),
      REPEAT('*', LENGTH(TRIM(name)) - 2),
      SUBSTRING(TRIM(name), -1)
    )
  END
""")

print("‚úÖ Created function: mask_name()")

In [0]:
# Test mask_name()
print("Testing mask_name() function:\n")

df = spark.sql("""
SELECT
  'Emma' AS original_name,
  mask_name('Emma') AS masked_name
UNION ALL
SELECT
  'Alexander' AS original_name,
  mask_name('Alexander') AS masked_name
UNION ALL
SELECT
  'Jo' AS original_name,
  mask_name('Jo') AS masked_name
""")
display(df)

### ‚ûï Step 3b: Aggregate Functions for Governed Queries

Users often need to answer questions like:

* "How many letters from Ontario?"
* "What's the most popular gift in Quebec?"

The Solution: Create functions that return governed, pre-aggregated data.

**Example: `get_province_summary()` - Governed Stats**

How It Works
* **Input:** Province name (e.g., 'Ontario')
* **Output:** JSON with letter count, cities, sample gifts

Why JSON Output?
* **Structured** - Consistent format every time
* **Controlled** - Only returns approved fields

In [0]:
# Create the get_province_summary() function
spark.sql(f"""
CREATE OR REPLACE FUNCTION get_province_summary(province_name STRING)
RETURNS STRING
COMMENT 'Returns JSON summary of letters from a specific province'
RETURN (
  SELECT TO_JSON(
    STRUCT(
      province_name AS province,
      COUNT(*) AS total_letters,
      COUNT(DISTINCT city) AS unique_cities,
      SLICE(COLLECT_SET(SUBSTRING(gifts, 1, 50)), 1, 10) AS sample_gifts
    )
  )
  FROM {source_table}
  WHERE UPPER(province) = UPPER(province_name)
     OR province = province_name
)
""")

print("‚úÖ Created function: get_province_summary()")

In [0]:
# Test get_province_summary()
print("üß™ Testing get_province_summary() function:\n")

df = spark.sql("SELECT get_province_summary('Ontario') AS ontario_stats")
display(df)

In [0]:
# Compare multiple provinces
print("üó∫Ô∏è Province Comparison:\n")

df = spark.sql("""
SELECT 'Ontario' AS province, get_province_summary('Ontario') AS stats
UNION ALL
SELECT 'Quebec' AS province, get_province_summary('Quebec') AS stats
UNION ALL
SELECT 'British Columbia' AS province, get_province_summary('British Columbia') AS stats
""")
display(df)

### ‚è∫Ô∏è Step 3c: Table-Valued Functions for Safe Search

What Are Table-Valued Functions?

Unlike scalar functions (which return one value), **table-valued functions** return **entire result sets**.

Perfect for:
* Search functionality (find matching records)
* Filtered views (subset of data)
* Complex queries with automatic masking

**Example: `search_letters()` - Safe Keyword Search**

How It Works
* **Input:** Keyword to search (e.g., 'bicycle')
* **Output:** Results with AUTOMATICALLY MASKED names!

Built-In Safety
* Names are masked using our `mask_name()` function
* Letter previews have names replaced with masked versions
* Limited to 10 results per search

In [0]:
# Create the search_letters() table-valued function
spark.sql(f"""
CREATE OR REPLACE FUNCTION search_letters(keyword STRING)
RETURNS TABLE(
  masked_name STRING,
  city STRING,
  province STRING,
  letter_preview STRING
)
COMMENT 'Searches letters by keyword and returns results with masked names'
RETURN
  SELECT
    mask_name(name) AS masked_name,
    city,
    province,
    SUBSTRING(REGEXP_REPLACE(letter, name, mask_name(name)), 1, 200) AS letter_preview
  FROM {source_table}
  WHERE UPPER(letter) LIKE CONCAT('%', UPPER(keyword), '%')
     OR UPPER(gifts) LIKE CONCAT('%', UPPER(keyword), '%')
  LIMIT 10
""")

print("‚úÖ Created function: search_letters()")

In [0]:
# Test search_letters() - Bicycles
print("üîç Searching for: 'bicycle'\n")

df = spark.sql("SELECT * FROM search_letters('bicycle')")
display(df)

In [0]:
# Test search_letters() - LEGO
print("üîç Searching for: 'LEGO'\n")

df = spark.sql("SELECT * FROM search_letters('LEGO')")
display(df)

In [0]:
# Test search_letters() - Nintendo
print("üîç Searching for: 'Nintendo'\n")

df = spark.sql("SELECT * FROM search_letters('Nintendo')")
display(df)

### üéØ Step 4: Putting It All Together

Verify all functions

In [0]:
# List all functions we created
print("üìã UC Functions in our schema:\n")

df = spark.sql("SHOW USER FUNCTIONS")
display(df)

**üîí Create a Governed View**

Combine our masking functions into a **reusable view** that's always safe to query!

In [0]:
# Create a masked view using our functions
masked_view = f"{full_schema}.holiday_letters_masked"

spark.sql(f"""
CREATE OR REPLACE VIEW {masked_view} AS
SELECT
  mask_name(name) AS child_name,
  city,
  province,
  REGEXP_REPLACE(letter, name, mask_name(name)) AS letter,
  gifts
FROM {source_table}
""")

print(f"‚úÖ Created governed view: {masked_view}")

In [0]:
# Query the masked view
print("üîí Data from governed view (all PII masked):\n")

df = spark.sql(f"SELECT * FROM {masked_view} LIMIT 10")
display(df)

**üìä Analytics with Governed Data**

Run analytics on the masked view - safe to share anywhere!

In [0]:
# Letters by province (safe to share!)
print("üìä Letters by Province:\n")

df = spark.sql(f"""
SELECT
  province,
  COUNT(*) AS letter_count,
  COUNT(DISTINCT city) AS unique_cities
FROM {masked_view}
GROUP BY province
ORDER BY letter_count DESC
""")
display(df)

In [0]:
# Top gift requests (anonymized)
print("üéÅ Top Gift Requests (with anonymized requesters):\n")

df = spark.sql(f"""
SELECT
  gifts,
  COUNT(*) AS request_count,
  SLICE(COLLECT_LIST(child_name), 1, 5) AS sample_requesters
FROM {masked_view}
WHERE gifts IS NOT NULL
GROUP BY gifts
ORDER BY request_count DESC
LIMIT 15
""")
display(df)