# üîê Unity Catalog in Databricks

Unity Catalog is **Databricks' unified governance solution** for data, files, and AI assets in the Lakehouse. It centralizes access control, auditing, data discovery, and data lineage across **multiple Databricks workspaces** and **cloud providers**.

---

## üß± Architecture: 3-Level Namespace

Unity Catalog introduces a standardized namespace: **[catalog].[schema].[table]**

- **Catalog**: Top-level container (like a metastore)
- **Schema**: Equivalent to a database
- **Table/View**: Data objects stored in the schema

---

## ‚úÖ Key Benefits of Unity Catalog

| Feature | Description |
|--------|-------------|
| üîê Fine-Grained Access Control | Manage access at **catalog, schema, table, column, and row levels** |
| üìÅ Unified Governance | Controls access to **tables, files (Volumes), ML models, notebooks, dashboards** |
| üè¢ Multi-Workspace Support | Share and control data across **multiple workspaces** in the same metastore |
| üìú Built-in Data Lineage | Automatically captures and visualizes **data flow and dependencies** |
| üîÑ Delta Sharing | Share data with external consumers via **open protocol** (no Databricks required) |
| üìà Central Auditing | Track **who accessed what, when, and how** via logs |
| üîç Data Discovery | Provides a **data catalog interface** for finding and understanding datasets |
| üåê IAM Integration | Maps **cloud-native identities** (Azure AD, AWS IAM, Google IAM) to Unity roles |
| üß© Row/Column-Level Security | Enforce detailed security policies dynamically at runtime |

---

## üÜö Hive Metastore vs Unity Catalog

| Feature | Hive Metastore | Unity Catalog |
|--------|----------------|---------------|
| Access Control | Cluster-based | Centralized, fine-grained |
| Lineage | ‚ùå No | ‚úÖ Yes |
| Auditing | ‚ùå Manual | ‚úÖ Built-in |
| Cross-Workspace | ‚ùå No | ‚úÖ Yes |
| Delta Sharing | ‚ùå No | ‚úÖ Yes |
| Cloud Support | Single cloud | Multi-cloud (AWS, Azure, GCP) |
| Govern Files | ‚ùå No | ‚úÖ Yes (via Volumes) |

---

## üìú SQL Permissions Example

```sql
-- Grant table access
GRANT SELECT ON TABLE finance.transactions TO `analyst_role`;

-- Grant schema usage
GRANT USAGE ON SCHEMA finance TO `finance_team`;

-- Grant access to files in a volume
GRANT READ FILES ON VOLUME raw_data TO `etl_engineer`;


# üìö Types of Catalogs in Unity Catalog
---
| Type                 | Description                                                                                                                                      |
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Standard Catalog** | Default catalog type; stores data natively within the Unity Catalog metastore.                                                                   |
| **Foreign Catalog**  | Virtual catalog pointing to an **external system** (e.g., Snowflake, AWS Glue). Metadata is read-only and reflects the external system‚Äôs schema. |
| **Shared Catalog**   | A catalog that is **shared** with your organization by another Databricks account using **Delta Sharing**.                                       |
---

## 1. Standard Catalog
- Managed directly by Unity Catalog and stored in the metastore‚Äôs external location (like S3, ADLS, GCS).

### Use Case:
- Used for native Databricks tables, views, volumes, models, etc.
``` sql 

-- create catalog
CREATE CATALOG IF NOT EXISTS sales_catalog
COMMENT 'Standard Unity Catalog for sales data';

-- create schema 
CREATE SCHEMA IF NOT EXISTS sales_catalog.retail_schema
COMMENT 'Retail transactions and lookup tables';

-- create table 
CREATE TABLE sales_catalog.retail_schema.transactions (
  transaction_id STRING,
  customer_id STRING,
  amount DOUBLE,
  transaction_date DATE
)
COMMENT 'Retail transaction records';

-- select table
SELECT * FROM sales_catalog.retail_schema.transactions;

-- grant permission 
GRANT SELECT, INSERT 
ON TABLE sales_catalog.retail_schema.transactions 
TO `analyst_role`;

-- drop table 
DROP TABLE IF EXISTS sales_catalog.retail_schema.transactions;

-- drop schema or catalog 
DROP SCHEMA IF EXISTS sales_catalog.retail_schema CASCADE;

DROP CATALOG IF EXISTS sales_catalog CASCADE;


```
---

## 2. Foreign Catalog
- References metadata stored in an external catalog system like:
  - AWS Glue Data Catalog
  - Azure Purview
  - Snowflake (read-only)

### Use Case:
- For read-only access to existing external metadata without duplicating it.
- Central governance across hybrid data platforms.

```sql 

CREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>
OPTIONS [(database '<database-name>') | (catalog '<external-catalog-name>')];

SELECT * FROM glue_catalog.external_db.customers;
```

### How to create a foreign catalog in Databricks: 
- https://www.youtube.com/watch?v=fjxjK-jvRng

---
## 3. Shared Catalog
- Appears when another Databricks account shares data with you via Delta Sharing.

### Use Case:
- Consuming shared data securely without needing raw file access.
- Used across organizations or business units.
``` sql

CREATE CATALOG [IF NOT EXISTS] <catalog-name>
USING SHARE <provider-name>.<share-name>;
[ COMMENT <comment> ];


SELECT * FROM shared_data_provider.sales.monthly_revenue;
```
----
## üîÑ Summary Table

| Type         | Source                          | Write Support | Editable | Typical Use Case                       |
| ------------ | ------------------------------- | ------------- | -------- | -------------------------------------- |
| **Standard** | Unity Catalog Metastore         | ‚úÖ Yes         | ‚úÖ Yes    | Native Databricks data governance      |
| **Foreign**  | External metastore (e.g., Glue) | ‚ùå No          | ‚ùå No     | Read-only access to external metadata  |
| **Shared**   | External UC via Delta Sharing   | ‚ùå No          | ‚ùå No     | Access shared data from other accounts |

----

# Create Catalog on external location:
- Set up a Unity Catalog-managed catalog that stores data in a custom ADLS Gen2 path via an external location.

---
## 1. Create Storage Credential (using OAuth)
``` sql 

CREATE STORAGE CREDENTIAL adls_cred
WITH AZURE_MANAGED_IDENTITY '<client-id>'
-- Or use a service principal:
-- WITH AZURE_SERVICE_PRINCIPAL (
--     CLIENT_ID '<client-id>',
--     CLIENT_SECRET '<client-secret>'
-- )
-- TENANT_ID '<tenant-id>'
COMMENT 'ADLS Gen2 access for Unity Catalog external location';

```
---

## 2. Create External Location
```sql

CREATE EXTERNAL LOCATION adls_ext_location
URL 'abfss://<container-name>@<account-name>.dfs.core.windows.net/uc-catalogs/sales_catalog/'
WITH STORAGE CREDENTIAL adls_cred
COMMENT 'External location for Unity Catalog catalog in ADLS Gen2';

```
----
## 3. Grant permission of external location 
``` sql 

GRANT READ FILES, WRITE FILES ON EXTERNAL LOCATION adls_ext_location TO `etl_engineer_role`;

```
---

## 4. Create the Catalog Using the External Location
```sql

CREATE CATALOG sales_catalog
MANAGED LOCATION 'abfss://bronze@yourstorageacc.dfs.core.windows.net/uc/sales_catalog/'
COMMENT 'External managed catalog stored in ADLS Gen2';

CREATE SCHEMA sales_catalog.transactions_schema;

CREATE TABLE sales_catalog.transactions_schema.orders (
  order_id STRING,
  customer_id STRING,
  order_date DATE,
  total_amount DOUBLE
);

```
---
## 5. create schema on external location which is different from catalog
``` sql

CREATE SCHEMA sales_catalog.txn_schema
MANAGED LOCATION 'abfss://bronze@myadlsacc.dfs.core.windows.net/otherpath/txns/';

```
---
## 6. create external table 
``` sql

CREATE TABLE sales_catalog.raw_schema.orders_external (
    order_id STRING,
    customer_id STRING,
    order_date DATE,
    total_amount DOUBLE
)
USING DELTA
LOCATION 'abfss://bronze@yourstorageacc.dfs.core.windows.net/externaltables/orders/';
```
----


# üì¶ What is a Volume in Databricks Unity Catalog?
- A Volume in Databricks Unity Catalog is a governed, file-based storage space where you can store and manage files and directories (like CSVs, JSON, images, PDFs, etc.) within a schema, under a catalog.
- Think of a Volume as a governed data lake folder with access control and audit logging ‚Äî like a managed folder inside Unity Catalog.

## üß± Structure
- A volume is defined using the 3-level namespace: **[catalog].[schema].[volume]**
---
## ‚úÖ Use Cases for Volumes
| Use Case                       | Description                         |
| ------------------------------ | ----------------------------------- |
| Store semi-structured data     | CSV, JSON, XML, etc. for processing |
| Store binary files             | PDFs, ZIPs, images, audio, video    |
| Store ML artifacts             | Models, embeddings, configs         |
| Staging area for ingestion     | Drop zone for raw files before ETL  |
| External table file management | Store files for external tables     |
---

## üßæ How to Create a Volume
```sql

CREATE VOLUME main.shared_assets.docs
COMMENT 'Stores unstructured document files';
```

## üîê Governance
- Volumes support RBAC (Role-Based Access Control), auditing, and Unity Catalog permissions, just like tables.
```sql

GRANT READ FILES, WRITE FILES
ON VOLUME main.shared_assets.images
TO `ml_team`;
```
## üîÑ Deleting a Volume
```sql

DROP VOLUME main.shared_assets.docs;
```

---
## üîç Volume vs External Location vs Table
| Feature            | Volume                       | External Location      | Table (Managed)         |
| ------------------ | ---------------------------- | ---------------------- | ----------------------- |
| Stores             | Files                        | Files                  | Structured tabular data |
| Access control     | Unity Catalog RBAC           | Unity Catalog RBAC     | Unity Catalog RBAC      |
| Governance         | ‚úÖ Yes                        | ‚úÖ Yes                  | ‚úÖ Yes                   |
| Structured queries | ‚ùå No (need to load into DF)  | ‚ùå No (same)            | ‚úÖ Yes (via SQL)         |
| Best for           | Semi-structured/unstructured | Data lake integrations | SQL-first datasets      |
---




# üìç What is Lineage in Databricks?
- Data lineage shows:
  - Where data came from (source tables/files)
  - How it was transformed (queries, notebooks, jobs)
  - Where it was written to (target tables or files)
  - It works across SQL, Python, Delta Live Tables, DBT, and jobs.
---
## üîß Supported Sources for Lineage
| Tool                    | Lineage Captured?          |
| ----------------------- | -------------------------- |
| SQL (Notebook or Query) | ‚úÖ Yes                      |
| Python (Spark APIs)     | ‚úÖ Yes                      |
| Delta Live Tables       | ‚úÖ Yes                      |
| Jobs & Workflows        | ‚úÖ Yes                      |
| DBT in Databricks       | ‚úÖ Yes (native integration) |
| Pandas / raw file ops   | ‚ùå No (not tracked)         |
---
## lineage tables
```sql

SELECT * FROM system.access.table_lineage;
select * from system.access.column_lineage;
```

# üìò What is Delta Sharing?
- Delta Sharing is an open protocol for secure data sharing. It allows you to share live data (like Delta tables or views) with external partners, other Databricks workspaces, or non-Databricks clients, without copying data.
---
## üîÑ Sharing Models
| Sharing Type                 | Use Case                                                 | Consumer Tool       |
| ---------------------------- | -------------------------------------------------------- | ------------------- |
| **Open Sharing**             | Share with external tools (e.g., Pandas, Power BI, etc.) | External recipients |
| **Databricks-to-Databricks** | Share between workspaces/accounts                        | Other Unity Catalog |
| **Internal Sharing**         | Share across teams in the same workspace                 | Unity Catalog RBAC  |
---

## üîê Delta Sharing Roles
| Role                | Description                                          |
| ------------------- | ---------------------------------------------------- |
| **Provider**        | You (the data owner) who shares data                 |
| **Recipient**       | External party (can be Databricks or non-Databricks) |
| **Share**           | Logical object grouping tables/views to be shared    |
| **Recipient Token** | Secure access credential for recipients              |
--

# üõ°Ô∏è Data Masking in Databricks (Unity Catalog)
- Data masking in Databricks is implemented using column-level security features in Unity Catalog, including:
  - Dynamic views (fine-grained access control)
  - Row-level security (RLS) and column-level access control (CLAC)
  - MASK expressions (native masking ‚Äî in preview or GA depending on region)

## ‚úÖ Option 1: Column Masking Using Dynamic Views (Common Pattern)
- You create a view that applies masking logic based on the user's group or role.
```sql

CREATE OR REPLACE VIEW secure_view AS
SELECT
  user_id,
  CASE
    WHEN current_user() IN ('admin@databricks.com') THEN email
    ELSE '***MASKED***'
  END AS email
FROM main.sales.customers;
```
## ‚úÖ Option 2: Unity Catalog Column Masking (Preview or GA)
- If you're using Unity Catalog column-level security, you can apply policy-based masking.
```sql

-- Step 1: Create a Masking Policy
CREATE MASKING POLICY mask_email
AS (email STRING) -> STRING
USING (
  CASE
    WHEN is_account_group_member('privileged_users') THEN email
    ELSE '***MASKED***'
  END
);

-- Step 2: Apply Masking Policy to a Column
ALTER TABLE main.hr.employees
ALTER COLUMN email
SET MASKING POLICY mask_email;
```
## ‚úÖ Option 3: Use Row-Level Security + Masking
- You can combine RLS + masking for more complex access control:
```sql

CREATE OR REPLACE VIEW secure_employees AS
SELECT
  *,
  CASE
    WHEN is_member('hr_admin') THEN ssn
    ELSE NULL
  END AS masked_ssn
FROM main.hr.employees
WHERE department = 'HR' OR is_member('executive_team');
```