# Lab 2: Direct Lake with Big Data - Billion Row Analytics

## Lab Overview

This lab demonstrates Direct Lake's **enterprise-scale capabilities** by working with **billion-row datasets**. You'll learn how Direct Lake handles massive data volumes, create OneLake shortcuts for cross-workspace data access, and understand when Direct Lake falls back to SQL Endpoint mode.

### What You'll Build

**Workshop Flow:**
```
1. Setup Big Data Environment
   ↓
2. Create OneLake Shortcuts
   ↓
3. Build Billion-Row Semantic Model
   ↓
4. Performance Testing & Tracing
   ↓
5. Analyze Fallback Scenarios
   ↓
6. Optimize for Scale
```

### Key Concepts
- **OneLake Shortcuts**: Access data across workspaces without duplication
- **Direct Lake Guardrails**: Understanding billion-row table limits
- **Fallback Behavior**: When Direct Lake uses SQL Endpoint mode
- **Column Temperature**: Memory management for large datasets

### Learning Objectives
By completing this lab, you'll be able to:
- ✅ Create OneLake shortcuts for cross-workspace data access
- ✅ Build semantic models with billion-row fact tables
- ✅ Monitor Direct Lake performance with advanced tracing
- ✅ Understand and troubleshoot fallback scenarios
- ✅ Optimize queries for massive datasets

### Dataset Scale
| Table | Rows | Purpose |
|:------|:-----|:--------|
| **fact_myevents_1bln** | 1 billion | Standard fact table |
| **fact_myevents_2bln** | 2 billion | Stress test limits |
| **fact_myevents_1bln_partitioned_datekey** | 1 billion | Optimized with partitioning |
| **dim_Date** | ~3,650 | Date dimension |
| **dim_Geography** | ~200 | Geography dimension |

**Estimated Time**: 60-90 minutes  
**Prerequisites**: Lab 1 completion, access to Big Data workspace

---

## 1. Install Required Libraries

Install Semantic Link Labs with enhanced big data capabilities for billion-row analytics and OneLake shortcut management.
Working with billion-row tables introduces unique challenges:
- **Memory management** becomes critical
- **Query optimization** requires detailed tracing
- **Fallback behavior** needs monitoring
- **Cross-workspace access** requires shortcut expertise

**Expected outcome**: Enhanced library ready for enterprise-scale analytics and performance monitoring.

In [None]:
%pip install -q --disable-pip-version-check semantic-link-labs

## 2. Import Libraries and Set Variables

Import required libraries and configure environment variables for big data workspace and region-aware data source selection.

In [None]:
import sempy_labs as labs
from sempy import fabric
import sempy
import pandas
import json
import time

LakehouseName = "BigData"
SemanticModelName = f"{LakehouseName}_model"

capacity_name = labs.get_capacity_name()

Shortcut_LakehouseName = "BigDemoDB"
Shortcut_WorkspaceName = "DL Labs - Data [North Central US]"
if capacity_name == "FabConUS8-P1":
    Shortcut_WorkspaceName = "DL Labs - Data [West US 3]"


## 3. Create Lakehouse for Big Data

Create a lightweight lakehouse that will use OneLake shortcuts to access billion-row tables without data duplication.
```
Source Workspace        Target Workspace
├── BigDemoDB           ├── BigData
│   ├── 1bln_table ──── │   ├── 1bln_table (COPY - Expensive!)
│   └── 2bln_table ──── │   └── 2bln_table (COPY - Expensive!)
```

#### OneLake Shortcut Approach (✅ Recommended):
```
Source Workspace        Target Workspace  
├── BigDemoDB           ├── BigData
│   ├── 1bln_table ←─── │   ├── 1bln_table (SHORTCUT - Efficient!)
│   └── 2bln_table ←─── │   └── 2bln_table (SHORTCUT - Efficient!)
```

### Expected Infrastructure
After execution, you'll have:
- ✅ **Empty lakehouse** named "BigData" ready for shortcuts
- ✅ **Workspace identifiers** for cross-workspace operations
- ✅ **Foundation** for billion-row data access

🎯 **Infrastructure checkpoint**: Lakehouse ready for big data shortcuts!

In [None]:
lakehouses=labs.list_lakehouses()["Lakehouse Name"]
if LakehouseName in lakehouses.values:
    lakehouseId = notebookutils.lakehouse.getWithProperties(LakehouseName)["id"]
else:
    lakehouseId = fabric.create_lakehouse(LakehouseName)

workspaceId = notebookutils.lakehouse.getWithProperties(LakehouseName)["workspaceId"]
workspaceName = sempy.fabric.resolve_workspace_name(workspaceId)
print(f"WorkspaceId = {workspaceId}, LakehouseID = {lakehouseId}, Workspace Name = {workspaceName}")

## 4. Create OneLake Shortcuts for Big Data Access

### Understanding OneLake Shortcuts
**OneLake shortcuts** are virtual connections that provide seamless access to data across Fabric workspaces without physical data movement:

#### Shortcut Benefits for Big Data:
- **📊 Instant access**: No waiting for billion-row data copies
- **💰 Zero storage cost**: Reference data without duplication  
- **🔄 Real-time consistency**: Always access current data
- **🛡️ Security inheritance**: Maintains source permissions
- **⚡ Performance**: Direct Delta Lake access

### Shortcut Cleanup Strategy
```python
for index, row in labs.lakehouse.list_shortcuts(lakehouse=LakehouseName).iterrows():
    labs.lakehouse.delete_shortcut(shortcut_name=row["Shortcut Name"], lakehouse=LakehouseName)
```
- **🧹 Clean slate**: Removes existing shortcuts to prevent conflicts
- **🔄 Idempotent**: Allows multiple runs without issues
- **✅ Consistent**: Ensures predictable shortcut configuration

### Big Data Tables Being Connected

#### Fact Tables (Billion+ Rows):
| Shortcut Name | Purpose | Rows | Special Features |
|---------------|---------|------|------------------|
| **fact_myevents_1bln** | Standard billion-row analytics | ~1B | V-Order optimized |
| **fact_myevents_1bln_no_vorder** | Performance comparison | ~1B | Without V-Order |
| **fact_myevents_1bln_partitioned_datekey** | Partitioning benefits | ~1B | Date-key partitioned |
| **fact_myevents_2bln** | Stress testing limits | ~2B | Maximum scale test |

#### Dimension Tables:
| Shortcut Name | Purpose | Rows | Relationship Target |
|---------------|---------|------|-------------------|
| **dim_Date** | Time dimension | ~3,650 | DateKey relationships |
| **dim_Geography** | Location dimension | ~200 | GeographyID relationships |

### OneLake Shortcut Creation Process
```python
labs.lakehouse.create_shortcut_onelake(
    table_name="fact_myevents_1bln",
    source_lakehouse=Shortcut_LakehouseName,
    source_workspace=Shortcut_WorkspaceName,
    destination_lakehouse=LakehouseName
)
```

#### Parameters Explained:
- **`table_name`**: The exact table name in the source lakehouse
- **`source_lakehouse`**: "BigDemoDB" - contains the billion-row datasets
- **`source_workspace`**: Region-specific workspace for optimal performance
- **`destination_lakehouse`**: "BigData" - our local workspace lakehouse

### Expected Outcome
After execution, you'll see:
```
Deleted shortcut existing_table_1    # Cleanup messages
Deleted shortcut existing_table_2
Adding shortcuts complete.           # Success confirmation
```

Your lakehouse now has **virtual access** to billions of rows without consuming local storage!

🎯 **Big data access checkpoint**: Billion-row tables accessible via zero-copy shortcuts!

In [None]:
#1. Remove any existing shortcuts
for index, row in labs.lakehouse.list_shortcuts(lakehouse=LakehouseName).iterrows():
    labs.lakehouse.delete_shortcut(shortcut_name=row["Shortcut Name"],lakehouse=LakehouseName)
    print(f"Deleted shortcut {row['Shortcut Name']}")

#2. Creates correct shortcuts
labs.lakehouse.create_shortcut_onelake(table_name="fact_myevents_1bln"                      ,source_lakehouse=Shortcut_LakehouseName,source_workspace=Shortcut_WorkspaceName,destination_lakehouse=LakehouseName)
labs.lakehouse.create_shortcut_onelake(table_name="fact_myevents_1bln_no_vorder"            ,source_lakehouse=Shortcut_LakehouseName,source_workspace=Shortcut_WorkspaceName,destination_lakehouse=LakehouseName)
labs.lakehouse.create_shortcut_onelake(table_name="fact_myevents_1bln_partitioned_datekey"  ,source_lakehouse=Shortcut_LakehouseName,source_workspace=Shortcut_WorkspaceName,destination_lakehouse=LakehouseName)
labs.lakehouse.create_shortcut_onelake(table_name="fact_myevents_2bln"                      ,source_lakehouse=Shortcut_LakehouseName,source_workspace=Shortcut_WorkspaceName,destination_lakehouse=LakehouseName)
labs.lakehouse.create_shortcut_onelake(table_name="dim_Date"                                ,source_lakehouse=Shortcut_LakehouseName,source_workspace=Shortcut_WorkspaceName,destination_lakehouse=LakehouseName)
labs.lakehouse.create_shortcut_onelake(table_name="dim_Geography"                           ,source_lakehouse=Shortcut_LakehouseName,source_workspace=Shortcut_WorkspaceName,destination_lakehouse=LakehouseName)

print('Adding shortcuts complete.')

## 5. Synchronize Big Data Table Metadata

Forces lakehouse metadata refresh to recognize billion-row shortcuts and their schemas.

Triggers REST API call to refresh lakehouse metadata and table schemas.
3. **📊 Progress Monitoring**: Poll batch status every second until success
4. **✅ Completion Validation**: Confirm all billion-row tables are properly cataloged

#### Why REST API vs. Automatic?
- **⏱️ Timing control**: Force immediate sync rather than waiting for background processes
- **🔍 Visibility**: Real-time progress monitoring for large metadata operations
- **🛡️ Reliability**: Guaranteed completion before proceeding to model creation
- **🔄 Repeatability**: Can be re-run if any step fails

#### Big Data Considerations:
- **Longer sync times**: Billion-row tables require more metadata processing
- **Resource usage**: Background jobs may consume more compute during sync
- **Cross-workspace complexity**: Shortcut metadata requires additional validation

**Expected behavior**: Periodic "running" status updates followed by "success" for complete metadata sync.

In [None]:
##https://medium.com/@sqltidy/delays-in-the-automatically-generated-schema-in-the-sql-analytics-endpoint-of-the-lakehouse-b01c7633035d

def triggerMetadataRefresh():
    client = fabric.FabricRestClient()
    response = client.get(f"/v1/workspaces/{workspaceId}/lakehouses/{lakehouseId}")
    sqlendpoint = response.json()['properties']['sqlEndpointProperties']['id']

    # trigger sync
    uri = f"/v1.0/myorg/lhdatamarts/{sqlendpoint}"
    payload = {"commands":[{"$type":"MetadataRefreshExternalCommand"}]}
    response = client.post(uri,json= payload)
    batchId = response.json()['batchId']

    # Monitor Progress
    statusuri = f"/v1.0/myorg/lhdatamarts/{sqlendpoint}/batches/{batchId}"
    statusresponsedata = client.get(statusuri).json()
    progressState = statusresponsedata['progressState']
    print(f"Metadata refresh : {progressState}")
    while progressState != "success":
        statusuri = f"/v1.0/myorg/lhdatamarts/{sqlendpoint}/batches/{batchId}"
        statusresponsedata = client.get(statusuri).json()
        progressState = statusresponsedata['progressState']
        print(f"Metadata refresh : {progressState}")
        time.sleep(1)

    print('Metadata refresh complete')

triggerMetadataRefresh()

## 6. Create Big Data Direct Lake Semantic Model

Creates semantic model with automatic discovery of billion-row shortcut tables.

In [None]:
Discovers lakehouse tables and creates semantic model with retry logic for robust big data deployment.

## 7. Configure Relationships for Big Data Analytics

Establishes efficient many-to-one relationships between billion-row facts and dimension tables.
- **⚡ Index optimization**: Proper relationships enable Direct Lake query optimization
- **🧠 Memory planning**: Relationship columns will be loaded into memory during queries

### Multi-Fact Table Relationships
This model demonstrates **multiple billion-row fact tables** sharing the same dimensions:

```
    dim_Date ────┬──── fact_myevents_1bln (1B rows)
                 ├──── fact_myevents_2bln (2B rows)  
                 └──── fact_myevents_1bln_partitioned_datekey (1B rows)
                 
    dim_Geography ┬──── fact_myevents_1bln (1B rows)
                  ├──── fact_myevents_2bln (2B rows)
                  └──── fact_myevents_1bln_partitioned_datekey (1B rows)
```

### Relationship Configuration Details

#### Date Relationships (Time Intelligence):
- **fact_myevents_1bln.DateKey → dim_Date.DateKey**
- **fact_myevents_2bln.DateKey → dim_Date.DateKey**  
- **fact_myevents_1bln_partitioned_datekey.DateKey → dim_Date.DateKey**

**Performance Impact**: DateKey is typically an integer with good cardinality distribution, making it efficient for billion-row joins.

#### Geography Relationships (Dimensional Analysis):
- **fact_myevents_1bln.GeographyID → dim_Geography.GeographyID**
- **fact_myevents_2bln.GeographyID → dim_Geography.GeographyID**
- **fact_myevents_1bln_partitioned_datekey.GeographyID → dim_Geography.GeographyID**

**Performance Impact**: GeographyID has lower cardinality (~200 values), making it very efficient for filtering and grouping.

### Big Data Relationship Best Practices

#### ✅ **Efficient Patterns**:
- **Many-to-One cardinality**: Billion fact rows to thousands of dimension rows
- **Integer keys**: Faster joins and smaller memory footprint
- **Indexed columns**: Keys that benefit from Delta Lake optimization

#### ❌ **Patterns to Avoid**:
- **Many-to-Many relationships**: Expensive with billion-row tables
- **String-based keys**: Larger memory footprint and slower joins
- **High-cardinality dimensions**: Can trigger fallback to SQL Endpoint

### Memory and Performance Implications
When relationships are used in queries:
- **Key columns loaded**: DateKey and GeographyID columns from fact tables enter memory
- **Join optimization**: Direct Lake optimizes join execution paths
- **Filter propagation**: Efficient filtering from dimensions to billion-row facts

**Expected outcome**: Six relationships configured enabling efficient cross-table analysis across multiple billion-row fact tables.

🎯 **Big data relationships checkpoint**: Multi-billion row fact tables connected efficiently to shared dimensions!

In [None]:
completedOK:bool=False
while not completedOK:
    try:
        with labs.tom.connect_semantic_model(dataset=SemanticModelName, readonly=False) as tom:
            #1. Remove any existing relationships
            for r in tom.model.Relationships:
                tom.model.Relationships.Remove(r)

            #2. Creates correct relationships
            tom.add_relationship(from_table="fact_myevents_1bln"                    , from_column="DateKey"     , to_table="dim_Date"       , to_column="DateKey"       , from_cardinality="Many" , to_cardinality="One")
            tom.add_relationship(from_table="fact_myevents_1bln"                    , from_column="GeographyID" , to_table="dim_Geography"  , to_column="GeographyID"   , from_cardinality="Many" , to_cardinality="One")

            tom.add_relationship(from_table="fact_myevents_2bln"                    , from_column="DateKey"     , to_table="dim_Date"       , to_column="DateKey"       , from_cardinality="Many" , to_cardinality="One")
            tom.add_relationship(from_table="fact_myevents_2bln"                    , from_column="GeographyID" , to_table="dim_Geography"  , to_column="GeographyID"   , from_cardinality="Many" , to_cardinality="One")

            tom.add_relationship(from_table="fact_myevents_1bln_partitioned_datekey", from_column="DateKey"     , to_table="dim_Date"       , to_column="DateKey"       , from_cardinality="Many" , to_cardinality="One")
            tom.add_relationship(from_table="fact_myevents_1bln_partitioned_datekey", from_column="GeographyID" , to_table="dim_Geography"  , to_column="GeographyID"   , from_cardinality="Many" , to_cardinality="One")
            completedOK=True
    except:
        print('Error adding relationships... trying again.')
        time.sleep(3)

print('done')

## 8. Create Performance-Optimized Measures for Big Data

Creates strategic measures for billion-row performance comparison and scale analysis.

In [None]:
Removes existing measures and adds optimized measures for 1B and 2B row performance comparison.

## 9. Configure Date Intelligence for Big Data Analytics

### Time Intelligence with Billion-Row Datasets
**Date tables** become even more critical when working with billion-row fact tables because:

- **📅 Time-based filtering**: Essential for managing large dataset query performance
- **📊 Aggregation efficiency**: Date hierarchies enable efficient drill-down on large datasets
- **⚡ Partition alignment**: Date partitioning often aligns with time intelligence patterns
- **🎯 User experience**: Time filters help users focus on relevant data subsets

### Big Data Date Table Configuration
```python
tom.mark_as_date_table(table_name="dim_Date", column_name="DateKey")
```

#### Why DateKey Instead of Date?
- **🔗 Relationship compatibility**: Matches the integer keys in billion-row fact tables
- **⚡ Performance optimization**: Integer joins are faster than datetime joins with large datasets
- **💾 Storage efficiency**: Smaller memory footprint for relationship columns
- **📊 Index optimization**: Integer keys benefit more from Delta Lake indexing

### Time Intelligence Impact on Big Data Queries

#### Efficient Time Filtering:
With proper date table configuration, queries like:
```dax
CALCULATE([Sum of Sales (2bln)], DATESINPERIOD(dim_Date[DateKey], TODAY(), -1, YEAR))
```
Can efficiently filter billion-row datasets to specific time periods.

#### Performance Benefits:
- **Partition elimination**: Date-based filters can eliminate entire partitions
- **Memory reduction**: Only relevant time periods loaded into memory
- **Query optimization**: Direct Lake can optimize date-based predicates

### Big Data Time Intelligence Scenarios

#### Typical Use Cases:
| Time Period | Impact on 2B Row Table | Performance Benefit |
|-------------|------------------------|-------------------|
| **Last 30 days** | ~164M rows (5%) | 95% data elimination |
| **Current year** | ~730M rows (37%) | 63% data elimination |
| **Year-over-year** | ~1.46B rows (73%) | Comparative analysis |

### Expected Configuration Benefits
Once the date table is marked:
- ✅ **Time intelligence functions** available for billion-row analysis
- ✅ **Efficient filtering** on large fact tables
- ✅ **Optimized memory usage** through time-based data elimination
- ✅ **Better user experience** with familiar date hierarchies

🎯 **Big data time intelligence checkpoint**: Date table optimized for efficient billion-row time-based analysis!

In [None]:
completedOK:bool=False
while not completedOK:
    try:
        with labs.tom.connect_semantic_model(dataset=SemanticModelName, readonly=False) as tom:
            tom.mark_as_date_table(table_name="dim_Date",column_name="DateKey")
            completedOK=True
    except:
        print('Error with date table... trying again.')
        time.sleep(3)

print('done')

## 10. Configure Logical Sorting for Big Data Visualizations

### Sorting Importance in Big Data Contexts
With **billion-row datasets**, proper sorting becomes crucial for user experience because:

- **📊 Large result sets**: Time-based reports may return thousands of rows
- **🎯 Pattern recognition**: Logical ordering helps identify trends in big data
- **⚡ Performance impact**: Proper sorting can leverage existing table ordering
- **👥 User adoption**: Intuitive ordering reduces confusion with large datasets

### Big Data Dimension Sorting Configuration

#### Month Name Sorting:
```python
tom.set_sort_by_column(table_name="dim_Date", column_name="MonthName", sort_by_column="Month")
```
- **Display column**: MonthName ("January", "February", "March"...)
- **Sort column**: Month (1, 2, 3...)
- **Big data impact**: When aggregating billion rows by month, results display logically

#### Weekday Sorting:
```python
tom.set_sort_by_column(table_name="dim_Date", column_name="WeekDayName", sort_by_column="Weekday")
```
- **Display column**: WeekDayName ("Sunday", "Monday", "Tuesday"...)
- **Sort column**: Weekday (1, 2, 3...)
- **Big data impact**: Weekly aggregations of large datasets display in intuitive order

### Performance Considerations for Big Data Sorting

#### Memory Impact:
- **Sort columns loaded**: Both display and sort-by columns enter memory during queries
- **Relationship efficiency**: Sorting columns are often used in time-based relationships
- **Query optimization**: Proper sorting can leverage Delta Lake file ordering

#### User Experience with Large Results:
| Query Result Size | Sorting Benefit | Example |
|------------------|-----------------|---------|
| **Monthly aggregation** | 12-60 rows | Chronological progression |
| **Daily aggregation** | 365-1,095 rows | Calendar order clarity |
| **Hourly aggregation** | 8,760+ rows | Time series continuity |

### JSON Verification Output
The code displays the complete table structure showing:
- ✅ **Sort relationships** properly configured
- ✅ **Column metadata** including sort-by properties
- ✅ **Performance settings** for large-scale operations

**Expected outcome**: Date dimension columns configured for logical sorting, ensuring intuitive ordering in billion-row aggregation results.

🎯 **Big data UX checkpoint**: Sorting optimized for large-scale analytics user experience!

In [None]:
completedOK:bool=False
while not completedOK:
    try:
        tom = labs.tom.TOMWrapper(dataset=SemanticModelName, workspace=workspaceName, readonly=False)
        tom.set_sort_by_column(table_name="dim_Date",column_name="MonthName"       ,sort_by_column="Month")
        tom.set_sort_by_column(table_name="dim_Date",column_name="WeekDayName"     ,sort_by_column="Weekday")
        tom.model.SaveChanges()

        #Show BIM data for dim_Date table
        i:int=0
        for t in tom.model.Tables:
            if t.Name=="dim_Date":
                bim = json.dumps(tom.get_bim()["model"]["tables"][i],indent=4)
                print(bim)
            i=i+1
            completedOK=True
    except:
        print('Error with sort by cols... trying again.')
        time.sleep(3)

print('done')

## 11. Optimize Big Data Model by Hiding Fact Table Columns

Hides all fact table columns to prevent accidental memory overload from billion-row column access.

In [None]:
completedOK:bool=False
while not completedOK:
    try:
        i:int=0
        for t in tom.model.Tables:
            if t.Name in ["fact_myevents_1bln","fact_myevents_2bln","fact_myevents_1bln_partitioned_datekey"]:
                for c in t.Columns:
                    c.IsHidden=True

                bim = json.dumps(tom.get_bim()["model"]["tables"][i],indent=4)
                print(bim)
            i=i+1
        tom.model.SaveChanges()
        completedOK=True
    except:
        print('Error with hiding cols... trying again.')
        time.sleep(3)

print('done')

## 12. Refresh Big Data Model and Validate Configuration

Refreshes semantic model and validates proper configuration for billion-row analytics.

### Big Data Model Refresh Considerations
Refreshing a semantic model with **billion-row tables** requires special attention to:

- **⏱️ Extended timing**: Billion-row metadata validation takes longer
- **🧠 Memory planning**: Initial refresh prepares memory allocation strategies
- **🔗 Shortcut validation**: Ensures cross-workspace connections are stable
- **🛡️ Guardrail checking**: Validates tables remain within Direct Lake limits

### Why Refresh is More Critical for Big Data

#### Validation Processes:
- **Schema consistency**: Ensures billion-row table schemas are properly detected
- **Relationship integrity**: Validates joins work correctly with large datasets
- **Memory allocation**: Prepares Direct Lake for potential large column loads
- **Performance optimization**: Updates query execution plans for big data scenarios

#### Big Data Refresh Challenges:
| Challenge | Impact | Mitigation |
|-----------|---------|------------|
| **Metadata complexity** | Longer refresh times | Patient retry logic |
| **Cross-workspace shortcuts** | Connection validation delays | Automatic metadata sync |
| **Memory preparation** | Initial allocation overhead | Staged loading approach |

### Enhanced Error Handling for Big Data
```python
while not reframeOK:
    try:
        result = labs.refresh_semantic_model(dataset=SemanticModelName)
        reframeOK = True
    except:
        triggerMetadataRefresh()  # Re-sync billion-row table metadata
        time.sleep(3)             # Allow background processes to complete
```

#### Why Robust Error Handling Matters:
- **Billion-row complexity**: More opportunities for transient failures
- **Resource coordination**: Multiple services must coordinate for large datasets
- **Network dependencies**: Cross-workspace shortcuts may have connectivity issues
- **Memory allocation**: Initial memory planning may require multiple attempts

### Post-Refresh Validation
After successful refresh, the model is ready for:
- ✅ **Billion-row queries**: Can handle large-scale aggregations
- ✅ **Cross-table analysis**: Relationships function with massive datasets
- ✅ **Memory-efficient operation**: Optimized for large column loading
- ✅ **Fallback monitoring**: Ready for guardrail limit testing

**Expected outcome**: "Custom Semantic Model reframe OK" confirms your big data model is ready for billion-row analytics!

🎯 **Big data model readiness checkpoint**: Billion-row semantic model fully configured and validated!

In [None]:
reframeOK:bool=False
while not reframeOK:
    try:
        result:pandas.DataFrame = labs.refresh_semantic_model(dataset=SemanticModelName)
        reframeOK=True
    except:
        print('Error with reframe... trying again.')
        triggerMetadataRefresh()
        time.sleep(3)

print('Custom Semantic Model reframe OK')

## 13. Advanced Performance Tracing for Big Data Analytics

Creates enhanced tracing function for comprehensive big data performance monitoring and analysis.

### Why Advanced Tracing is Essential for Big Data
With **billion-row datasets**, understanding query execution becomes critical for:

- **🔍 Fallback detection**: When and why Direct Lake falls back to SQL Endpoint
- **⏱️ Performance analysis**: Detailed timing breakdown for large-scale operations
- **🧠 Memory monitoring**: Track column loading patterns during billion-row queries
- **🛡️ Guardrail monitoring**: Understand limit breaches and their impacts

### Enhanced Tracing Capabilities

#### DMV (Dynamic Management Views) Analysis:
The `runDMV()` function provides deep insights into:
- **Column temperature**: Which billion-row columns are "HOT", "WARM", or "COLD"
- **Memory residency**: What's currently loaded vs. on-disk
- **Dictionary sizes**: Compression effectiveness for large columns
- **Access patterns**: When columns were last used in queries

#### Advanced Query Tracing:
The `runQueryWithTrace()` function provides:
- **Server-side timing**: Detailed execution breakdown
- **Storage engine events**: Direct Lake vs. SQL Endpoint execution paths  
- **Memory operations**: Column loading and caching behavior
- **Fallback triggers**: Exact reasons for SQL Endpoint fallback

### Big Data Tracing Function Parameters

#### Flexible Analysis Options:
```python
runQueryWithTrace(
    expr: str,                    # DAX query to execute
    workspaceName: str,           # Workspace context
    SemanticModelName: str,       # Target model
    Result: bool = True,          # Show query results
    Trace: bool = True,           # Show server timings
    DMV: bool = True,             # Show column states
    ClearCache: bool = True       # Start with clean memory
)
```

#### Why Each Parameter Matters for Big Data:

##### **`ClearCache=True`**:
- **Clean baseline**: Ensures accurate measurement of billion-row column loading
- **Reproducible tests**: Consistent starting point for performance analysis
- **Memory reset**: Clears any previously loaded large columns

##### **`Trace=True`**:
- **Execution path visibility**: See if query uses Direct Lake or falls back
- **Performance bottlenecks**: Identify slow operations with large datasets
- **Resource usage**: Understand memory and CPU impact

##### **`DMV=True`**:
- **Before/after comparison**: See impact of billion-row queries on column states
- **Memory evolution**: Track which columns become "HOT" during execution
- **Optimization insights**: Identify opportunities for performance tuning

### Event Filtering for Big Data
```python
def filter_func(e):
    if e.EventSubclass.ToString() == "VertiPaqScanInternal":
        return False  # Filter out excessive internal scanning events
    return True
```

This filtering prevents **trace log flooding** when working with billion-row tables, focusing on meaningful events.

### Expected Tracing Benefits
With these functions, you can:
- ✅ **Monitor fallback scenarios**: Understand when billion-row queries trigger SQL Endpoint
- ✅ **Analyze memory patterns**: See how large columns are loaded and cached
- ✅ **Compare performance**: Benchmark different query approaches on big data
- ✅ **Optimize queries**: Identify the most efficient patterns for large datasets

🎯 **Big data monitoring checkpoint**: Advanced tracing tools ready for billion-row performance analysis!

In [None]:
import warnings
from Microsoft.AnalysisServices.Tabular import TraceEventArgs
from typing import Dict, List, Optional, Callable

def runDMV():
    df = sempy.fabric.evaluate_dax(
        dataset=SemanticModelName, 
        dax_string="""
        
        SELECT 
            MEASURE_GROUP_NAME AS [TABLE],
            ATTRIBUTE_NAME AS [COLUMN],
            DATATYPE ,
            DICTIONARY_SIZE 		    AS SIZE ,
            DICTIONARY_ISPAGEABLE 		AS PAGEABLE ,
            DICTIONARY_ISRESIDENT		AS RESIDENT ,
            DICTIONARY_TEMPERATURE		AS TEMPERATURE,
            DICTIONARY_LAST_ACCESSED	AS LASTACCESSED 
        FROM $SYSTEM.DISCOVER_STORAGE_TABLE_COLUMNS 
        ORDER BY 
            [DICTIONARY_TEMPERATURE] DESC
        
        """)
    display(df)

def filter_func(e):
    retVal:bool=True
    if e.EventSubclass.ToString() == "VertiPaqScanInternal":
        retVal=False      
    #     #if e.EventSubClass.ToString() == "VertiPaqScanInternal":
    #     retVal=False
    return retVal

# define events to trace and their corresponding columns
def runQueryWithTrace (expr:str,workspaceName:str,SemanticModelName:str,Result:Optional[bool]=True,Trace:Optional[bool]=True,DMV:Optional[bool]=True,ClearCache:Optional[bool]=True) -> pandas.DataFrame :
    event_schema = fabric.Trace.get_default_query_trace_schema()
    event_schema.update({"ExecutionMetrics":["EventClass","TextData"]})
    del event_schema['VertiPaqSEQueryBegin']
    del event_schema['VertiPaqSEQueryCacheMatch']
    del event_schema['DirectQueryBegin']

    warnings.filterwarnings("ignore")

    WorkspaceName = workspaceName
    SemanticModelName = SemanticModelName

    if ClearCache:
        labs.clear_cache(SemanticModelName)

    with fabric.create_trace_connection(SemanticModelName,WorkspaceName) as trace_connection:
        # create trace on server with specified events
        with trace_connection.create_trace(
            event_schema=event_schema, 
            name="Simple Query Trace",
            filter_predicate=filter_func,
            stop_event="QueryEnd"
            ) as trace:

            trace.start()

            df=sempy.fabric.evaluate_dax(
                dataset=SemanticModelName, 
                dax_string=expr)

            if Result:
                displayHTML(f"<H2>####### DAX QUERY RESULT #######</H2>")
                display(df)

            # Wait 5 seconds for trace data to arrive
            time.sleep(5)

            # stop Trace and collect logs
            final_trace_logs = trace.stop()

    if Trace:
        displayHTML(f"<H2>####### SERVER TIMINGS #######</H2>")
        display(final_trace_logs)
    
    if DMV:
        displayHTML(f"<H2>####### SHOW DMV RESULTS #######</H2>")
        runDMV()
    
    return final_trace_logs



## 14. Validate Big Data Model Configuration

Validates model configuration and Direct Lake mode operation for billion-row tables.

### Big Data Model Validation Strategy
Before executing billion-row queries, validate that your model is properly configured for large-scale operations:

#### TABLETRAITS() for Big Data Validation:
This function becomes especially important for billion-row tables because it reveals:
- **Storage mode confirmation**: Ensures tables are actually in Direct Lake mode
- **Partition information**: Shows how billion-row tables are organized
- **File size details**: Validates files are within Direct Lake limits
- **Compression statistics**: Shows V-Order optimization effectiveness

#### Expected TABLETRAITS() Results for Big Data:
| Table | Storage Mode | Row Count | File Size | Partitions |
|-------|-------------|-----------|-----------|------------|
| **fact_myevents_1bln** | DirectLake | ~1B | Multiple files <1GB each | Multiple |
| **fact_myevents_2bln** | DirectLake | ~2B | Multiple files <1GB each | Multiple |
| **fact_myevents_1bln_partitioned_datekey** | DirectLake | ~1B | Date-partitioned files | By DateKey |

### Direct Lake Guardrails for Big Data
The guardrails query becomes critical for billion-row scenarios:

#### Key Guardrails to Monitor:
- **Maximum file size**: Individual parquet files must be <1GB
- **Row count limits**: Tables approaching 100M+ rows per partition
- **Memory constraints**: Available memory for column dictionaries
- **Cardinality limits**: Unique values per column thresholds

#### Potential Guardrail Violations:
| Guardrail | Billion-Row Risk | Impact |
|-----------|-----------------|---------|
| **File size >1GB** | Large fact tables | Forces fallback to SQL Endpoint |
| **High cardinality** | Geography, Product IDs | May trigger import mode |
| **Memory exhaustion** | Multiple concurrent users | Automatic fallback protection |

### Fallback Detection Strategy
If TABLETRAITS() shows **ImportMode** instead of **DirectLake** for any billion-row table:
- 🚨 **Guardrail breach**: Table exceeded Direct Lake limits
- 🔄 **Automatic fallback**: System protected against performance issues
- 🛠️ **Optimization needed**: Consider partitioning or data reduction

**Expected validation outcome**: All tables show DirectLake mode, confirming readiness for billion-row analytics.

🎯 **Big data validation checkpoint**: Model configuration confirmed ready for large-scale Direct Lake operations!

In [None]:
df=sempy.fabric.evaluate_dax(
    dataset=SemanticModelName, 
    dax_string="""
    
    evaluate tabletraits()
    
    """)
display(df)

In [None]:
df=labs.directlake.get_direct_lake_guardrails()
display(df)

## 15. Establish Big Data Performance Baseline

Establishes performance baseline for billion-row analytics before executing large-scale queries.

### Baseline Importance for Billion-Row Analytics
Establishing a **performance baseline** before executing billion-row queries is crucial because:

- **📊 Memory footprint**: See initial state before large columns are loaded
- **🌡️ Temperature tracking**: All columns should start "COLD" for accurate measurement
- **🧠 Capacity planning**: Understand available memory before billion-row operations
- **📈 Change detection**: Measure the exact impact of large-scale queries

### Expected Baseline for Big Data Tables

#### Initial Column States:
All billion-row table columns should show:
- **❄️ TEMPERATURE**: "COLD" (not accessed yet)
- **🚫 RESIDENT**: "FALSE" (not in memory)
- **📅 LASTACCESSED**: Null or very old timestamps
- **📏 SIZE**: Actual dictionary compression sizes

#### Critical Columns to Monitor:
| Table | Column | Expected Size | Importance |
|-------|--------|---------------|------------|
| **fact_myevents_1bln** | Quantity_ThisYear | Large (1B values) | Primary measure column |
| **fact_myevents_1bln** | DateKey | Medium (date range) | Relationship key |
| **fact_myevents_2bln** | Quantity_ThisYear | Very Large (2B values) | Stress test column |
| **fact_myevents_2bln** | GeographyID | Small (low cardinality) | Efficient dimension key |

### Memory Capacity Analysis
The baseline DMV reveals:
- **Available memory**: How much memory is available for column loading
- **Current usage**: Memory already consumed by model metadata
- **Compression ratios**: How effectively billion-row columns are compressed
- **Cardinality patterns**: Which columns have high vs. low unique value counts

### Baseline Insights for Performance Planning

#### Memory Allocation Estimates:
- **1B row column**: May require 2-8GB depending on data type and compression
- **2B row column**: May require 4-16GB, potentially triggering fallback
- **Date keys**: Usually very efficient due to limited range
- **Geography IDs**: Minimal memory due to low cardinality

#### Performance Prediction:
Based on baseline dictionary sizes:
- **Small dictionaries** (<100MB): Will load quickly into memory
- **Medium dictionaries** (100MB-1GB): Moderate loading time
- **Large dictionaries** (>1GB): May trigger fallback to SQL Endpoint

**Expected baseline outcome**: Complete view of billion-row table column states before any query execution, establishing foundation for performance analysis.

🎯 **Big data baseline checkpoint**: Billion-row performance measurement foundation established!

In [None]:
runDMV()

## 16. Execute Billion-Row Analytics with Performance Monitoring

Executes comprehensive billion-row analytics with detailed performance monitoring and stress testing.

### Big Data Query Execution Strategy
This section demonstrates **real-world billion-row analytics** while monitoring Direct Lake behavior and potential fallback scenarios.

### Performance Analysis Framework
Each query provides comprehensive analysis through:
- **📊 Query results**: Actual business insights from billion-row datasets
- **⏱️ Server timing traces**: Detailed execution path and performance metrics
- **🧠 Memory impact analysis**: See which columns become "HOT" and enter memory
- **🛡️ Fallback detection**: Monitor if/when queries fall back to SQL Endpoint

### Query Progression Strategy

#### Graduated Scale Testing:
1. **1 billion rows**: Establish baseline performance
2. **2 billion rows**: Test Direct Lake limits
3. **Combined analysis**: Multi-table billion-row queries

This progression helps identify **performance thresholds** and **fallback triggers**.

#### 16.1 Baseline Performance: 1 Billion Row Analytics

### Query Analysis: 1 Billion Row Table
This query establishes **baseline performance** for billion-row Direct Lake analytics:

```dax
EVALUATE
SUMMARIZECOLUMNS(
    dim_Date[FirstDateofMonth],
    "Count of Transactions", COUNTROWS(fact_myevents_1bln),
    "Sum of Sales", [Sum of Sales (1bln)]
)
ORDER BY [FirstDateofMonth]
```

#### Query Component Analysis:

##### **SUMMARIZECOLUMNS()** with 1B Rows:
- **Cross-table aggregation**: Joins 1B fact rows with date dimension
- **Memory impact**: Will load DateKey column from billion-row table
- **Performance**: Tests Direct Lake's ability to handle large-scale aggregations

##### **COUNTROWS(fact_myevents_1bln)**:
- **Row counting**: Counts all 1 billion rows efficiently
- **Optimization**: Should leverage table metadata rather than scanning all rows
- **Expected performance**: Fast due to columnar storage optimization

##### **[Sum of Sales (1bln)]**:
- **Measure evaluation**: Aggregates Quantity_ThisYear from 1B rows
- **Memory loading**: Will bring billion-row numeric column into memory
- **Critical test**: Primary test of Direct Lake's billion-row aggregation capability

### Expected Performance Characteristics

#### Direct Lake Success Scenario:
- **Execution time**: 5-30 seconds depending on memory and data distribution
- **Memory usage**: DateKey + Quantity_ThisYear columns loaded
- **Result accuracy**: Exact aggregation across all billion rows
- **Temperature change**: Related columns become "HOT"

#### Potential Fallback Scenarios:
- **Memory exhaustion**: If Quantity_ThisYear column too large for available memory
- **Cardinality issues**: If unique values exceed Direct Lake limits
- **Resource contention**: If multiple users accessing simultaneously

### What to Look For

#### Success Indicators:
- ✅ **Query completes successfully** with reasonable performance
- ✅ **Server traces show DirectLake** execution path
- ✅ **DMV shows columns loaded** with "HOT" temperature
- ✅ **Results show aggregated data** by month

#### Performance Insights:
- 📊 **Monthly aggregation results**: Business insights from billion-row dataset
- ⏱️ **Execution timing**: Baseline for subsequent larger queries
- 🧠 **Memory impact**: See exact memory footprint of billion-row columns

**Expected outcome**: Successful billion-row aggregation demonstrating Direct Lake's capability with large-scale datasets.

In [None]:
df = runQueryWithTrace("""
    
    EVALUATE
        SUMMARIZECOLUMNS(
               
                dim_Date[FirstDateofMonth] ,
                "Count of Transactions" , COUNTROWS(fact_myevents_1bln) ,
                "Sum of Sales" , [Sum of Sales (1bln)] 
        )
        ORDER BY [FirstDateofMonth]

""",workspaceName,SemanticModelName)

#### 16.2 Stress Test: 2 Billion Row Analytics

### Pushing Direct Lake to Its Limits
This query tests **Direct Lake's maximum scale capabilities** with 2 billion rows:

```dax
EVALUATE
SUMMARIZECOLUMNS(
    dim_Date[FirstDateofMonth],
    "Count of Transactions", COUNTROWS(fact_myevents_2bln),
    "Sum of Sales", [Sum of Sales (2bln)]
)
ORDER BY [FirstDateofMonth]
```

#### Critical Scale Factors:

##### **2 Billion Row Impact**:
- **Memory pressure**: Quantity_ThisYear column from 2B rows may be 8-16GB
- **Guardrail testing**: Likely to approach or exceed Direct Lake memory limits
- **Fallback probability**: Higher chance of SQL Endpoint fallback

##### **Performance Comparison**:
| Metric | 1B Rows | 2B Rows | Expected Impact |
|--------|---------|---------|-----------------|
| **Memory usage** | 4-8GB | 8-16GB | 2x increase |
| **Execution time** | 5-30s | 10-60s or fallback | Variable |
| **Fallback risk** | Low | Moderate-High | Depends on capacity |

### Expected Scenarios

#### Scenario 1: Direct Lake Success (Best Case)
- **Memory sufficient**: Available memory can handle 2B-row column
- **Performance**: Slower than 1B but still reasonable (30-60 seconds)
- **Traces show**: DirectLake execution path maintained
- **DMV impact**: Large memory allocation for Quantity_ThisYear column

#### Scenario 2: SQL Endpoint Fallback (Common Case)
- **Memory exhaustion**: 2B-row column exceeds available memory
- **Automatic fallback**: Direct Lake gracefully falls back to SQL Endpoint
- **Performance**: May be slower but still functional
- **Traces show**: Fallback events and SQL execution

#### Scenario 3: Resource Limits (Stress Case)
- **Capacity constraints**: Multiple factors trigger protective fallback
- **Error handling**: System protects against memory exhaustion
- **Learning opportunity**: Understand Direct Lake boundaries

### Fallback Analysis Opportunity
This query is **designed to explore fallback behavior**:

#### Key Learning Points:
- **When fallback occurs**: Exact conditions that trigger SQL Endpoint usage
- **Performance impact**: How fallback affects query execution time
- **User experience**: Fallback is transparent to end users
- **Resource protection**: How Fabric protects against memory exhaustion

### Monitoring Strategy
With `DMV=False` parameter:
- **Focus on traces**: Detailed execution path analysis
- **Reduced output**: Cleaner focus on performance characteristics
- **Fallback detection**: Clear visibility into execution mode changes

**Expected outcome**: Either successful 2-billion row Direct Lake aggregation or educational fallback scenario demonstrating system limits and protection mechanisms.

In [None]:
df = runQueryWithTrace("""

    EVALUATE
        SUMMARIZECOLUMNS(
                dim_Date[FirstDateofMonth] ,
                "Count of Transactions" , COUNTROWS(fact_myevents_2bln) ,
                "Sum of Sales" , [Sum of Sales (2bln)]
        )
        ORDER BY [FirstDateofMonth]

""",workspaceName,SemanticModelName,DMV=False)

#### 16.3 Ultimate Stress Test: Multi-Billion Row Cross-Table Analysis

### Maximum Scale Multi-Table Query
This advanced query represents the **ultimate Direct Lake stress test** by combining both billion-row tables in a single analysis:

```dax
EVALUATE
SUMMARIZECOLUMNS(
    dim_Date[FirstDateofMonth],
    "Count of Transactions", COUNTROWS(fact_myevents_1bln),
    "Sum of Sales (1bln)", [Sum of Sales (1bln)],
    "Sum of Sales (2bln)", [Sum of Sales (2bln)]
)
ORDER BY [FirstDateofMonth]
```

#### Multi-Table Complexity Analysis:

##### **Memory Amplification**:
- **1B + 2B columns**: Potentially 12-24GB total memory requirement
- **Concurrent loading**: Both Quantity_ThisYear columns may load simultaneously
- **Relationship columns**: Multiple DateKey columns from fact tables

##### **Cross-Table Performance**:
- **Parallel processing**: Direct Lake may optimize multi-table access
- **Memory coordination**: Smart loading of shared dimension data
- **Resource management**: Advanced memory allocation strategies

### Expected Execution Scenarios

#### Scenario 1: Direct Lake Optimization Success
- **Intelligent caching**: Shared dimensions loaded once
- **Parallel aggregation**: Both fact tables processed efficiently
- **Memory management**: Optimal allocation across multiple large columns
- **Performance**: Reasonable execution time despite scale

#### Scenario 2: Selective Fallback
- **Partial fallback**: 2B table falls back while 1B remains Direct Lake
- **Mixed execution**: Demonstrates Direct Lake's adaptive behavior
- **Performance variation**: Different execution paths for different tables

#### Scenario 3: Complete Fallback
- **System protection**: Total memory requirements exceed capacity
- **Full SQL Endpoint**: All tables processed via SQL Analytics Endpoint
- **Consistent results**: Same analytical output via different execution path

### Business Value Demonstration

#### Comparative Analysis Capability:
This query demonstrates **real-world business scenarios**:
- **Trend comparison**: Compare current (1B) vs. historical (2B) data patterns
- **Scale analysis**: Understand business growth impact on performance
- **Performance baselines**: Establish benchmarks for production workloads

#### Enterprise Readiness Validation:
- **Concurrent workload**: Tests realistic multi-user scenarios
- **Resource scaling**: Validates Direct Lake behavior under pressure
- **Fallback reliability**: Confirms transparent degradation when needed

### Advanced Monitoring Insights

#### Memory Orchestration:
Watch for:
- **Load sequencing**: Order in which large columns are loaded
- **Memory sharing**: How dimension tables are cached across fact table queries
- **Resource coordination**: System-level memory management decisions

#### Performance Patterns:
- **Execution parallelism**: Whether fact tables are processed concurrently
- **Optimization strategies**: How Direct Lake handles multi-billion row scenarios
- **Fallback coordination**: If fallback occurs, how it's managed across tables

**Expected outcome**: Ultimate demonstration of Direct Lake's enterprise-scale capabilities or intelligent fallback behavior when approaching system limits.

🎯 **Maximum scale checkpoint**: Multi-billion row analytics showcasing Direct Lake's ultimate capabilities!

In [None]:
df = runQueryWithTrace("""

    EVALUATE
        SUMMARIZECOLUMNS(
                dim_Date[FirstDateofMonth] ,
                "Count of Transactions" , COUNTROWS(fact_myevents_1bln) ,
                "Sum of Sales (1bln)" , [Sum of Sales (1bln)] ,
                "Sum of Sales (2bln)" , [Sum of Sales (2bln)]
        )
        ORDER BY [FirstDateofMonth]

""",workspaceName,SemanticModelName,DMV=False)

## 17. Lab 2 Completion and Big Data Insights

### Congratulations! Big Data Mastery Achieved 🎉

You've successfully completed the most challenging Direct Lake lab, working with **billion-row datasets** and pushing the technology to its limits.

#### 🚀 **What You've Accomplished**:
- ✅ **Cross-workspace data access** via OneLake shortcuts
- ✅ **Billion-row semantic model** creation and configuration
- ✅ **Advanced performance monitoring** with tracing and DMVs
- ✅ **Fallback scenario analysis** understanding Direct Lake limits
- ✅ **Multi-table billion-row analytics** stress testing

### Key Direct Lake Big Data Learnings

#### 📊 **Scale Capabilities**:
- **Direct Lake can handle billion-row tables** when properly configured
- **OneLake shortcuts enable zero-copy big data access** across workspaces
- **Intelligent fallback protects against memory exhaustion** while maintaining functionality
- **Performance monitoring tools** provide deep insights into large-scale operations

#### 🛡️ **Guardrails Understanding**:
- **Memory limits protect system stability** while maximizing performance
- **Automatic fallback to SQL Endpoint** ensures query reliability
- **Column temperature tracking** shows memory usage patterns
- **Cross-workspace shortcuts** maintain performance with proper configuration

#### ⚡ **Performance Insights**:
- **V-Order optimization** significantly improves billion-row query performance
- **Partitioning strategies** can enhance large-scale analytics
- **Memory management** becomes critical at enterprise scale
- **Query design patterns** impact Direct Lake vs. fallback behavior

### Real-World Applications

#### Enterprise Scenarios You're Now Ready For:
- **📈 Historical trend analysis** across years of transactional data
- **🌍 Global analytics** combining data from multiple regions/workspaces
- **📊 Real-time dashboards** over massive operational datasets
- **🔍 Detailed forensic analysis** of billion-row audit logs

### Next Steps in Your Big Data Journey

#### 🎯 **Immediate Exploration**:
- Experiment with different query patterns to understand fallback triggers
- Create Power BI reports using your billion-row semantic model
- Test concurrent user scenarios to understand multi-user performance

#### 📚 **Advanced Learning Path**:
- **Lab 3**: Delta table analysis and optimization techniques
- **Lab 4**: Deep dive into fallback behaviors and troubleshooting
- **Lab 5**: Framing and refresh strategies for big data
- **Lab 6-7**: Performance optimization techniques for billion-row scenarios

### Resource Cleanup Importance
Stopping the Spark session is crucial after big data operations to:
- **💰 Release expensive compute resources** used for billion-row processing
- **🧹 Free memory** allocated for large column dictionaries
- **✅ Clean up cross-workspace connections** properly

### Big Data Direct Lake Mastery Certificate 🏆
You now understand:
- ✅ **Enterprise-scale Direct Lake** capabilities and limitations
- ✅ **Performance monitoring** for billion-row scenarios
- ✅ **Fallback behavior** and system protection mechanisms
- ✅ **Cross-workspace big data** architecture with OneLake shortcuts

**Ready for production big data analytics with Direct Lake!** 🚀

In [None]:
mssparkutils.session.stop()