# Lab 2: Direct Lake with Big Data - Billion Row Analytics

## Lab Overview

This lab demonstrates Direct Lake's **enterprise-scale capabilities** by working with **billion-row datasets**. You'll learn how Direct Lake handles massive data volumes, create OneLake shortcuts for cross-workspace data access, and understand when Direct Lake falls back to SQL Endpoint mode.

### What You'll Build

**Workshop Flow:**
```
1. Setup Big Data Environment
   ↓
2. Create OneLake Shortcuts
   ↓
3. Build Billion-Row Semantic Model
   ↓
4. Performance Testing & Tracing
   ↓
5. Analyze Fallback Scenarios
   ↓
6. Optimize for Scale
```

### Key Concepts
- **OneLake Shortcuts**: Access data across workspaces without duplication
- **Direct Lake Guardrails**: Understanding billion-row table limits
- **Fallback Behavior**: When Direct Lake uses SQL Endpoint mode
- **Column Temperature**: Memory management for large datasets

### Learning Objectives
By completing this lab, you'll be able to:
- ✅ Create OneLake shortcuts for cross-workspace data access
- ✅ Build semantic models with billion-row fact tables
- ✅ Monitor Direct Lake performance with advanced tracing
- ✅ Understand and troubleshoot fallback scenarios
- ✅ Optimize queries for massive datasets

### Dataset Scale
| Table | Rows | Purpose |
|:------|:-----|:--------|
| **fact_myevents_1bln** | 1 billion | Standard fact table |
| **fact_myevents_2bln** | 2 billion | Stress test limits |
| **fact_myevents_1bln_partitioned_datekey** | 1 billion | Optimized with partitioning |
| **dim_Date** | ~3,650 | Date dimension |
| **dim_Geography** | ~200 | Geography dimension |

**Estimated Time**: 60-90 minutes  
**Prerequisites**: Lab 1 completion, access to Big Data workspace

---

## 1. Install Required Libraries

Install Semantic Link Labs with enhanced big data capabilities for billion-row analytics and OneLake shortcut management.

In [None]:
%pip install -q --disable-pip-version-check semantic-link-labs

## 2. Import Libraries and Set Variables

Import required libraries and configure environment variables for big data workspace and region-aware data source selection.

In [None]:
import sempy_labs as labs
from sempy import fabric
import sempy
import pandas
import json
import time

LakehouseName = "BigData"
SemanticModelName = f"{LakehouseName}_model"

capacity_name = labs.get_capacity_name()

Shortcut_LakehouseName = "BigDemoDB"
Shortcut_WorkspaceName = "DL Labs - Data [North Central US]"
if capacity_name == "FabConUS8-P1":
    Shortcut_WorkspaceName = "DL Labs - Data [West US 3]"


## 3. Create Lakehouse for Big Data

Create a lightweight lakehouse that will use OneLake shortcuts to access billion-row tables without data duplication.

In [None]:
lakehouses=labs.list_lakehouses()["Lakehouse Name"]
if LakehouseName in lakehouses.values:
    lakehouseId = notebookutils.lakehouse.getWithProperties(LakehouseName)["id"]
else:
    lakehouseId = fabric.create_lakehouse(LakehouseName)

workspaceId = notebookutils.lakehouse.getWithProperties(LakehouseName)["workspaceId"]
workspaceName = sempy.fabric.resolve_workspace_name(workspaceId)
print(f"WorkspaceId = {workspaceId}, LakehouseID = {lakehouseId}, Workspace Name = {workspaceName}")

## 4. Create OneLake Shortcuts for Big Data Access

Creates shortcuts to billion-row fact tables and dimension tables across workspaces using OneLake shortcut functionality.

In [None]:
#1. Remove any existing shortcuts
for index, row in labs.lakehouse.list_shortcuts(lakehouse=LakehouseName).iterrows():
    labs.lakehouse.delete_shortcut(shortcut_name=row["Shortcut Name"],lakehouse=LakehouseName)
    print(f"Deleted shortcut {row['Shortcut Name']}")

#2. Creates correct shortcuts
labs.lakehouse.create_shortcut_onelake(table_name="fact_myevents_1bln"                      ,source_lakehouse=Shortcut_LakehouseName,source_workspace=Shortcut_WorkspaceName,destination_lakehouse=LakehouseName)
labs.lakehouse.create_shortcut_onelake(table_name="fact_myevents_1bln_no_vorder"            ,source_lakehouse=Shortcut_LakehouseName,source_workspace=Shortcut_WorkspaceName,destination_lakehouse=LakehouseName)
labs.lakehouse.create_shortcut_onelake(table_name="fact_myevents_1bln_partitioned_datekey"  ,source_lakehouse=Shortcut_LakehouseName,source_workspace=Shortcut_WorkspaceName,destination_lakehouse=LakehouseName)
labs.lakehouse.create_shortcut_onelake(table_name="fact_myevents_2bln"                      ,source_lakehouse=Shortcut_LakehouseName,source_workspace=Shortcut_WorkspaceName,destination_lakehouse=LakehouseName)
labs.lakehouse.create_shortcut_onelake(table_name="dim_Date"                                ,source_lakehouse=Shortcut_LakehouseName,source_workspace=Shortcut_WorkspaceName,destination_lakehouse=LakehouseName)
labs.lakehouse.create_shortcut_onelake(table_name="dim_Geography"                           ,source_lakehouse=Shortcut_LakehouseName,source_workspace=Shortcut_WorkspaceName,destination_lakehouse=LakehouseName)

print('Adding shortcuts complete.')

## 5. Synchronize Big Data Table Metadata

Forces lakehouse metadata refresh to recognize billion-row shortcuts and their schemas.

Triggers REST API call to refresh lakehouse metadata and table schemas.
3. **📊 Progress Monitoring**: Poll batch status every second until success
4. **✅ Completion Validation**: Confirm all billion-row tables are properly cataloged

#### Why REST API vs. Automatic?
- **⏱️ Timing control**: Force immediate sync rather than waiting for background processes
- **🔍 Visibility**: Real-time progress monitoring for large metadata operations
- **🛡️ Reliability**: Guaranteed completion before proceeding to model creation
- **🔄 Repeatability**: Can be re-run if any step fails

#### Big Data Considerations:
- **Longer sync times**: Billion-row tables require more metadata processing
- **Resource usage**: Background jobs may consume more compute during sync
- **Cross-workspace complexity**: Shortcut metadata requires additional validation

**Expected behavior**: Periodic "running" status updates followed by "success" for complete metadata sync.

In [None]:
##https://medium.com/@sqltidy/delays-in-the-automatically-generated-schema-in-the-sql-analytics-endpoint-of-the-lakehouse-b01c7633035d

def triggerMetadataRefresh():
    client = fabric.FabricRestClient()
    response = client.get(f"/v1/workspaces/{workspaceId}/lakehouses/{lakehouseId}")
    sqlendpoint = response.json()['properties']['sqlEndpointProperties']['id']

    # trigger sync
    uri = f"/v1.0/myorg/lhdatamarts/{sqlendpoint}"
    payload = {"commands":[{"$type":"MetadataRefreshExternalCommand"}]}
    response = client.post(uri,json= payload)
    batchId = response.json()['batchId']

    # Monitor Progress
    statusuri = f"/v1.0/myorg/lhdatamarts/{sqlendpoint}/batches/{batchId}"
    statusresponsedata = client.get(statusuri).json()
    progressState = statusresponsedata['progressState']
    print(f"Metadata refresh : {progressState}")
    while progressState != "success":
        statusuri = f"/v1.0/myorg/lhdatamarts/{sqlendpoint}/batches/{batchId}"
        statusresponsedata = client.get(statusuri).json()
        progressState = statusresponsedata['progressState']
        print(f"Metadata refresh : {progressState}")
        time.sleep(1)

    print('Metadata refresh complete')

triggerMetadataRefresh()

## 6. Create Big Data Direct Lake Semantic Model

Creates semantic model with automatic discovery of billion-row shortcut tables.

In [None]:
from sempy import fabric
#1. Generate list of ALL table names from lakehouse to add to Semantic Model
lakehouseTables:list = labs.lakehouse.get_lakehouse_tables(lakehouse=LakehouseName)["Table Name"]

completedOK:bool=False
while not completedOK:
    try:
        #2 Create the semantic model (check if exists first)
        if sempy.fabric.list_items().query(f"`Display Name`=='{LakehouseName}_model' & Type=='SemanticModel'  ").shape[0] ==0:
            labs.directlake.generate_direct_lake_semantic_model(dataset=f"{LakehouseName}_model",lakehouse_tables=lakehouseTables,workspace=workspaceName,lakehouse=lakehouseId,refresh=False,overwrite=True)
            completedOK=True
    except:
        print('Error creating model... trying again.')
        time.sleep(3)
        triggerMetadataRefresh()

print('done')

## 7. Configure Relationships for Big Data Analytics

Establishes star schema relationships between multiple billion-row fact tables and shared dimension tables.

In [None]:
completedOK:bool=False
while not completedOK:
    try:
        with labs.tom.connect_semantic_model(dataset=SemanticModelName, readonly=False) as tom:
            #1. Remove any existing relationships
            for r in tom.model.Relationships:
                tom.model.Relationships.Remove(r)

            #2. Creates correct relationships
            tom.add_relationship(from_table="fact_myevents_1bln"                    , from_column="DateKey"     , to_table="dim_Date"       , to_column="DateKey"       , from_cardinality="Many" , to_cardinality="One")
            tom.add_relationship(from_table="fact_myevents_1bln"                    , from_column="GeographyID" , to_table="dim_Geography"  , to_column="GeographyID"   , from_cardinality="Many" , to_cardinality="One")

            tom.add_relationship(from_table="fact_myevents_2bln"                    , from_column="DateKey"     , to_table="dim_Date"       , to_column="DateKey"       , from_cardinality="Many" , to_cardinality="One")
            tom.add_relationship(from_table="fact_myevents_2bln"                    , from_column="GeographyID" , to_table="dim_Geography"  , to_column="GeographyID"   , from_cardinality="Many" , to_cardinality="One")

            tom.add_relationship(from_table="fact_myevents_1bln_partitioned_datekey", from_column="DateKey"     , to_table="dim_Date"       , to_column="DateKey"       , from_cardinality="Many" , to_cardinality="One")
            tom.add_relationship(from_table="fact_myevents_1bln_partitioned_datekey", from_column="GeographyID" , to_table="dim_Geography"  , to_column="GeographyID"   , from_cardinality="Many" , to_cardinality="One")
            completedOK=True
    except:
        print('Error adding relationships... trying again.')
        time.sleep(3)

print('done')

## 8. Create Performance-Optimized Measures for Big Data

Creates strategic measures for billion-row performance comparison and scale analysis.

In [None]:
completedOK:bool=False
while not completedOK:
    try:
        with labs.tom.connect_semantic_model(dataset=SemanticModelName, readonly=False) as tom:
            #1. Remove any existing measures
            for t in tom.model.Tables:
                for m in t.Measures:
                    tom.remove_object(m)
                    print(m.Name)

            tom.add_measure(table_name="fact_myevents_2bln",measure_name="Sum of Sales (2bln)",expression="SUM(fact_myevents_2bln[Quantity_ThisYear])",format_string="#,0")
            tom.add_measure(table_name="fact_myevents_1bln",measure_name="Sum of Sales (1bln)",expression="SUM(fact_myevents_1bln[Quantity_ThisYear])",format_string="#,0")
            completedOK=True
    except:
        print('Error adding measures... trying again.')
        time.sleep(3)

print('done')

## 9. Configure Date Intelligence for Big Data Analytics

Marks dim_Date as date table to enable time intelligence functions for efficient billion-row time-based filtering.

In [None]:
completedOK:bool=False
while not completedOK:
    try:
        with labs.tom.connect_semantic_model(dataset=SemanticModelName, readonly=False) as tom:
            tom.mark_as_date_table(table_name="dim_Date",column_name="DateKey")
            completedOK=True
    except:
        print('Error with date table... trying again.')
        time.sleep(3)

print('done')

## 10. Configure Logical Sorting for Big Data Visualizations

Sets logical column sorting for date dimensions to ensure proper chronological ordering in billion-row aggregation results.

In [None]:
completedOK:bool=False
while not completedOK:
    try:
        tom = labs.tom.TOMWrapper(dataset=SemanticModelName, workspace=workspaceName, readonly=False)
        tom.set_sort_by_column(table_name="dim_Date",column_name="MonthName"       ,sort_by_column="Month")
        tom.set_sort_by_column(table_name="dim_Date",column_name="WeekDayName"     ,sort_by_column="Weekday")
        tom.model.SaveChanges()

        #Show BIM data for dim_Date table
        i:int=0
        for t in tom.model.Tables:
            if t.Name=="dim_Date":
                bim = json.dumps(tom.get_bim()["model"]["tables"][i],indent=4)
                print(bim)
            i=i+1
            completedOK=True
    except:
        print('Error with sort by cols... trying again.')
        time.sleep(3)

print('done')

## 11. Optimize Big Data Model by Hiding Fact Table Columns

Hides all fact table columns to prevent accidental memory overload from billion-row column access.

In [None]:
completedOK:bool=False
while not completedOK:
    try:
        i:int=0
        for t in tom.model.Tables:
            if t.Name in ["fact_myevents_1bln","fact_myevents_2bln","fact_myevents_1bln_partitioned_datekey"]:
                for c in t.Columns:
                    c.IsHidden=True

                bim = json.dumps(tom.get_bim()["model"]["tables"][i],indent=4)
                print(bim)
            i=i+1
        tom.model.SaveChanges()
        completedOK=True
    except:
        print('Error with hiding cols... trying again.')
        time.sleep(3)

print('done')

## 12. Refresh Big Data Model and Validate Configuration

Refreshes semantic model and validates proper configuration for billion-row analytics with enhanced error handling.

In [None]:
reframeOK:bool=False
while not reframeOK:
    try:
        result:pandas.DataFrame = labs.refresh_semantic_model(dataset=SemanticModelName)
        reframeOK=True
    except:
        print('Error with reframe... trying again.')
        triggerMetadataRefresh()
        time.sleep(3)

print('Custom Semantic Model reframe OK')

## 13. Advanced Performance Tracing for Big Data Analytics

Creates enhanced tracing function for comprehensive big data performance monitoring and fallback detection.

In [None]:
import warnings
from Microsoft.AnalysisServices.Tabular import TraceEventArgs
from typing import Dict, List, Optional, Callable

def runDMV():
    df = sempy.fabric.evaluate_dax(
        dataset=SemanticModelName, 
        dax_string="""
        
        SELECT 
            MEASURE_GROUP_NAME AS [TABLE],
            ATTRIBUTE_NAME AS [COLUMN],
            DATATYPE ,
            DICTIONARY_SIZE 		    AS SIZE ,
            DICTIONARY_ISPAGEABLE 		AS PAGEABLE ,
            DICTIONARY_ISRESIDENT		AS RESIDENT ,
            DICTIONARY_TEMPERATURE		AS TEMPERATURE,
            DICTIONARY_LAST_ACCESSED	AS LASTACCESSED 
        FROM $SYSTEM.DISCOVER_STORAGE_TABLE_COLUMNS 
        ORDER BY 
            [DICTIONARY_TEMPERATURE] DESC
        
        """)
    display(df)

def filter_func(e):
    retVal:bool=True
    if e.EventSubclass.ToString() == "VertiPaqScanInternal":
        retVal=False      
    #     #if e.EventSubClass.ToString() == "VertiPaqScanInternal":
    #     retVal=False
    return retVal

# define events to trace and their corresponding columns
def runQueryWithTrace (expr:str,workspaceName:str,SemanticModelName:str,Result:Optional[bool]=True,Trace:Optional[bool]=True,DMV:Optional[bool]=True,ClearCache:Optional[bool]=True) -> pandas.DataFrame :
    event_schema = fabric.Trace.get_default_query_trace_schema()
    event_schema.update({"ExecutionMetrics":["EventClass","TextData"]})
    del event_schema['VertiPaqSEQueryBegin']
    del event_schema['VertiPaqSEQueryCacheMatch']
    del event_schema['DirectQueryBegin']

    warnings.filterwarnings("ignore")

    WorkspaceName = workspaceName
    SemanticModelName = SemanticModelName

    if ClearCache:
        labs.clear_cache(SemanticModelName)

    with fabric.create_trace_connection(SemanticModelName,WorkspaceName) as trace_connection:
        # create trace on server with specified events
        with trace_connection.create_trace(
            event_schema=event_schema, 
            name="Simple Query Trace",
            filter_predicate=filter_func,
            stop_event="QueryEnd"
            ) as trace:

            trace.start()

            df=sempy.fabric.evaluate_dax(
                dataset=SemanticModelName, 
                dax_string=expr)

            if Result:
                displayHTML(f"<H2>####### DAX QUERY RESULT #######</H2>")
                display(df)

            # Wait 5 seconds for trace data to arrive
            time.sleep(5)

            # stop Trace and collect logs
            final_trace_logs = trace.stop()

    if Trace:
        displayHTML(f"<H2>####### SERVER TIMINGS #######</H2>")
        display(final_trace_logs)
    
    if DMV:
        displayHTML(f"<H2>####### SHOW DMV RESULTS #######</H2>")
        runDMV()
    
    return final_trace_logs



## 14. Validate Big Data Model Configuration

Validates model configuration and Direct Lake mode operation for billion-row tables using TABLETRAITS and guardrails.

In [None]:
df=sempy.fabric.evaluate_dax(
    dataset=SemanticModelName, 
    dax_string="""
    
    evaluate tabletraits()
    
    """)
display(df)

In [None]:
df=labs.directlake.get_direct_lake_guardrails()
display(df)

## 15. Establish Big Data Performance Baseline

Establishes performance baseline for billion-row analytics by capturing initial column states and memory usage.

In [None]:
runDMV()

## 16. Execute Billion-Row Analytics with Performance Monitoring

Executes comprehensive billion-row analytics with detailed performance monitoring and stress testing.

#### 16.1 Baseline Performance: 1 Billion Row Analytics

Executes baseline query against 1 billion row table to establish Direct Lake performance characteristics.

In [None]:
df = runQueryWithTrace("""
    
    EVALUATE
        SUMMARIZECOLUMNS(
               
                dim_Date[FirstDateofMonth] ,
                "Count of Transactions" , COUNTROWS(fact_myevents_1bln) ,
                "Sum of Sales" , [Sum of Sales (1bln)] 
        )
        ORDER BY [FirstDateofMonth]

""",workspaceName,SemanticModelName)

#### 16.2 Stress Test: 2 Billion Row Analytics

Tests Direct Lake limits with 2 billion row query to explore fallback behavior and performance boundaries.

In [None]:
df = runQueryWithTrace("""

    EVALUATE
        SUMMARIZECOLUMNS(
                dim_Date[FirstDateofMonth] ,
                "Count of Transactions" , COUNTROWS(fact_myevents_2bln) ,
                "Sum of Sales" , [Sum of Sales (2bln)]
        )
        ORDER BY [FirstDateofMonth]

""",workspaceName,SemanticModelName,DMV=False)

#### 16.3 Ultimate Stress Test: Multi-Billion Row Cross-Table Analysis

Executes ultimate stress test query combining both billion-row tables to test maximum Direct Lake capabilities.

**Expected outcome**: Ultimate demonstration of Direct Lake's enterprise-scale capabilities or intelligent fallback behavior when approaching system limits.

🎯 **Maximum scale checkpoint**: Multi-billion row analytics showcasing Direct Lake's ultimate capabilities!

In [None]:
df = runQueryWithTrace("""

    EVALUATE
        SUMMARIZECOLUMNS(
                dim_Date[FirstDateofMonth] ,
                "Count of Transactions" , COUNTROWS(fact_myevents_1bln) ,
                "Sum of Sales (1bln)" , [Sum of Sales (1bln)] ,
                "Sum of Sales (2bln)" , [Sum of Sales (2bln)]
        )
        ORDER BY [FirstDateofMonth]

""",workspaceName,SemanticModelName,DMV=False)

## 17. Lab 2 Completion and Big Data Insights

### Congratulations! Big Data Mastery Achieved 🎉

You've successfully completed the most challenging Direct Lake lab, working with **billion-row datasets** and pushing the technology to its limits.

#### 🚀 **What You've Accomplished**:
- ✅ **Cross-workspace data access** via OneLake shortcuts
- ✅ **Billion-row semantic model** creation and configuration
- ✅ **Advanced performance monitoring** with tracing and DMVs
- ✅ **Fallback scenario analysis** understanding Direct Lake limits
- ✅ **Multi-table billion-row analytics** stress testing

### Key Direct Lake Big Data Learnings

#### 📊 **Scale Capabilities**:
- **Direct Lake can handle billion-row tables** when properly configured
- **OneLake shortcuts enable zero-copy big data access** across workspaces
- **Intelligent fallback protects against memory exhaustion** while maintaining functionality
- **Performance monitoring tools** provide deep insights into large-scale operations

#### 🛡️ **Guardrails Understanding**:
- **Memory limits protect system stability** while maximizing performance
- **Automatic fallback to SQL Endpoint** ensures query reliability
- **Column temperature tracking** shows memory usage patterns
- **Cross-workspace shortcuts** maintain performance with proper configuration

#### ⚡ **Performance Insights**:
- **V-Order optimization** significantly improves billion-row query performance
- **Partitioning strategies** can enhance large-scale analytics
- **Memory management** becomes critical at enterprise scale
- **Query design patterns** impact Direct Lake vs. fallback behavior

### Real-World Applications

#### Enterprise Scenarios You're Now Ready For:
- **📈 Historical trend analysis** across years of transactional data
- **🌍 Global analytics** combining data from multiple regions/workspaces
- **📊 Real-time dashboards** over massive operational datasets
- **🔍 Detailed forensic analysis** of billion-row audit logs

### Next Steps in Your Big Data Journey

#### 🎯 **Immediate Exploration**:
- Experiment with different query patterns to understand fallback triggers
- Create Power BI reports using your billion-row semantic model
- Test concurrent user scenarios to understand multi-user performance

#### 📚 **Advanced Learning Path**:
- **Lab 3**: Delta table analysis and optimization techniques
- **Lab 4**: Deep dive into fallback behaviors and troubleshooting
- **Lab 5**: Framing and refresh strategies for big data
- **Lab 6-7**: Performance optimization techniques for billion-row scenarios

### Resource Cleanup Importance
Stopping the Spark session is crucial after big data operations to:
- **💰 Release expensive compute resources** used for billion-row processing
- **🧹 Free memory** allocated for large column dictionaries
- **✅ Clean up cross-workspace connections** properly

### Big Data Direct Lake Mastery Certificate 🏆
You now understand:
- ✅ **Enterprise-scale Direct Lake** capabilities and limitations
- ✅ **Performance monitoring** for billion-row scenarios
- ✅ **Fallback behavior** and system protection mechanisms
- ✅ **Cross-workspace big data** architecture with OneLake shortcuts

**Ready for production big data analytics with Direct Lake!** 🚀

In [None]:
mssparkutils.session.stop()