# Lab 1: Create Direct Lake Semantic Model

## Lab Overview

This lab teaches you how to create a **Direct Lake semantic model** from scratch using Microsoft Fabric. You'll learn the complete workflow from data loading to model creation and validation.

### What You'll Build
```mermaid
graph LR
    A[Lakehouse] --> B[Load Adventure Works Data]
    B --> C[Create Semantic Model]
    C --> D[Add Relationships]
    D --> E[Create Measures]
    E --> F[Test & Validate]
```

### Key Concepts
- **Direct Lake**: Query data directly from Delta Lake without imports
- **Adventure Works**: Sample business dataset with customers, products, and sales
- **Semantic Model**: Business logic layer with relationships and measures

### Learning Objectives
By completing this lab, you'll be able to:
- ✅ Set up a lakehouse and load sample data
- ✅ Create a Direct Lake semantic model programmatically  
- ✅ Define table relationships and business measures
- ✅ Validate model performance and behavior

**Estimated Time**: 30-45 minutes

---

## 1. Install Required Libraries

Install Semantic Link Labs to enable Direct Lake model creation and management capabilities.

In [None]:
%pip install -q --disable-pip-version-check semantic-link-labs

## 2. Import Libraries and Set Variables

Import required libraries and define key variables for the lakehouse and semantic model names.

In [None]:
import sempy_labs as labs
from sempy import fabric
import sempy
import pandas
import json
import time

LakehouseName = "AdventureWorks"
SemanticModelName = f"{LakehouseName}_model"

## 3. Create or Connect to Lakehouse

Check if the AdventureWorks lakehouse exists, create it if needed, and retrieve workspace identifiers.

In [None]:
lakehouses=labs.list_lakehouses()["Lakehouse Name"]
if LakehouseName in lakehouses.values:
    lakehouseId = notebookutils.lakehouse.getWithProperties(LakehouseName)["id"]
else:
    lakehouseId = fabric.create_lakehouse(LakehouseName)

workspaceId = notebookutils.lakehouse.getWithProperties(LakehouseName)["workspaceId"]
workspaceName = sempy.fabric.resolve_workspace_name(workspaceId)
print(f"WorkspaceId = {workspaceId}, LakehouseID = {lakehouseId}, Workspace Name = {workspaceName}")

## 4. Load Adventure Works Sample Data

Load four Adventure Works tables (Customer, Date, Product, Sales) into the lakehouse using region-aware data sources.

**Tables being loaded:**
- DimCustomer (~18K customers)  
- DimDate (2K+ dates)
- DimProduct (~600 products)
- FactInternetSales (~60K sales records)
```
Loaded DimCustomer
Loaded DimDate  
Loaded DimProduct
Loaded FactInternetSales
Done
```

### Behind the Scenes
- Data is stored in **Delta format** for ACID compliance
- **Overwrite mode** ensures clean data for the workshop
- **OneLake integration** provides seamless cross-workspace data access

🎯 **Success indicator**: All four "Loaded" messages followed by "Done"

In [None]:
capacity_name = labs.get_capacity_name()

def loadDataToLakehouse(fromTable: str, toTable: str):
    """
    Optimized data loading function with improved error handling and performance.
    
    Args:
        fromTable: Source table name to read from
        toTable: Target table name to write to
    """
    try:
        # Get lakehouse properties once and reuse
        lakehouse_props = notebookutils.lakehouse.getWithProperties(LakehouseName)
        workspaceId = lakehouse_props["workspaceId"]
        lakehouseId = lakehouse_props["id"]

        # Region-aware connection string selection
        if capacity_name == "FabConUS8-P1":  # West US 3
            conn_str = "abfss://b1d61bbe-de20-4d3a-8075-b8e2eaacb868@onelake.dfs.fabric.microsoft.com/631e45c0-1243-4f42-920a-56bfe6ecdd6d/Tables"
        else:  # North Central US (default)
            conn_str = "abfss://16cf855f-3bf4-4312-a7a1-ccf5cb6a0121@onelake.dfs.fabric.microsoft.com/99ed86df-13d1-4008-a7f6-5768e53f4f85/Tables"

        # Read source data with format specification for better performance
        customer_df = spark.read.format("delta").load(f"{conn_str}/{fromTable}")
        
        # Cache the DataFrame if it will be used multiple times or is computation-heavy
        customer_df.cache()
        
        # Write with optimized settings
        (customer_df
         .write
         .format("delta")
         .mode("overwrite")
         .option("overwriteSchema", "true")
         .save(f"abfss://{workspaceId}@onelake.dfs.fabric.microsoft.com/{lakehouseId}/Tables/{toTable}"))
        
        # Unpersist cached DataFrame to free memory
        customer_df.unpersist()
        
        print(f"Loaded {toTable}")
        
    except Exception as e:
        print(f"Error loading {toTable}: {e}")
        raise

# Load all tables with proper error handling
tables_to_load = [
    ("DimCustomer", "DimCustomer"),
    ("DimDate", "DimDate"),
    ("DimProduct", "DimProduct"),
    ("FactInternetSales", "FactInternetSales")
]

for from_table, to_table in tables_to_load:
    loadDataToLakehouse(from_table, to_table)

print("Done")

## 5. Trigger Lakehouse Metadata Synchronization

### Understanding the Challenge
When data is loaded into a lakehouse, there can be a **delay** before the SQL Analytics Endpoint recognizes the new table schemas. This is because:

- **Lakehouse storage** and **SQL endpoint** operate on different systems
- **Schema discovery** happens asynchronously in the background
- **Delta table metadata** needs to be synchronized across services

### What This Code Solves
The `triggerMetadataRefresh` function forces immediate synchronization by:

#### Step-by-Step Process:
1. **🔗 Creates API client**: Establishes connection to Fabric REST APIs
2. **🔍 Finds SQL endpoint**: Retrieves the SQL Analytics Endpoint ID for our lakehouse
3. **🚀 Triggers refresh**: Sends a metadata refresh command to synchronize schemas
4. **⏱️ Monitors progress**: Polls the refresh status until completion
5. **✅ Confirms success**: Reports when synchronization is complete

### Technical Details
- **API endpoint**: Uses Fabric's REST API for programmatic control
- **Batch processing**: Operations are queued and tracked with batch IDs
- **Progress states**: Monitors transitions from "running" → "success"
- **Error handling**: Built-in retry logic for robust execution

### Expected Output
You'll see real-time progress updates:
```
Metadata refresh : running
Metadata refresh : running  
Metadata refresh : success
Metadata refresh complete
```

### Why This Matters for Direct Lake
- **Schema accuracy**: Ensures Direct Lake models can read current table structures
- **Column metadata**: Synchronizes data types, constraints, and relationships
- **Performance optimization**: Enables proper query planning and execution

🎯 **Critical for success**: This step ensures your semantic model will have accurate table metadata!

In [None]:
##https://medium.com/@sqltidy/delays-in-the-automatically-generated-schema-in-the-sql-analytics-endpoint-of-the-lakehouse-b01c7633035d

def triggerMetadataRefresh():
    client = fabric.FabricRestClient()
    response = client.get(f"/v1/workspaces/{workspaceId}/lakehouses/{lakehouseId}")
    sqlendpoint = response.json()['properties']['sqlEndpointProperties']['id']

    # trigger sync
    uri = f"/v1.0/myorg/lhdatamarts/{sqlendpoint}"
    payload = {"commands":[{"$type":"MetadataRefreshExternalCommand"}]}
    response = client.post(uri,json= payload)
    batchId = response.json()['batchId']

    # Monitor Progress
    statusuri = f"/v1.0/myorg/lhdatamarts/{sqlendpoint}/batches/{batchId}"
    statusresponsedata = client.get(statusuri).json()
    progressState = statusresponsedata['progressState']
    print(f"Metadata refresh : {progressState}")
    while progressState != "success":
        statusuri = f"/v1.0/myorg/lhdatamarts/{sqlendpoint}/batches/{batchId}"
        statusresponsedata = client.get(statusuri).json()
        progressState = statusresponsedata['progressState']
        print(f"Metadata refresh : {progressState}")
        time.sleep(1)

    print('Metadata refresh complete')

triggerMetadataRefresh()

## 6. Create Direct Lake Semantic Model

### Understanding Semantic Models
A **semantic model** in Microsoft Fabric is the bridge between raw data and business insights. It provides:

- **📊 Business context**: Meaningful names, descriptions, and formatting
- **🔗 Relationships**: How tables connect for accurate cross-table analysis  
- **📈 Measures**: Pre-calculated business metrics using DAX
- **🎯 User experience**: Optimized for self-service analytics

### Direct Lake vs. Other Storage Modes

| Storage Mode | Data Location | Performance | Real-time | Memory Usage |
|--------------|---------------|-------------|-----------|--------------|
| **Direct Lake** | Lakehouse Delta tables | Excellent | ✅ Real-time | Minimal |
| Import | Semantic model cache | Fast | ❌ Scheduled refresh | High |
| DirectQuery | Source system | Variable | ✅ Real-time | Minimal |

### Code Walkthrough

#### 1. **Table Discovery**
```python
lakehouseTables = labs.lakehouse.get_lakehouse_tables(lakehouse=LakehouseName)["Table Name"]
```
- Automatically discovers all tables in our lakehouse
- Creates a list of table names for model inclusion

#### 2. **Duplicate Check**
```python
if sempy.fabric.list_items().query(f"`Display Name`=='{LakehouseName}_model'...").shape[0] ==0:
```
- Prevents creating duplicate models with the same name
- Uses pandas query syntax for filtering

#### 3. **Model Generation**
```python
labs.directlake.generate_direct_lake_semantic_model(...)
```
- **`dataset`**: Name of the semantic model to create
- **`lakehouse_tables`**: List of tables to include
- **`workspace`**: Target workspace for the model
- **`lakehouse`**: Source lakehouse ID
- **`refresh=False`**: Don't trigger refresh during creation
- **`overwrite=True`**: Replace existing model if found

#### 4. **Error Handling**
The code includes retry logic because:
- **Concurrent operations**: Multiple users might access the same resources
- **Metadata dependencies**: Tables must be fully synchronized
- **API rate limiting**: Fabric services may need brief delays

### Expected Outcome
- ✅ **New semantic model** created in your workspace
- ✅ **All four tables** automatically added with proper Direct Lake configuration
- ✅ **Ready for customization** with relationships, measures, and formatting

🎯 **Success indicator**: "Semantic model created OK" message confirms your model is ready!

In [None]:
from sempy import fabric

#1. Generate list of ALL table names from lakehouse to add to Semantic Model
lakehouseTables:list = labs.lakehouse.get_lakehouse_tables(lakehouse=LakehouseName)["Table Name"]

completedOK:bool=False
while not completedOK:
    try:
        #2 Create the semantic model
        if sempy.fabric.list_items().query(f"`Display Name`=='{LakehouseName}_model' & Type=='SemanticModel'  ").shape[0] ==0:
            labs.directlake.generate_direct_lake_semantic_model(dataset=f"{LakehouseName}_model",lakehouse_tables=lakehouseTables,workspace=workspaceName,lakehouse=lakehouseId,refresh=False,overwrite=True)
            completedOK=True
    except:
        print('Error creating model... trying again.')
        time.sleep(3)
        triggerMetadataRefresh()

print('Semantic model created OK')

## 7. Configure Table Relationships

### Why Relationships Matter in Data Modeling
**Relationships** are the foundation of accurate business intelligence. They define how tables connect and enable:

- **🔍 Cross-table filtering**: When you filter customers, see their related sales
- **📊 Accurate aggregations**: Prevent double-counting and incorrect totals
- **🧭 Intuitive navigation**: Users can explore data naturally across dimensions
- **⚡ Optimized queries**: Fabric can optimize query execution paths

### Star Schema Design Pattern
Our Adventure Works model follows the **star schema** pattern:

```
       DimCustomer ──┐
                     │
       DimDate ──────┼──── FactInternetSales (CENTER)
                     │
       DimProduct ───┘
```

### Relationship Configuration Details

#### 1. **Date Relationship**
```python
FactInternetSales.OrderDateKey → DimDate.DateKey (Many-to-One)
```
- **Business meaning**: Each sale happens on one specific date
- **Cardinality**: Many sales can occur on the same date
- **Enables**: Time-based analysis (monthly/yearly trends)

#### 2. **Customer Relationship** 
```python
FactInternetSales.CustomerKey → DimCustomer.CustomerKey (Many-to-One)
```
- **Business meaning**: Each sale belongs to one customer
- **Cardinality**: Customers can have multiple sales
- **Enables**: Customer segmentation and analysis

#### 3. **Product Relationship**
```python
FactInternetSales.ProductKey → DimProduct.ProductKey (Many-to-One)
```
- **Business meaning**: Each sale involves one product
- **Cardinality**: Products can be sold multiple times
- **Enables**: Product performance analysis

### Technical Implementation

#### Clean Slate Approach
```python
for r in tom.model.Relationships:
    tom.model.Relationships.Remove(r)
```
- Removes any existing relationships to ensure clean configuration
- Prevents conflicts from previous model iterations

#### TOM (Tabular Object Model) Connection
- **`readonly=False`**: Enables write operations on the model
- **Context manager**: Automatically handles connection cleanup
- **Transaction safety**: Changes are applied atomically

### Expected Outcome
- ✅ **Three relationships** established following star schema best practices
- ✅ **Proper cardinality** configured for accurate data analysis
- ✅ **Foundation ready** for meaningful cross-table queries

🎯 **Data modeling checkpoint**: Your model now understands how tables connect!

In [None]:
completedOK:bool=False
while not completedOK:
    try:
        with labs.tom.connect_semantic_model(dataset=SemanticModelName, readonly=False) as tom:
            #1. Remove any existing relationships
            for r in tom.model.Relationships:
                tom.model.Relationships.Remove(r)

            #2. Creates correct relationships
            tom.add_relationship(from_table="FactInternetSales", from_column="OrderDateKey" , to_table="DimDate"    , to_column="DateKey"       , from_cardinality="Many" , to_cardinality="One")
            tom.add_relationship(from_table="FactInternetSales", from_column="CustomerKey"  , to_table="DimCustomer", to_column="CustomerKey"   , from_cardinality="Many" , to_cardinality="One")
            tom.add_relationship(from_table="FactInternetSales", from_column="ProductKey"   , to_table="DimProduct" , to_column="ProductKey"    , from_cardinality="Many" , to_cardinality="One")
            completedOK=True
    except:
        print('Error adding relationships... trying again.')
        time.sleep(3)

print('done')


## 8. Add Business Intelligence Measures

### Understanding DAX Measures
**Measures** are the calculated fields that provide business insights. Unlike calculated columns, measures:

- **🔄 Calculate dynamically**: Values change based on filter context
- **📊 Aggregate data**: Sum, count, average across filtered datasets  
- **💰 Represent KPIs**: Revenue, growth rates, conversion metrics
- **🎯 Guide decisions**: Turn raw data into actionable insights

### Measure Design Principles

#### 1. **Measure Cleanup Strategy**
```python
for m in t.Measures:
    tom.remove_object(m)
```
- **Clean slate approach**: Removes existing measures to avoid conflicts
- **Iterative development**: Allows for multiple runs during development
- **Consistency**: Ensures predictable measure configuration

#### 2. **Business Metrics Implementation**

##### 📈 **Sum of Sales**
```dax
SUM(FactInternetSales[SalesAmount])
```
- **Purpose**: Total revenue across filtered context
- **Format**: Currency with thousands separators `$#,0.###############`
- **Business use**: Revenue reporting, target tracking

##### 📊 **Count of Sales** 
```dax
COUNTROWS(FactInternetSales)
```
- **Purpose**: Number of transactions
- **Format**: Integer with thousands separators `#,0`
- **Business use**: Volume analysis, conversion tracking

### DAX Function Deep Dive

#### SUM() vs SUMX()
- **`SUM()`**: Aggregates a single column efficiently
- **`SUMX()`**: Row-by-row iteration (use when calculations needed per row)

#### COUNTROWS() vs COUNT()
- **`COUNTROWS()`**: Counts all rows (including blanks)
- **`COUNT()`**: Counts non-blank values in a specific column

### Formatting Standards
| Format String | Example Output | Use Case |
|---------------|----------------|----------|
| `$#,0.##` | $1,234.56 | Currency values |
| `#,0` | 1,234 | Whole numbers |
| `0.00%` | 12.34% | Percentages |

### Expected Outcome
- ✅ **Sum of Sales**: Currency-formatted revenue measure
- ✅ **Count of Sales**: Integer-formatted transaction count  
- ✅ **Clean implementation**: No duplicate or conflicting measures
- ✅ **Ready for reporting**: Measures available in visualization tools

🎯 **Business intelligence checkpoint**: Your model now calculates key business metrics!

In [None]:
completedOK:bool=False
while not completedOK:
    try:
        with labs.tom.connect_semantic_model(dataset=SemanticModelName, readonly=False) as tom:
            #1. Remove any existing measures
            for t in tom.model.Tables:
                for m in t.Measures:
                    tom.remove_object(m)
                    print(f"[{m.Name}] measure removed")

            tom.add_measure(table_name="FactInternetSales" ,measure_name="Sum of Sales",expression="SUM(FactInternetSales[SalesAmount])",format_string="\$#,0.###############;(\$#,0.###############);\$#,0.###############")
            tom.add_measure(table_name="FactInternetSales" ,measure_name="Count of Sales",expression="COUNTROWS(FactInternetSales)",format_string="#,0")
            completedOK=True
    except:
        print('Error adding measures... trying again.')
        time.sleep(3)

print('done')

## 9. Configure Date Table for Time Intelligence

### Why Mark a Date Table?
**Date tables** are special in business intelligence because they enable **time intelligence** functions. Marking a table as a date table:

- **🕐 Enables DAX time functions**: TOTALYTD, SAMEPERIODLASTYEAR, DATESADD
- **📅 Improves auto-formatting**: Dates display properly in visuals
- **🔄 Supports relative filtering**: "Last 30 days", "This quarter", "Year over year"
- **⚡ Optimizes performance**: Fabric optimizes queries involving date operations

### Date Table Requirements
For a table to be marked as a date table, it must have:

- ✅ **Unique date column**: No duplicate dates
- ✅ **Continuous date range**: No gaps in the date sequence
- ✅ **Date data type**: Proper datetime or date column type
- ✅ **Complete coverage**: Spans the full range of fact table dates

### Our DimDate Table Structure
The Adventure Works DimDate table includes:

| Column | Purpose | Example |
|--------|---------|---------|
| **Date** | Primary date column | 2023-01-01 |
| DateKey | Integer date key | 20230101 |
| MonthName | Month display name | January |
| MonthNumberOfYear | Month number | 1 |
| DayOfWeek | Weekday name | Sunday |
| DayNumberOfWeek | Weekday number | 1 |

### Technical Implementation
```python
tom.mark_as_date_table(table_name="DimDate", column_name="Date")
```

#### Parameters Explained:
- **`table_name="DimDate"`**: The table containing our date dimension
- **`column_name="Date"`**: The primary date column (must be Date/DateTime type)

### What Happens Behind the Scenes
When you mark a date table, Fabric:

1. **🔍 Validates structure**: Checks for unique, continuous dates
2. **🏷️ Sets metadata**: Marks the table with special date table properties
3. **⚡ Optimizes storage**: Enables date-specific compression and indexing
4. **🔧 Enables functions**: Unlocks DAX time intelligence capabilities

### Time Intelligence Unlocked! 🚀
Once marked, you can create measures like:
```dax
Sales YTD = TOTALYTD([Sum of Sales], DimDate[Date])
Sales Last Year = CALCULATE([Sum of Sales], SAMEPERIODLASTYEAR(DimDate[Date]))
Sales Growth = [Sum of Sales] - [Sales Last Year]
```

### Expected Outcome
- ✅ **DimDate marked** as official date table
- ✅ **Time intelligence enabled** for DAX calculations
- ✅ **Foundation ready** for advanced date-based analysis

🎯 **Time intelligence checkpoint**: Your model now supports sophisticated date calculations!

In [None]:
completedOK:bool=False
while not completedOK:
    try:
        with labs.tom.connect_semantic_model(dataset=SemanticModelName, readonly=False) as tom:
            tom.mark_as_date_table(table_name="DimDate",column_name="Date")
            completedOK=True
    except:
        print('Error with date table... trying again.')
        time.sleep(3)

print('done')

## 10. Configure Column Sorting for Better User Experience

### The Sorting Challenge in BI
When users create reports, they expect logical sorting behavior:

- **❌ Default alphabetical**: "April, August, December, February, January..."
- **✅ Business logical**: "January, February, March, April, May..."

Without proper sorting configuration, month names sort alphabetically, which confuses users and makes reports hard to interpret.

### Sort By Column Concept
**Sort by column** allows you to sort one column using the values from another column:

| Display Column | Sort By Column | Why? |
|----------------|----------------|------|
| MonthName | MonthNumberOfYear | Sort "January" by 1, "February" by 2 |
| DayOfWeek | DayNumberOfWeek | Sort "Sunday" by 1, "Monday" by 2 |
| Product Size | SizeOrder | Sort "Small" by 1, "Medium" by 2, "Large" by 3 |

### Our Configuration

#### 1. **Month Sorting**
```python
tom.set_sort_by_column(table_name="DimDate", column_name="MonthName", sort_by_column="MonthNumberOfYear")
```
- **Display**: "January", "February", "March"...
- **Sort by**: 1, 2, 3...
- **Result**: Chronological month order in visuals

#### 2. **Weekday Sorting**
```python  
tom.set_sort_by_column(table_name="DimDate", column_name="DayOfWeek", sort_by_column="DayNumberOfWeek")
```
- **Display**: "Sunday", "Monday", "Tuesday"...
- **Sort by**: 1, 2, 3...
- **Result**: Logical weekday progression

### Technical Implementation Details

#### JSON Output Inspection
The code displays the table's JSON structure to verify:
- ✅ **Sort by relationships** are properly configured
- ✅ **Column metadata** is correctly set
- ✅ **Model structure** matches expectations

#### BIM (Business Intelligence Model) Format
The output shows the table definition in BIM format, which includes:
- Column definitions and data types
- Sort by column relationships  
- Display formatting rules
- Performance optimization settings

### User Experience Impact

#### Before Sort Configuration:
```
Chart shows: Apr, Aug, Dec, Feb, Jan, Jul, Jun...
Users think: "This makes no sense!"
```

#### After Sort Configuration:
```
Chart shows: Jan, Feb, Mar, Apr, May, Jun, Jul...
Users think: "Perfect! This is what I expected."
```

### Expected Outcome
- ✅ **MonthName column** sorts chronologically (Jan → Dec)
- ✅ **DayOfWeek column** sorts logically (Sun → Sat) 
- ✅ **JSON structure** displayed for verification
- ✅ **Enhanced UX** for report consumers

🎯 **User experience checkpoint**: Your model now provides intuitive sorting behavior!

In [None]:
import json
tom = labs.tom.TOMWrapper(dataset=SemanticModelName, workspace=workspaceName, readonly=False)
tom.set_sort_by_column(table_name="DimDate",column_name="MonthName"       ,sort_by_column="MonthNumberOfYear")
tom.set_sort_by_column(table_name="DimDate",column_name="DayOfWeek"       ,sort_by_column="DayNumberOfWeek")
tom.model.SaveChanges()

i:int=0
for t in tom.model.Tables:
    if t.Name=="DimDate":
        bim = json.dumps(tom.get_bim()["model"]["tables"][i],indent=4)
        print(bim)
    i=i+1

## 11. Optimize Model by Hiding Technical Columns

### Why Hide Columns in Fact Tables?
**Fact tables** contain both business-relevant and technical columns. Hiding technical columns improves the user experience by:

- **🎯 Reducing complexity**: Users see only meaningful business columns
- **🚫 Preventing errors**: Technical keys shouldn't be used in reports directly
- **📊 Promoting measures**: Guides users toward proper aggregated values
- **⚡ Improving performance**: Reduces metadata that client tools need to process

### Column Visibility Strategy

#### FactInternetSales Column Analysis:
| Column Type | Examples | Visibility | Reason |
|-------------|----------|------------|---------|
| **Foreign Keys** | CustomerKey, ProductKey | 🔒 Hidden | Use relationships instead |
| **Technical IDs** | OrderDateKey | 🔒 Hidden | Use Date dimension |
| **Raw Values** | SalesAmount, Quantity | 🔒 Hidden | Use measures instead |
| **Calculated Fields** | [Sum of Sales] measure | 👁️ Visible | Proper aggregation |

### The Column Hiding Process

#### 1. **Iterate Through Tables**
```python
for t in tom.model.Tables:
    if t.Name in ["FactInternetSales"]:
```
- Targets specific fact tables (extensible to multiple tables)
- Preserves dimension table columns for filtering and grouping

#### 2. **Hide All Columns**
```python
for c in t.Columns:
    c.IsHidden = True
```
- Sets the `IsHidden` property to `True` for each column
- Columns remain in the model but don't appear in field lists

#### 3. **JSON Verification**
```python
bim = json.dumps(tom.get_bim()["model"]["tables"][i], indent=4)
```
- Displays the table structure for verification
- Shows the `IsHidden` property for each column

### Best Practices for Column Visibility

#### Always Hide in Fact Tables:
- ✅ **Surrogate keys**: CustomerKey, ProductKey, DateKey
- ✅ **Raw numeric values**: Use measures instead
- ✅ **Technical timestamps**: Created dates, modified dates

#### Keep Visible in Dimension Tables:
- ✅ **Descriptive attributes**: Customer names, product categories
- ✅ **Natural keys**: Account numbers, product codes
- ✅ **Date components**: Year, month, quarter (from date table)

### User Experience Impact

#### Before Hiding:
```
Field List shows:
├── FactInternetSales
│   ├── CustomerKey         ← Confusing
│   ├── ProductKey          ← Confusing  
│   ├── OrderDateKey        ← Confusing
│   ├── SalesAmount         ← Misleading
│   └── Quantity            ← Misleading
```

#### After Hiding:
```
Field List shows:
├── FactInternetSales
│   ├── Sum of Sales        ← Clear measure
│   └── Count of Sales      ← Clear measure
├── DimCustomer
│   ├── CustomerName        ← Useful for grouping
│   └── CustomerCity        ← Useful for filtering
```

### Expected Outcome
- ✅ **All FactInternetSales columns** hidden from user interface
- ✅ **Measures remain visible** for proper aggregation
- ✅ **Dimension columns visible** for filtering and grouping
- ✅ **JSON structure** displayed showing `IsHidden: true`

🎯 **Model optimization checkpoint**: Your model now guides users toward correct analysis patterns!

In [None]:
i:int=0
for t in tom.model.Tables:
    if t.Name in ["FactInternetSales"]:
        for c in t.Columns:
            c.IsHidden=True

        bim = json.dumps(tom.get_bim()["model"]["tables"][i],indent=4)
        print(bim)
    i=i+1
    
tom.model.SaveChanges()

## 12. Refresh Semantic Model to Apply Configuration Changes

### Understanding Semantic Model Refresh
**Refreshing** a semantic model ensures that all configuration changes are properly applied and the model is ready for use. This process:

- **💾 Commits metadata changes**: Relationships, measures, column properties
- **🔄 Synchronizes with lakehouse**: Ensures Direct Lake connections are active
- **⚡ Optimizes query plans**: Updates internal structures for better performance
- **✅ Validates configuration**: Checks that all changes are compatible

### Why Refresh is Critical for Direct Lake
Direct Lake models have unique refresh requirements:

#### Configuration Refresh vs. Data Refresh:
| Refresh Type | Purpose | When Needed | Duration |
|--------------|---------|-------------|----------|
| **Configuration** | Apply metadata changes | After model modifications | Seconds |
| **Data** | Update cached data | Not needed (real-time) | N/A |

#### What Happens During Refresh:
1. **🔍 Validates structure**: Checks table relationships and measure definitions
2. **🔗 Tests lakehouse connections**: Ensures Direct Lake paths are accessible  
3. **📊 Updates metadata**: Applies visibility, sorting, and formatting rules
4. **⚡ Optimizes performance**: Builds internal indexes and query plans

### Error Handling Strategy

#### Retry Logic Implementation:
```python
reframeOK:bool = False
while not reframeOK:
    try:
        result = labs.refresh_semantic_model(dataset=SemanticModelName)
        reframeOK = True
    except:
        print('Error with reframe... trying again.')
        triggerMetadataRefresh()
        time.sleep(3)
```

#### Why Retry Logic is Necessary:
- **🔄 Async operations**: Lakehouse metadata may still be syncing
- **⏱️ Timing dependencies**: Model changes need coordination across services
- **🛡️ Transient issues**: Network or service delays can cause temporary failures
- **🔧 Metadata dependencies**: Table schemas must be fully synchronized

#### Recovery Actions:
1. **Trigger metadata refresh**: Re-sync lakehouse table information
2. **Wait period**: Allow background processes to complete
3. **Retry operation**: Attempt the refresh again

### Expected Outcomes

#### Success Indicators:
- ✅ **"Custom Semantic Model reframe OK"**: Confirms successful refresh
- ✅ **No error messages**: All configurations applied without issues
- ✅ **Model ready**: Available for querying and report creation

#### What Gets Applied:
- ✅ **Table relationships**: Star schema connections active
- ✅ **DAX measures**: Business calculations available
- ✅ **Date table**: Time intelligence functions enabled
- ✅ **Column sorting**: Intuitive ordering in visuals
- ✅ **Column visibility**: Optimized field lists for users

### Fabric Integration Benefits
The refresh process leverages Fabric's integrated architecture:
- **OneLake integration**: Direct access to lakehouse data
- **Unified metastore**: Consistent metadata across services  
- **Intelligent caching**: Optimized for real-time scenarios

🎯 **Model readiness checkpoint**: Your Direct Lake model is now fully configured and ready for business use!

In [None]:
reframeOK:bool=False
while not reframeOK:
    try:
        result:pandas.DataFrame = labs.refresh_semantic_model(dataset=SemanticModelName)
        reframeOK=True
    except:
        print('Error with reframe... trying again.')
        triggerMetadataRefresh()
        time.sleep(3)

print('Custom Semantic Model reframe OK')

## 13. Create Monitoring Functions for Direct Lake Analysis

### Understanding Dynamic Management Views (DMVs)
**DMVs** are special system tables that provide insights into how your Direct Lake model operates. They reveal:

- **🧠 Memory usage**: Which columns are loaded into memory
- **🌡️ Column temperature**: How frequently columns are accessed ("hot" vs "cold")
- **💾 Storage details**: Compression ratios and data types
- **⚡ Performance metrics**: Query patterns and optimization opportunities

### The Storage Columns DMV
Our `runDMV()` function queries `$SYSTEM.DISCOVER_STORAGE_TABLE_COLUMNS` to show:

| Column | Purpose | Example Values |
|--------|---------|----------------|
| **TABLE** | Table name | DimCustomer, FactInternetSales |
| **COLUMN** | Column name | CustomerKey, SalesAmount |
| **DATATYPE** | Storage data type | Int64, Double, String |
| **SIZE** | Dictionary size | 1024, 4096 (bytes) |
| **PAGEABLE** | Can be paged to disk | TRUE, FALSE |
| **RESIDENT** | Currently in memory | TRUE, FALSE |
| **TEMPERATURE** | Access frequency | HOT, WARM, COLD |
| **LASTACCESSED** | Last access time | 2024-01-15 14:30:00 |

### Temperature-Based Optimization
**Column temperature** is crucial for Direct Lake performance:

#### 🔥 **HOT Columns**:
- Frequently accessed in queries
- Kept in memory for fast access
- Examples: Date columns, key measures

#### 🌡️ **WARM Columns**:
- Occasionally accessed
- May be paged in/out of memory
- Examples: Customer attributes used in some reports

#### ❄️ **COLD Columns**:
- Rarely or never accessed
- Kept on disk to save memory
- Examples: Technical columns, unused attributes

### Monitoring Strategy Benefits

#### Before Query Execution:
- 📊 **Baseline measurement**: See initial column states
- 🧠 **Memory footprint**: Understand current memory usage
- 📈 **Performance baseline**: Establish starting point

#### After Query Execution:
- 🔍 **Query impact analysis**: See which columns became "hot"
- 💾 **Memory changes**: Track new columns loaded
- ⚡ **Optimization insights**: Identify performance patterns

### Function Implementation Details

#### Import Requirements:
```python
import warnings  # Handle warning messages
import time      # Timing operations
from Microsoft.AnalysisServices.Tabular import TraceEventArgs  # Tracing events
from typing import Dict, List, Optional, Callable              # Type hints
```

#### DAX Query Structure:
The DMV query uses specific system tables:
- `$SYSTEM.DISCOVER_STORAGE_TABLE_COLUMNS`: Column-level storage information
- **ORDER BY DICTIONARY_TEMPERATURE DESC**: Shows hottest columns first

### Expected Output
The function will display a table showing:
```
TABLE               COLUMN          DATATYPE  SIZE  TEMPERATURE
FactInternetSales   SalesAmount     Double    2048  HOT
DimDate            Date            DateTime  1024  HOT  
DimCustomer        CustomerName    String    4096  WARM
...
```

🎯 **Monitoring foundation**: You now have tools to understand and optimize Direct Lake performance!

In [None]:
import warnings
import time
from Microsoft.AnalysisServices.Tabular import TraceEventArgs
from typing import Dict, List, Optional, Callable

def runDMV():
    df = sempy.fabric.evaluate_dax(
        dataset=SemanticModelName, 
        dax_string="""
        
        SELECT 
            MEASURE_GROUP_NAME AS [TABLE],
            ATTRIBUTE_NAME AS [COLUMN],
            DATATYPE ,
            DICTIONARY_SIZE 		    AS SIZE ,
            DICTIONARY_ISPAGEABLE 		AS PAGEABLE ,
            DICTIONARY_ISRESIDENT		AS RESIDENT ,
            DICTIONARY_TEMPERATURE		AS TEMPERATURE,
            DICTIONARY_LAST_ACCESSED	AS LASTACCESSED 
        FROM $SYSTEM.DISCOVER_STORAGE_TABLE_COLUMNS 
        ORDER BY 
            [DICTIONARY_TEMPERATURE] DESC
        
        """)
    display(df)

## 14. Explore Direct Lake Capabilities with DAX Functions

### Understanding TABLETRAITS()
**TABLETRAITS()** is a special DAX function that reveals the internal characteristics of your Direct Lake tables. It provides insights into:

- **📊 Storage mode**: Confirms Direct Lake configuration
- **🔗 Data source**: Shows lakehouse connection details  
- **📈 Table properties**: Size, partitioning, compression
- **⚡ Performance hints**: Optimization opportunities

### What TABLETRAITS() Reveals

#### Key Information Returned:
| Property | Description | Example Values |
|----------|-------------|----------------|
| **Table Name** | Name of the table | DimCustomer, FactInternetSales |
| **Storage Mode** | How data is stored | DirectLake, Import, DirectQuery |
| **Data Source** | Source location | OneLake path, SQL connection |
| **Partition Count** | Number of partitions | 1, 4, 12 |
| **Row Count** | Estimated rows | 18,484 (customers), 60,398 (sales) |
| **Size (MB)** | Storage footprint | 2.1 MB, 15.7 MB |

### Direct Lake Guardrails
The second query retrieves **Direct Lake guardrails** - the limits and thresholds that ensure optimal performance:

#### Common Guardrails Include:
- **📏 Maximum file size**: Individual parquet file limits
- **📊 Row count limits**: Maximum rows per table/partition
- **🧠 Memory constraints**: Available memory for column loading
- **🔄 Refresh frequency**: How often metadata can be updated
- **📈 Column cardinality**: Limits on unique values per column

### Why These Queries Matter

#### Model Validation:
```dax
EVALUATE TABLETRAITS()
```
Confirms that your model is properly configured as Direct Lake and shows the connection to your lakehouse.

#### Performance Planning:
```python
labs.directlake.get_direct_lake_guardrails()
```
Shows the limits you need to stay within for optimal performance.

### Expected Output Examples

#### TABLETRAITS() Sample Results:
```
TableName           StorageMode   DataSource                    RowCount
DimCustomer         DirectLake    OneLake://workspace/lake...   18,484
DimDate             DirectLake    OneLake://workspace/lake...   2,556  
DimProduct          DirectLake    OneLake://workspace/lake...   606
FactInternetSales   DirectLake    OneLake://workspace/lake...   60,398
```

#### Guardrails Sample Results:
```
Guardrail                    Current Value    Limit        Status
Max File Size               145 MB           1 GB         ✅ OK
Max Rows Per Table          60,398           100M         ✅ OK  
Available Memory            2.1 GB           8 GB         ✅ OK
Max Column Cardinality      18,484           1.6M         ✅ OK
```

### Troubleshooting with TABLETRAITS()
If a table shows **ImportMode** instead of **DirectLake**:
- ❌ **Fallback occurred**: Something caused the table to fall back to import mode
- 🔍 **Check guardrails**: Verify limits aren't exceeded
- 🔧 **Review configuration**: Ensure proper lakehouse connections

🎯 **Model verification checkpoint**: Confirm your Direct Lake configuration is working correctly!

In [None]:
df=sempy.fabric.evaluate_dax(
    dataset=SemanticModelName, 
    dax_string="""
    
    evaluate tabletraits()
    
    """)
display(df)

In [None]:
df=labs.directlake.get_direct_lake_guardrails()
display(df)

## 15. Establish Performance Baseline with DMV Analysis

### The Importance of Baseline Measurement
Before executing any business queries, it's crucial to establish a **performance baseline**. This initial DMV run shows:

- **🧠 Initial memory state**: Which columns are already loaded
- **📊 Starting temperatures**: Current "hot", "warm", and "cold" column states
- **💾 Memory footprint**: Baseline memory usage before query execution
- **🎯 Optimization opportunities**: Identify columns that might need attention

### What You'll Observe in the Baseline

#### Expected Initial State:
Most columns should show:
- **❄️ TEMPERATURE**: "COLD" (not recently accessed)
- **🚫 RESIDENT**: "FALSE" (not currently in memory)  
- **📅 LASTACCESSED**: Older timestamps or null values
- **📏 SIZE**: Actual dictionary sizes for each column

#### Key Columns to Watch:
| Table | Column | Expected State | Why |
|-------|--------|----------------|-----|
| **FactInternetSales** | SalesAmount | COLD | Main measure column |
| **DimDate** | Date | COLD | Primary date column |
| **DimCustomer** | CustomerKey | COLD | Relationship key |
| **DimProduct** | ProductKey | COLD | Relationship key |

### DMV Column Analysis

#### Understanding the Output:
```
TABLE               COLUMN          DATATYPE  SIZE   PAGEABLE  RESIDENT  TEMPERATURE
FactInternetSales   OrderDateKey    Int64     246    TRUE      FALSE     COLD
FactInternetSales   CustomerKey     Int64     7244   TRUE      FALSE     COLD  
FactInternetSales   ProductKey      Int64     1064   TRUE      FALSE     COLD
FactInternetSales   SalesAmount     Double    8192   TRUE      FALSE     COLD
DimCustomer         CustomerKey     Int64     7244   TRUE      FALSE     COLD
DimDate             DateKey         Int64     1024   TRUE      FALSE     COLD
```

### Storage Insights from DMV

#### Data Type Optimization:
- **Int64**: Efficient for keys and identifiers
- **Double**: Precise for currency values
- **String**: Variable size for text fields

#### Memory Management:
- **PAGEABLE=TRUE**: Column can be moved between memory and disk
- **RESIDENT=FALSE**: Currently stored on disk, not in memory
- **SIZE**: Dictionary compression size (smaller = better compression)

### Baseline Benefits for Learning

#### Performance Comparison:
1. **📊 Before query**: All columns COLD and not resident
2. **🔥 After query**: Used columns become HOT and resident
3. **📈 Delta analysis**: See exact impact of specific queries

#### Memory Usage Tracking:
- Compare memory usage before and after queries
- Understand which columns consume the most memory
- Identify optimization opportunities

### Expected Outcome
You'll see a comprehensive table showing:
- ✅ **All model columns** with their current storage states
- ✅ **Temperature baseline** (mostly COLD initially)
- ✅ **Memory footprint** before any business queries
- ✅ **Performance foundation** for comparison analysis

🎯 **Performance baseline established**: Ready to analyze query impact on Direct Lake behavior!

In [None]:
runDMV()

## 16. Execute Business Query and Analyze Direct Lake Performance

### The Complete Performance Analysis Workflow
This final section demonstrates the full Direct Lake performance analysis cycle:

1. **🧹 Clear cache**: Start with clean memory state
2. **📊 Execute business query**: Run meaningful DAX analysis  
3. **🔍 Analyze impact**: See how query execution affects column states

### Cache Clearing Strategy
```python
labs.clear_cache(SemanticModelName)
```

#### Why Clear Cache First?
- **🧠 Clean memory state**: Removes any previously loaded columns
- **📊 Accurate measurement**: Ensures we see true query impact
- **🔄 Consistent testing**: Provides repeatable performance analysis
- **⚡ Real-world simulation**: Mimics first-time query execution

### Business Query Analysis

#### DAX Query Breakdown:
```dax
EVALUATE
SUMMARIZECOLUMNS(
    DimDate[MonthName],
    "Count of Transactions", COUNTROWS(FactInternetSales),
    "Sum of Sales", [Sum of Sales]
)
ORDER BY [MonthName]
```

#### Query Components Explained:

##### 📅 **SUMMARIZECOLUMNS()**:
- **Purpose**: Creates cross-table aggregations efficiently
- **Performance**: Optimized for Direct Lake scenarios
- **Flexibility**: Handles multiple measures and dimensions

##### 🗓️ **DimDate[MonthName]**:
- **Role**: Grouping dimension (shows data by month)
- **Sort behavior**: Uses our configured sort-by-column (MonthNumberOfYear)
- **Expected impact**: Will make DimDate columns "HOT"

##### 📊 **"Count of Transactions"**: 
- **Formula**: `COUNTROWS(FactInternetSales)`
- **Purpose**: Shows transaction volume per month
- **Expected impact**: Will load FactInternetSales into memory

##### 💰 **"Sum of Sales"**:
- **Formula**: Our custom `[Sum of Sales]` measure
- **Purpose**: Shows revenue by month  
- **Expected impact**: Will make SalesAmount column "HOT"

##### 📈 **ORDER BY [MonthName]**:
- **Behavior**: Uses our sort-by-column configuration
- **Result**: Chronological month order (Jan, Feb, Mar...)
- **User experience**: Intuitive time-series presentation

### Expected Business Results
The query should return data like:
```
MonthName    Count of Transactions    Sum of Sales
January      1,234                   $456,789.12
February     1,567                   $567,890.23  
March        1,890                   $678,901.34
...
```

### Performance Impact Analysis

#### After Query Execution - Expected Changes:
| Table | Column | Before | After | Why |
|-------|--------|--------|-------|-----|
| **DimDate** | MonthName | COLD | HOT | Used for grouping |
| **DimDate** | MonthNumberOfYear | COLD | HOT | Used for sorting |
| **FactInternetSales** | SalesAmount | COLD | HOT | Used in Sum measure |
| **FactInternetSales** | OrderDateKey | COLD | HOT | Used for relationship |

#### Memory Usage Patterns:
- **🔥 HOT columns**: Recently accessed, kept in memory
- **📈 RESIDENT=TRUE**: Columns now loaded in memory
- **⏰ LASTACCESSED**: Updated to current timestamp
- **💾 Memory increase**: Overall model memory footprint grows

### Learning Objectives Achieved

#### Direct Lake Behavior Understanding:
- ✅ **Real-time data access**: No import delay, immediate results
- ✅ **Intelligent caching**: Only needed columns loaded into memory
- ✅ **Performance optimization**: Subsequent queries using same columns will be faster
- ✅ **Resource efficiency**: Unused columns remain on disk

#### Performance Monitoring Mastery:
- ✅ **Before/after analysis**: Clear view of query impact
- ✅ **Memory optimization**: Understanding of column temperature
- ✅ **Cache behavior**: How Direct Lake manages memory
- ✅ **Query planning**: Insights for future optimization

🎯 **Workshop completion**: You've successfully created, configured, and analyzed a production-ready Direct Lake semantic model!

In [None]:
labs.clear_cache(SemanticModelName)

df=sempy.fabric.evaluate_dax(
    dataset=SemanticModelName, 
    dax_string="""
    
    EVALUATE
        SUMMARIZECOLUMNS(
               
                DimDate[MonthName] ,
                "Count of Transactions" , COUNTROWS(FactInternetSales) ,
                "Sum of Sales" , [Sum of Sales] 
        )
        ORDER BY [MonthName]
    """)
display(df)

runDMV()

## 17. Clean Up Resources and Session Conclusion

### Workshop Summary 🎉
Congratulations! You have successfully completed Lab 1 and built a comprehensive Direct Lake semantic model. Here's what you accomplished:

#### ✅ **Infrastructure Setup**
- Created a lakehouse with proper configuration
- Loaded Adventure Works sample data (4 tables, 80K+ rows)
- Configured metadata synchronization

#### ✅ **Model Development**  
- Built a Direct Lake semantic model from lakehouse tables
- Established star schema relationships (3 relationships)
- Created business measures with proper DAX and formatting

#### ✅ **User Experience Optimization**
- Configured date table for time intelligence
- Set logical column sorting for better visuals
- Optimized column visibility for end users

#### ✅ **Performance Analysis**
- Implemented DMV monitoring for performance insights
- Analyzed query execution impact on memory usage  
- Established baseline and post-query performance comparison

### Key Direct Lake Concepts Learned

#### 🔄 **Real-time Analytics**
Your model provides immediate access to lakehouse data without import delays or scheduled refreshes.

#### ⚡ **Intelligent Memory Management**
Direct Lake automatically loads only the columns needed for your queries, optimizing both performance and resource usage.

#### 📊 **Enterprise-Ready Design**
The star schema design with proper relationships, measures, and formatting provides a foundation for scalable business intelligence.

### Next Steps in Your Direct Lake Journey

#### 🚀 **Immediate Actions**:
- Explore the model in Power BI Desktop or Fabric
- Create reports using the measures and relationships you built
- Experiment with different DAX queries to see performance patterns

#### 📈 **Advanced Learning**:
- **Lab 2**: Scale to larger datasets and understand big data scenarios
- **Lab 3**: Analyze Delta table structure and optimization
- **Lab 4**: Explore fallback behaviors and troubleshooting

#### 🛠️ **Production Considerations**:
- Security and access control for lakehouse data
- Monitoring and alerting for model performance
- Governance and lifecycle management

### Resource Cleanup Importance
The following command stops the Spark session to:
- **💰 Save costs**: Release compute resources
- **🧹 Clean memory**: Free up cluster resources for other users
- **✅ Best practice**: Proper session management in Fabric notebooks

### Final Thoughts
Direct Lake represents a paradigm shift in analytics, providing the **real-time capabilities of DirectQuery** with the **performance benefits of Import mode**. You now have hands-on experience with this powerful technology!

🎯 **Ready for the next lab?** Let's explore Direct Lake with big data scenarios!

In [None]:
mssparkutils.session.stop()