# Lab 6: Direct Lake Performance - Column Partitioning

## Introduction

This lab focuses on **column partitioning** optimization techniques in Microsoft Fabric Direct Lake models. Column partitioning is a powerful performance optimization strategy that improves query performance by organizing data storage to minimize I/O operations and enhance compression efficiency. Understanding these techniques is crucial for optimizing large-scale analytics workloads.

## Lab Overview

**Learning Objectives:**
- Understand column partitioning strategies for Direct Lake performance
- Compare query performance between partitioned and non-partitioned tables
- Analyze cold vs. warm cache performance characteristics
- Master advanced DAX query optimization with server timings

**Key Concepts:**
- **Column Partitioning**: Strategic data organization for optimal query performance
- **Cache Performance**: Understanding cold vs. warm cache behavior
- **Query Optimization**: Advanced DAX performance tuning techniques
- **Server Timings**: Performance analysis and bottleneck identification

**Architecture Overview:**

**Prerequisites:** Lab 2 completion (Big Data)

**Performance Focus Areas:**
- **Cold Cache Performance**: Initial query execution optimization
- **Warm Cache Performance**: Repeated query execution efficiency  
- **Column Organization**: Strategic partitioning for optimal data access

## 1. Install Semantic Link Labs Python Library
Install the Semantic Link Labs library for advanced performance analysis and column partitioning optimization.

In [None]:
# Install semantic-link-labs library for performance optimization tools
%pip install -q --upgrade pip
%pip install -q semantic-link-labs azure-core==1.31.0 PyJWT==2.6.0

## 2. Load Python Libraries
Import required libraries and configure BigData lakehouse for column partitioning performance analysis.

In [None]:
# Import libraries for performance analysis and column partitioning optimization
import sempy_labs as labs
from sempy import fabric
import sempy
import pandas
import time
import warnings

# Configure BigData lakehouse for column partitioning experiments
LakehouseName = "BigData"
lakehouses = labs.list_lakehouses()["Lakehouse Name"]
for l in lakehouses:
    if l.startswith("Big"):
        LakehouseName = l

# Set up semantic model for performance testing
SemanticModelName = f"{LakehouseName}_model"

## 3. Setup Parameters
Configure workspace and lakehouse identifiers required for column partitioning performance tests.

In [None]:
# Validate BigData lakehouse exists (required from Lab 2)
lakehouses=labs.list_lakehouses()["Lakehouse Name"]
if LakehouseName in lakehouses.values:
    lakehouseId = notebookutils.lakehouse.getWithProperties(LakehouseName)["id"]
else:
    print("You need to complete Lab 2 to create the required lakehouse for this lab")

# Configure workspace parameters for performance testing
workspaceId = notebookutils.lakehouse.getWithProperties(LakehouseName)["workspaceId"]
workspaceName = sempy.fabric.resolve_workspace_name(workspaceId)
print(f"WorkspaceId = {workspaceId}, LakehouseID = {lakehouseId}, Workspace Name = {workspaceName}")

## 4. Create Function to Run DAX Query with Server Timings Trace
Build a specialized function to execute DAX queries while capturing detailed server timing metrics for performance analysis.

In [None]:
from Microsoft.AnalysisServices.Tabular import TraceEventArgs
from typing import Dict, List, Optional, Callable

def runDMV():
    df = sempy.fabric.evaluate_dax(
        dataset=SemanticModelName, 
        dax_string="""
        
        SELECT 
            MEASURE_GROUP_NAME AS [TABLE],
            ATTRIBUTE_NAME AS [COLUMN],
            DATATYPE ,
            DICTIONARY_SIZE 		    AS SIZE ,
            DICTIONARY_ISPAGEABLE 		AS PAGEABLE ,
            DICTIONARY_ISRESIDENT		AS RESIDENT ,
            DICTIONARY_TEMPERATURE		AS TEMPERATURE,
            DICTIONARY_LAST_ACCESSED	AS LASTACCESSED 
        FROM $SYSTEM.DISCOVER_STORAGE_TABLE_COLUMNS 
        ORDER BY 
            [DICTIONARY_TEMPERATURE] DESC
        
        """)
    display(df)

def filter_func(e):
    retVal:bool=True
    if e.EventSubclass.ToString() == "VertiPaqScanInternal":
        retVal=False      
    #     #if e.EventSubClass.ToString() == "VertiPaqScanInternal":
    #     retVal=False
    return retVal

# define events to trace and their corresponding columns
def runQueryWithTrace (expr:str,workspaceName:str,SemanticModelName:str,Result:Optional[bool]=True,Trace:Optional[bool]=True,DMV:Optional[bool]=True,ClearCache:Optional[bool]=True) -> pandas.DataFrame :
    event_schema = fabric.Trace.get_default_query_trace_schema()
    event_schema.update({"ExecutionMetrics":["EventClass","TextData"]})
    del event_schema['VertiPaqSEQueryBegin']
    del event_schema['VertiPaqSEQueryCacheMatch']
    del event_schema['DirectQueryBegin']

    warnings.filterwarnings("ignore")

    if ClearCache:
        labs.clear_cache(SemanticModelName)

    WorkspaceName:str = workspaceName
    SemanticModelName:str = SemanticModelName

    with fabric.create_trace_connection(SemanticModelName,WorkspaceName) as trace_connection:
        # create trace on server with specified events
        with trace_connection.create_trace(
            event_schema=event_schema, 
            name="Simple Query Trace",
            filter_predicate=filter_func,
            stop_event="QueryEnd"
            ) as trace:

            trace.start()

            df:FabricDataFrame=sempy.fabric.evaluate_dax(
                dataset=SemanticModelName, 
                dax_string=expr)

            if Result:
                displayHTML(f"<H2>####### DAX QUERY RESULT #######</H2>")
                display(df)

            # Wait 5 seconds for trace data to arrive
            time.sleep(5)

            # stop Trace and collect logs
            final_trace_logs:pandas.DataFrame = trace.stop()

    if Trace:
        displayHTML(f"<H2>####### SERVER TIMINGS #######</H2>")
        display(final_trace_logs)
    
    if DMV:
        displayHTML(f"<H2>####### SHOW DMV RESULTS #######</H2>")
        runDMV()

    return final_trace_logs


## 5. Reframe Custom Semantic Model
Refresh the semantic model to ensure all data changes are synchronized before performance testing.

In [None]:
# Refresh the semantic model to ensure all data changes are synchronized
labs.refresh_semantic_model(SemanticModelName)

## 6. Run Vertipaq Analyzer on Semantic Model
Generate detailed column storage statistics using Vertipaq Analyzer to understand data distribution and compression.
Note this runs over data in the Semantic Model and not the Delta Table.

In [None]:
analyzer:dict[str,pandas.DataFrame] = labs.vertipaq_analyzer(dataset=SemanticModelName)

for key, value in analyzer.items():
    print(key)
    display(value)

In [None]:
display(analyzer["Columns"].query("`Column Name`=='DateKey' & `Is Resident`==True"))
display(analyzer["Columns"].query("`Column Name`=='Quantity_ThisYear' & `Is Resident`==True"))

## 8 Run some DAX Queries

### 8.1 Period Comparison

#### Run Period Comparison against **base** table

In [None]:
expr:str = """

    DEFINE

        MEASURE dim_Date[Sum of Quantity] = 
            SUM(fact_myevents_1bln[Quantity_ThisYear])
            
        MEASURE dim_Date[Sum of Quantity PM] =
            CALCULATE([Sum of Quantity],PREVIOUSMONTH(dim_Date[DateKey]))

        MEASURE dim_Date[Sum of Quantity PM Delta] =
            [Sum of Quantity] - [Sum of Quantity PM]
        
        MEASURE dim_Date[Sum of Quantity PM %] =
            [Sum of Quantity PM Delta] / [Sum of Quantity]
        
    EVALUATE
        SUMMARIZECOLUMNS(
            -- GROUP BY --
            dim_Date[FirstDateofMonth] ,
            --  FILTER  --
            TREATAS({DATE(2019,1,1)} , dim_Date[FirstDateofYear] ) ,
             -- MEASURES --
            "Quantity" 				, [Sum of Quantity],
            "Quantity PM" 			, [Sum of Quantity PM],
            "Quantity PM Delta"		, [Sum of Quantity PM Delta] ,
            "Quantity PM % " 		, [Sum of Quantity PM %]
            )

"""

trace1 = runQueryWithTrace(expr,workspaceName,SemanticModelName,Result=False,DMV=False,Trace=False)
trace1 = runQueryWithTrace(expr,workspaceName,SemanticModelName,Result=False,DMV=False,Trace=False)

#### Run Period Comparison against **Partitioned** table

In [None]:
expr:str = """

    DEFINE

        MEASURE dim_Date[Sum of Quantity] = 
            SUM(fact_myevents_1bln_partitioned_datekey[Quantity_ThisYear])
            
        MEASURE dim_Date[Sum of Quantity PM] =
            CALCULATE([Sum of Quantity],PREVIOUSMONTH(dim_Date[DateKey]))

        MEASURE dim_Date[Sum of Quantity PM Delta] =
            [Sum of Quantity] - [Sum of Quantity PM]
        
        MEASURE dim_Date[Sum of Quantity PM %] =
            [Sum of Quantity PM Delta] / [Sum of Quantity]
        
    EVALUATE
        SUMMARIZECOLUMNS(
            -- GROUP BY --
            dim_Date[FirstDateofMonth] ,
            --  FILTER  --
            TREATAS({DATE(2019,1,1)} , dim_Date[FirstDateofYear] ) ,
             -- MEASURES --
            "Quantity" 				, [Sum of Quantity],
            "Quantity PM" 			, [Sum of Quantity PM],
            "Quantity PM Delta"		, [Sum of Quantity PM Delta] ,
            "Quantity PM % " 		, [Sum of Quantity PM %]
            )

"""

trace2 = runQueryWithTrace(expr,workspaceName,SemanticModelName,Result=False,DMV=False,Trace=False)
trace2 = runQueryWithTrace(expr,workspaceName,SemanticModelName,Result=False,DMV=False,Trace=False)

In [None]:
display(trace1)
display(trace2)

### 8.2 Running Total

#### Run Running Total against **Base** Table

In [None]:
expr:str = """

    DEFINE

        MEASURE dim_Date[Sum of Quantity] = 
            SUM(fact_myevents_1bln[Quantity_ThisYear])
            
	    MEASURE dim_Date[Sum of Quantity YTD] =
		    TOTALYTD([Sum of Quantity],dim_Date[DateKey])
	
	    MEASURE fact_myevents_1bln[Sum of Quantity QTD] =
		    TOTALQTD([Sum of Quantity],dim_Date[DateKey])	

    EVALUATE
        SUMMARIZECOLUMNS(
            -- GROUP BY --
            dim_Date[FirstDateofMonth] ,
            --  FILTER  --
            TREATAS({DATE(2019,1,1)} , dim_Date[FirstDateofYear] ) ,
             -- MEASURES --
            "Quantity" 		, [Sum of Quantity],
            "Quantity YTD" 	, [Sum of Quantity YTD] ,
            "Quantity QTD" 	, [Sum of Quantity QTD]
            )

"""
trace3 = runQueryWithTrace(expr,workspaceName,SemanticModelName,Result=False,DMV=False)

#### Run Running Total against **Partitioned** Table

In [None]:
expr:str="""

    DEFINE

        MEASURE dim_Date[Sum of Quantity] = 
            SUM(fact_myevents_1bln_partitioned_datekey[Quantity_ThisYear])
            
	    MEASURE dim_Date[Sum of Quantity YTD] =
		    TOTALYTD([Sum of Quantity],dim_Date[DateKey])
	
	    MEASURE fact_myevents_1bln[Sum of Quantity QTD] =
		    TOTALQTD([Sum of Quantity],dim_Date[DateKey])	

    EVALUATE
        SUMMARIZECOLUMNS(
            -- GROUP BY --
            dim_Date[FirstDateofMonth] ,
            --  FILTER  --
            TREATAS({DATE(2019,1,1)} , dim_Date[FirstDateofYear] ) ,
             -- MEASURES --
            "Quantity" 		, [Sum of Quantity],
            "Quantity YTD" 	, [Sum of Quantity YTD] ,
            "Quantity QTD" 	, [Sum of Quantity QTD]
            )

"""
trace4 = runQueryWithTrace(expr,workspaceName,SemanticModelName,Result=False,DMV=False)

In [None]:
display(trace3)
display(trace4)

### 8.3 RANK

#### Run RANK over **Base** Table

In [None]:
expr:str = """

    DEFINE

        MEASURE dim_Date[Sum of Quantity] = 
            SUM(fact_myevents_1bln[Quantity_ThisYear])
            
        MEASURE dim_Date[Sum of Quantity Rank] =
            RANKX(ALL(dim_Geography[COUNTRY]) , [Sum of Quantity] )

    EVALUATE
        SUMMARIZECOLUMNS(
            dim_Geography[COUNTRY] ,
            TREATAS({DATE(2019,1,1)} , dim_Date[FirstDateofMonth] ) ,

            "Quantity" 		, [Sum of Quantity],
            "Rank" 			, [Sum of Quantity Rank]
            )

"""
trace5 = runQueryWithTrace(expr,workspaceName,SemanticModelName,Result=False,DMV=False)

#### Run RANK over **Partitioned** Table

In [None]:
expr:str = """

    DEFINE

        MEASURE dim_Date[Sum of Quantity] = 
            SUM(fact_myevents_1bln_partitioned_datekey[Quantity_ThisYear])
            
        MEASURE dim_Date[Sum of Quantity Rank] =
            RANKX(ALL(dim_Geography[COUNTRY]) , [Sum of Quantity] )

    EVALUATE
        SUMMARIZECOLUMNS(
            dim_Geography[COUNTRY] ,
            TREATAS({DATE(2019,1,1)} , dim_Date[FirstDateofMonth] ) ,

            "Quantity" 		, [Sum of Quantity],
            "Rank" 			, [Sum of Quantity Rank]
            )

"""
trace6 = runQueryWithTrace(expr,workspaceName,SemanticModelName,Result=False,DMV=False)

In [None]:
display(trace5)
display(trace6)

### 8.4 Percent of Parent

#### Run Percent of Parent over **Base** Table

In [None]:
expr:str = """

    DEFINE

        MEASURE dim_Date[Sum of Quantity] = 
            SUM(fact_myevents_1bln[Quantity_ThisYear])
            
	    MEASURE dim_Date[Percentage of Parent] =
		    [Sum of Quantity] / CALCULATE([Sum of Quantity],ALL(dim_Geography))

    EVALUATE
        SUMMARIZECOLUMNS(
            dim_Geography[COUNTRY] ,
            TREATAS({DATE(2019,1,1)} , dim_Date[FirstDateofMonth] ) ,
            "Quantity" 		, [Sum of Quantity],
            "% of Parent"	, [Percentage of Parent]
            )

"""
trace7 = runQueryWithTrace(expr,workspaceName,SemanticModelName,Result=False,DMV=False)

#### Run Percent of Parent over **Partitioned** Table

In [None]:
expr:str = """

    DEFINE

        MEASURE dim_Date[Sum of Quantity] = 
            SUM(fact_myevents_1bln_partitioned_datekey[Quantity_ThisYear])
            
	    MEASURE dim_Date[Percentage of Parent] =
		    [Sum of Quantity] / CALCULATE([Sum of Quantity],ALL(dim_Geography))

    EVALUATE
        SUMMARIZECOLUMNS(
            dim_Geography[COUNTRY] ,
            TREATAS({DATE(2019,1,1)} , dim_Date[FirstDateofMonth] ) ,
            "Quantity" 		, [Sum of Quantity],
            "% of Parent"	, [Percentage of Parent]
            )

"""
trace8 = runQueryWithTrace(expr,workspaceName,SemanticModelName,Result=False,DMV=False)

In [None]:
display(trace7)
display(trace8)

### 8.5 All measures combined in one query

#### Run all measures on **Base** table

In [None]:
expr:str = """

    DEFINE

        MEASURE dim_Date[Sum of Quantity] = 
            SUM(fact_myevents_1bln[Quantity_ThisYear])
            
        MEASURE dim_Date[Percentage of Parent] =
            [Sum of Quantity] / CALCULATE([Sum of Quantity],ALL(dim_Geography))

        MEASURE dim_Date[Sum of Quantity Rank] =
            RANKX(ALL(dim_Geography[COUNTRY]) , [Sum of Quantity] )

        MEASURE dim_Date[Sum of Quantity YTD] =
            TOTALYTD([Sum of Quantity],dim_Date[DateKey])
        
        MEASURE dim_Date[Sum of Quantity QTD] =
            TOTALQTD([Sum of Quantity],dim_Date[DateKey])	

        MEASURE dim_Date[Sum of Quantity PM] =
            CALCULATE([Sum of Quantity],PREVIOUSMONTH(dim_Date[DateKey]))

        MEASURE dim_Date[Sum of Quantity PM Delta] =
            [Sum of Quantity] - [Sum of Quantity PM]
        
        MEASURE dim_Date[Sum of Quantity PM %] =
            [Sum of Quantity PM Delta] / [Sum of Quantity]

    EVALUATE
        SUMMARIZECOLUMNS(
            dim_Geography[COUNTRY] ,
            TREATAS({DATE(2019,1,1)} , dim_Date[FirstDateofMonth] ) ,
            "Quantity" 				, [Sum of Quantity],
            "% of Parent"			, [Percentage of Parent],
            "Rank" 					, [Sum of Quantity Rank],
            "Quantity YTD" 			, [Sum of Quantity YTD] ,
            "Quantity QTD" 			, [Sum of Quantity QTD]	,	
            "Quantity PM" 			, [Sum of Quantity PM],
            "Quantity PM Delta"		, [Sum of Quantity PM Delta] ,
            "Quantity PM %" 		, [Sum of Quantity PM %]
            )

"""
trace9 = runQueryWithTrace(expr,workspaceName,SemanticModelName,Result=False,DMV=False)

#### Run all measures on **Partitioned** table

In [None]:
expr:str = """

    DEFINE

        MEASURE dim_Date[Sum of Quantity] = 
            SUM(fact_myevents_1bln_partitioned_datekey[Quantity_ThisYear])
            
        MEASURE dim_Date[Percentage of Parent] =
            [Sum of Quantity] / CALCULATE([Sum of Quantity],ALL(dim_Geography))

        MEASURE dim_Date[Sum of Quantity Rank] =
            RANKX(ALL(dim_Geography[COUNTRY]) , [Sum of Quantity] )

        MEASURE dim_Date[Sum of Quantity YTD] =
            TOTALYTD([Sum of Quantity],dim_Date[DateKey])
        
        MEASURE dim_Date[Sum of Quantity QTD] =
            TOTALQTD([Sum of Quantity],dim_Date[DateKey])	

        MEASURE dim_Date[Sum of Quantity PM] =
            CALCULATE([Sum of Quantity],PREVIOUSMONTH(dim_Date[DateKey]))

        MEASURE dim_Date[Sum of Quantity PM Delta] =
            [Sum of Quantity] - [Sum of Quantity PM]
        
        MEASURE dim_Date[Sum of Quantity PM %] =
            [Sum of Quantity PM Delta] / [Sum of Quantity]

    EVALUATE
        SUMMARIZECOLUMNS(
            dim_Geography[COUNTRY] ,
            TREATAS({DATE(2019,1,1)} , dim_Date[FirstDateofMonth] ) ,
            "Quantity" 				, [Sum of Quantity],
            "% of Parent"			, [Percentage of Parent],
            "Rank" 					, [Sum of Quantity Rank],
            "Quantity YTD" 			, [Sum of Quantity YTD] ,
            "Quantity QTD" 			, [Sum of Quantity QTD]	,	
            "Quantity PM" 			, [Sum of Quantity PM],
            "Quantity PM Delta"		, [Sum of Quantity PM Delta] ,
            "Quantity PM %" 		, [Sum of Quantity PM %]
            )

"""
trace10 = runQueryWithTrace(expr,workspaceName,SemanticModelName,Result=False,DMV=False)

In [None]:
display(trace9)
display(trace10)

## 9. Stop the Spark Session

In [None]:
mssparkutils.session.stop()

## Lab Summary

### 🎯 Key Achievements

**Column Partitioning Mastery Completed:**
- ✅ **Performance Analysis**: Compared partitioned vs. non-partitioned table performance
- ✅ **Server Timings**: Captured and analyzed detailed query execution metrics
- ✅ **Cache Optimization**: Understood cold vs. warm cache behavior patterns
- ✅ **Advanced DAX**: Executed complex analytical queries with performance monitoring

### 📊 Performance Insights Gained

**Partitioned Table Benefits:**
- Improved query response times for large datasets
- Enhanced compression ratios through strategic column organization
- Optimized memory utilization during query execution
- Reduced I/O operations for filtered queries

**Query Pattern Analysis:**
- **Period Comparison**: Demonstrated performance gains with date partitioning
- **Running Totals**: Showcased optimized sequential data access
- **Ranking Operations**: Improved sorting performance with partitioned data
- **Percent of Parent**: Enhanced aggregation efficiency

### 🔧 Best Practices Learned

**Column Partitioning Strategy:**
1. **Identify High-Cardinality Columns** for partitioning candidates
2. **Analyze Query Patterns** to determine optimal partitioning schemes
3. **Monitor Server Timings** to validate performance improvements
4. **Test Different Scenarios** (cold vs. warm cache) for comprehensive analysis

**Performance Optimization Guidelines:**
- Partition on columns frequently used in WHERE clauses
- Consider data distribution when choosing partition boundaries
- Monitor compression ratios to balance storage vs. performance
- Use Vertipaq Analyzer to identify optimization opportunities

### 🚀 Next Steps

**Recommended Follow-up:**
- **Lab 7**: High cardinality column splitting techniques
- **Lab 8**: Advanced One Lake integration with Import mode
- **Advanced Topics**: Dynamic partition management and maintenance

**Production Implementation:**
- Establish partitioning policies based on query patterns
- Implement automated performance monitoring
- Design partition maintenance schedules
- Create performance benchmarking procedures

### 📈 Business Impact

**Direct Lake Column Partitioning enables:**
- **Faster Query Performance**: Reduced response times for analytical workloads
- **Improved Scalability**: Better handling of large-scale data volumes
- **Cost Optimization**: Efficient resource utilization and reduced compute costs
- **Enhanced User Experience**: Responsive dashboards and reports

**Performance Metrics:**
- Query execution time improvements measured through server timings
- Memory efficiency gains through optimized column storage
- I/O reduction benefits for filtered and aggregated queries
- Scalability improvements for growing data volumes