# Lab 3: Delta Table Analysis and Optimization

## Lab Overview

This lab explores the **Delta Analyzer** tool for deep insights into Delta Lake table structure and optimization. You'll analyze billion-row tables to understand how different storage configurations impact Direct Lake performance and discover optimization opportunities.

### What You'll Build

**Workshop Flow:**
```
1. Setup Analysis Environment
   ↓
2. Analyze Small Dimension Tables
   ↓
3. Compare V-Order vs Standard Storage
   ↓
4. Examine Partitioned Tables
   ↓
5. Analyze Billion-Row Performance
   ↓
6. Generate Optimization Recommendations
```

### Key Concepts
- **Delta Analyzer**: Tool for examining Delta Lake table internals
- **V-Order Optimization**: Microsoft's columnar compression technology
- **Row Group Analysis**: Understanding data clustering and organization
- **Partitioning Impact**: How partitioning affects query performance

### Learning Objectives
By completing this lab, you'll be able to:
- ✅ Use Delta Analyzer to examine table structure and optimization
- ✅ Compare V-Order vs standard storage performance characteristics
- ✅ Analyze partitioning benefits for large tables
- ✅ Identify optimization opportunities for Direct Lake performance
- ✅ Generate actionable recommendations for table improvements

### Analysis Targets
| Table | Rows | Features | Focus |
|:------|:-----|:---------|:------|
| **dim_Date** | ~3.6K | Small dimension | Baseline optimization |
| **fact_myevents_1bln** | 1B | V-Order optimized | Standard optimization |
| **fact_myevents_1bln_no_vorder** | 1B | No V-Order | Performance comparison |
| **fact_myevents_1bln_partitioned_datekey** | 1B | Date partitioned | Partitioning benefits |
| **fact_myevents_2bln** | 2B | Maximum scale | Scale impact analysis |

**Estimated Time**: 45-60 minutes  
**Prerequisites**: Lab 2 completion (BigData lakehouse with billion-row tables)

---

### Why Delta Analysis Matters for Direct Lake Performance

Delta Analyzer reveals physical storage characteristics that directly impact Direct Lake performance and enables data-driven optimization decisions.

## 1. Install Required Libraries

Install Semantic Link Labs with specialized Delta Analyzer capabilities for examining table structure and optimization opportunities.

In [None]:
%pip install -q --disable-pip-version-check semantic-link-labs

## 2. Configure Libraries and Delta Analysis Environment

Import required libraries and setup custom Delta Analyzer function for comprehensive table analysis.

In [None]:
import sempy_labs as labs
from sempy import fabric
import sempy
import pandas
import time

LakehouseName = "BigData"
lakehouses = labs.list_lakehouses()["Lakehouse Name"]
for l in lakehouses:
    if l.startswith("Big"):
        LakehouseName = l

SemanticModelName = f"{LakehouseName}_model"

def myDeltaAnalyzer(table_name:str,skip_cardinality:bool = True) -> dict:
    # Run Delta Analyzer
    analyzer:dict = labs.delta_analyzer(lakehouse=LakehouseName, table_name=table_name, skip_cardinality=skip_cardinality)
    for key , value in analyzer.items():
        displayHTML(f"<H2>#### {key} ({table_name}) ####</H2>")
        display(value)
    return analyzer

## 3. Validate Lakehouse Environment and Prerequisites

Validates Lab 2 completion and confirms access to big data lakehouse for Delta analysis.

In [None]:
lakehouses=labs.list_lakehouses()["Lakehouse Name"]
if LakehouseName in lakehouses.values:
    lakehouseId = notebookutils.lakehouse.getWithProperties(LakehouseName)["id"]
else:
    print("You need to complete Lab 2 to create the required lakehouse for this lab")

workspaceId = notebookutils.lakehouse.getWithProperties(LakehouseName)["workspaceId"]
workspaceName = sempy.fabric.resolve_workspace_name(workspaceId)
print(f"WorkspaceId = {workspaceId}, LakehouseID = {lakehouseId}, Workspace Name = {workspaceName}")

## 4. Inventory Available Tables for Delta Analysis

Lists all available tables in the big data lakehouse to catalog analysis targets.

In [None]:
labs.lakehouse.get_lakehouse_tables(lakehouse=LakehouseName)

## 5. Baseline Analysis: Small Dimension Table (dim_Date)

Analyzes optimized dimension table structure to establish baseline for Delta Lake optimization patterns.

In [None]:
df1:dict[str, pandas.DataFrame] = myDeltaAnalyzer ("dim_Date",False)

## 6. V-Order Optimization Analysis: 1 Billion Row Table

Analyzes V-Order optimization effectiveness on billion-row table for performance and compression benefits.

In [None]:
df2:dict[str, pandas.DataFrame] = myDeltaAnalyzer ("fact_myevents_1bln")

## 7. Performance Comparison: 1 Billion Rows Without V-Order

Compares performance characteristics of billion-row table without V-Order optimization for impact analysis.

In [None]:
df3:dict[str, pandas.DataFrame] = myDeltaAnalyzer ("fact_myevents_1bln_no_vorder")

## 8. Partitioning Strategy Analysis: Date-Key Organization Benefits

Examines billion-row table partitioned by DateKey to understand partitioning impact on Direct Lake performance.

### Date-Key Partitioning Analysis
```python
myDeltaAnalyzer("fact_myevents_1bln_partitioned_datekey")
```

#### Partitioning Strategy Examination:

##### **File Organization by Date**:
- **Partition structure**: Files organized by DateKey values
- **File distribution**: How billion rows are distributed across date partitions
- **Size consistency**: Whether partitions have similar sizes
- **Storage efficiency**: Impact of partitioning on compression

##### **Performance Optimization Opportunities**:
- **Query patterns**: How date-based filters benefit from partitioning
- **Memory loading**: Selective partition loading for Direct Lake
- **Parallel processing**: Concurrent partition access capabilities

### Expected Partitioning Insights

#### Structural Analysis:
| Aspect | Partitioned Table | Non-Partitioned | Benefit |
|--------|------------------|------------------|---------|
| **File count** | Many small files per partition | Fewer large files | Better granularity |
| **Query filtering** | Partition elimination | Full table scan | Faster queries |
| **Memory loading** | Selective partition loading | Entire columns | Reduced memory |
| **Parallel processing** | Per-partition processing | Limited parallelism | Better throughput |

#### Direct Lake Performance Impact:
- **Memory management**: Load only relevant date ranges
- **Query optimization**: Skip irrelevant partitions automatically
- **Fallback avoidance**: Reduced memory pressure through selective loading
- **User experience**: Faster response times for date-filtered queries

### Partitioning Trade-offs

#### Advantages:
- ✅ **Query performance**: Dramatic improvement for date-filtered queries
- ✅ **Memory efficiency**: Selective data loading
- ✅ **Parallel processing**: Better resource utilization
- ✅ **Maintenance efficiency**: Easier data lifecycle management

#### Considerations:
- ⚠️ **Small file overhead**: Many small files can impact metadata operations
- ⚠️ **Non-partition queries**: Cross-partition queries may be slower
- ⚠️ **Storage overhead**: Slightly more metadata storage required

### Analysis Validation
The Delta analyzer will reveal:
- **Partition distribution**: How evenly data is distributed across partitions
- **File size patterns**: Whether partitions are appropriately sized
- **Compression effectiveness**: How partitioning affects compression ratios
- **Storage organization**: Physical layout optimization for query patterns

**Expected outcome**: Comprehensive analysis of partitioning benefits for billion-row Direct Lake tables, demonstrating query performance optimization and memory efficiency gains.

In [None]:
df4:dict[str, pandas.DataFrame] = myDeltaAnalyzer ("fact_myevents_1bln_partitioned_datekey")

## 9. Maximum Scale Analysis: 2 Billion Row Table Limits

Analyzes ultimate scale limits with 2 billion row table to understand Direct Lake boundary conditions.

### 2 Billion Row Storage Analysis
```python
myDeltaAnalyzer("fact_myevents_2bln")  # Maximum scale analysis
```

#### Critical Analysis Areas:

##### **File Size Distribution at Scale**:
- **Individual file sizes**: Whether files remain within 1GB Direct Lake limits
- **File count scaling**: How doubling rows affects file organization
- **Storage efficiency**: Compression effectiveness at maximum scale

##### **Memory Impact Projection**:
- **Dictionary sizes**: Column dictionary memory requirements
- **Cardinality scaling**: How unique values scale with data volume
- **Memory pressure indicators**: Signs of approaching Direct Lake limits

### Expected Scale Impact Analysis

#### Storage Scaling Patterns:
| Metric | 1B Rows | 2B Rows | Scaling Factor |
|--------|---------|---------|----------------|
| **Total storage** | Baseline | ~2x | Linear scaling |
| **File count** | Baseline | ~2x | Proportional increase |
| **Individual file sizes** | <1GB | Monitor limits | Critical threshold |
| **Compression ratio** | Optimized | Maintain efficiency | Consistency check |

#### Direct Lake Guardrail Implications:
- **Memory limits**: 2B-row columns may approach memory capacity
- **File size limits**: Individual files must remain <1GB
- **Cardinality limits**: Unique values may approach Direct Lake thresholds
- **Performance thresholds**: Query execution time considerations

### Extreme Scale Insights

#### Performance Characteristics:
- **Loading behavior**: How Direct Lake handles extremely large columns
- **Fallback triggers**: Conditions that force SQL Endpoint fallback
- **Memory optimization**: System memory management at scale
- **Query patterns**: Which query types work efficiently at this scale

#### Optimization Strategies:
- **Partitioning necessity**: Whether extreme scale requires partitioning
- **Compression critical**: Importance of V-Order at maximum scale
- **Memory planning**: Capacity requirements for production deployment
- **Fallback planning**: When to expect and plan for SQL Endpoint usage

### Real-World Production Guidance
This analysis provides practical insights for:
- **Capacity planning**: Understanding hardware requirements for large datasets
- **Architecture decisions**: When to partition, optimize, or plan for fallback
- **Performance expectations**: Realistic performance goals for extreme-scale scenarios
- **Cost planning**: Storage and compute costs for maximum-scale Direct Lake implementations

**Expected outcome**: Comprehensive understanding of Direct Lake behavior at extreme scale, including storage optimization requirements and fallback scenario planning for production deployments.

In [None]:
df5:dict[str, pandas.DataFrame] = myDeltaAnalyzer ("fact_myevents_2bln")

## 10. Understanding Delta Analyzer Output Categories

Interprets comprehensive Delta analyzer results to understand optimization opportunities and performance patterns.

### Analysis Category Selection Strategy

#### For Different Table Types:
| Table Type | Primary Focus | Key Categories |
|------------|---------------|----------------|
| **Small dimensions** | Complete analysis | All categories including Columns |
| **Large facts** | Structure and optimization | Summary, Parquet Files, Row Groups |
| **Partitioned tables** | File distribution | Parquet Files, Row Groups |
| **Optimization comparison** | Compression analysis | Summary, Column Chunks |

#### Performance Considerations:
- **Small tables**: All categories provide valuable insights without performance impact
- **Billion-row tables**: Focus on structural analysis to avoid expensive operations
- **Comparison studies**: Use consistent categories across tables for meaningful comparison

**Expected outcome**: Clear understanding of each analysis category's purpose and value for different optimization scenarios.

In [None]:
for key , value in df1.items():
    print(key)

## 11. Comparative Analysis: Cross-Table Optimization Study

### Multi-Table Comparison Strategy
Combining Delta Analyzer results from all five tables enables **comprehensive optimization comparison** across different storage strategies and scales:

#### Comparison Dimensions:
- **Scale impact**: How data volume affects optimization (1B vs 2B rows)
- **V-Order benefits**: Quantifying optimization effectiveness
- **Partitioning benefits**: Understanding partitioning impact on storage
- **Size scaling**: How storage characteristics scale with data volume

### Comparative Analysis Framework

#### Table Comparison Matrix:
| Table | Scale | V-Order | Partitioned | Analysis Focus |
|-------|-------|---------|-------------|----------------|
| **dim_Date** | Small | ✅ | ❌ | Baseline reference |
| **fact_myevents_1bln** | 1B | ✅ | ❌ | Standard optimization |
| **fact_myevents_1bln_no_vorder** | 1B | ❌ | ❌ | Optimization comparison |
| **fact_myevents_1bln_partitioned_datekey** | 1B | ✅ | ✅ | Partitioning benefits |
| **fact_myevents_2bln** | 2B | ✅ | ❌ | Maximum scale |

#### Key Comparison Metrics:
- **Storage efficiency**: Total storage size and compression ratios
- **File organization**: File count and size distribution
- **Performance indicators**: Metrics that predict Direct Lake performance
- **Optimization impact**: Quantifiable benefits of different strategies

### Visualization and Insights Generation

#### Summary Table Consolidation:
The combined dataframe enables powerful comparative analysis:

```python
delta_analyzer_output_table = "Summary"  # Focus on high-level comparison
```

#### Why Start with Summary Analysis:
- **Overview perspective**: High-level metrics across all tables
- **Key performance indicators**: Critical metrics for Direct Lake optimization
- **Trend identification**: Patterns across different optimization strategies
- **Decision support**: Data-driven optimization recommendations

#### Visualization Benefits:
- **Side-by-side comparison**: Direct comparison of optimization strategies
- **Scale impact visualization**: How data volume affects storage characteristics
- **Optimization quantification**: Measurable benefits of V-Order and partitioning
- **Performance prediction**: Indicators for Direct Lake behavior

### Expected Comparative Insights

#### V-Order Impact Quantification:
- **Storage reduction**: Measurable compression improvement
- **File efficiency**: Better file size distribution
- **Performance prediction**: Indicators for faster Direct Lake loading

#### Partitioning Benefits Analysis:
- **File distribution**: How partitioning affects file organization
- **Query optimization potential**: Partition elimination opportunities
- **Memory efficiency**: Selective loading benefits

#### Scale Impact Understanding:
- **Linear scaling**: How storage scales with data volume
- **Performance thresholds**: Where optimization becomes critical
- **Guardrail proximity**: Approaching Direct Lake limits

**Expected outcome**: Comprehensive comparative analysis revealing optimization strategies' effectiveness and providing data-driven guidance for Direct Lake performance tuning.

In [None]:
delta_analyzer_output_table = "Summary"  #  "Summary","Parquet Files","Row Groups","Column Chunks","Columns"

df_combined = pandas.concat([
        df1[delta_analyzer_output_table],
        df2[delta_analyzer_output_table],
        df3[delta_analyzer_output_table],
        df4[delta_analyzer_output_table],
        df5[delta_analyzer_output_table],
        ]
        ).drop_duplicates().reset_index(drop=True)
display(df_combined)

## 12. Lab 3 Completion: Delta Optimization Mastery

### Congratulations! Delta Lake Analysis Expert 🎉

You've successfully completed a comprehensive **Delta Lake optimization analysis**, gaining deep insights into storage patterns that directly impact Direct Lake performance.

#### 🔍 **Analysis Achievements**:
- ✅ **Multi-scale analysis**: From 3.6K to 2B row tables
- ✅ **Optimization comparison**: V-Order vs. standard storage quantification
- ✅ **Partitioning evaluation**: Understanding partitioning benefits and trade-offs
- ✅ **Performance prediction**: Storage patterns that optimize Direct Lake behavior
- ✅ **Comparative insights**: Data-driven optimization recommendations

### Key Delta Lake Optimization Learnings

#### 📊 **Storage Optimization Impact**:
- **V-Order provides measurable benefits** with 20-40% storage reduction
- **Partitioning enables selective loading** reducing memory requirements
- **File organization directly impacts** Direct Lake loading performance
- **Compression strategies become critical** at billion-row scale

#### 🎯 **Direct Lake Performance Guidance**:
- **File size distribution** affects memory loading efficiency
- **Column compression ratios** predict memory requirements
- **Row group organization** impacts query performance
- **Storage patterns indicate** fallback probability

#### 🛡️ **Production Optimization Strategy**:
- **V-Order optimization** is essential for large-scale Direct Lake
- **Partitioning strategies** should align with query patterns
- **File size monitoring** prevents Direct Lake guardrail violations
- **Compression analysis** guides memory capacity planning

### Real-World Applications

#### Enterprise Optimization Scenarios:
- **📈 Performance tuning**: Use Delta analysis to optimize existing tables
- **🏗️ Architecture planning**: Choose optimization strategies based on data characteristics
- **💰 Cost optimization**: Balance storage costs with performance requirements
- **🔄 Maintenance strategies**: Monitor storage patterns over time

### Advanced Optimization Techniques Discovered

#### V-Order Benefits Quantified:
- **Storage efficiency**: Measurable compression improvements
- **Memory optimization**: Reduced Direct Lake memory footprint
- **Query acceleration**: Better predicate pushdown and filtering

#### Partitioning Strategy Validation:
- **Partition elimination**: Quantified benefits for date-filtered queries
- **Memory efficiency**: Selective partition loading capabilities
- **Performance trade-offs**: Understanding when partitioning helps vs. hurts

### Next Steps in Your Direct Lake Journey

#### 🚀 **Immediate Applications**:
- Apply Delta analysis to your own datasets
- Implement V-Order optimization on critical tables
- Design partitioning strategies based on query patterns

#### 📚 **Advanced Learning Path**:
- **Lab 4**: Explore fallback behaviors and troubleshooting
- **Lab 5**: Understand framing and refresh optimization
- **Lab 6-7**: Advanced performance tuning techniques

### Resource Cleanup and Best Practices
Stopping the Spark session properly:
- **💰 Releases analysis compute resources**
- **🧹 Cleans up temporary analysis data**
- **✅ Ensures proper resource management**

### Delta Lake Optimization Certification 🏆
You now master:
- ✅ **Storage analysis**: Deep understanding of Delta Lake file organization
- ✅ **Performance optimization**: V-Order and partitioning strategies
- ✅ **Direct Lake tuning**: Storage patterns that optimize memory usage
- ✅ **Comparative analysis**: Data-driven optimization decision making

🎯 **Ready for the next lab?** Let's explore Direct Lake fallback behavior and protection mechanisms!

---

## Lab Summary

### What You Accomplished
In this lab, you mastered **Delta Lake table analysis and optimization** for Direct Lake performance:

- ✅ **Comprehensive Analysis**: Analyzed Delta Lake table structures across billion-row datasets
- ✅ **V-Order Optimization**: Understood compression and performance benefits of V-Order optimization
- ✅ **Partitioning Strategy**: Evaluated partitioning effectiveness for query pruning and performance
- ✅ **Scale Validation**: Tested Direct Lake limits with massive 2 billion row datasets
- ✅ **Performance Comparison**: Compared optimization techniques across different table configurations
- ✅ **Optimization Insights**: Identified key factors for Delta Lake table optimization

### Architecture Overview

**Delta Lake Analysis and Optimization Flow:**
```
Delta Tables → Structure Analysis → Optimization Assessment → Performance Insights
     ↓              ↓                    ↓                    ↓
Raw Data    → File Organization → V-Order Benefits → Compression Ratios
Partitions  → Distribution Analysis → Query Pruning → Performance Gains
Row Groups  → Column Chunks → Storage Efficiency → Memory Optimization
     ↓              ↓                    ↓                    ↓
Optimized Tables → Enhanced Performance → Efficient Queries → Business Value
```

### Key Takeaways

- **V-Order Optimization**: Provides significant compression and query performance benefits for large tables
- **Partitioning Strategy**: Date-based partitioning enables effective query pruning for time-series data
- **File Organization**: Optimal file sizes and row group distribution improve Direct Lake performance
- **Scale Understanding**: Billion-row tables operate effectively within Direct Lake guardrails
- **Optimization Identification**: Delta analyzer reveals specific optimization opportunities

### Technical Skills Gained

- **Delta Analyzer Expertise**: Advanced skills in analyzing Delta Lake table structures
- **Optimization Assessment**: Ability to evaluate and recommend table optimization strategies
- **Performance Analysis**: Understanding of how table structure impacts Direct Lake performance
- **Scale Planning**: Capability to plan and validate large-scale Delta Lake implementations

### Next Steps

**Continue to Lab 4** to learn about:
- Direct Lake fallback behavior and protection mechanisms
- Understanding when and why fallback to SQL Endpoint occurs
- Configuring optimal fallback modes for production environments

---

Stops Spark session to clean up resources.

In [None]:
mssparkutils.session.stop()