# Enhanced Analytics-Ready Features with SQL Semantics in Memgraph

This notebook explores the enhanced analytics-ready optimization features with **SQL semantics metadata** - essential for migration agents and automated code generation.

## 🆕 Enhanced for Migration Agents
- **SQL Semantics Materialized Views**: Pre-computed JOIN relationships and column aliases
- **Migration-Ready Performance**: Sub-second access to complex SQL analysis
- **Platform-Specific Insights**: Organized data for Spark, dbt, Pandas code generation
- **Categories Table Validation**: Verify the critical table extraction improvements

## What We'll Learn
- How to access SQL semantics from materialized views
- Performance comparison: Enhanced views vs raw SQL parsing
- Migration agent data consumption patterns
- JOIN relationship analysis for automated code generation

In [None]:
# Setup
import mgclient
import pandas as pd
import json
import time
import matplotlib.pyplot as plt
from datetime import datetime

# Connect to Memgraph
mg = mgclient.connect(host='localhost', port=7687, username='', password='')
print("✅ Connected to Memgraph")

def execute_query(query, description=None, time_it=False):
    """Execute a Cypher query and return results as DataFrame."""
    if description:
        print(f"\n🔍 {description}")
        if not time_it:
            print(f"Query: {query}")
        print("-" * 50)
    
    start_time = time.time() if time_it else None
    
    cursor = mg.cursor()
    cursor.execute(query)
    results = cursor.fetchall()
    
    end_time = time.time() if time_it else None
    
    if time_it:
        execution_time = (end_time - start_time) * 1000  # Convert to milliseconds
        print(f"⏱️ Execution time: {execution_time:.2f}ms")
    
    if results:
        columns = [desc.name for desc in cursor.description] if cursor.description else ['result']
        df = pd.DataFrame(results, columns=columns)
        print(f"Found {len(df)} results")
        return df, execution_time if time_it else None
    else:
        print("No results found.")
        return pd.DataFrame(), execution_time if time_it else None

def pretty_print_json(data, max_length=1000):
    """Pretty print JSON data."""
    if isinstance(data, str):
        try:
            data = json.loads(data)
        except:
            print(data[:max_length] + "..." if len(data) > max_length else data)
            return
    
    json_str = json.dumps(data, indent=2)
    if len(json_str) > max_length:
        json_str = json_str[:max_length] + "\n...truncated..."
    print(json_str)

## Part 1: Verify Analytics-Ready Status

In [None]:
# Check if the graph is analytics-ready
metadata_df, _ = execute_query(
    """MATCH (m:Node {node_type: 'graph_metadata'})
       RETURN m.name as metadata_name,
              m.id as metadata_id,
              m.properties as properties""",
    "Checking analytics-ready status"
)

if not metadata_df.empty:
    metadata = metadata_df.iloc[0]
    
    # Parse the properties JSON
    try:
        props = json.loads(metadata['properties']) if isinstance(metadata['properties'], str) else metadata['properties']
        ready_status = props.get('ready_for_applications', 'unknown')
        optimization_level = props.get('optimization_level', 'unknown')
        use_cases = props.get('supported_use_cases', [])
    except:
        ready_status = 'unknown'
        optimization_level = 'unknown'
        use_cases = []
    
    print(f"\n🎉 Graph Analytics Status:")
    print(f"  ✅ Ready for Applications: {ready_status}")
    print(f"  🚀 Optimization Level: {optimization_level}")
    
    if use_cases:
        print(f"  🎯 Supported Use Cases:")
        for use_case in use_cases:
            print(f"    • {use_case.replace('_', ' ').title()}")
else:
    print("❌ Graph is not analytics-ready. No metadata found.")
    print("Run: METAZCODE_DB_BACKEND=memgraph metazcode full --path ssis_northwind")

display(metadata_df)


In [None]:
# Get detailed metadata information
if not metadata_df.empty:
    detailed_metadata_df, _ = execute_query(
        """MATCH (m:Node {node_type: 'graph_metadata'})
           RETURN m.properties as full_metadata""",
        "Detailed graph metadata"
    )
    
    if not detailed_metadata_df.empty:
        metadata_props = detailed_metadata_df.iloc[0]['full_metadata']
        print("\n📊 Complete Graph Metadata:")
        pretty_print_json(metadata_props)

## Part 2: Explore Materialized Views

In [None]:
# List all available materialized views
views_df, _ = execute_query(
    """MATCH (v:Node {node_type: 'materialized_view'})
       RETURN v.name as view_name,
              v.id as view_id,
              v.properties as record_count,
              v.properties as created_at,
              v.properties as version
       ORDER BY v.name""",
    "Available materialized views"
)

if not views_df.empty:
    print(f"\n📋 Found {len(views_df)} Materialized Views:")
    for _, view in views_df.iterrows():
        print(f"  • {view['view_name']}: {view['record_count']} records (v{view['version']})")
    
    display(views_df)
else:
    print("❌ No materialized views found.")
    print("The graph may not be analytics-ready or may need to be regenerated.")

## Part 3: Using Materialized Views

Let's explore each materialized view and understand what data it contains.

In [None]:
# Helper function to access materialized view data
def get_materialized_view_data(view_name, description=None):
    """Get data from a materialized view."""
    query = f"""MATCH (v:Node {{id: 'view:{view_name}'}})
                RETURN v.properties as view_data,
                       v.properties as record_count"""
    
    df, exec_time = execute_query(query, description or f"Accessing {view_name} materialized view", time_it=True)
    
    if not df.empty and df.iloc[0]['view_data']:
        try:
            view_data = json.loads(df.iloc[0]['view_data'])
            record_count = df.iloc[0]['record_count']
            print(f"📊 Records in view: {record_count}")
            return view_data, exec_time
        except Exception as e:
            print(f"❌ Error parsing view data: {e}")
            return None, exec_time
    else:
        print("❌ View not found or empty")
        return None, exec_time

### 3.1 SQL Operations Catalog

In [None]:
# Explore SQL Operations Catalog
sql_ops_data, sql_ops_time = get_materialized_view_data('sql_operations_catalog', 
                                                        'SQL Operations Catalog - Pre-computed SQL transformations')

if sql_ops_data:
    print(f"\n🔍 Sample SQL Operations:")
    
    # Convert to DataFrame for easier analysis
    sql_ops_df = pd.DataFrame(sql_ops_data)
    
    if not sql_ops_df.empty:
        print(f"\nColumns available: {list(sql_ops_df.columns)}")
        
        # Show first few operations
        for i, op in sql_ops_df.head(3).iterrows():
            print(f"\n📝 Operation: {op.get('operation_name', 'Unknown')}")
            print(f"   Type: {op.get('sql_type', 'Unknown')}")
            print(f"   Tables: {len(op.get('affected_tables', []))}")
            if op.get('complexity_indicators'):
                complexity = op['complexity_indicators']
                print(f"   Complexity: {complexity.get('table_count', 0)} tables, "
                      f"Joins: {complexity.get('has_joins', False)}, "
                      f"Subqueries: {complexity.get('has_subqueries', False)}")
        
        display(sql_ops_df.head())
    else:
        print("No SQL operations found in the catalog.")
else:
    print("SQL Operations Catalog is empty or not available.")

### 3.2 Cross-Package Dependencies

In [None]:
# Explore Cross-Package Dependencies
deps_data, deps_time = get_materialized_view_data('cross_package_dependencies',
                                                 'Cross-Package Dependencies - Pre-computed package relationships')

if deps_data:
    deps_df = pd.DataFrame(deps_data)
    
    if not deps_df.empty:
        print(f"\n🔗 Package Dependencies:")
        for _, dep in deps_df.iterrows():
            print(f"  • {dep.get('source_package', 'Unknown')} depends on {dep.get('target_package', 'Unknown')}")
            print(f"    Risk Level: {dep.get('risk_level', 'Unknown')}")
            if dep.get('shared_resources'):
                print(f"    Shared Resources: {len(dep['shared_resources'])}")
        
        display(deps_df)
    else:
        print("No cross-package dependencies found.")
        print("This means packages are likely independent.")
else:
    print("Cross-Package Dependencies view is empty or not available.")

### 3.3 Shared Resources Analysis

In [None]:
# Explore Shared Resources Analysis
shared_data, shared_time = get_materialized_view_data('shared_resources_analysis',
                                                     'Shared Resources Analysis - Resources used by multiple packages')

if shared_data:
    shared_df = pd.DataFrame(shared_data)
    
    if not shared_df.empty:
        print(f"\n⚠️ Shared Resources (Potential Bottlenecks):")
        for _, resource in shared_df.iterrows():
            print(f"  • {resource.get('resource_name', 'Unknown')}")
            print(f"    Used by {resource.get('package_count', 0)} packages: {resource.get('sharing_packages', [])}")
            print(f"    Contention Risk: {resource.get('contention_risk', 'Unknown')}")
            print()
        
        # Create a visualization of shared resources
        if 'contention_risk' in shared_df.columns:
            risk_counts = shared_df['contention_risk'].value_counts()
            
            plt.figure(figsize=(10, 6))
            plt.subplot(1, 2, 1)
            risk_counts.plot(kind='pie', autopct='%1.1f%%', title='Shared Resources by Risk Level')
            
            plt.subplot(1, 2, 2)
            shared_df['package_count'].hist(bins=10, title='Distribution of Package Usage')
            plt.xlabel('Number of Packages Using Resource')
            plt.ylabel('Number of Resources')
            
            plt.tight_layout()
            plt.show()
        
        display(shared_df)
    else:
        print("No shared resources found.")
        print("This means each resource is used by only one package.")
else:
    print("Shared Resources Analysis view is empty or not available.")

### 3.4 Data Lineage Catalog

In [None]:
# Explore Data Lineage Catalog
lineage_data, lineage_time = get_materialized_view_data('data_lineage_catalog',
                                                       'Data Lineage Catalog - Complete data flow mapping')

if lineage_data:
    lineage_df = pd.DataFrame(lineage_data)
    
    if not lineage_df.empty:
        print(f"\n🔄 Data Lineage Summary:")
        print(f"Total lineage relationships: {len(lineage_df)}")
        
        if 'lineage_direction' in lineage_df.columns:
            direction_counts = lineage_df['lineage_direction'].value_counts()
            print(f"\nLineage directions:")
            for direction, count in direction_counts.items():
                print(f"  • {direction}: {count} relationships")
        
        if 'relationship_type' in lineage_df.columns:
            rel_type_counts = lineage_df['relationship_type'].value_counts()
            print(f"\nRelationship types:")
            for rel_type, count in rel_type_counts.items():
                print(f"  • {rel_type}: {count} relationships")
        
        print(f"\n📝 Sample lineage relationships:")
        for _, lineage in lineage_df.head(5).iterrows():
            print(f"  • {lineage.get('source_name', 'Unknown')} --[{lineage.get('relationship_type', 'UNKNOWN')}]--> {lineage.get('target_name', 'Unknown')}")
        
        display(lineage_df.head(10))
    else:
        print("No data lineage relationships found.")
else:
    print("Data Lineage Catalog view is empty or not available.")

### 3.5 Business Rules Catalog

In [None]:
# Explore Business Rules Catalog
rules_data, rules_time = get_materialized_view_data('business_rules_catalog',
                                                   'Business Rules Catalog - Extracted business logic')

if rules_data:
    rules_df = pd.DataFrame(rules_data)
    
    if not rules_df.empty:
        print(f"\n📋 Business Rules Summary:")
        print(f"Operations with business rules: {len(rules_df)}")
        
        total_rules = sum([len(op.get('rules', [])) for _, op in rules_df.iterrows()])
        print(f"Total business rules: {total_rules}")
        
        print(f"\n📝 Sample business rules:")
        for _, operation in rules_df.head(3).iterrows():
            print(f"\n🔧 Operation: {operation.get('operation_name', 'Unknown')}")
            rules = operation.get('rules', [])
            print(f"   Rules count: {len(rules)}")
            
            for rule in rules[:2]:  # Show first 2 rules
                print(f"   • Type: {rule.get('rule_type', 'Unknown')}")
                print(f"     Description: {rule.get('description', 'No description')[:100]}...")
        
        display(rules_df[['operation_name', 'rule_count']].head())
    else:
        print("No business rules found.")
        print("This could mean operations don't contain complex business logic or it wasn't extracted.")
else:
    print("Business Rules Catalog view is empty or not available.")

### 3.6 Graph Summary Statistics

In [None]:
# Explore Graph Summary Statistics
stats_data, stats_time = get_materialized_view_data('graph_summary_stats',
                                                   'Graph Summary Statistics - High-level system metrics')

if stats_data:
    stats_df = pd.DataFrame(stats_data)
    
    if not stats_df.empty:
        stats = stats_df.iloc[0]  # Should be just one summary record
        
        print(f"\n📊 Graph Summary Statistics:")
        print(f"Metric: {stats.get('metric_name', 'Unknown')}")
        
        if 'statistics' in stats:
            statistics = stats['statistics']
            print(f"\n📈 System Metrics:")
            for metric, value in statistics.items():
                print(f"  • {metric.replace('_', ' ').title()}: {value}")
        
        display(stats_df)
    else:
        print("No summary statistics found.")
else:
    print("Graph Summary Statistics view is empty or not available.")

### 3.7 Complexity Metrics

In [None]:
# Explore Complexity Metrics
complexity_data, complexity_time = get_materialized_view_data('complexity_metrics',
                                                            'Complexity Metrics - Migration complexity scoring')

if complexity_data:
    complexity_df = pd.DataFrame(complexity_data)
    
    if not complexity_df.empty:
        complexity = complexity_df.iloc[0]  # Should be one complexity record
        
        print(f"\n🎯 Migration Complexity Analysis:")
        print(f"Metric: {complexity.get('metric_name', 'Unknown')}")
        
        metrics = ['package_count', 'operation_count', 'cross_package_dependencies', 
                  'shared_resource_count', 'complexity_score']
        
        for metric in metrics:
            if metric in complexity:
                print(f"  • {metric.replace('_', ' ').title()}: {complexity[metric]}")
        
        # Create a simple complexity visualization
        if 'complexity_score' in complexity:
            score = complexity['complexity_score']
            
            plt.figure(figsize=(8, 4))
            plt.bar(['Complexity Score'], [score], color='skyblue')
            plt.ylabel('Score')
            plt.title(f'SSIS System Complexity Score: {score}')
            
            # Add interpretation
            if score < 5:
                interpretation = "Low Complexity - Easy to migrate"
                color = 'green'
            elif score < 15:
                interpretation = "Medium Complexity - Moderate migration effort"
                color = 'orange'
            else:
                interpretation = "High Complexity - Significant migration effort"
                color = 'red'
            
            plt.text(0, score + 0.1, interpretation, ha='center', color=color, weight='bold')
            plt.tight_layout()
            plt.show()
        
        display(complexity_df)
    else:
        print("No complexity metrics found.")
else:
    print("Complexity Metrics view is empty or not available.")

## Part 4: Performance Comparison

Let's compare the performance of using materialized views vs. computing the same data from scratch.

In [None]:
# Performance comparison: Materialized view vs raw computation
print("⏱️ Performance Comparison: Materialized Views vs Raw Queries")
print("=" * 60)

# Method 1: Using materialized view (fast)
print("\n🚀 Method 1: Using Materialized View")
start_time = time.time()
view_data, _ = get_materialized_view_data('sql_operations_catalog')
view_time = (time.time() - start_time) * 1000
view_count = len(view_data) if view_data else 0
print(f"Found {view_count} SQL operations in {view_time:.2f}ms")

# Method 2: Computing from scratch (slower)
print("\n🐌 Method 2: Computing from Raw Graph Data")
raw_query = """
    MATCH (n:Node {node_type: 'operation'})
    WHERE n.properties CONTAINS 'sql_transformation'
    RETURN n.id as operation_id,
           n.name as operation_name,
           n.properties as operation_details
"""

raw_df, raw_time = execute_query(raw_query, "Computing SQL operations from scratch", time_it=True)
raw_count = len(raw_df) if not raw_df.empty else 0

# Performance comparison
print(f"\n📊 Performance Results:")
print(f"  • Materialized View: {view_count} results in {view_time:.2f}ms")
print(f"  • Raw Query: {raw_count} results in {raw_time:.2f}ms" if raw_time else "  • Raw Query: Failed or no results")

if raw_time and view_time > 0:
    speedup = raw_time / view_time
    print(f"  • Speedup: {speedup:.1f}x faster with materialized views")
    
    # Visualization
    plt.figure(figsize=(10, 6))
    methods = ['Materialized View', 'Raw Query']
    times = [view_time, raw_time]
    colors = ['green', 'red']
    
    bars = plt.bar(methods, times, color=colors, alpha=0.7)
    plt.ylabel('Execution Time (ms)')
    plt.title('Performance Comparison: Materialized Views vs Raw Queries')
    
    # Add value labels on bars
    for bar, time_val in zip(bars, times):
        plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(times)*0.01, 
                f'{time_val:.1f}ms', ha='center', va='bottom', fontweight='bold')
    
    plt.text(0.5, max(times)*0.8, f'{speedup:.1f}x Faster', ha='center', 
            transform=plt.gca().transAxes, fontsize=16, fontweight='bold', 
            bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.8))
    
    plt.tight_layout()
    plt.show()
else:
    print("  • Could not compare performance - one method failed")

## Part 5: Building a Simple Migration Tool

Let's create a simple example of how a migration engineer would use these materialized views.

In [None]:
# Example: Simple Migration Assessment Tool
class SSISMigrationAssessment:
    """Simple migration assessment using materialized views."""
    
    def __init__(self, mg_connection):
        self.mg = mg_connection
    
    def get_view_data(self, view_name):
        """Get data from a materialized view."""
        query = f"""MATCH (v:Node {{id: 'view:{view_name}'}})
                    RETURN v.properties as data"""
        cursor = self.mg.cursor()
        cursor.execute(query)
        result = cursor.fetchone()
        
        if result and result[0]:
            return json.loads(result[0])
        return []
    
    def assess_migration_complexity(self):
        """Assess overall migration complexity."""
        print("🎯 SSIS Migration Assessment Report")
        print("=" * 40)
        
        # Get complexity metrics
        complexity_data = self.get_view_data('complexity_metrics')
        if complexity_data:
            complexity = complexity_data[0]
            print(f"\n📊 System Overview:")
            print(f"  • Total Packages: {complexity.get('package_count', 0)}")
            print(f"  • Total Operations: {complexity.get('operation_count', 0)}")
            print(f"  • Cross-Package Dependencies: {complexity.get('cross_package_dependencies', 0)}")
            print(f"  • Shared Resources: {complexity.get('shared_resource_count', 0)}")
            print(f"  • Complexity Score: {complexity.get('complexity_score', 0):.1f}")
        
        # Analyze SQL operations
        sql_ops = self.get_view_data('sql_operations_catalog')
        if sql_ops:
            print(f"\n🔍 SQL Operations Analysis:")
            print(f"  • Total SQL Operations: {len(sql_ops)}")
            
            # Analyze complexity
            complex_ops = [op for op in sql_ops if op.get('complexity_indicators', {}).get('has_joins', False)]
            subquery_ops = [op for op in sql_ops if op.get('complexity_indicators', {}).get('has_subqueries', False)]
            
            print(f"  • Operations with JOINs: {len(complex_ops)}")
            print(f"  • Operations with Subqueries: {len(subquery_ops)}")
        
        # Check shared resources
        shared_resources = self.get_view_data('shared_resources_analysis')
        if shared_resources:
            high_risk = [r for r in shared_resources if r.get('contention_risk') == 'HIGH']
            print(f"\n⚠️ Risk Assessment:")
            print(f"  • Total Shared Resources: {len(shared_resources)}")
            print(f"  • High-Risk Resources: {len(high_risk)}")
            
            if high_risk:
                print(f"  • High-Risk Resources List:")
                for resource in high_risk[:3]:  # Show top 3
                    print(f"    - {resource.get('resource_name', 'Unknown')}: {resource.get('package_count', 0)} packages")
        
        # Migration recommendations
        print(f"\n💡 Migration Recommendations:")
        if complexity_data and complexity_data[0].get('complexity_score', 0) < 5:
            print(f"  • ✅ Low complexity - Good candidate for automated migration")
        elif complexity_data and complexity_data[0].get('complexity_score', 0) < 15:
            print(f"  • ⚠️ Medium complexity - Plan for manual review of complex operations")
        else:
            print(f"  • 🚨 High complexity - Recommend phased migration approach")
        
        if shared_resources and len([r for r in shared_resources if r.get('contention_risk') == 'HIGH']) > 0:
            print(f"  • ⚠️ Address shared resource dependencies before migration")
        
        print(f"  • 📋 Review {len(sql_ops) if sql_ops else 0} SQL operations for cloud compatibility")

# Run the assessment
assessment = SSISMigrationAssessment(mg)
assessment.assess_migration_complexity()

## Part 6: Understanding the Analytics-Ready Advantage

Let's summarize what we've learned about the analytics-ready features.

In [None]:
# Summary of analytics-ready benefits
print("🎉 Analytics-Ready Features Summary")
print("=" * 40)

# Collect timing data from our tests
view_access_times = {
    'SQL Operations': sql_ops_time if 'sql_ops_time' in locals() else None,
    'Dependencies': deps_time if 'deps_time' in locals() else None,
    'Shared Resources': shared_time if 'shared_time' in locals() else None,
    'Data Lineage': lineage_time if 'lineage_time' in locals() else None,
    'Business Rules': rules_time if 'rules_time' in locals() else None,
    'Summary Stats': stats_time if 'stats_time' in locals() else None,
    'Complexity': complexity_time if 'complexity_time' in locals() else None
}

print(f"\n⚡ Performance Results:")
total_time = 0
successful_queries = 0

for view_name, exec_time in view_access_times.items():
    if exec_time is not None:
        print(f"  • {view_name}: {exec_time:.1f}ms")
        total_time += exec_time
        successful_queries += 1
    else:
        print(f"  • {view_name}: Not available")

if successful_queries > 0:
    avg_time = total_time / successful_queries
    print(f"\n📊 Performance Summary:")
    print(f"  • Average access time: {avg_time:.1f}ms per view")
    print(f"  • Total time for all views: {total_time:.1f}ms")
    print(f"  • Successful views accessed: {successful_queries}/7")

print(f"\n🎯 Key Benefits of Analytics-Ready Optimization:")
print(f"  • ✅ Sub-second access to pre-computed analysis")
print(f"  • ✅ No need to rebuild indexes for each application")
print(f"  • ✅ Consistent data structure across tools")
print(f"  • ✅ Ready for migration, compliance, and governance applications")
print(f"  • ✅ Supports multiple downstream tools simultaneously")

print(f"\n📚 What You've Learned:")
print(f"  • How to verify if a graph is analytics-ready")
print(f"  • How to access and use materialized views")
print(f"  • Performance advantages of pre-computed data")
print(f"  • Building simple migration tools with the data")

print(f"\n🚀 Next Steps:")
print(f"  • Open 04_advanced_queries.ipynb for complex analysis patterns")
print(f"  • Open 05_migration_analysis.ipynb for practical migration scenarios")
print(f"  • Build your own migration tools using these materialized views!")

## Key Takeaways

### What Analytics-Ready Means:
1. **Pre-computed Views**: Data is processed once during analysis, not every time you query
2. **Instant Access**: Migration tools get results in milliseconds, not minutes
3. **Consistent Structure**: All applications use the same optimized data format
4. **No Rebuilding**: Unlike search indexes, materialized views persist between sessions

### Available Materialized Views:
- **SQL Operations Catalog**: Ready-to-use SQL transformation analysis
- **Cross-Package Dependencies**: Pre-computed dependency relationships  
- **Shared Resources Analysis**: Bottleneck and risk identification
- **Data Lineage Catalog**: Complete data flow mapping
- **Business Rules Catalog**: Extracted business logic
- **Graph Summary Statistics**: High-level system metrics
- **Complexity Metrics**: Migration effort estimation

### For Migration Engineers:
This analytics-ready approach means you can build interactive migration tools that:
- Start instantly (no 3-minute wait times)
- Access rich, pre-processed data
- Focus on migration logic, not data processing
- Scale to enterprise-size SSIS environments

The "Week 1 vs Week 2" performance problem is solved - your tools work fast immediately!