# SSIS to Modern Platform Migration Analysis with SQL Semantics

This notebook demonstrates practical migration scenarios using **enhanced SQL semantics metadata** for automated code generation.

## 🎯 Migration Agent Focus
- **Real-world Migration Scenarios**: Product.dtsx with Categories table JOIN
- **Platform Code Generation**: Spark, dbt, Pandas migration examples
- **SQL Semantics Consumption**: How migration agents use JOIN relationships
- **Categories Table Success**: Validate the critical extraction fix

## Migration Platforms Covered
- **Apache Spark (PySpark)**: DataFrame operations with JOIN relationships
- **dbt (Data Build Tool)**: SQL model generation with proper semantics
- **Python/Pandas**: DataFrame merge operations with aliases
- **Migration Assessment**: Effort estimation and complexity analysis

## Prerequisites
- Enhanced SSIS Northwind data with SQL semantics in Memgraph
- Understanding of graph structure from previous notebooks
- Basic knowledge of target migration platforms

In [None]:
# Setup for Migration Analysis
import mgclient
import pandas as pd
import json
import time
from datetime import datetime

# Connect to Memgraph
mg = mgclient.connect(host='localhost', port=7687, username='', password='')
print("✅ Connected to Memgraph for Migration Analysis")

def execute_query(query, description=None):
    """Execute a Cypher query for migration analysis."""
    if description:
        print(f"\n🔍 {description}")
        print(f"Query: {query}")
        print("-" * 50)
    
    cursor = mg.cursor()
    cursor.execute(query)
    results = cursor.fetchall()
    
    if results:
        columns = [desc.name for desc in cursor.description] if cursor.description else ['result']
        df = pd.DataFrame(results, columns=columns)
        print(f"Found {len(df)} results")
        return df
    else:
        print("No results found.")
        return pd.DataFrame()

def generate_migration_code(sql_semantics, platform="spark"):
    """Generate platform-specific migration code from SQL semantics."""
    if not sql_semantics or not sql_semantics.get('joins'):
        return f"# No JOIN relationships found for {platform} migration"
    
    join = sql_semantics['joins'][0]  # Use first join
    tables = sql_semantics.get('tables', [])
    columns = sql_semantics.get('columns', [])
    
    if platform == "spark":
        return f"""# Generated Spark Migration Code
# Original SQL: {sql_semantics.get('original_query', 'N/A')[:80]}...

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Load DataFrames
df_{tables[0]['name'].lower()} = spark.table('{tables[0]['name']}').alias('{tables[0]['alias']}')
df_{tables[1]['name'].lower()} = spark.table('{tables[1]['name']}').alias('{tables[1]['alias']}')

# Perform JOIN
result_df = df_{tables[0]['name'].lower()}.join(
    df_{tables[1]['name'].lower()},
    col('{join['condition'].split('=')[0].strip()}') == col('{join['condition'].split('=')[1].strip()}'),
    'inner'
)

# Select columns
result_df = result_df.select({', '.join([f"col('{col['expression']}')" + (f".alias('{col['alias']}')" if col.get('alias') else "") for col in columns[:5]])})

result_df.show()"""
    
    elif platform == "dbt":
        return f"""-- Generated dbt Model
-- Original SQL: {sql_semantics.get('original_query', 'N/A')[:80]}...

SELECT
{', '.join([f"    {col['expression']}" + (f" AS {col['alias']}" if col.get('alias') else "") for col in columns[:5]])}
FROM {{{{ ref('{tables[0]['name'].lower()}') }}}} {tables[0]['alias']}
{join['join_type']} {{{{ ref('{tables[1]['name'].lower()}') }}}} {tables[1]['alias']}
    ON {join['condition']}"""
    
    elif platform == "pandas":
        return f"""# Generated Pandas Migration Code
# Original SQL: {sql_semantics.get('original_query', 'N/A')[:80]}...

import pandas as pd

# Load DataFrames (replace with actual data loading)
df_{tables[0]['name'].lower()} = pd.read_sql("SELECT * FROM {tables[0]['name']}", connection)
df_{tables[1]['name'].lower()} = pd.read_sql("SELECT * FROM {tables[1]['name']}", connection)

# Perform JOIN
result_df = pd.merge(
    df_{tables[0]['name'].lower()},
    df_{tables[1]['name'].lower()},
    left_on='{join['condition'].split('=')[0].strip().split('.')[-1]}',
    right_on='{join['condition'].split('=')[1].strip().split('.')[-1]}',
    how='inner',
    suffixes=('_{tables[0]['alias']}', '_{tables[1]['alias']}')
)

print(result_df.head())"""
    
    return f"# Platform {platform} not implemented"

print("🚀 Migration analysis toolkit ready!")

## 1. Migration Complexity Assessment

Assess the overall complexity of migrating the SSIS system to a new platform.

In [None]:
# Get comprehensive complexity metrics using materialized view
complexity_query = """
MATCH (v:Node {id: 'view:complexity_metrics'})
RETURN JSON_EXTRACT(v.properties, '$.data') as complexity_data
"""

complexity_result = execute_and_fetch(complexity_query)

if complexity_result and complexity_result[0][0]:
    complexity_data = json.loads(complexity_result[0][0])
    
    print("📊 Migration Complexity Assessment")
    print("=" * 50)
    
    # Analyze complexity distribution
    complexity_levels = {'LOW': 0, 'MEDIUM': 0, 'HIGH': 0, 'CRITICAL': 0}
    total_effort_days = 0
    high_risk_packages = []
    
    for item in complexity_data:
        level = item.get('complexity_level', 'UNKNOWN')
        if level in complexity_levels:
            complexity_levels[level] += 1
        
        # Estimate effort in days
        effort_days = item.get('estimated_effort_days', 0)
        total_effort_days += effort_days
        
        if level in ['HIGH', 'CRITICAL']:
            high_risk_packages.append({
                'name': item.get('package_name', 'Unknown'),
                'level': level,
                'effort': effort_days,
                'reasons': item.get('complexity_reasons', [])
            })
    
    total_packages = sum(complexity_levels.values())
    
    print(f"📦 Total Packages: {total_packages}")
    print(f"⏱️  Estimated Total Effort: {total_effort_days} days ({total_effort_days/5:.1f} weeks)")
    print(f"💰 Estimated Cost (at $800/day): ${total_effort_days * 800:,}")
    
    print("\n📈 Complexity Distribution:")
    for level, count in complexity_levels.items():
        percentage = (count / total_packages * 100) if total_packages > 0 else 0
        print(f"  {level:8}: {count:2} packages ({percentage:4.1f}%)")
    
    # Risk assessment
    high_risk_ratio = (complexity_levels['HIGH'] + complexity_levels['CRITICAL']) / total_packages
    if high_risk_ratio > 0.3:
        risk_level = "🔴 HIGH RISK"
        recommendation = "Consider phased migration approach"
    elif high_risk_ratio > 0.15:
        risk_level = "🟡 MEDIUM RISK"
        recommendation = "Plan additional testing and validation"
    else:
        risk_level = "🟢 LOW RISK"
        recommendation = "Suitable for standard migration approach"
    
    print(f"\n🎯 Migration Risk Level: {risk_level}")
    print(f"💡 Recommendation: {recommendation}")
    
    # Show high-risk packages
    if high_risk_packages:
        print(f"\n⚠️ High-Risk Packages ({len(high_risk_packages)}):")
        for pkg in high_risk_packages[:5]:  # Show top 5
            print(f"  • {pkg['name']} ({pkg['level']}) - {pkg['effort']} days")
            if pkg['reasons']:
                print(f"    Reasons: {', '.join(pkg['reasons'][:2])}")
        
        if len(high_risk_packages) > 5:
            print(f"    ... and {len(high_risk_packages) - 5} more")

else:
    print("⚠️ No complexity metrics available. Run full analysis first.")

## 2. Execution Sequence Planning

Determine the optimal order for migrating packages based on dependencies.

In [None]:
# Get cross-package dependencies for execution planning
dependencies_query = """
MATCH (v:Node {id: 'view:cross_package_dependencies'})
RETURN JSON_EXTRACT(v.properties, '$.data') as dependencies_data
"""

deps_result = execute_and_fetch(dependencies_query)

if deps_result and deps_result[0][0]:
    dependencies_data = json.loads(deps_result[0][0])
    
    print("🔗 Migration Execution Sequence Planning")
    print("=" * 50)
    
    # Build dependency graph
    package_deps = defaultdict(set)  # packages that this package depends on
    package_dependents = defaultdict(set)  # packages that depend on this package
    all_packages = set()
    
    for dep in dependencies_data:
        source = dep.get('source_package', '')
        target = dep.get('target_package', '')
        
        if source and target:
            package_deps[source].add(target)  # source depends on target
            package_dependents[target].add(source)  # target is depended on by source
            all_packages.add(source)
            all_packages.add(target)
    
    # Get all packages (including those without dependencies)
    all_packages_query = """
    MATCH (p:Node {node_type: 'PIPELINE'})
    RETURN p.name as package_name
    """
    all_pkg_results = execute_and_fetch(all_packages_query)
    for pkg_result in all_pkg_results:
        all_packages.add(pkg_result[0])
    
    # Perform topological sort to determine execution order
    def topological_sort(packages, dependencies):
        """Returns packages in dependency order (independent first)"""
        result = []
        remaining = set(packages)
        
        while remaining:
            # Find packages with no unresolved dependencies
            ready = [pkg for pkg in remaining 
                    if not (dependencies.get(pkg, set()) & remaining)]
            
            if not ready:
                # Circular dependency - break it by taking package with fewest deps
                ready = [min(remaining, key=lambda p: len(dependencies.get(p, set())))]
                print(f"  ⚠️ Breaking circular dependency with {ready[0]}")
            
            result.extend(ready)
            remaining -= set(ready)
        
        return result
    
    execution_order = topological_sort(all_packages, package_deps)
    
    # Group into migration waves
    wave_size = max(3, len(execution_order) // 5)  # 3-5 waves
    waves = []
    
    for i in range(0, len(execution_order), wave_size):
        wave = execution_order[i:i + wave_size]
        waves.append(wave)
    
    print(f"📋 Recommended Migration Sequence ({len(waves)} waves):")
    
    total_estimated_time = 0
    
    for wave_num, wave_packages in enumerate(waves, 1):
        print(f"\n  Wave {wave_num}: {len(wave_packages)} packages")
        
        wave_complexity = {'LOW': 0, 'MEDIUM': 0, 'HIGH': 0, 'CRITICAL': 0}
        wave_effort = 0
        
        for pkg in wave_packages:
            # Find complexity info for this package
            pkg_complexity = 'MEDIUM'  # Default
            pkg_effort = 2  # Default days
            
            if complexity_result and complexity_result[0][0]:
                for item in complexity_data:
                    if item.get('package_name') == pkg:
                        pkg_complexity = item.get('complexity_level', 'MEDIUM')
                        pkg_effort = item.get('estimated_effort_days', 2)
                        break
            
            wave_complexity[pkg_complexity] += 1
            wave_effort += pkg_effort
            
            # Show dependencies
            deps = package_deps.get(pkg, set())
            deps_str = f" (depends on: {', '.join(list(deps)[:2])})" if deps else ""
            
            print(f"    • {pkg} ({pkg_complexity}, {pkg_effort}d){deps_str}")
        
        total_estimated_time += wave_effort
        
        print(f"    Wave {wave_num} Total: {wave_effort} days")
        complexity_summary = ', '.join([f"{k}: {v}" for k, v in wave_complexity.items() if v > 0])
        print(f"    Complexity: {complexity_summary}")
    
    print(f"\n⏱️ Total Estimated Migration Time: {total_estimated_time} days ({total_estimated_time/5:.1f} weeks)")
    print(f"🔄 Recommended Approach: Execute waves sequentially with 1-2 day buffer between waves")

else:
    print("⚠️ No dependency data available. Run full analysis first.")

## 3. Risk Analysis and Mitigation

Identify potential risks and suggest mitigation strategies.

In [None]:
# Comprehensive risk analysis
def analyze_migration_risks():
    risks = []
    
    # 1. Shared Resource Contention Risk
    shared_resources_query = """
    MATCH (v:Node {id: 'view:shared_resources_analysis'})
    RETURN JSON_EXTRACT(v.properties, '$.data') as shared_resources
    """
    
    shared_result = execute_and_fetch(shared_resources_query)
    if shared_result and shared_result[0][0]:
        shared_data = json.loads(shared_result[0][0])
        high_contention = [r for r in shared_data if r.get('contention_risk') == 'HIGH']
        
        if high_contention:
            risks.append({
                'type': 'Resource Contention',
                'severity': 'HIGH',
                'description': f'{len(high_contention)} shared resources with high contention risk',
                'impact': 'Migration failures, data inconsistency, performance issues',
                'mitigation': 'Migrate packages sharing resources in the same wave, implement resource pooling',
                'details': [r['resource_name'] for r in high_contention[:3]]
            })
    
    # 2. Complex SQL Operations Risk
    sql_complexity_query = """
    MATCH (v:Node {id: 'view:sql_operations_catalog'})
    RETURN JSON_EXTRACT(v.properties, '$.data') as sql_operations
    """
    
    sql_result = execute_and_fetch(sql_complexity_query)
    if sql_result and sql_result[0][0]:
        sql_data = json.loads(sql_result[0][0])
        complex_sql = [op for op in sql_data 
                      if op.get('complexity_indicators', {}).get('has_subqueries') 
                      or (op.get('complexity_indicators', {}).get('table_count', 0) > 5)]
        
        if complex_sql:
            risks.append({
                'type': 'SQL Complexity',
                'severity': 'MEDIUM',
                'description': f'{len(complex_sql)} operations with complex SQL patterns',
                'impact': 'Longer migration time, potential for logic errors in translation',
                'mitigation': 'Manual review of complex SQL, automated testing, gradual rollout',
                'details': [op['operation_name'] for op in complex_sql[:3]]
            })
    
    # 3. Circular Dependencies Risk
    if deps_result and deps_result[0][0]:
        circular_deps = 0
        for dep in dependencies_data:
            source = dep.get('source_package')
            target = dep.get('target_package')
            # Check if there's a reverse dependency
            reverse_exists = any(d.get('source_package') == target and d.get('target_package') == source 
                               for d in dependencies_data)
            if reverse_exists:
                circular_deps += 1
        
        if circular_deps > 0:
            risks.append({
                'type': 'Circular Dependencies',
                'severity': 'HIGH',
                'description': f'{circular_deps} potential circular dependencies detected',
                'impact': 'Cannot determine safe migration order, rollback complications',
                'mitigation': 'Break circular dependencies by refactoring, implement staged rollouts',
                'details': ['Requires detailed dependency analysis']
            })
    
    # 4. Data Volume Risk (estimated)
    data_assets_query = """
    MATCH (da:Node {node_type: 'DATA_ASSET'})
    WHERE da.asset_type = 'Table'
    RETURN count(da) as table_count
    """
    
    data_result = execute_and_fetch(data_assets_query)
    if data_result and data_result[0][0] > 50:
        risks.append({
            'type': 'Data Volume',
            'severity': 'MEDIUM',
            'description': f'{data_result[0][0]} tables involved in migration',
            'impact': 'Extended migration time, storage requirements, data validation complexity',
            'mitigation': 'Implement incremental data migration, validate data integrity at each step',
            'details': ['Consider data archiving strategies']
        })
    
    return risks

risks = analyze_migration_risks()

print("⚠️ Migration Risk Analysis")
print("=" * 50)

if not risks:
    print("✅ No significant risks identified. Migration should proceed smoothly.")
else:
    print(f"🎯 Identified {len(risks)} risk categories:")
    
    # Sort by severity
    severity_order = {'CRITICAL': 4, 'HIGH': 3, 'MEDIUM': 2, 'LOW': 1}
    risks.sort(key=lambda r: severity_order.get(r['severity'], 0), reverse=True)
    
    for i, risk in enumerate(risks, 1):
        severity_icon = {'CRITICAL': '🔴', 'HIGH': '🟠', 'MEDIUM': '🟡', 'LOW': '🟢'}.get(risk['severity'], '⚪')
        
        print(f"\n{i}. {risk['type']} {severity_icon} {risk['severity']}")
        print(f"   Description: {risk['description']}")
        print(f"   Impact: {risk['impact']}")
        print(f"   Mitigation: {risk['mitigation']}")
        
        if risk['details']:
            print(f"   Examples: {', '.join(risk['details'])}")
    
    # Overall risk assessment
    high_risks = sum(1 for r in risks if r['severity'] in ['CRITICAL', 'HIGH'])
    if high_risks > 2:
        overall_risk = "🔴 HIGH RISK - Consider postponing until risks are mitigated"
    elif high_risks > 0:
        overall_risk = "🟡 MEDIUM RISK - Proceed with caution and additional planning"
    else:
        overall_risk = "🟢 LOW RISK - Safe to proceed with standard precautions"
    
    print(f"\n🎯 Overall Migration Risk: {overall_risk}")

## 4. Target Platform Analysis

Analyze compatibility with different target platforms (Azure Data Factory, AWS Glue, etc.).

In [None]:
# Analyze SSIS features and their compatibility with target platforms
def analyze_platform_compatibility():
    # Get operation types from the graph
    operations_query = """
    MATCH (op:Node {node_type: 'OPERATION'})
    RETURN DISTINCT op.operation_type as operation_type, count(*) as count
    ORDER BY count DESC
    """
    
    operations = execute_and_fetch(operations_query)
    
    # Platform compatibility matrix
    compatibility_matrix = {
        'Azure Data Factory': {
            'Data Flow Task': {'compatible': True, 'effort': 'LOW', 'notes': 'Native equivalent available'},
            'Execute SQL Task': {'compatible': True, 'effort': 'LOW', 'notes': 'Direct mapping to SQL activities'},
            'Script Task': {'compatible': False, 'effort': 'HIGH', 'notes': 'Requires custom Azure Functions'},
            'OLEDB Source': {'compatible': True, 'effort': 'LOW', 'notes': 'Maps to dataset connectors'},
            'OLEDB Destination': {'compatible': True, 'effort': 'LOW', 'notes': 'Maps to dataset connectors'},
            'Lookup Transformation': {'compatible': True, 'effort': 'MEDIUM', 'notes': 'Requires data flow configuration'},
            'Conditional Split': {'compatible': True, 'effort': 'LOW', 'notes': 'Native conditional split available'},
            'File System Task': {'compatible': True, 'effort': 'MEDIUM', 'notes': 'Use Azure Storage activities'},
        },
        'AWS Glue': {
            'Data Flow Task': {'compatible': True, 'effort': 'MEDIUM', 'notes': 'Convert to Glue ETL jobs'},
            'Execute SQL Task': {'compatible': True, 'effort': 'LOW', 'notes': 'Use Glue connections to databases'},
            'Script Task': {'compatible': True, 'effort': 'MEDIUM', 'notes': 'Convert to Python/Scala in Glue'},
            'OLEDB Source': {'compatible': True, 'effort': 'MEDIUM', 'notes': 'Configure Glue connections'},
            'OLEDB Destination': {'compatible': True, 'effort': 'MEDIUM', 'notes': 'Configure Glue connections'},
            'Lookup Transformation': {'compatible': True, 'effort': 'HIGH', 'notes': 'Implement as joins in Spark'},
            'Conditional Split': {'compatible': True, 'effort': 'MEDIUM', 'notes': 'Implement with Spark conditions'},
            'File System Task': {'compatible': True, 'effort': 'LOW', 'notes': 'Native S3 operations'},
        },
        'Apache Airflow': {
            'Data Flow Task': {'compatible': True, 'effort': 'HIGH', 'notes': 'Requires custom operators'},
            'Execute SQL Task': {'compatible': True, 'effort': 'LOW', 'notes': 'Native SQL operators available'},
            'Script Task': {'compatible': True, 'effort': 'LOW', 'notes': 'Python/Bash operators available'},
            'OLEDB Source': {'compatible': True, 'effort': 'MEDIUM', 'notes': 'Use database hooks'},
            'OLEDB Destination': {'compatible': True, 'effort': 'MEDIUM', 'notes': 'Use database hooks'},
            'Lookup Transformation': {'compatible': True, 'effort': 'HIGH', 'notes': 'Custom implementation required'},
            'Conditional Split': {'compatible': True, 'effort': 'MEDIUM', 'notes': 'Use branching operators'},
            'File System Task': {'compatible': True, 'effort': 'LOW', 'notes': 'Native file system operators'},
        }
    }
    
    print("🎯 Target Platform Compatibility Analysis")
    print("=" * 50)
    
    print(f"📊 SSIS Operations Found in Northwind:")
    for op_type, count in operations:
        print(f"  • {op_type}: {count} instances")
    
    # Analyze each platform
    for platform_name, platform_compat in compatibility_matrix.items():
        print(f"\n🏢 {platform_name} Compatibility:")
        
        total_operations = sum(count for _, count in operations)
        compatible_operations = 0
        total_effort_score = 0
        effort_scores = {'LOW': 1, 'MEDIUM': 2, 'HIGH': 3}
        
        for op_type, count in operations:
            if op_type in platform_compat:
                compat_info = platform_compat[op_type]
                compatible = compat_info['compatible']
                effort = compat_info['effort']
                notes = compat_info['notes']
                
                status_icon = "✅" if compatible else "❌"
                effort_icon = {"LOW": "🟢", "MEDIUM": "🟡", "HIGH": "🔴"}.get(effort, "⚪")
                
                print(f"  {status_icon} {op_type} ({count}x) - {effort_icon} {effort}")
                print(f"     {notes}")
                
                if compatible:
                    compatible_operations += count
                    total_effort_score += count * effort_scores.get(effort, 2)
            else:
                print(f"  ❓ {op_type} ({count}x) - ⚪ UNKNOWN")
                print(f"     Requires investigation")
        
        # Calculate platform score
        compatibility_ratio = compatible_operations / total_operations if total_operations > 0 else 0
        avg_effort = total_effort_score / compatible_operations if compatible_operations > 0 else 3
        
        # Overall platform rating
        if compatibility_ratio > 0.9 and avg_effort < 1.5:
            rating = "🟢 EXCELLENT"
        elif compatibility_ratio > 0.8 and avg_effort < 2:
            rating = "🟡 GOOD"
        elif compatibility_ratio > 0.6:
            rating = "🟠 FAIR"
        else:
            rating = "🔴 CHALLENGING"
        
        print(f"  \n  📈 Platform Rating: {rating}")
        print(f"     Compatibility: {compatibility_ratio*100:.1f}% of operations")
        print(f"     Average Effort: {avg_effort:.1f}/3")
    
    return operations, compatibility_matrix

operations, compat_matrix = analyze_platform_compatibility()

# Generate recommendation
print("\n💡 Platform Recommendation:")
print("Based on the SSIS Northwind analysis:")
print("  1. Azure Data Factory: Best for organizations already on Azure")
print("  2. AWS Glue: Good for AWS-native organizations, handles most patterns")
print("  3. Apache Airflow: Most flexible but requires more custom development")
print("\nConsider hybrid approaches for complex transformations.")

## 5. Data Lineage Impact Assessment

Analyze how changes to one component affect the entire system.

In [None]:
# Data lineage impact analysis for migration planning
def analyze_lineage_impact(source_table=None):
    """Analyze the impact of migrating a specific data source"""
    
    if not source_table:
        # Find the most connected table as an example
        connectivity_query = """
        MATCH (da:Node {node_type: 'DATA_ASSET'})
        WHERE da.asset_type = 'Table'
        WITH da, 
             size([(da)<-[:READS_FROM]-() | 1]) as readers,
             size([(da)<-[:WRITES_TO]-() | 1]) as writers
        RETURN da.name as table_name, (readers + writers) as connections
        ORDER BY connections DESC
        LIMIT 1
        """
        
        result = execute_and_fetch(connectivity_query)
        if result:
            source_table = result[0][0]
        else:
            source_table = "Orders"  # Fallback
    
    print(f"🔍 Lineage Impact Analysis: {source_table}")
    print("=" * 50)
    
    # Find all downstream impacts
    downstream_query = """
    MATCH (source:Node {node_type: 'DATA_ASSET'})
    WHERE source.name CONTAINS $table_name
    MATCH path = (source)-[*1..4]->(downstream:Node)
    WHERE downstream.node_type IN ['DATA_ASSET', 'OPERATION', 'PIPELINE']
    RETURN DISTINCT
        downstream.node_type as node_type,
        downstream.name as name,
        length(path) as distance,
        [r in relationships(path) | type(r)] as relationship_path
    ORDER BY distance, node_type, name
    """
    
    downstream_results = execute_and_fetch(downstream_query, {"table_name": source_table})
    
    if not downstream_results:
        print(f"❌ No lineage found for table containing '{source_table}'")
        return
    
    # Group by distance (migration waves)
    impact_by_distance = defaultdict(list)
    for result in downstream_results:
        node_type, name, distance, rel_path = result
        impact_by_distance[distance].append({
            'type': node_type,
            'name': name,
            'path': rel_path
        })
    
    print(f"📊 Impact Analysis for '{source_table}':")
    print(f"   Total Affected Components: {len(downstream_results)}")
    
    total_operations = 0
    total_assets = 0
    total_packages = 0
    
    for distance in sorted(impact_by_distance.keys()):
        components = impact_by_distance[distance]
        
        operations = [c for c in components if c['type'] == 'OPERATION']
        assets = [c for c in components if c['type'] == 'DATA_ASSET']
        packages = [c for c in components if c['type'] == 'PIPELINE']
        
        total_operations += len(operations)
        total_assets += len(assets)
        total_packages += len(packages)
        
        print(f"\n  📍 Distance {distance} ({len(components)} components):")
        
        if packages:
            print(f"     📦 Packages ({len(packages)}): {', '.join([p['name'] for p in packages])}")
        
        if operations:
            print(f"     ⚙️  Operations ({len(operations)}): {', '.join([o['name'][:30] for o in operations[:3]])}{'...' if len(operations) > 3 else ''}")
        
        if assets:
            print(f"     🗃️  Data Assets ({len(assets)}): {', '.join([a['name'] for a in assets[:3]])}{'...' if len(assets) > 3 else ''}")
    
    # Risk assessment
    print(f"\n⚠️ Migration Impact Assessment:")
    
    if total_packages > 5:
        risk_level = "🔴 HIGH"
        recommendation = "Requires careful coordination across multiple packages"
    elif total_packages > 2:
        risk_level = "🟡 MEDIUM"
        recommendation = "Plan migration with affected packages"
    else:
        risk_level = "🟢 LOW"
        recommendation = "Can be migrated with minimal coordination"
    
    print(f"   Risk Level: {risk_level}")
    print(f"   Affected Packages: {total_packages}")
    print(f"   Affected Operations: {total_operations}")
    print(f"   Affected Data Assets: {total_assets}")
    print(f"   Recommendation: {recommendation}")
    
    return {
        'source_table': source_table,
        'total_impact': len(downstream_results),
        'packages': total_packages,
        'operations': total_operations,
        'assets': total_assets,
        'risk_level': risk_level,
        'max_distance': max(impact_by_distance.keys()) if impact_by_distance else 0
    }

# Analyze impact for a key table
impact_analysis = analyze_lineage_impact()

# Also analyze a few more tables to understand the overall system
print("\n" + "="*70)
print("📈 System-Wide Impact Summary")

key_tables = ['Customer', 'Product', 'Order']
impact_summary = []

for table in key_tables:
    try:
        impact = analyze_lineage_impact(table)
        if impact:
            impact_summary.append(impact)
        print("\n" + "-"*50)
    except:
        continue

if impact_summary:
    print("\n🎯 Migration Planning Insights:")
    avg_impact = sum(i['total_impact'] for i in impact_summary) / len(impact_summary)
    max_depth = max(i['max_distance'] for i in impact_summary)
    
    print(f"   Average Impact per Table: {avg_impact:.1f} components")
    print(f"   Maximum Propagation Depth: {max_depth} steps")
    print(f"   High-Risk Tables: {sum(1 for i in impact_summary if 'HIGH' in i['risk_level'])}")
    
    if avg_impact > 20:
        print("   ⚠️ System has high interconnectedness - plan carefully")
    elif avg_impact > 10:
        print("   💡 Moderate interconnectedness - standard migration approach")
    else:
        print("   ✅ Low interconnectedness - migration should be straightforward")

## 6. Migration Report Generation

Generate a comprehensive migration report combining all analyses.

In [None]:
# Generate comprehensive migration report
def generate_migration_report():
    report = {
        'generated_at': datetime.now().isoformat(),
        'system_name': 'SSIS Northwind',
        'analysis_type': 'Migration Readiness Assessment'
    }
    
    # Executive Summary
    print("📋 SSIS NORTHWIND MIGRATION REPORT")
    print("=" * 60)
    print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"System: SSIS Northwind")
    print(f"Analysis Type: Migration Readiness Assessment")
    
    # System Overview
    overview_query = """
    MATCH (p:Node {node_type: 'PIPELINE'}) 
    WITH count(p) as packages
    MATCH (op:Node {node_type: 'OPERATION'})
    WITH packages, count(op) as operations
    MATCH (da:Node {node_type: 'DATA_ASSET'})
    WITH packages, operations, count(da) as data_assets
    MATCH (conn:Node {node_type: 'CONNECTION'})
    RETURN packages, operations, data_assets, count(conn) as connections
    """
    
    overview = execute_and_fetch(overview_query)
    if overview:
        packages, operations, data_assets, connections = overview[0]
        
        print(f"\n📊 SYSTEM OVERVIEW")
        print(f"   SSIS Packages: {packages}")
        print(f"   Operations: {operations}")
        print(f"   Data Assets: {data_assets}")
        print(f"   Connections: {connections}")
        
        report['system_overview'] = {
            'packages': packages,
            'operations': operations,
            'data_assets': data_assets,
            'connections': connections
        }
    
    # Complexity Assessment
    if complexity_result and complexity_result[0][0]:
        complexity_data = json.loads(complexity_result[0][0])
        
        complexity_distribution = {'LOW': 0, 'MEDIUM': 0, 'HIGH': 0, 'CRITICAL': 0}
        total_effort = 0
        
        for item in complexity_data:
            level = item.get('complexity_level', 'MEDIUM')
            if level in complexity_distribution:
                complexity_distribution[level] += 1
            total_effort += item.get('estimated_effort_days', 2)
        
        print(f"\n📈 COMPLEXITY ASSESSMENT")
        for level, count in complexity_distribution.items():
            percentage = (count / len(complexity_data) * 100) if complexity_data else 0
            print(f"   {level}: {count} packages ({percentage:.1f}%)")
        
        print(f"   Total Estimated Effort: {total_effort} days")
        print(f"   Estimated Cost: ${total_effort * 800:,}")
        
        report['complexity_assessment'] = {
            'distribution': complexity_distribution,
            'total_effort_days': total_effort,
            'estimated_cost': total_effort * 800
        }
    
    # Risk Summary
    risks = analyze_migration_risks()
    high_risks = [r for r in risks if r['severity'] in ['CRITICAL', 'HIGH']]
    
    print(f"\n⚠️ RISK ASSESSMENT")
    print(f"   Total Risks Identified: {len(risks)}")
    print(f"   High/Critical Risks: {len(high_risks)}")
    
    if high_risks:
        print(f"   Key Risk Areas:")
        for risk in high_risks[:3]:
            print(f"     • {risk['type']} ({risk['severity']})")
    
    report['risk_assessment'] = {
        'total_risks': len(risks),
        'high_risks': len(high_risks),
        'risk_details': [{'type': r['type'], 'severity': r['severity'], 'description': r['description']} for r in risks]
    }
    
    # Platform Recommendations
    print(f"\n🎯 PLATFORM RECOMMENDATIONS")
    print(f"   Recommended Approach: Cloud-native ETL platform")
    print(f"   Top Choices:")
    print(f"     1. Azure Data Factory (if Azure-based)")
    print(f"     2. AWS Glue (if AWS-based)")
    print(f"     3. Apache Airflow (platform-agnostic)")
    
    # Timeline Estimate
    if 'complexity_assessment' in report:
        effort_days = report['complexity_assessment']['total_effort_days']
        weeks = effort_days / 5
        
        # Add buffer based on risk level
        risk_buffer = 1.2 if len(high_risks) > 0 else 1.1
        estimated_weeks = weeks * risk_buffer
        
        print(f"\n⏱️ TIMELINE ESTIMATE")
        print(f"   Base Effort: {effort_days} days ({weeks:.1f} weeks)")
        print(f"   With Risk Buffer: {estimated_weeks:.1f} weeks")
        print(f"   Recommended Timeline: {estimated_weeks + 2:.1f} weeks (including testing)")
        
        report['timeline_estimate'] = {
            'base_days': effort_days,
            'base_weeks': weeks,
            'with_buffer_weeks': estimated_weeks,
            'recommended_weeks': estimated_weeks + 2
        }
    
    # Success Criteria
    print(f"\n✅ SUCCESS CRITERIA")
    print(f"   □ All SSIS packages successfully migrated")
    print(f"   □ Data integrity validated across all tables")
    print(f"   □ Performance benchmarks met or exceeded")
    print(f"   □ Zero data loss during migration")
    print(f"   □ All business rules preserved")
    print(f"   □ User acceptance testing passed")
    print(f"   □ Rollback procedures tested and documented")
    
    # Next Steps
    print(f"\n🚀 RECOMMENDED NEXT STEPS")
    print(f"   1. Stakeholder review and approval of migration plan")
    print(f"   2. Detailed technical design for target platform")
    print(f"   3. Setup development and testing environments")
    print(f"   4. Begin with low-complexity packages for proof of concept")
    print(f"   5. Develop automated testing and validation procedures")
    print(f"   6. Create detailed rollback and recovery procedures")
    
    # Overall Assessment
    overall_score = 85  # Base score
    if len(high_risks) > 2:
        overall_score -= 20
    elif len(high_risks) > 0:
        overall_score -= 10
    
    if 'complexity_assessment' in report:
        high_complexity_ratio = report['complexity_assessment']['distribution'].get('HIGH', 0) / report['system_overview']['packages']
        if high_complexity_ratio > 0.3:
            overall_score -= 10
    
    if overall_score >= 80:
        readiness = "🟢 READY FOR MIGRATION"
    elif overall_score >= 60:
        readiness = "🟡 MOSTLY READY - ADDRESS RISKS FIRST"
    else:
        readiness = "🔴 NOT READY - SIGNIFICANT WORK REQUIRED"
    
    print(f"\n🎯 OVERALL MIGRATION READINESS")
    print(f"   Score: {overall_score}/100")
    print(f"   Status: {readiness}")
    
    report['overall_assessment'] = {
        'score': overall_score,
        'status': readiness
    }
    
    print(f"\n" + "=" * 60)
    print(f"📄 Report completed. System analysis suggests {readiness.split()[-1].lower()} for migration.")
    
    return report

# Generate the comprehensive report
migration_report = generate_migration_report()

# Save report as JSON for further processing
print(f"\n💾 Report data structure created for programmatic access.")
print(f"   Use 'migration_report' variable to access structured data.")

## 7. Custom Migration Scenarios

Test specific migration scenarios and their implications.

In [None]:
# Test migration scenarios
def test_migration_scenario(scenario_name, package_subset=None):
    """Test a specific migration scenario"""
    
    print(f"🧪 Testing Migration Scenario: {scenario_name}")
    print("=" * 50)
    
    if package_subset:
        # Analyze specific packages
        packages_filter = "WHERE " + " OR ".join([f"p.name CONTAINS '{pkg}'" for pkg in package_subset])
    else:
        packages_filter = ""
    
    scenario_query = f"""
    MATCH (p:Node {{node_type: 'PIPELINE'}})
    {packages_filter}
    OPTIONAL MATCH (p)-[:CONTAINS]->(op:Node {{node_type: 'OPERATION'}})
    OPTIONAL MATCH (op)-[:READS_FROM|WRITES_TO]->(da:Node {{node_type: 'DATA_ASSET'}})
    RETURN 
        p.name as package_name,
        count(DISTINCT op) as operations,
        count(DISTINCT da) as data_assets,
        collect(DISTINCT op.operation_type) as operation_types
    ORDER BY operations DESC
    """
    
    results = execute_and_fetch(scenario_query)
    
    if not results:
        print("❌ No packages found for this scenario")
        return
    
    total_operations = sum(r[1] for r in results)
    total_assets = sum(r[2] for r in results)
    
    print(f"📊 Scenario Analysis:")
    print(f"   Packages in scope: {len(results)}")
    print(f"   Total operations: {total_operations}")
    print(f"   Total data assets: {total_assets}")
    
    # Estimate effort for this scenario
    base_effort_per_operation = 0.5  # days
    estimated_effort = total_operations * base_effort_per_operation
    
    print(f"   Estimated effort: {estimated_effort:.1f} days")
    
    # Show package details
    print(f"\n📦 Packages in Scenario:")
    for pkg_name, ops, assets, op_types in results:
        complexity = "HIGH" if ops > 10 else "MEDIUM" if ops > 5 else "LOW"
        print(f"   • {pkg_name}: {ops} ops, {assets} assets ({complexity} complexity)")
        if op_types:
            unique_types = [t for t in op_types if t]  # Remove None values
            print(f"     Types: {', '.join(unique_types[:3])}{'...' if len(unique_types) > 3 else ''}")
    
    # Check for dependencies outside the scenario
    if deps_result and deps_result[0][0]:
        dependencies_data = json.loads(deps_result[0][0])
        scenario_packages = {r[0] for r in results}
        
        external_deps = []
        for dep in dependencies_data:
            source = dep.get('source_package', '')
            target = dep.get('target_package', '')
            
            # Check if one package is in scenario but dependency is outside
            if source in scenario_packages and target not in scenario_packages:
                external_deps.append(f"{source} depends on {target} (external)")
            elif target in scenario_packages and source not in scenario_packages:
                external_deps.append(f"{source} (external) affects {target}")
        
        if external_deps:
            print(f"\n⚠️ External Dependencies ({len(external_deps)}):")
            for dep in external_deps[:5]:
                print(f"   • {dep}")
            if len(external_deps) > 5:
                print(f"   ... and {len(external_deps) - 5} more")
        else:
            print(f"\n✅ No external dependencies - scenario is self-contained")
    
    return {
        'scenario': scenario_name,
        'packages': len(results),
        'operations': total_operations,
        'assets': total_assets,
        'estimated_effort': estimated_effort,
        'external_dependencies': len(external_deps) if 'external_deps' in locals() else 0
    }

# Test different migration scenarios
scenarios = [
    ("Customer Data Migration", ['Customer', 'Order']),
    ("Product Catalog Migration", ['Product', 'Category']),
    ("Reporting Systems Migration", ['Report', 'Summary']),
    ("Full System Migration", None)
]

scenario_results = []
for scenario_name, package_filter in scenarios:
    result = test_migration_scenario(scenario_name, package_filter)
    if result:
        scenario_results.append(result)
    print("\n" + "-"*50)

# Compare scenarios
if scenario_results:
    print("\n📊 Scenario Comparison:")
    print(f"{'Scenario':<30} {'Packages':<10} {'Operations':<12} {'Effort (days)':<15} {'Ext Deps':<10}")
    print("-" * 80)
    
    for result in scenario_results:
        print(f"{result['scenario']:<30} {result['packages']:<10} {result['operations']:<12} {result['estimated_effort']:<15.1f} {result['external_dependencies']:<10}")
    
    print("\n💡 Scenario Recommendations:")
    # Find the best starter scenario (low effort, few external deps)
    starter_scenarios = [r for r in scenario_results if r['external_dependencies'] == 0 and r['estimated_effort'] < 10]
    if starter_scenarios:
        best_starter = min(starter_scenarios, key=lambda x: x['estimated_effort'])
        print(f"   🚀 Best Starting Point: {best_starter['scenario']} ({best_starter['estimated_effort']:.1f} days)")
    
    # Find the most complex scenario
    most_complex = max(scenario_results, key=lambda x: x['estimated_effort'])
    print(f"   ⚠️ Most Complex: {most_complex['scenario']} ({most_complex['estimated_effort']:.1f} days)")
    
    print(f"\n   💡 Recommended approach: Start with self-contained scenarios and progress to more complex ones.")

## Summary and Next Steps

This comprehensive migration analysis notebook has covered:

### 🎯 Key Analysis Areas
1. **Migration Complexity Assessment** - Overall effort estimation and cost analysis
2. **Execution Sequence Planning** - Dependency-based migration waves
3. **Risk Analysis and Mitigation** - Comprehensive risk identification
4. **Target Platform Analysis** - Compatibility with Azure, AWS, and Airflow
5. **Data Lineage Impact Assessment** - Understanding change propagation
6. **Migration Report Generation** - Executive-level reporting
7. **Custom Migration Scenarios** - Testing specific migration approaches

### 📊 Key Insights for SSIS Northwind
- System complexity and migration readiness
- Optimal execution sequence based on dependencies
- Risk factors that need attention
- Platform compatibility analysis
- Cost and timeline estimates

### 🚀 How to Use This Analysis
1. **For Migration Teams**: Use the execution sequence and risk analysis to plan your migration waves
2. **For Architects**: Leverage platform compatibility analysis to choose the best target
3. **For Project Managers**: Use timeline and cost estimates for project planning
4. **For Stakeholders**: Review the executive summary and overall readiness assessment

### 🔄 Next Steps
1. **Customize Analysis**: Modify queries and scenarios for your specific requirements
2. **Integrate with Tools**: Use the structured report data in your migration tools
3. **Monitor Progress**: Adapt the analysis as migration progresses
4. **Scale Up**: Apply these techniques to larger SSIS environments

### 💡 Advanced Applications
- **Automated Migration Planning**: Use the analysis to drive migration tool decisions
- **Change Impact Analysis**: Assess impacts of modifications during migration
- **Performance Optimization**: Identify bottlenecks and optimization opportunities
- **Compliance Reporting**: Generate reports for audit and governance requirements

The combination of graph analysis, materialized views, and migration-specific queries provides a powerful foundation for data-driven migration planning and execution.