# 02 - Exploring SSIS Package Structure

This notebook demonstrates how to explore the structure of SSIS packages stored in Memgraph,
with focus on the enhanced SQL semantics metadata for migration analysis.

## Key Features Covered:
- Package hierarchy and relationships
- Operation types and their properties
- Data flow analysis with SQL semantics
- Connection and parameter mapping
- Migration readiness assessment

In [None]:
# Setup and imports
import pymgclient
import pandas as pd
import json
from typing import Dict, List, Any
import matplotlib.pyplot as plt
import seaborn as sns

# Connection configuration
HOST = "localhost"
PORT = 7687

def get_connection():
    """Create Memgraph connection."""
    return pymgclient.connect(host=HOST, port=PORT)

def execute_query(query: str, params: Dict = None) -> pd.DataFrame:
    """Execute query and return results as DataFrame."""
    with get_connection() as conn:
        cursor = conn.cursor()
        cursor.execute(query, params or {})
        
        columns = [desc[0] for desc in cursor.description] if cursor.description else []
        rows = cursor.fetchall()
        
        return pd.DataFrame(rows, columns=columns)

## 1. Package Structure Overview

Let's start by examining the overall structure of SSIS packages in our graph database.

In [None]:
# Get package overview
package_overview = execute_query("""
    MATCH (p:Node)
    WHERE p.node_type = 'pipeline'
    RETURN 
        p.name as package_name,
        p.properties.file_path as file_path,
        p.properties.package_type as package_type
    ORDER BY p.name
""")

print("📦 SSIS PACKAGES IN GRAPH:")
print("=" * 60)
display(package_overview)

In [None]:
# Get detailed package metrics
package_metrics = execute_query("""
    MATCH (pkg:Node)-[:CONTAINS]->(op:Node)
    WHERE pkg.node_type = 'pipeline' AND op.node_type = 'operation'
    WITH pkg, count(op) as operation_count
    OPTIONAL MATCH (pkg)-[:CONTAINS*]->(asset:Node)
    WHERE asset.node_type = 'data_asset'
    WITH pkg, operation_count, count(DISTINCT asset) as data_asset_count
    OPTIONAL MATCH (pkg)-[:CONTAINS*]->(conn:Node)
    WHERE conn.node_type = 'connection'
    RETURN 
        pkg.name as package_name,
        operation_count,
        data_asset_count,
        count(DISTINCT conn) as connection_count
    ORDER BY operation_count DESC
""")

print("📊 PACKAGE COMPLEXITY METRICS:")
print("=" * 60)
display(package_metrics)

# Visualize package complexity
if not package_metrics.empty:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Operations per package
    ax1.bar(package_metrics['package_name'], package_metrics['operation_count'])
    ax1.set_title('Operations per Package')
    ax1.set_xlabel('Package')
    ax1.set_ylabel('Number of Operations')
    ax1.tick_params(axis='x', rotation=45)
    
    # Data assets per package
    ax2.bar(package_metrics['package_name'], package_metrics['data_asset_count'])
    ax2.set_title('Data Assets per Package')
    ax2.set_xlabel('Package')
    ax2.set_ylabel('Number of Data Assets')
    ax2.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()

## 2. Operation Types and SQL Semantics Analysis

Now let's examine the different types of operations and their enhanced SQL semantics metadata.

In [None]:
# Analyze operation types
operation_types = execute_query("""
    MATCH (op:Node)
    WHERE op.node_type = 'operation'
    WITH op.properties.operation_type as op_type, count(*) as count,
         sum(CASE WHEN op.properties CONTAINS 'sql_semantics' THEN 1 ELSE 0 END) as with_sql_semantics
    RETURN 
        op_type,
        count as total_operations,
        with_sql_semantics,
        round(100.0 * with_sql_semantics / count, 1) as sql_semantics_percentage
    ORDER BY count DESC
""")

print("🔧 OPERATION TYPES WITH SQL SEMANTICS:")
print("=" * 70)
display(operation_types)

# Visualize operation types
if not operation_types.empty:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Operation type distribution
    ax1.pie(operation_types['total_operations'], 
            labels=operation_types['op_type'],
            autopct='%1.1f%%',
            startangle=90)
    ax1.set_title('Distribution of Operation Types')
    
    # SQL semantics coverage
    x_pos = range(len(operation_types))
    ax2.bar(x_pos, operation_types['total_operations'], label='Total Operations', alpha=0.7)
    ax2.bar(x_pos, operation_types['with_sql_semantics'], label='With SQL Semantics', alpha=0.9)
    ax2.set_title('SQL Semantics Coverage by Operation Type')
    ax2.set_xlabel('Operation Type')
    ax2.set_ylabel('Number of Operations')
    ax2.set_xticks(x_pos)
    ax2.set_xticklabels(operation_types['op_type'], rotation=45)
    ax2.legend()
    
    plt.tight_layout()
    plt.show()

## 3. SQL Semantics Deep Dive

Let's examine the enhanced SQL semantics metadata that enables accurate migration.

In [None]:
# Get operations with rich SQL semantics
sql_operations = execute_query("""
    MATCH (op:Node)
    WHERE op.node_type = 'operation' AND op.properties CONTAINS 'sql_semantics'
    RETURN 
        op.name as operation_name,
        op.properties.operation_type as operation_type,
        op.properties.sql_semantics as sql_semantics_raw
    LIMIT 5
""")

print("🔍 OPERATIONS WITH SQL SEMANTICS:")
print("=" * 70)

for idx, row in sql_operations.iterrows():
    print(f"\n📋 Operation: {row['operation_name']}")
    print(f"   Type: {row['operation_type']}")
    
    # Parse SQL semantics JSON
    try:
        sql_semantics = json.loads(row['sql_semantics_raw']) if isinstance(row['sql_semantics_raw'], str) else row['sql_semantics_raw']
        
        print(f"   📊 Metadata:")
        migration_meta = sql_semantics.get('migration_metadata', {})
        print(f"      • Tables: {migration_meta.get('table_count', 0)}")
        print(f"      • Joins: {migration_meta.get('join_count', 0)}")
        print(f"      • Columns: {migration_meta.get('column_count', 0)}")
        print(f"      • Has Aliases: {migration_meta.get('has_aliases', False)}")
        print(f"      • Join Types: {migration_meta.get('join_types', [])}")
        
        # Show original query snippet
        if 'original_query' in sql_semantics:
            query_snippet = sql_semantics['original_query'][:100]
            print(f"   🔤 Query Preview: {query_snippet}...")
        
        # Show table references
        if sql_semantics.get('tables'):
            print(f"   📋 Tables Referenced:")
            for table in sql_semantics['tables'][:3]:  # Show first 3
                alias_info = f" (as {table['alias']})" if table.get('alias') else ""
                schema_info = f"{table['schema']}." if table.get('schema') else ""
                print(f"      • {schema_info}{table['name']}{alias_info}")
        
        # Show JOIN relationships
        if sql_semantics.get('joins'):
            print(f"   🔗 JOIN Relationships:")
            for join in sql_semantics['joins'][:2]:  # Show first 2
                left_table = join['left_table']['name']
                right_table = join['right_table']['name']
                join_type = join['join_type']
                condition = join['condition'][:50] + "..." if len(join['condition']) > 50 else join['condition']
                print(f"      • {left_table} {join_type} {right_table} ON {condition}")
        
    except (json.JSONDecodeError, TypeError) as e:
        print(f"   ❌ Error parsing SQL semantics: {e}")
    
    print("-" * 70)

## 4. Data Flow Analysis

Analyze data flow patterns and table relationships across packages.

In [None]:
# Analyze data flow patterns
data_flow = execute_query("""
    MATCH (src:Node)-[r:READS_FROM|WRITES_TO]->(target:Node)
    WHERE src.node_type = 'operation' AND target.node_type = 'data_asset'
    RETURN 
        type(r) as relationship_type,
        src.name as operation_name,
        target.name as asset_name,
        src.properties.operation_type as operation_type
    ORDER BY relationship_type, asset_name
""")

print("📊 DATA FLOW PATTERNS:")
print("=" * 60)

if not data_flow.empty:
    # Group by relationship type
    flow_summary = data_flow.groupby(['relationship_type', 'asset_name']).size().reset_index(name='count')
    flow_summary = flow_summary.sort_values(['relationship_type', 'count'], ascending=[True, False])
    
    display(flow_summary.head(15))
    
    # Show most accessed data assets
    asset_access = data_flow.groupby('asset_name').agg({
        'relationship_type': 'count',
        'operation_name': 'nunique'
    }).reset_index()
    asset_access.columns = ['asset_name', 'total_accesses', 'unique_operations']
    asset_access = asset_access.sort_values('total_accesses', ascending=False)
    
    print("\n🎯 MOST ACCESSED DATA ASSETS:")
    display(asset_access.head(10))
else:
    print("No data flow relationships found.")

In [None]:
# Enhanced table reference analysis using SQL semantics
table_references = execute_query("""
    MATCH (op:Node)
    WHERE op.node_type = 'operation' AND op.properties CONTAINS 'sql_semantics'
    RETURN 
        op.name as operation_name,
        op.properties.sql_semantics as sql_semantics_raw
""")

print("📋 TABLE REFERENCES FROM SQL SEMANTICS:")
print("=" * 70)

table_usage = {}
join_patterns = {}

for idx, row in table_references.iterrows():
    try:
        sql_semantics = json.loads(row['sql_semantics_raw']) if isinstance(row['sql_semantics_raw'], str) else row['sql_semantics_raw']
        
        # Track table usage
        for table in sql_semantics.get('tables', []):
            table_name = table['name']
            if table_name not in table_usage:
                table_usage[table_name] = {
                    'operations': set(),
                    'with_alias': 0,
                    'schemas': set()
                }
            
            table_usage[table_name]['operations'].add(row['operation_name'])
            if table.get('alias'):
                table_usage[table_name]['with_alias'] += 1
            if table.get('schema'):
                table_usage[table_name]['schemas'].add(table['schema'])
        
        # Track JOIN patterns
        for join in sql_semantics.get('joins', []):
            join_type = join['join_type']
            if join_type not in join_patterns:
                join_patterns[join_type] = 0
            join_patterns[join_type] += 1
            
    except (json.JSONDecodeError, TypeError):
        continue

# Display table usage summary
if table_usage:
    table_df = pd.DataFrame([
        {
            'table_name': name,
            'operation_count': len(info['operations']),
            'operations': ', '.join(list(info['operations'])[:3]) + ('...' if len(info['operations']) > 3 else ''),
            'alias_usage': info['with_alias'],
            'schemas': ', '.join(info['schemas']) if info['schemas'] else 'None'
        }
        for name, info in table_usage.items()
    ]).sort_values('operation_count', ascending=False)
    
    display(table_df)
    
    print(f"\n🔗 JOIN PATTERN DISTRIBUTION:")
    for join_type, count in sorted(join_patterns.items(), key=lambda x: x[1], reverse=True):
        print(f"   • {join_type}: {count} occurrences")
else:
    print("No table references found in SQL semantics.")

## 5. Connection and Parameter Analysis

Examine connection patterns and parameter usage for migration planning.

In [None]:
# Analyze connections
connections = execute_query("""
    MATCH (conn:Node)
    WHERE conn.node_type = 'connection'
    OPTIONAL MATCH (op:Node)-[:USES_CONNECTION]->(conn)
    WHERE op.node_type = 'operation'
    RETURN 
        conn.name as connection_name,
        conn.properties.connection_type as connection_type,
        conn.properties.server_name as server_name,
        conn.properties.database_name as database_name,
        count(op) as used_by_operations
    ORDER BY used_by_operations DESC
""")

print("🔌 CONNECTION ANALYSIS:")
print("=" * 60)
display(connections)

# Analyze parameters
parameters = execute_query("""
    MATCH (param:Node)
    WHERE param.node_type = 'parameter'
    OPTIONAL MATCH (op:Node)-[:USES_PARAMETER]->(param)
    WHERE op.node_type = 'operation'
    RETURN 
        param.name as parameter_name,
        param.properties.data_type as data_type,
        param.properties.default_value as default_value,
        count(op) as used_by_operations
    ORDER BY used_by_operations DESC
""")

print("\n📝 PARAMETER ANALYSIS:")
print("=" * 60)
display(parameters)

## 6. Migration Readiness Assessment

Assess how ready each package is for automated migration based on available metadata.

In [None]:
# Migration readiness analysis
migration_readiness = execute_query("""
    MATCH (pkg:Node)
    WHERE pkg.node_type = 'pipeline'
    OPTIONAL MATCH (pkg)-[:CONTAINS]->(op:Node)
    WHERE op.node_type = 'operation'
    WITH pkg, 
         count(op) as total_operations,
         sum(CASE WHEN op.properties CONTAINS 'sql_semantics' THEN 1 ELSE 0 END) as operations_with_sql_semantics
    OPTIONAL MATCH (pkg)-[:CONTAINS*]->(asset:Node)
    WHERE asset.node_type = 'data_asset'
    WITH pkg, total_operations, operations_with_sql_semantics, count(DISTINCT asset) as data_assets
    OPTIONAL MATCH (pkg)-[:CONTAINS*]->(conn:Node)
    WHERE conn.node_type = 'connection'
    RETURN 
        pkg.name as package_name,
        total_operations,
        operations_with_sql_semantics,
        round(100.0 * operations_with_sql_semantics / CASE WHEN total_operations = 0 THEN 1 ELSE total_operations END, 1) as sql_semantics_coverage,
        data_assets,
        count(DISTINCT conn) as connections
    ORDER BY sql_semantics_coverage DESC, total_operations DESC
""")

print("🎯 MIGRATION READINESS ASSESSMENT:")
print("=" * 80)

# Add readiness score calculation
def calculate_readiness_score(row):
    """Calculate migration readiness score (0-100)."""
    score = 0
    
    # SQL semantics coverage (40% weight)
    score += (row['sql_semantics_coverage'] / 100) * 40
    
    # Package complexity bonus (30% weight - inverse relationship)
    complexity_score = min(30, max(0, 30 - (row['total_operations'] - 5) * 2))
    score += complexity_score
    
    # Data asset mapping (20% weight)
    asset_score = min(20, row['data_assets'] * 4)  # Up to 20 points
    score += asset_score
    
    # Connection clarity (10% weight)
    conn_score = min(10, row['connections'] * 5)  # Up to 10 points
    score += conn_score
    
    return round(score, 1)

if not migration_readiness.empty:
    migration_readiness['readiness_score'] = migration_readiness.apply(calculate_readiness_score, axis=1)
    
    # Add readiness category
    def get_readiness_category(score):
        if score >= 80: return "🟢 High"
        elif score >= 60: return "🟡 Medium"
        elif score >= 40: return "🟠 Low"
        else: return "🔴 Manual"
    
    migration_readiness['readiness_category'] = migration_readiness['readiness_score'].apply(get_readiness_category)
    
    # Sort by readiness score
    migration_readiness = migration_readiness.sort_values('readiness_score', ascending=False)
    
    display(migration_readiness)
    
    # Visualization
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Readiness score distribution
    ax1.bar(migration_readiness['package_name'], migration_readiness['readiness_score'])
    ax1.set_title('Migration Readiness Scores')
    ax1.set_xlabel('Package')
    ax1.set_ylabel('Readiness Score (0-100)')
    ax1.tick_params(axis='x', rotation=45)
    ax1.axhline(y=80, color='green', linestyle='--', alpha=0.7, label='High Readiness')
    ax1.axhline(y=60, color='orange', linestyle='--', alpha=0.7, label='Medium Readiness')
    ax1.legend()
    
    # SQL semantics coverage vs operations
    ax2.scatter(migration_readiness['total_operations'], migration_readiness['sql_semantics_coverage'], 
               s=migration_readiness['readiness_score']*2, alpha=0.6)
    ax2.set_title('SQL Semantics Coverage vs Package Complexity')
    ax2.set_xlabel('Total Operations')
    ax2.set_ylabel('SQL Semantics Coverage (%)')
    
    # Add package labels
    for idx, row in migration_readiness.iterrows():
        ax2.annotate(row['package_name'], 
                    (row['total_operations'], row['sql_semantics_coverage']),
                    xytext=(5, 5), textcoords='offset points', fontsize=8)
    
    plt.tight_layout()
    plt.show()
    
    # Summary statistics
    print("\n📊 READINESS SUMMARY:")
    readiness_summary = migration_readiness['readiness_category'].value_counts()
    for category, count in readiness_summary.items():
        print(f"   {category}: {count} packages")
    
    avg_score = migration_readiness['readiness_score'].mean()
    print(f"\n   📈 Average Readiness Score: {avg_score:.1f}/100")
    
    high_ready = len(migration_readiness[migration_readiness['readiness_score'] >= 80])
    total_packages = len(migration_readiness)
    print(f"   🎯 High-Readiness Packages: {high_ready}/{total_packages} ({100*high_ready/total_packages:.1f}%)")
else:
    print("No packages found for migration readiness assessment.")

## 7. Migration Recommendations

Based on the analysis above, here are specific recommendations for each package.

In [None]:
print("💡 MIGRATION RECOMMENDATIONS:")
print("=" * 80)

if not migration_readiness.empty:
    for idx, row in migration_readiness.iterrows():
        package_name = row['package_name']
        score = row['readiness_score']
        coverage = row['sql_semantics_coverage']
        
        print(f"\n📦 {package_name}")
        print(f"   Readiness: {row['readiness_category']} ({score}/100)")
        
        # Specific recommendations
        recommendations = []
        
        if score >= 80:
            recommendations.append("✅ Ready for automated migration")
            if coverage == 100:
                recommendations.append("🚀 Perfect SQL semantics coverage - use code generators")
            else:
                recommendations.append(f"📊 {coverage}% SQL coverage - minor manual validation needed")
        elif score >= 60:
            recommendations.append("⚠️  Semi-automated migration possible")
            if coverage < 50:
                recommendations.append("🔍 Improve SQL semantics extraction first")
            recommendations.append("👥 Require migration expert review")
        else:
            recommendations.append("🔴 Manual migration required")
            if coverage == 0:
                recommendations.append("📝 No SQL semantics captured - start with parser enhancement")
            recommendations.append("🛠️  Consider package refactoring before migration")
        
        # Effort estimation
        if score >= 80:
            effort = "2-4 hours (mostly automated)"
        elif score >= 60:
            effort = "1-2 days (semi-automated + review)"
        elif score >= 40:
            effort = "3-5 days (significant manual work)"
        else:
            effort = "1-2 weeks (full manual migration)"
        
        recommendations.append(f"⏱️  Estimated effort: {effort}")
        
        for rec in recommendations:
            print(f"   {rec}")
        
        print("-" * 70)
    
    # Overall migration strategy
    total_estimated_hours = 0
    for idx, row in migration_readiness.iterrows():
        score = row['readiness_score']
        if score >= 80: total_estimated_hours += 3
        elif score >= 60: total_estimated_hours += 24
        elif score >= 40: total_estimated_hours += 64
        else: total_estimated_hours += 120
    
    print(f"\n🎯 OVERALL MIGRATION STRATEGY:")
    print(f"   📊 Total estimated effort: {total_estimated_hours} hours ({total_estimated_hours/8:.1f} person-days)")
    print(f"   🚀 High-priority packages: {len(migration_readiness[migration_readiness['readiness_score'] >= 80])}")
    print(f"   ⚠️  Medium-priority packages: {len(migration_readiness[(migration_readiness['readiness_score'] >= 60) & (migration_readiness['readiness_score'] < 80)])}")
    print(f"   🔴 Manual migration packages: {len(migration_readiness[migration_readiness['readiness_score'] < 60])}")
    
    print(f"\n   💡 Recommended approach:")
    print(f"   1. Start with high-readiness packages for quick wins")
    print(f"   2. Use automated code generators where SQL semantics coverage > 80%")
    print(f"   3. Enhance parser for medium-readiness packages")
    print(f"   4. Plan manual migration workshops for complex packages")
else:
    print("No migration readiness data available.")

## Summary

This notebook demonstrated comprehensive SSIS package structure analysis with enhanced SQL semantics metadata. Key insights:

### Key Capabilities Demonstrated:
1. **Package Structure Analysis** - Comprehensive overview of SSIS packages and their complexity
2. **SQL Semantics Integration** - Enhanced metadata for accurate migration planning
3. **Data Flow Mapping** - Complete understanding of data movements and transformations
4. **Migration Readiness Assessment** - Automated scoring and categorization
5. **Strategic Recommendations** - Data-driven migration planning

### Enhanced Features:
- **Direct Property Queries** - Bypassing outdated materialized views for real-time analysis
- **SQL Semantics Metadata** - JOIN relationships, table references, and column transformations
- **Migration Scoring** - Quantitative readiness assessment for prioritization
- **Code Generation Ready** - Metadata structured for automated migration code generation

### Next Steps:
- Use this analysis to prioritize migration efforts
- Leverage high-readiness packages for automated code generation
- Enhance SQL semantics coverage for medium-readiness packages
- Plan detailed migration workshops for complex packages requiring manual intervention