# Copy SalesLT Data from Shortcuts to Bronze Lakehouse

This notebook reads SalesLT tables from shortcuts in `Gaiye_Test_Lakehouse` and copies them to the bronze layer in `RDS_Fabric_Foundry_workspace_Gaiye_Retail_Solution_Test_IDM_LH_bronze`.

**Setup Required:**
1. **Source**: `Gaiye_Test_Lakehouse` attached as additional lakehouse (contains shortcuts)
2. **Target**: `RDS_Fabric_Foundry_workspace_Gaiye_Retail_Solution_Test_IDM_LH_bronze` attached as default lakehouse
3. **Operation**: Copy shortcut data → bronze layer Files/bronze/saleslt/

**Expected Tables in Source:**
- Customer, Product, Address, CustomerAddress
- ProductCategory, ProductDescription, ProductModel
- ProductModelProductDescription, SalesOrderDetail, SalesOrderHeader

In [None]:
# Step 1: Import Libraries and Setup
import pandas as pd
from datetime import datetime
import os
from pyspark.sql.functions import lit

print("✅ Libraries imported successfully")
print(f"📅 Copy process started at: {datetime.now()}")
print("🔄 Source: Gaiye_Test_Lakehouse shortcuts → Target: Bronze lakehouse")

In [None]:
# Step 2: Environment and Lakehouse Check
print("🔍 ENVIRONMENT CHECK")
print("=" * 60)

# Check Spark session
try:
    print(f"✅ Spark session: {spark.version}")
except:
    print("❌ Spark session not available")
    raise Exception("Spark session required")

# Check default lakehouse (should be bronze lakehouse)
try:
    default_tables = spark.sql("SHOW TABLES").toPandas()
    print(f"✅ Default lakehouse (Bronze): {len(default_tables)} tables visible")
    
    # Show sample tables from default lakehouse
    if len(default_tables) > 0:
        print("📊 Sample tables in default lakehouse:")
        for table in default_tables['tableName'][:5]:
            print(f"   📄 {table}")
        if len(default_tables) > 5:
            print(f"   ... and {len(default_tables)-5} more")
    else:
        print("📊 Default lakehouse is empty (expected for bronze target)")
        
except Exception as e:
    print(f"❌ Cannot access default lakehouse: {str(e)}")

print("=" * 60)

In [None]:
# Step 3: Connect to Source Lakehouse and Discover SalesLT Tables
print("🔍 CONNECTING TO SOURCE LAKEHOUSE (Gaiye_Test_Lakehouse)")
print("=" * 60)

# Define expected SalesLT table names (from shortcuts)
expected_saleslt_tables = {
    'address': 'Address Information',
    'customer': 'Customer Data', 
    'customeraddress': 'Customer Data',
    'product': 'Product Catalog',
    'productcategory': 'Product Catalog',
    'productdescription': 'Product Catalog',
    'productmodel': 'Product Catalog',
    'productmodelproductdescription': 'Product Catalog',
    'salesorderdetail': 'Sales Transactions',
    'salesorderheader': 'Sales Transactions'
}

# Try to access source lakehouse tables
source_tables_info = []

try:
    # Method 1: Try to access tables directly (if both lakehouses are attached)
    print("🔍 Method 1: Checking for SalesLT tables in accessible lakehouses...")
    
    # Check all available tables across lakehouses
    all_tables = spark.sql("SHOW TABLES").toPandas()
    
    # Look for SalesLT tables
    found_source_tables = []
    
    for _, row in all_tables.iterrows():
        table_name = row['tableName']
        table_name_lower = table_name.lower()
        
        # Check if this matches a SalesLT table
        if table_name_lower in expected_saleslt_tables:
            category = expected_saleslt_tables[table_name_lower]
            found_source_tables.append(table_name)
            
            source_tables_info.append({
                'table_name': table_name,
                'category': category,
                'source_reference': table_name  # Direct table reference
            })
            
            print(f"   ✅ Found: {table_name} ({category})")
    
    if len(source_tables_info) == 0:
        print("⚠️ No SalesLT tables found in directly accessible tables")
        print("💡 You may need to:")
        print("   1. Attach Gaiye_Test_Lakehouse as additional lakehouse")
        print("   2. Or use database.table_name syntax")
        
        # Method 2: Try with database prefix
        print("\n🔍 Method 2: Trying with database prefix...")
        
        # Check available databases
        databases = spark.sql("SHOW DATABASES").toPandas()
        print(f"📁 Available databases: {databases.iloc[:, 0].tolist()}")
        
        # Look for Gaiye-related database
        for db_name in databases.iloc[:, 0]:
            if 'gaiye' in db_name.lower() or 'test' in db_name.lower():
                print(f"🔍 Checking database: {db_name}")
                try:
                    db_tables = spark.sql(f"SHOW TABLES IN {db_name}").toPandas()
                    
                    for _, row in db_tables.iterrows():
                        table_name = row['tableName']
                        table_name_lower = table_name.lower()
                        
                        if table_name_lower in expected_saleslt_tables:
                            category = expected_saleslt_tables[table_name_lower]
                            
                            source_tables_info.append({
                                'table_name': table_name,
                                'category': category,
                                'source_reference': f"{db_name}.{table_name}"
                            })
                            
                            print(f"   ✅ Found: {db_name}.{table_name} ({category})")
                            
                except Exception as e:
                    print(f"   ❌ Could not access {db_name}: {str(e)[:50]}...")
    
    print(f"\n📊 DISCOVERY SUMMARY:")
    print(f"   🔍 Total tables found: {len(source_tables_info)}")
    print(f"   📋 Expected tables: {len(expected_saleslt_tables)}")
    
    if len(source_tables_info) == 0:
        print(f"\n❌ NO SALESLT TABLES FOUND")
        print(f"🔧 TROUBLESHOOTING:")
        print(f"   1. Ensure Gaiye_Test_Lakehouse is attached as additional lakehouse")
        print(f"   2. Check that shortcuts are visible in the source lakehouse")
        print(f"   3. Verify table names match expected format")
        raise Exception("No source tables found")
    
    elif len(source_tables_info) < len(expected_saleslt_tables):
        missing_count = len(expected_saleslt_tables) - len(source_tables_info)
        print(f"\n⚠️ PARTIAL DISCOVERY: {missing_count} tables missing")
        print(f"   💡 You can proceed with available tables")
    
    else:
        print(f"\n🎉 COMPLETE DISCOVERY: All expected tables found!")

except Exception as e:
    print(f"❌ Discovery failed: {str(e)}")
    raise

print("=" * 60)

In [None]:
# Step 4: Copy Function - Source to Bronze
def copy_table_to_bronze(table_info):
    """Copy a table from source lakehouse to bronze layer"""
    table_name = table_info['table_name']
    category = table_info['category']
    source_ref = table_info['source_reference']
    
    print(f"📦 Copying {table_name} ({category})")
    print(f"   🔗 Source: {source_ref}")
    
    try:
        # Read from source table
        df = spark.sql(f"SELECT * FROM {source_ref}")
        
        # Get basic info
        row_count = df.count()
        columns = df.columns
        
        print(f"   📊 Loaded: {row_count:,} rows, {len(columns)} columns")
        
        # Add bronze layer metadata
        df_with_metadata = df \
            .withColumn("_bronze_load_date", lit(datetime.now().strftime("%Y-%m-%d"))) \
            .withColumn("_bronze_load_timestamp", lit(datetime.now().strftime("%Y-%m-%d %H:%M:%S"))) \
            .withColumn("_source_system", lit("SalesLT")) \
            .withColumn("_source_table", lit(table_name)) \
            .withColumn("_source_lakehouse", lit("Gaiye_Test_Lakehouse")) \
            .withColumn("_load_method", lit("lakehouse_copy"))
        
        # Create bronze path
        bronze_path = f"Files/bronze/saleslt/{table_name.lower()}"
        
        print(f"   💾 Writing to bronze: {bronze_path}")
        
        # Write to bronze layer
        df_with_metadata.write \
            .mode("overwrite") \
            .option("overwriteSchema", "true") \
            .parquet(bronze_path)
        
        print(f"   ✅ Copy completed: {row_count:,} rows")
        
        return {
            "success": True,
            "table_name": table_name,
            "row_count": row_count,
            "column_count": len(columns),
            "bronze_path": bronze_path,
            "category": category
        }
        
    except Exception as e:
        error_msg = str(e)
        print(f"   ❌ Copy failed: {error_msg[:100]}...")
        
        return {
            "success": False,
            "table_name": table_name,
            "error": error_msg
        }

print("⚙️ COPY FUNCTION READY")
print("✅ Source: Gaiye_Test_Lakehouse shortcuts")
print("✅ Target: Bronze lakehouse Files/bronze/saleslt/")
print("✅ Metadata tracking included")

In [None]:
# Step 5: Execute Copy Process
print("🚀 STARTING COPY TO BRONZE LAYER")
print("=" * 60)
print(f"📅 Copy Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"📊 Tables to Copy: {len(source_tables_info)}")
print(f"🔗 Source: Gaiye_Test_Lakehouse")
print(f"🎯 Target: Bronze lakehouse")
print()

copy_results = []
successful_copies = 0
failed_copies = 0

import time
start_time = time.time()

# Copy each table
for i, table_info in enumerate(source_tables_info, 1):
    print(f"[{i}/{len(source_tables_info)}] {table_info['table_name']}")
    print("-" * 40)
    
    result = copy_table_to_bronze(table_info)
    copy_results.append(result)
    
    if result['success']:
        successful_copies += 1
    else:
        failed_copies += 1
    
    print()  # Add spacing between tables

end_time = time.time()
duration = end_time - start_time

# Copy Summary
print("🎉 COPY SUMMARY")
print("=" * 60)
print(f"⏱️ Total Duration: {duration:.1f} seconds")
print(f"✅ Successful Copies: {successful_copies}")
print(f"❌ Failed Copies: {failed_copies}")
print(f"📊 Total Tables: {len(source_tables_info)}")

if successful_copies > 0:
    print(f"\n📁 BRONZE LAYER STRUCTURE:")
    print(f"Files/bronze/saleslt/")
    
    # Group by category
    from collections import defaultdict
    by_category = defaultdict(list)
    
    for result in copy_results:
        if result['success']:
            table_info = next((t for t in source_tables_info if t['table_name'] == result['table_name']), None)
            if table_info:
                by_category[table_info['category']].append(result)
    
    for category, results in by_category.items():
        print(f"   📂 {category}:")
        for result in results:
            print(f"      📄 {result['table_name'].lower()}/ ({result['row_count']:,} rows)")
    
    print(f"\n🎯 NEXT STEPS:")
    print(f"   1. ✅ Data now available in bronze lakehouse")
    print(f"   2. 🔍 Verify data in Files/bronze/saleslt/ folders")
    print(f"   3. 📊 Create silver layer transformations") 
    print(f"   4. 🔗 Build analytics and reports")

if failed_copies > 0:
    print(f"\n⚠️ COPY FAILURES:")
    for result in copy_results:
        if not result['success']:
            print(f"   ❌ {result['table_name']}: {result.get('error', 'Unknown error')[:80]}...")

print(f"\n{'='*60}")
print(f"📋 Copy process completed at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"🎯 Bronze data now available in target lakehouse!")
print(f"{'='*60}")

In [None]:
# Step 6: Validation - Check Bronze Layer in Target Lakehouse
print("🔍 BRONZE LAYER VALIDATION")
print("=" * 60)

try:
    # Check Files directory structure
    try:
        files_list = dbutils.fs.ls("Files/")
        print(f"📁 Files directory contains {len(files_list)} items:")
        for file_info in files_list:
            print(f"   📂 {file_info.name}")
        
        # Check bronze directory
        try:
            bronze_list = dbutils.fs.ls("Files/bronze/")
            print(f"\n📁 Bronze directory contains {len(bronze_list)} items:")
            for file_info in bronze_list:
                print(f"   📂 {file_info.name}")
            
            # Check saleslt directory
            try:
                saleslt_list = dbutils.fs.ls("Files/bronze/saleslt/")
                print(f"\n📁 SalesLT directory contains {len(saleslt_list)} tables:")
                
                total_files = 0
                for file_info in saleslt_list:
                    if file_info.isDir():
                        table_name = file_info.name.rstrip('/')
                        try:
                            table_files = dbutils.fs.ls(f"Files/bronze/saleslt/{table_name}/")
                            parquet_files = [f for f in table_files if f.name.endswith('.parquet')]
                            total_files += len(parquet_files)
                            print(f"   📄 {table_name}/ ({len(parquet_files)} parquet files)")
                        except:
                            print(f"   📄 {table_name}/ (could not read contents)")
                
                print(f"\n📊 SUMMARY:")
                print(f"   📂 Table directories: {len(saleslt_list)}")
                print(f"   📄 Total parquet files: {total_files}")
                
            except Exception as e:
                print(f"\n❌ Could not access Files/bronze/saleslt/: {str(e)}")
                
        except Exception as e:
            print(f"\n❌ Could not access Files/bronze/: {str(e)}")
            
    except Exception as e:
        print(f"❌ Could not access Files/: {str(e)}")
        print("💡 Using alternative method...")
        
        # Alternative: Try mssparkutils
        try:
            from notebookutils import mssparkutils
            files_list = mssparkutils.fs.ls("Files/bronze/saleslt/")
            print(f"✅ Found {len(files_list)} items in bronze/saleslt/:")
            for file_info in files_list:
                print(f"   📂 {file_info.name}")
        except Exception as alt_e:
            print(f"❌ Alternative method failed: {str(alt_e)}")
    
    # Quick data validation
    if successful_copies > 0:
        print(f"\n🔍 QUICK DATA VALIDATION:")
        
        # Try to read a sample table
        sample_table = copy_results[0]['table_name'].lower() if copy_results and copy_results[0]['success'] else 'customer'
        
        try:
            sample_df = spark.read.parquet(f"Files/bronze/saleslt/{sample_table}")
            sample_count = sample_df.count()
            sample_columns = sample_df.columns
            
            print(f"   ✅ {sample_table}: {sample_count:,} rows, {len(sample_columns)} columns")
            print(f"   📋 Sample columns: {', '.join(sample_columns[:5])}...")
            
            # Show metadata columns
            metadata_cols = [col for col in sample_columns if col.startswith('_')]
            if metadata_cols:
                print(f"   🏷️ Metadata columns: {', '.join(metadata_cols)}")
            
        except Exception as e:
            print(f"   ⚠️ Could not read sample data: {str(e)[:100]}...")

except Exception as e:
    print(f"❌ Validation error: {str(e)}")

print(f"\n✅ Validation completed")
print(f"🎯 Check your bronze lakehouse Files section for the copied data!")
print("=" * 60)

## Success! 🎉

Your SalesLT data has been successfully copied from the shortcuts in `Gaiye_Test_Lakehouse` to the bronze layer in your target lakehouse.

### What was accomplished:
- ✅ **Source detected**: Found SalesLT tables from Gaiye_Test_Lakehouse shortcuts
- ✅ **Data copied**: All tables transferred to bronze layer with metadata tracking
- ✅ **Bronze layer created**: Organized structure in `Files/bronze/saleslt/`
- ✅ **Metadata added**: Source tracking and load timestamps included

### Bronze Layer Structure:
```
Files/bronze/saleslt/
├── address/           # Address information with parquet files
├── customer/          # Customer master data
├── customeraddress/   # Customer-address relationships  
├── product/           # Product catalog
├── productcategory/   # Product categories
├── productdescription/ # Product descriptions
├── productmodel/      # Product models
├── productmodelproductdescription/ # Model-description links
├── salesorderdetail/  # Order line items
└── salesorderheader/  # Order headers
```

### Next Steps:
1. **Verify in Lakehouse**: Check Files → bronze → saleslt in your target lakehouse
2. **Data Quality Check**: Review row counts and data completeness
3. **Create Silver Layer**: Build cleaned and standardized datasets
4. **Build Analytics**: Connect to Power BI or create reports

### Key Features:
- **Metadata Tracking**: Each file includes `_bronze_load_date`, `_source_lakehouse`, etc.
- **Overwrite Mode**: Re-running will refresh the data
- **Parquet Format**: Optimized for analytics and compression
- **Organized Structure**: Each table in its own folder for easy access

Your retail data model bronze layer is now ready for the next phase of your data pipeline! 🚀