# Spark Table Recovery from MinIO S3

This notebook provides functions to scan a Spark database directory in MinIO S3 and recreate table definitions that point to existing data. This is useful for scenarios where:

1. Your Spark metastore was lost or corrupted
2. You need to access existing data from a different Spark cluster
3. You need to recover from a Delta Lake table metadata loss

The tool will preserve all existing data and simply recreate the table definitions in the Spark catalog.

## Import Required Libraries

In [1]:
import boto3
from urllib.parse import urlparse
import os
import json
from pyspark.sql import SparkSession

## Create Spark Session

In [2]:


# Create a Spark session with Delta Lake and S3 support
def create_spark_session(app_name="Table Recovery", aws_access_key=None, aws_secret_key=None):
    """Create a Spark session configured for Delta Lake with S3 access."""
    aws_access_key = aws_access_key or "minioadmin"
    aws_secret_key = aws_secret_key or "minioadmin"
    
    # Stop any existing session
    try:
        SparkSession.builder.getOrCreate().stop()
        print("Stopped existing Spark session")
    except:
        print("No existing Spark session to stop")
    
    # Create new session
    spark = SparkSession.builder \
        .appName(app_name) \
        .master("local[*]") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
        .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", "http://localhost:9000") \
        .config("spark.hadoop.fs.s3a.access.key", aws_access_key) \
        .config("spark.hadoop.fs.s3a.secret.key", aws_secret_key) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
        .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider") \
        .config("spark.hadoop.javax.jdo.option.ConnectionURL", "jdbc:postgresql://localhost:5432/metastore_db") \
        .config("spark.hadoop.javax.jdo.option.ConnectionDriverName", "org.postgresql.Driver") \
        .config("spark.hadoop.javax.jdo.option.ConnectionUserName", "admin") \
        .config("spark.hadoop.javax.jdo.option.ConnectionPassword", "admin") \
        .config("spark.sql.warehouse.dir", "s3a://wba/warehouse") \
               .config("spark.hadoop.javax.jdo.option.ConnectionURL", "jdbc:postgresql://localhost:5432/metastore_db") \
               .config("spark.hadoop.javax.jdo.option.ConnectionDriverName", "org.postgresql.Driver") \
               .config("spark.hadoop.javax.jdo.option.ConnectionUserName", "admin") \
               .config("spark.hadoop.javax.jdo.option.ConnectionPassword", "admin") \
               .config("spark.hadoop.hive.metastore.uris", "thrift://localhost:9083") \
               .config("spark.sql.warehouse.dir", "s3a://wba/warehouse") \
               .config("spark.jars.excludes", "org.slf4j:slf4j-log4j12,org.slf4j:slf4j-reload4j,org.slf4j:log4j-slf4j-impl") \
               .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
               .config("spark.hadoop.fs.s3a.endpoint", "http://localhost:9000") \
               .config("spark.hadoop.fs.s3a.access.key", aws_access_key or "minioadmin") \
               .config("spark.hadoop.fs.s3a.secret.key", aws_secret_key or "minioadmin") \
               .config("spark.hadoop.fs.s3a.path.style.access", "true") \
               .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
               .config("spark.hadoop.fs.s3a.fast.upload", "true") \
               .config("spark.hadoop.fs.s3a.multipart.size", "5242880") \
               .config("spark.hadoop.fs.s3a.block.size", "5242880") \
               .config("spark.hadoop.fs.s3a.multipart.threshold", "5242880") \
               .config("spark.hadoop.fs.s3a.threads.core", "10") \
               .config("spark.hadoop.fs.s3a.threads.max", "20") \
               .config("spark.hadoop.fs.s3a.max.total.tasks", "50") \
               .config("spark.hadoop.fs.s3a.connection.timeout", "60000") \
               .config("spark.hadoop.fs.s3a.connection.establish.timeout", "60000") \
               .config("spark.hadoop.fs.s3a.socket.timeout", "60000") \
               .config("spark.hadoop.fs.s3a.connection.maximum", "50") \
               .config("spark.hadoop.fs.s3a.fast.upload.buffer", "bytebuffer") \
               .config("spark.hadoop.fs.s3a.fast.upload.active.blocks", "2") \
               .config("spark.hadoop.fs.s3a.multipart.purge", "false") \
               .config("spark.hadoop.fs.s3a.multipart.purge.age", "86400000") \
               .config("spark.hadoop.fs.s3a.retry.limit", "10") \
               .config("spark.hadoop.fs.s3a.retry.interval", "1000") \
               .config("spark.hadoop.fs.s3a.attempts.maximum", "10") \
               .config("spark.hadoop.fs.s3a.connection.request.timeout", "60000") \
               .config("spark.hadoop.fs.s3a.threads.keepalivetime", "60000") \
                .config("spark.executor.cores","1") \
                .config("spark.executor.instances","1") \
                .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
                .config("spark.hadoop.hive.metastore.warehouse.dir", "s3a://wba/warehouse") \
                .config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") \
                .config("spark.sql.hive.convertMetastoreParquet", "false") \
                .config("spark.sql.hive.metastorePartitionPruning", "true")  \
                .enableHiveSupport()  \
                .getOrCreate()
    
    return spark

# Create the session
spark = create_spark_session()

your 131072x1 screen size is bogus. expect trouble
25/04/06 15:21:42 WARN Utils: Your hostname, JBLAPTOPW11 resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/04/06 15:21:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/06 15:21:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/04/06 15:21:46 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


Stopped existing Spark session


25/04/06 15:21:49 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Table Recovery Function

In [3]:
def recreate_tables_from_minio(spark, warehouse_dir, database_name, aws_access_key=None, aws_secret_key=None):
    """
    Scans a Spark database directory in MinIO S3 and generates SQL script
    to recreate table definitions pointing to the existing data.
    
    Args:
        spark: SparkSession object
        warehouse_dir: S3 path to Spark warehouse (e.g., "s3a://wba/warehouse")
        database_name: Database name to scan
        aws_access_key: MinIO access key (defaults to "minioadmin")
        aws_secret_key: MinIO secret key (defaults to "minioadmin")
    
    Returns:
        Tuple of (sql_script, results_dict)
        - sql_script: Complete SQL script as a string
        - results_dict: Dictionary mapping table names to status
    """
    aws_access_key = aws_access_key or "minioadmin"
    aws_secret_key = aws_secret_key or "minioadmin"
    
    print(f"Scanning database {database_name} in warehouse {warehouse_dir}")
    
    # Initialize sql script string
    sql_script = f"-- SQL Script to recreate tables from MinIO warehouse: {warehouse_dir}\n"
    sql_script += f"-- Database: {database_name}\n\n"
    
    # Add CREATE DATABASE statement
    sql_script += f"CREATE DATABASE IF NOT EXISTS {database_name};\n\n"
    
    # Parse the S3 URI to get bucket and prefix
    parsed_uri = urlparse(warehouse_dir)
    bucket_name = parsed_uri.netloc
    prefix = parsed_uri.path.lstrip('/')
    
    if prefix and not prefix.endswith('/'):
        prefix += '/'
    
    # Full path to the database directory
    db_prefix = f"{prefix}{database_name}.db/"
    
    # Initialize S3 client
    s3_client = boto3.client(
        "s3",
        endpoint_url="http://localhost:9000",  # adjust to your MinIO endpoint
        aws_access_key_id=aws_access_key,
        aws_secret_access_key=aws_secret_key
    )
    
    # Get existing tables in Spark catalog
    try:
        existing_tables = [table.name for table in spark.catalog.listTables(database_name)]
        print(f"Found {len(existing_tables)} existing tables in Spark catalog: {existing_tables}")
    except:
        # Database might not exist
        existing_tables = []
        print(f"No existing tables found in database {database_name}")
    
    # List all folders in the database directory (each folder is a table)
    try:
        print(f"Listing objects in bucket '{bucket_name}' with prefix '{db_prefix}'")
        paginator = s3_client.get_paginator('list_objects_v2')
        page_iterator = paginator.paginate(Bucket=bucket_name, Prefix=db_prefix, Delimiter='/')
        
        table_folders = []
        for page in page_iterator:
            # Process common prefixes (directories)
            if 'CommonPrefixes' in page:
                for common_prefix in page['CommonPrefixes']:
                    prefix_path = common_prefix['Prefix']
                    # Get the table name from the path
                    table_name = os.path.basename(prefix_path.rstrip('/'))
                    table_folders.append((table_name, prefix_path))
        
        if not table_folders:
            print(f"No table directories found in {warehouse_dir}/{database_name}.db/")
            return sql_script, {}
        
        print(f"Found {len(table_folders)} table directories: {[t[0] for t in table_folders]}")
        
        # Process each table folder
        results = {}
        for table_name, folder_path in table_folders:
            print(f"\nProcessing table: {table_name}")
            
            # Check if the table already exists in Spark catalog
            if table_name in existing_tables:
                print(f"Table {table_name} already exists in Spark catalog. Will generate DROP statement.")
                sql_script += f"-- Table {table_name} already exists\n"
                results[table_name] = "already exists"
            
            # Path to the table in S3
            table_path = f"{warehouse_dir}/{database_name}.db/{table_name}"
            
            # Check if this is a Delta table by looking for _delta_log directory
            delta_log_prefix = f"{folder_path}_delta_log/"
            delta_log_response = s3_client.list_objects_v2(
                Bucket=bucket_name, 
                Prefix=delta_log_prefix,
                MaxKeys=1
            )
            
            is_delta = 'Contents' in delta_log_response and len(delta_log_response['Contents']) > 0
            
            if is_delta:
                print(f"Found Delta table at {table_path}")
                
                # Try to read the table schema and properties from existing Delta files
                try:
                    # Check for partition information
                    partition_cols = []
                    try:
                        # Look for partition metadata in Delta log
                        metadata_prefix = f"{folder_path}_delta_log/00000000000000000000.json"
                        response = s3_client.get_object(Bucket=bucket_name, Key=metadata_prefix)
                        metadata = json.loads(response['Body'].read().decode('utf-8'))
                        
                        if 'partitionColumns' in metadata:
                            partition_cols = metadata['partitionColumns']
                            print(f"Detected partition columns: {partition_cols}")
                    except Exception as e:
                        print(f"Error reading partition info: {str(e)}")
                    
                    # Generate DROP statement if table exists
                    sql_script += f"-- Drop table if it exists\n"
                    sql_script += f"DROP TABLE IF EXISTS {database_name}.{table_name};\n\n"
                    
                    # Generate CREATE TABLE statement
                    sql_script += f"-- Create Delta table pointing to existing data\n"
                    sql_script += f"CREATE TABLE {database_name}.{table_name}\n"
                    sql_script += f"USING DELTA\n"
                    sql_script += f"LOCATION '{table_path}';\n\n"
                    
                    results[table_name] = "success"
                    print(f"Generated SQL for table {database_name}.{table_name}")
                    
                except Exception as e:
                    error_msg = f"Error generating SQL for table {table_name}: {str(e)}"
                    print(error_msg)
                    sql_script += f"-- {error_msg}\n\n"
                    results[table_name] = f"error: {str(e)}"
            else:
                note = f"No Delta log found for {table_name}. This might not be a Delta table."
                print(note)
                sql_script += f"-- {note}\n\n"
                results[table_name] = "not a Delta table"
        
        return sql_script, results
    
    except Exception as e:
        error_msg = f"Error scanning database directory: {str(e)}"
        print(error_msg)
        sql_script += f"-- {error_msg}\n"
        return sql_script, {"error": str(e)}

## Run the Table Recovery Process

In [4]:

# Configure these parameters for your environment
warehouse_dir = "s3a://wba/warehouse"  # Change to your warehouse directory
aws_access_key = "minioadmin"          # Change if not using default
aws_secret_key = "minioadmin"          # Change if not using default


# database_name = "ehr"                  # Change to your database name

# # Check if database exists and drop if requested to recreate
# databases = [db.name for db in spark.catalog.listDatabases()]

# if database_name in databases:
#     print(f"Database '{database_name}' already exists.")
#     # Option to return early if you don't want to recreate an existing database
#     # return {"status": "Database already exists, no action taken"}

# # Create database if it doesn't exist
# db_location = f"{warehouse_dir}/{database_name}.db"
# spark.sql(f"CREATE DATABASE IF NOT EXISTS {database_name}")
# print(f"Database '{database_name}' created or confirmed at location '{db_location}'")

# # Use the database
# spark.sql(f"USE {database_name}")

# # Run the recovery process
# sql_script, results = recreate_tables_from_minio(
#     spark=spark,
#     warehouse_dir=warehouse_dir,
#     database_name=database_name,
#     aws_access_key=aws_access_key,
#     aws_secret_key=aws_secret_key
# )

# # Print results
# print("\nTable recreation results:")
# for table, status in results.items():
#     print(f"{table}: {status}")
    
# #  write to a file
# file_name = f'0_recreate_tables_{database_name}.sql'
# with open(file_name, 'w') as f:
#     f.write(sql_script)


database_name = "omop531"                  # Change to your database name
    
# Run the recovery process
sql_script, results = recreate_tables_from_minio(
    spark=spark,
    warehouse_dir=warehouse_dir,
    database_name=database_name,
    aws_access_key=aws_access_key,
    aws_secret_key=aws_secret_key
)

# Print results
print("\nTable recreation results:")
for table, status in results.items():
    print(f"{table}: {status}")
    
#  write to a file
file_name = f'0_recreate_tables_{database_name}.sql'
with open(file_name, 'w') as f:
    f.write(sql_script)




Scanning database omop531 in warehouse s3a://wba/warehouse


25/04/06 15:21:51 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
25/04/06 15:21:51 WARN VersionInfoUtils: The AWS SDK for Java 1.x entered maintenance mode starting July 31, 2024 and will reach end of support on December 31, 2025. For more information, see https://aws.amazon.com/blogs/developer/the-aws-sdk-for-java-1-x-is-in-maintenance-mode-effective-july-31-2024/
You can print where on the file system the AWS SDK for Java 1.x core runtime is located by setting the AWS_JAVA_V1_PRINT_LOCATION environment variable or aws.java.v1.printLocation system property to 'true'.
This message can be disabled by setting the AWS_JAVA_V1_DISABLE_DEPRECATION_ANNOUNCEMENT environment variable or aws.java.v1.disableDeprecationAnnouncement system property to 'true'.
The AWS SDK for Java 1.x is being used here:
at java.base/java.lang.Thread.getStackTrace(Thread.java:1619)
at com.amazonaws.util.VersionInfoUtils.printDeprecationAn

Found 1 existing tables in Spark catalog: ['concept']
Listing objects in bucket 'wba' with prefix 'warehouse/omop531.db/'
Found 39 table directories: ['attribute_definition', 'care_site', 'cdm_source', 'cohort', 'cohort_attribute', 'cohort_definition', 'concept', 'concept_ancestor', 'concept_class', 'concept_relationship', 'concept_synonym', 'condition_era', 'condition_occurrence', 'cost', 'death', 'device_exposure', 'domain', 'dose_era', 'drug_era', 'drug_exposure', 'drug_strength', 'fact_relationship', 'location', 'measurement', 'metadata', 'note', 'note_nlp', 'observation', 'observation_period', 'payer_plan_period', 'person', 'procedure_occurrence', 'provider', 'relationship', 'source_to_concept_map', 'specimen', 'visit_detail', 'visit_occurrence', 'vocabulary']

Processing table: attribute_definition
Found Delta table at s3a://wba/warehouse/omop531.db/attribute_definition
Error reading partition info: Extra data: line 2 column 1 (char 359)
Generated SQL for table omop531.attribute_

## List Tables After Recovery

In [5]:
# List all tables in the database after recovery
print(f"\nTables in database {database_name} after recovery:")
for table in spark.catalog.listTables(database_name):
    print(f"- {table.name} ({table.tableType})")


Tables in database omop531 after recovery:
- concept (EXTERNAL)


## Query a Recovered Table

Now you can test one of the recovered tables by querying it:

In [6]:
# Replace 'table_name' with one of your recovered tables
table_name = "patients"  # Change this to one of your actual tables

try:
    # Check if table exists
    if spark.catalog.tableExists(f"{database_name}.{table_name}"):
        # Read the table
        df = spark.table(f"{database_name}.{table_name}")
        
        # Show schema
        print(f"Schema for {database_name}.{table_name}:")
        df.printSchema()
        
        # Show sample data
        print(f"\nSample data from {database_name}.{table_name}:")
        df.show(5)
        
        # Show count
        count = df.count()
        print(f"\nTotal records: {count}")
    else:
        print(f"Table {database_name}.{table_name} does not exist.")
except Exception as e:
    print(f"Error querying table: {str(e)}")

Table omop531.patients does not exist.
