# TCGA Project Setup - Configuration Hub

This notebook is the **single source of truth** for project configuration.

## What this notebook does:
1. Collects configuration parameters from user input
2. Creates/updates the main `config.json` file
3. Creates Unity Catalog resources (catalog, schema, volume)
4. Validates the setup

All downstream components (ETL pipelines, jobs, analysis notebooks) will read from the generated `config.json`.

## Step 1: Configure Your Environment

Provide the following parameters for your TCGA project:

In [None]:
# Create widgets for user input
dbutils.widgets.text("catalog_name", "kermany", "Unity Catalog Name")
dbutils.widgets.text("schema_name", "tcga", "Schema Name")
dbutils.widgets.text("volume_name", "tcga_files", "Volume Name")

# Get values from widgets
catalog = dbutils.widgets.get("catalog_name")
schema = dbutils.widgets.get("schema_name")
volume = dbutils.widgets.get("volume_name")

# Validate inputs
if not catalog or catalog == "<CHANGE TO YOUR CATALOG NAME>":
    raise ValueError("Please provide a valid catalog name")
if not schema:
    raise ValueError("Please provide a valid schema name")
if not volume:
    raise ValueError("Please provide a valid volume name")

# Display configuration
print("="*60)
print("TCGA Project Configuration")
print("="*60)
print(f"Catalog: {catalog}")
print(f"Schema: {schema}")
print(f"Volume: {volume}")
print(f"Volume Path: /Volumes/{catalog}/{schema}/{volume}")
print(f"Database: {catalog}.{schema}")
print("="*60)

## Step 2: Generate Configuration File

Create the main `config.json` that will be used by all components.

In [None]:
import json
import os

# Define configuration structure
config = {
    "lakehouse": {
        "catalog": catalog,
        "schema": schema,
        "volume": volume
    },
    "api_paths": {
        "cases_endpt": "https://api.gdc.cancer.gov/cases",
        "files_endpt": "https://api.gdc.cancer.gov/files",
        "data_endpt": "https://api.gdc.cancer.gov/data/"
    },
    "pipeline": {
        "max_workers": 64,
        "max_records": 20000,
        "force_download": False,
        "retry_attempts": 3,
        "timeout_seconds": 300
    }
}

# Write configuration to file
config_path = os.path.abspath('./config.json')
with open(config_path, 'w') as f:
    json.dump(config, f, indent=2)

print("✓ Configuration file created/updated at:", config_path)
print("\nConfiguration contents:")
print(json.dumps(config, indent=2))

## Step 3: Create Unity Catalog Resources

Create the catalog, schema, and volume if they don't already exist.

In [None]:
# Note: Creating a catalog requires METASTORE_ADMIN privilege
# Uncomment the line below if you have the required permissions
# spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog}")

print("Creating Unity Catalog resources...\n")

# Create schema
try:
    spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog}.{schema}")
    print(f"✓ Schema {catalog}.{schema} created or already exists")
except Exception as e:
    print(f"⚠ Warning: Could not create schema. Error: {str(e)}")
    print(f"  Make sure catalog '{catalog}' exists and you have permissions.")
    raise

# Create volume  
try:
    spark.sql(f"CREATE VOLUME IF NOT EXISTS {catalog}.{schema}.{volume}")
    print(f"✓ Volume {catalog}.{schema}.{volume} created or already exists")
except Exception as e:
    print(f"⚠ Warning: Could not create volume. Error: {str(e)}")
    raise

print(f"\n✓ Setup complete! Unity Catalog resources are ready.")

## Step 4: Verify Setup

Verify that the resources were created successfully.

In [None]:
# Verify schema exists
print("Verifying resources...\n")

try:
    schemas_df = spark.sql(f"SHOW SCHEMAS IN {catalog}")
    schema_exists = schemas_df.filter(f"databaseName = '{schema}'").count() > 0
    
    if schema_exists:
        print(f"✓ Schema {catalog}.{schema} verified")
    else:
        print(f"⚠ Schema {catalog}.{schema} not found")
except Exception as e:
    print(f"⚠ Could not verify schema: {str(e)}")

# Verify volume exists
try:
    volumes_df = spark.sql(f"SHOW VOLUMES IN {catalog}.{schema}")
    volume_exists = volumes_df.filter(f"volume_name = '{volume}'").count() > 0
    
    if volume_exists:
        print(f"✓ Volume {catalog}.{schema}.{volume} verified")
        
        # Get volume details
        volume_path = f'/Volumes/{catalog}/{schema}/{volume}'
        print(f"  Volume path: {volume_path}")
    else:
        print(f"⚠ Volume {catalog}.{schema}.{volume} not found")
except Exception as e:
    print(f"⚠ Could not verify volume: {str(e)}")

## Step 5: Export Configuration for Other Notebooks

Export configuration variables for use in other notebooks via `%run`.

In [None]:
# Export variables for notebooks that use %run ./00-setup
volume_path = f'/Volumes/{catalog}/{schema}/{volume}'
database_name = f'{catalog}.{schema}'

# API endpoints
cases_endpt = config['api_paths']['cases_endpt']
files_endpt = config['api_paths']['files_endpt']
data_endpt = config['api_paths']['data_endpt']

print("\n" + "="*60)
print("Configuration variables exported:")
print("="*60)
print(f"catalog = '{catalog}'")
print(f"schema = '{schema}'")
print(f"volume = '{volume}'")
print(f"volume_path = '{volume_path}'")
print(f"database_name = '{database_name}'")
print(f"cases_endpt = '{cases_endpt}'")
print(f"files_endpt = '{files_endpt}'")
print(f"data_endpt = '{data_endpt}'")
print("="*60)

## ✅ Setup Complete!

### Next Steps:

1. **Download Data**: Run `01-data-download.py` to fetch TCGA data
2. **Deploy Pipeline**: Use Databricks Asset Bundle
   ```bash
   databricks bundle deploy --target dev
   databricks bundle run tcga_data_workflow --target dev
   ```
3. **Run Analysis**: Execute `02-tcga-expression-clustering-optimized.py`

### Configuration Location:
- Main config: `./config.json`
- All pipelines and jobs will read from this file