### Initialize Catalog

 - Connect to catalog (jdbc - postgres) and warehouse (s3 - data & metadata layer)



In [1]:
import logging
logging.basicConfig(level=logging.INFO)

from connection import connect_to_catalog
catalog = connect_to_catalog()
catalog

INFO:root:Connected to Iceberg catalog: `trinity`


trinity (<class 'pyiceberg.catalog.sql.SqlCatalog'>)

In [2]:
name_spaces = [ns[0] for ns in catalog.list_namespaces()]
print(f"Existing namespaces: {name_spaces}")

Existing namespaces: ['conformance']


### What did that do?

Connecting to the catalog via the `connect_to_catalog` function:

1. Establishes a connection with the jdbc catalog (i.e. postgres) and data/metadata warehouse (i.e. s3)

2. With the connection established, the following tables are automatically added to the `public` schema if they do not exist:

![](imgs/jdbc-cat.png)

Note that we are using the postgres database from the container `pgduckdb/pgduckdb:17-main`. This image comes with the [pg_duckdb](https://github.com/duckdb/pg_duckdb) extension and schema pre-installed.

---

#### Now we can read in a schema to prepare for creating a table

To do that we first need a table schema and optional partition specification. Then can set the target and create table.


In [3]:
from table_utils import load_iceberg_schema_and_properties
import json

# Load / Preview schema
models_schema_json = "schemas/models.json"

MODEL_SCHEMA, MODEL_PROPS = load_iceberg_schema_and_properties(models_schema_json)

print(f"Schema: {MODEL_SCHEMA}")

# Identifier fields serve as primary keys
print(
    "Identifier fields:",
    {fid: MODEL_SCHEMA.find_field(fid).name for fid in MODEL_SCHEMA.identifier_field_ids},
)

Schema: table {
  1: donor_model_id: required string (Identifier for the donor model)
  2: donor_model_type: required string (Type of the donor model)
  3: donor_model_version: required string (Version of the donor model)
  4: donor_site_type: required string (Type of the donor site)
  5: donor_site_id: required string (Identifier for the donor site)
  6: receiver_model_id: required string (Identifier for the receiver model)
  7: receiver_model_type: required string (Type of the receiver model)
  8: receiver_model_version: required string (Version of the receiver model)
  9: receiver_site_type: required string (Type of the receiver site)
  10: receiver_site_id: required string (Identifier for the receiver site)
}
Identifier fields: {1: 'donor_model_id', 2: 'donor_model_type', 3: 'donor_model_version', 5: 'donor_site_id', 4: 'donor_site_type', 6: 'receiver_model_id', 7: 'receiver_model_type', 8: 'receiver_model_version', 10: 'receiver_site_id', 9: 'receiver_site_type'}


In [4]:
# This dataset will grow to many billions of rows, so partitioning is important.
# Note: We can use predicate pushdown later on unpartitioned columns
from table_utils import auto_partition_spec
MODEL_PARTITION_SPEC = auto_partition_spec(MODEL_SCHEMA, ["donor_model_id", "donor_model_version", "receiver_model_id", "receiver_model_version"])
print(f"Partition Spec: {MODEL_PARTITION_SPEC}")

Partition Spec: [
  100: donor_model_id: identity(1)
  101: donor_model_version: identity(3)
  102: receiver_model_id: identity(6)
  103: receiver_model_version: identity(8)
]



##### Now we can add a `namespace` for create an iceberg `table`.

In [5]:
# Set target: Load existing variables from iceberg/connection.py:
# Create table namespace
from connection import CATALOG_ROOT
print(f"Catalog Root: {CATALOG_ROOT}")  

# Assign table-specific variables
TABLE_NAME_SPACE = "conformance"

name_spaces = [ns[0] for ns in catalog.list_namespaces()]
if TABLE_NAME_SPACE not in name_spaces:
    catalog.create_namespace(TABLE_NAME_SPACE)
    print(f"Created namespace: {TABLE_NAME_SPACE}")
else:
    print(f"Namespace already exists: {TABLE_NAME_SPACE}")

Catalog Root: s3://trinity-pilot/warehouse
Namespace already exists: conformance


In [6]:
catalog

trinity (<class 'pyiceberg.catalog.sql.SqlCatalog'>)

In [7]:
MODEL_TABLE = "models"
PG_MODEL_TABLE_NAME = f"{TABLE_NAME_SPACE}.{MODEL_TABLE}"
S3_DATA_LOCATION = f"{CATALOG_ROOT}/{TABLE_NAME_SPACE}/{MODEL_TABLE}"

if catalog.table_exists(PG_MODEL_TABLE_NAME):
    print(f"Table `{PG_MODEL_TABLE_NAME}` already exists.")
else:
    # Create the Iceberg table
    catalog.create_table(
        identifier=PG_MODEL_TABLE_NAME,
        schema=MODEL_SCHEMA,
        location=S3_DATA_LOCATION,
        properties=MODEL_PROPS,
        partition_spec=MODEL_PARTITION_SPEC
    )
    print(f"Table `{PG_MODEL_TABLE_NAME}` has been created.")


INFO:pyiceberg.io:Loaded FileIO: pyiceberg.io.pyarrow.PyArrowFileIO


OSError: When initiating multiple part upload for key 'warehouse/conformance/models/metadata/00000-6f3092d4-e074-48da-a50b-6636215beb03.metadata.json' in bucket 'trinity-pilot': AWS Error ACCESS_DENIED during CreateMultipartUpload operation: User: arn:aws:iam::424342154607:user/kanawha-pilot-reader is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::trinity-pilot/warehouse/conformance/models/metadata/00000-6f3092d4-e074-48da-a50b-6636215beb03.metadata.json" because no identity-based policy allows the s3:PutObject action (Request ID: JAH58GVWB5AZW9AM)

### Connect the newly created catalog to nimtable

The Nimtable utility is a handy reference utility for viewing table data.

Enter the information in the `Create Catalog` button available at http://localhost/3000

![](imgs/new-nimtable-cat.png)

Then view the table:

![](imgs/cat-page-1.png)

![](imgs/cat-page-2.png)

![](imgs/cat-page-3.png)

*NOTE* See schemas/hydrology for addtional properties.



#### Updates acheived:

1. The postgres catalog no points to the newly create table in the conformance namespace:

![](imgs/pg-iceberg-table-1.png)

2. A newly created metadata file has been written to the s3 catalog + namespace location:

![](imgs/bucket-view-1.png)

*NOTE* There are no data files yet. 

---

#### Finally, lets add a Hydraulics table



In [9]:
#### Hydraulics Table

# Load / Preview schema
ras_schema_json = "schemas/hydraulics.json"

HYDRAULICS_SCHEMA, HYDRAULICS_PROPS = load_iceberg_schema_and_properties(ras_schema_json)

print(f"Iceberg Table Schema: {HYDRAULICS_SCHEMA}")
print(f"Iceberg Table Properties: {json.dumps(HYDRAULICS_PROPS, indent=2)}")

# Identifier fields serve as primary keys
print(
    "Identifier fields:",
    {fid: HYDROLOGY_SCHEMA.find_field(fid).name for fid in HYDROLOGY_SCHEMA.identifier_field_ids},
)

HYDRAULICS_PARTITION_SPEC = auto_partition_spec(HYDRAULICS_SCHEMA, ["realization_id", "model_id", "run_version"])
print(f"Partition Spec: {HYDRAULICS_PARTITION_SPEC}")

HYDRAULICS_TABLE = "hydraulics"
PG_HYDRAULICS_TABLE_NAME = f"{TABLE_NAME_SPACE}.{HYDRAULICS_TABLE}"
S3_DATA_LOCATION = f"{CATALOG_ROOT}/{TABLE_NAME_SPACE}/{HYDRAULICS_TABLE}"



if catalog.table_exists(PG_HYDRAULICS_TABLE_NAME):
    print(f"Table `{PG_HYDRAULICS_TABLE_NAME}` already exists.")
else:
    # Create the Iceberg table
    catalog.create_table(
        identifier=PG_HYDRAULICS_TABLE_NAME,
        schema=HYDRAULICS_SCHEMA,
        location=S3_DATA_LOCATION,
        properties=HYDRAULICS_PROPS,
        partition_spec=HYDRAULICS_PARTITION_SPEC
    )
    print(f"Table `{PG_HYDRAULICS_TABLE_NAME}` has been created.")

INFO:pyiceberg.io:Loaded FileIO: pyiceberg.io.pyarrow.PyArrowFileIO


Iceberg Table Schema: table {
  1: sim_time: required timestamp (Simulation timestamp from HEC-RAS model [UTC])
  2: realization_id: required int (Unique identifier for each model realization)
  3: model_id: required string (Identifier for the HEC-RAS model)
  4: site_id: required string (Identifier for the measurement site)
  5: event_id: required int (Identifier for the simulated event)
  6: run_version: required string (Version of the model run)
  7: flow: optional double (Discharge at the site [cfs])
  8: stage: optional double (Stage at the site [ft])
}
Iceberg Table Properties: {
  "hydraulics.schema.version": "1.0.0",
  "hydraulics.description": "HEC-RAS simulation outputs",
  "hydraulics.units.convention": "English",
  "hydraulic.model.version": "6.6",
  "hydraulics.stac.catalog": "s3://trinity-pilot/stac/hydraulics/catalog.json",
  "write.hive-style-partitioning": "true",
  "write.target-file-size-bytes": "536870912",
  "field.sim_time.unit": "UTC",
  "field.sim_time.descripti

INFO:pyiceberg.io:Loaded FileIO: pyiceberg.io.pyarrow.PyArrowFileIO
INFO:pyiceberg.io:Loaded FileIO: pyiceberg.io.pyarrow.PyArrowFileIO


Table `conformance.hydraulics` has been created.


#### Both tables have now been created and are ready for data.

![](imgs/nimtable-cat-2.png)