### Initialize Catalog

 - Connect to catalog (postgres) and warehouse (s3 - data & metadata layer)
 - Create Namespace and Table
 - Load sample data


In [1]:
import logging
logging.basicConfig(level=logging.INFO)

from connection import connect_to_catalog
catalog = connect_to_catalog()

name_spaces = [ns[0] for ns in catalog.list_namespaces()]
print(f"Existing namespaces: {name_spaces}")

INFO:root:Connected to Iceberg catalog: `trinity`


Existing namespaces: []


### What did that do?

Connecting to the catalog via the `connect_to_catalog` function:

1. Establishes a connection with the jdbc catalog (i.e. postgres) and data/metadata warehouse (i.e. s3)

2. With the connection established, the following tables are automatically added to the `public` schema if they do not exist:

![](imgs/jdbc-cat.png)

Note that we are using the postgres database from the container `pgduckdb/pgduckdb:17-main`. This image comes with the [pg_duckdb](https://github.com/duckdb/pg_duckdb) extension and schema pre-installed.

---

#### Now we can read in a schema to prepare for creating a table

To do that we first need a table schema and optional partition specification. Then can set the target and create table.


In [None]:

from table_utils import load_iceberg_schema_and_properties
import json

# Load / Preview schema
hms_schema_json = "schemas/hydrology.json"

HYDROLOGY_SCHEMA, HYDROLOGY_PROPS = load_iceberg_schema_and_properties(hms_schema_json)

print(f"Iceberg Table Schema: {HYDROLOGY_SCHEMA}")
print(f"Iceberg Table Properties: {json.dumps(HYDROLOGY_PROPS, indent=2)}")


Iceberg Table Schema: table {
  1: sim_time: required timestamp (Simulation timestamp from HEC-HMS model [UTC])
  2: realization_id: required int (Unique identifier for each model realization)
  3: model_id: required string (Identifier for the HEC-HMS model)
  4: site_id: required string (Identifier for the measurement site)
  5: event_id: required int (Identifier for the simulated event)
  6: run_version: required string (Version of the model run)
  7: flow: optional double (Discharge at the site [cfs])
  8: base_flow: optional double (Baseflow at the site [cfs])
}
Iceberg Table Properties: {
  "hydrology.schema.version": "1.0.0",
  "hydrology.description": "HEC-HMS simulation outputs",
  "hydrology.units.convention": "English",
  "hydrology.model.version": "4.13",
  "hydrology.time.step": "15min",
  "hydrology.time.timezone": "UTC",
  "hydrology.stac.catalog": "s3://trinity-pilot/stac/hydrology/catalog.json",
  "write.hive-style-partitioning": "true",
  "write.target-file-size-bytes"

In [10]:
# This dataset will grow to many billions of rows, so partitioning is important.
# Note: We can use predicate pushdown later on unpartitioned columns
from table_utils import auto_partition_spec
HYDROLOGY_PARTITION_SPEC = auto_partition_spec(HYDROLOGY_SCHEMA, ["realization_id", "model_id", "run_version"])
print(f"Partition Spec: {HYDROLOGY_PARTITION_SPEC}")

Partition Spec: [
  100: realization_id: identity(2)
  101: model_id: identity(3)
  102: run_version: identity(6)
]



##### Now we can add a `namespace` for create an iceberg `table`.

In [9]:
# Set target: Load existing variables from iceberg/connection.py:
# Create table namespace
from connection import CATALOG_ROOT
print(f"Catalog Root: {CATALOG_ROOT}")  

# Assign table-specific variables
TABLE_NAME_SPACE = "conformance"

name_spaces = [ns[0] for ns in catalog.list_namespaces()]
if TABLE_NAME_SPACE not in name_spaces:
    catalog.create_namespace(TABLE_NAME_SPACE)
    print(f"Created namespace: {TABLE_NAME_SPACE}")
else:
    print(f"Namespace already exists: {TABLE_NAME_SPACE}")

Catalog Root: s3://trinity-pilot/warehouse
Namespace already exists: conformance


In [11]:
HYDROLOGY_TABLE = "hydrology"
PG_HYDROLOGY_TABLE_NAME = f"{TABLE_NAME_SPACE}.{HYDROLOGY_TABLE}"
S3_DATA_LOCATION = f"{CATALOG_ROOT}/{TABLE_NAME_SPACE}/{HYDROLOGY_TABLE}"



if catalog.table_exists(PG_HYDROLOGY_TABLE_NAME):
    print(f"Table `{PG_HYDROLOGY_TABLE_NAME}` already exists.")
else:
    # Create the Iceberg table
    catalog.create_table(
        identifier=PG_HYDROLOGY_TABLE_NAME,
        schema=HYDROLOGY_SCHEMA,
        location=S3_DATA_LOCATION,
        properties=HYDROLOGY_PROPS,
        partition_spec=HYDROLOGY_PARTITION_SPEC
    )
    print(f"Table `{PG_HYDROLOGY_TABLE_NAME}` has been created.")


INFO:pyiceberg.io:Loaded FileIO: pyiceberg.io.pyarrow.PyArrowFileIO
INFO:pyiceberg.io:Loaded FileIO: pyiceberg.io.pyarrow.PyArrowFileIO
INFO:pyiceberg.io:Loaded FileIO: pyiceberg.io.pyarrow.PyArrowFileIO


Table `conformance.hydrology` has been created.


### Connect the newly created catalog to nimtable

The Nimtable utility is a handy reference utility for viewing table data.

Enter the information in the `Create Catalog` button available at http://localhost/3000

![](imgs/new-nimtable-cat.png)

Then view the table:

![](imgs/cat-page-1.png)

![](imgs/cat-page-2.png)

![](imgs/cat-page-3.png)

*NOTE* See schemas/hydrology for addtional properties.



#### Updates acheived:

1. The postgres catalog no points to the newly create table in the conformance namespace:

![](imgs/pg-iceberg-table-1.png)

2. A newly created metadata file has been written to the s3 catalog + namespace location:

![](imgs/bucket-view-1.png)

*NOTE* There are no data files yet. 

---

#### Finally, lets add a Hydraulics table



In [12]:
#### Hydraulics Table

# Load / Preview schema
ras_schema_json = "schemas/hydraulics.json"

HYDRAULICS_SCHEMA, HYDRAULICS_PROPS = load_iceberg_schema_and_properties(ras_schema_json)

print(f"Iceberg Table Schema: {HYDRAULICS_SCHEMA}")
print(f"Iceberg Table Properties: {json.dumps(HYDRAULICS_PROPS, indent=2)}")

HYDRAULICS_PARTITION_SPEC = auto_partition_spec(HYDRAULICS_SCHEMA, ["realization_id", "model_id", "run_version"])
print(f"Partition Spec: {HYDRAULICS_PARTITION_SPEC}")

HYDRAULICS_TABLE = "hydraulics"
PG_HYDRAULICS_TABLE_NAME = f"{TABLE_NAME_SPACE}.{HYDRAULICS_TABLE}"
S3_DATA_LOCATION = f"{CATALOG_ROOT}/{TABLE_NAME_SPACE}/{HYDRAULICS_TABLE}"



if catalog.table_exists(PG_HYDRAULICS_TABLE_NAME):
    print(f"Table `{PG_HYDRAULICS_TABLE_NAME}` already exists.")
else:
    # Create the Iceberg table
    catalog.create_table(
        identifier=PG_HYDRAULICS_TABLE_NAME,
        schema=HYDRAULICS_SCHEMA,
        location=S3_DATA_LOCATION,
        properties=HYDRAULICS_PROPS,
        partition_spec=HYDRAULICS_PARTITION_SPEC
    )
    print(f"Table `{PG_HYDRAULICS_TABLE_NAME}` has been created.")

INFO:pyiceberg.io:Loaded FileIO: pyiceberg.io.pyarrow.PyArrowFileIO


Iceberg Table Schema: table {
  1: sim_time: required timestamp (Simulation timestamp from HEC-RAS model [UTC])
  2: realization_id: required int (Unique identifier for each model realization)
  3: model_id: required string (Identifier for the HEC-RAS model)
  4: site_id: required string (Identifier for the measurement site)
  5: event_id: required int (Identifier for the simulated event)
  6: run_version: required string (Version of the model run)
  7: flow: optional double (Discharge at the site [cfs])
  8: stage: optional double (Stage at the site [ft])
}
Iceberg Table Properties: {
  "hydraulics.schema.version": "1.0.0",
  "hydraulics.description": "HEC-RAS simulation outputs",
  "hydraulics.units.convention": "English",
  "hydraulic.model.version": "6.6",
  "hydraulics.stac.catalog": "s3://trinity-pilot/stac/hydraulics/catalog.json",
  "write.hive-style-partitioning": "true",
  "write.target-file-size-bytes": "536870912",
  "field.sim_time.unit": "UTC",
  "field.sim_time.descripti

INFO:pyiceberg.io:Loaded FileIO: pyiceberg.io.pyarrow.PyArrowFileIO
INFO:pyiceberg.io:Loaded FileIO: pyiceberg.io.pyarrow.PyArrowFileIO


Table `conformance.hydraulics` has been created.


#### Both tables have now been created and are ready for data.

![](imgs/nimtable-cat-2.png)