# Setup

This example demonstrates basic read and write of Iceberg tables in Bodo. We will read TPC-H data from S3 and create Iceberg tables locally (requires configuring your aws credentials, e.g. ensure you have run `aws configure`).

In [3]:
import pandas as pd
import numpy as np
import bodo
import time

@bodo.jit
def bodo_read_parquet(path):
    return pd.read_parquet(path)

bodo_df = bodo_read_parquet("s3://bodo-example-data/tpch/SF1/lineitem.pq")



In [4]:
bodo_df.shape

(6001215, 16)

Bodo supports various catalogs for interacting with Iceberg tables. We will use the local filesystem here for simplicity. See Bodo's [iceberg documentation](https://docs.bodo.ai/2024.2/file_io/?h=iceberg#iceberg-section) for more details.

In [5]:
import os
conn = f"iceberg+file://{os.getcwd()}/"

db_name = "TEST_DB"
table_name = "SF1_LINEITEM_PQ_A"

@bodo.jit
def write_iceberg_table(df):
    df.to_sql(table_name, conn, schema=db_name, if_exists="fail", index=False)

write_iceberg_table(bodo_df)

starting write...


Launching JVM with Java executable: /Users/ehsan/dev/Bodo/.pixi/envs/default/lib/jvm/bin/java


Now we can read the table:

In [11]:
@bodo.jit
def read_iceberg_table(conn, table_name, db_name):
    start_time = time.time()
    df = pd.read_sql_table(
            table_name=table_name,
            con=conn,
            schema=db_name
        )
    print("Read time (s)", time.time() - start_time)
    return df

lineitem = read_iceberg_table(conn, table_name, db_name)

Read time (s) 0.9320029999980761


The output Pandas dataframe can be used for computation as usual:

In [12]:
@bodo.jit
def q01(lineitem):
    t1 = time.time()
    date = pd.Timestamp("1998-09-02")
    lineitem_filtered = lineitem.loc[
                        :,
                        [
                            "L_QUANTITY",
                            "L_EXTENDEDPRICE",
                            "L_DISCOUNT",
                            "L_TAX",
                            "L_RETURNFLAG",
                            "L_LINESTATUS",
                            "L_SHIPDATE",
                            "L_ORDERKEY",
                        ],
                        ]
    sel = lineitem_filtered.L_SHIPDATE <= date
    lineitem_filtered = lineitem_filtered[sel]
    lineitem_filtered["AVG_QTY"] = lineitem_filtered.L_QUANTITY
    lineitem_filtered["AVG_PRICE"] = lineitem_filtered.L_EXTENDEDPRICE
    lineitem_filtered["DISC_PRICE"] = lineitem_filtered.L_EXTENDEDPRICE * (
            1 - lineitem_filtered.L_DISCOUNT
    )
    lineitem_filtered["CHARGE"] = (
            lineitem_filtered.L_EXTENDEDPRICE
            * (1 - lineitem_filtered.L_DISCOUNT)
            * (1 + lineitem_filtered.L_TAX)
    )
    gb = lineitem_filtered.groupby(["L_RETURNFLAG", "L_LINESTATUS"], as_index=False)[
        "L_QUANTITY",
        "L_EXTENDEDPRICE",
        "DISC_PRICE",
        "CHARGE",
        "AVG_QTY",
        "AVG_PRICE",
        "L_DISCOUNT",
        "L_ORDERKEY",
    ]
    total = gb.agg(
        {
            "L_QUANTITY": "sum",
            "L_EXTENDEDPRICE": "sum",
            "DISC_PRICE": "sum",
            "CHARGE": "sum",
            "AVG_QTY": "mean",
            "AVG_PRICE": "mean",
            "L_DISCOUNT": "mean",
            "L_ORDERKEY": "count",
        }
    )
    total = total.sort_values(["L_RETURNFLAG", "L_LINESTATUS"])
    print(total.head())
    print("Q01 Execution time (s): ", time.time() - t1)

q01(lineitem)

  L_RETURNFLAG L_LINESTATUS  L_QUANTITY  ...     AVG_PRICE  L_DISCOUNT  L_ORDERKEY
2            A            F  37734107.0  ...  38273.129735    0.049985     1478493
0            N            F    991417.0  ...  38284.467761    0.050093       38854
1            N            O  74476040.0  ...  38249.117989    0.049997     2920374
3            R            F  37719753.0  ...  38250.854626    0.050009     1478870

[4 rows x 10 columns]
Q01 Execution time (s):  2.0284760000031383
