This notebook demonstrates how Bodo can be used to read data stored in Snowflake and process it with Pandas. Data read and computations are done using familiar Pandas APIs which Bodo automatically optimizes and parallelizes under the hood.

### Configuring credentials
To run the following code, ensure that you have the following environment variables set with your Snowflake account:
* `SF_USERNAME`
* `SF_PASSWORD`
* `SF_ACCOUNT`

This example uses data from TPC-H. In your snowflake account, ensure that you can access the [TPC-H sample database](https://docs.snowflake.com/en/user-guide/sample-data-tpch).

In [7]:
import os
username = os.environ["SF_USERNAME"]
password = os.environ["SF_PASSWORD"]
account = os.environ["SF_ACCOUNT"]
warehouse = "TEST_WH"
database = "SNOWFLAKE_SAMPLE_DATA"
schema = "TPCH_SF1"

## Predicate Pushdown

Bodo optimizes Snowflake I/O automatically by applying I/O parallelization, predicate pushdown, and column pruning optimizations. In standard Python, the code below reads the entire table before filtering data and selecting relevant columns, which can result in slow I/O performance and potential out-of-memory errors. In contrast, Bodo leverages all available CPU cores to efficiently read only the filtered and selected columns, significantly accelerating I/O operations.

Run the cell below and check Snowflake history to see filters applied to queries going to Snowflake.

In [3]:
import bodo
import pandas as pd

@bodo.jit(cache=True)
def load_lineitem(schema):
    date = pd.Timestamp("1998-09-02")
    lineitem=pd.read_sql(f"select * from {schema}.LINEITEM", f"snowflake://{username}:{password}@{account}/{database}/PUBLIC?warehouse={warehouse}",)
    lineitem=lineitem[lineitem.l_shipdate <= date]
    lineitem=lineitem[["l_quantity", "l_shipdate"]]
    return lineitem

lineitem=load_lineitem(schema)

In [4]:
lineitem.shape

(5916591, 2)

## High Performance Connector

Bodo's Snowflake connector loads the data in parallel all in Apache Arrow columnar format, leading to very high I/O performance and eliminating I/O bottlenecks for many programs. Here is another example including some computation:

In [5]:
import pandas as pd
import bodo
import time

@bodo.jit(cache=True)
def tpch_q01_filter(schema):
    t1 = time.time()
    lineitem=pd.read_sql(f"select * from {schema}.LINEITEM", f"snowflake://{username}:{password}@{account}/{database}/PUBLIC?warehouse={warehouse}",)
    date = pd.Timestamp("1998-09-02")
    sel = lineitem.l_shipdate <= date
    lineitem_filtered = lineitem[["l_quantity", "l_extendedprice", "l_discount", "l_tax", "l_returnflag", "l_linestatus",  "l_shipdate", "l_orderkey"]]
    lineitem_filtered = lineitem_filtered[sel]
    lineitem_filtered["avg_qty"] = lineitem_filtered.l_quantity
    lineitem_filtered["avg_price"] = lineitem_filtered.l_extendedprice
    lineitem_filtered["disc_price"] = lineitem_filtered.l_extendedprice * (1 - lineitem_filtered.l_discount)
    lineitem_filtered["charge"] = (
        lineitem_filtered.l_extendedprice * (1 - lineitem_filtered.l_discount) * (1 + lineitem_filtered.l_tax)
    )
    gb = lineitem_filtered.groupby(["l_returnflag", "l_linestatus"], as_index=False)[
        "l_quantity",
        "l_extendedprice",
        "disc_price",
        "charge",
        "avg_qty",
        "avg_price",
        "l_discount",
        "l_orderkey",
    ]
    total = gb.agg(
        {
            "l_quantity": "sum",
            "l_extendedprice": "sum",
            "disc_price": "sum",
            "charge": "sum",
            "avg_qty": "mean",
            "avg_price": "mean",
            "l_discount": "mean",
            "l_orderkey": "count",
        }
    )
    total = total.sort_values(["l_returnflag", "l_linestatus"])
    print(len(total))
    print("Q01 Execution time (s): ", time.time() - t1)
    return total

q1_result=tpch_q01_filter(schema)

4
Q01 Execution time (s):  9.224007999999912


In [6]:
q1_result

Unnamed: 0,l_returnflag,l_linestatus,l_quantity,l_extendedprice,disc_price,charge,avg_qty,avg_price,l_discount,l_orderkey
3,A,F,37734107.0,56586554400.729805,53758257134.870026,55909065222.82782,25.522006,38273.129735,0.049985,1478493
0,N,F,991417.0,1487504710.38,1413082168.0541,1469649223.194375,25.516472,38284.467761,0.050093,38854
1,N,O,74476040.0,111701729697.74014,106118230307.6051,110367043872.49712,25.502227,38249.117989,0.049997,2920374
2,R,F,37719753.0,56568041380.89975,53741292684.604256,55889619119.83151,25.505794,38250.854626,0.050009,1478870


BodoSQL can be used for reading from Snowflake as well:

In [8]:
import time
import bodo
import bodosql

@bodo.jit(cache=False)
def tpch_q01_sql(schema, conn_str):
    t1 = time.time()
    bc = bodosql.BodoSQLContext(
        {
            "LINEITEM": bodosql.TablePath(
                f"{schema}.lineitem", "sql", conn_str=conn_str, reorder_io=True
            ),
        })
    total = bc.sql(
        """select
                l_returnflag,
                l_linestatus,
                sum(l_quantity) as sum_qty,
                sum(l_extendedprice) as sum_base_price,
                sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
                sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
                avg(l_quantity) as avg_qty,
                avg(l_extendedprice) as avg_price,
                avg(l_discount) as avg_disc,
                count(*) as count_order
            from
                lineitem
            where
                l_shipdate <= date '1998-12-01' - interval '90' day
            group by
                l_returnflag,
                l_linestatus
            order by
                l_returnflag,
                l_linestatus"""
    )

    print("Q01 Execution time (s): ", time.time() - t1)
    return total

q1_result = tpch_q01_sql(schema,f"snowflake://{username}:{password}@{account}/{database}/PUBLIC?warehouse={warehouse}")

Q01 Execution time (s):  7.013795999999729


In [9]:
q1_result

Unnamed: 0,L_RETURNFLAG,L_LINESTATUS,SUM_QTY,SUM_BASE_PRICE,SUM_DISC_PRICE,SUM_CHARGE,AVG_QTY,AVG_PRICE,AVG_DISC,COUNT_ORDER
0,A,F,37734107.0,56586554400.73,53758257134.87,55909065222.82769,25.522006,38273.129735,0.049985,1478493
1,N,F,991417.0,1487504710.38,1413082168.0541,1469649223.194376,25.516472,38284.467761,0.050093,38854
2,N,O,74476040.0,111701729697.74,106118230307.60556,110367043872.497,25.502227,38249.117989,0.049997,2920374
3,R,F,37719753.0,56568041380.9,53741292684.60398,55889619119.83194,25.505794,38250.854626,0.050009,1478870
