Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#5982 but for Ibis #6296

Closed
wants to merge 3 commits into from
Closed

#5982 but for Ibis #6296

wants to merge 3 commits into from

Conversation

lostmygithubaccount
Copy link
Contributor

@lostmygithubaccount lostmygithubaccount commented Nov 21, 2022

resolves nothing

Description

this is basically #5982 but using .ibis as the extension

running dbt Python models is a bit slow -- SQL cheats. what if we could cheat with Python too?

Ibis provides a full-features replacement for SQL SELECT queries with Python code. this offers benefits like the flexibility of Python, reuse of expressions, type checking, and more

with some adapatations to the PRQL PR, I was able to get this working (at least in simple cases) for Ibis. this brings down the time to run a simple Python model from ~7-20s to ~1-2s on Snowflake

image

# test_snowpark.py
def model(dbt, session):

    # get upstream models
    stg_orders = dbt.ref("stg_orders")
    stg_payments = dbt.ref("stg_payments")

    # transform
    final = stg_orders.join(stg_payments, ["order_id"])

    return final
# test_ibis.ibis
# get upstream models
stg_orders = s.table("stg_orders")
stg_payments = s.table("stg_payments")

# transform
final = stg_orders.left_join(stg_payments, "order_id".upper())

# use the `model` variable
model = final

rough steps to reproduce:

gh repo clone dbt-labs/dbt-core -b cody/ibis
gh repo clone dbt-labs/dbt-snowflake
gh repo clone ibis

pip install snowflake-connector-python==2.8.1
pip install -r dbt-core/editable-requirements.txt
pip install -e dbt-snowflake/
pip install -e ibis/

you'd need to edit dbt-core/core/dbt/parser/_dbt_ibis.py to adjust the hardcoded stuff. the core/dbt/parser/_dbt_ibis.py code is unfathomably ugly:

"""
adapted from https://github.com/dbt-labs/dbt-core/pull/5982/files
"""
import ibis


def compile(code: str):
    # TODO: credentials from dbt
    import yaml

    # replace as needed
    PROFILE_PATH = "/Users/cody/.dbt/profiles.yml"
    PROFILE_NAME = "p-ibis"
    PROFILE_OUTPUT = "dev"

    # read in dbt profile
    with open(PROFILE_PATH, "r") as f:
        profiles = yaml.safe_load(f)
        profile = profiles[PROFILE_NAME]["outputs"][PROFILE_OUTPUT]

    # build connection parameters from profile
    conn_params = {
        "account": profile["account"],
        "user": profile["user"],
        "role": profile["role"],
        "warehouse": profile["warehouse"],
        "database": profile["database"],
        "schema": profile["schema"],
        "authenticator": profile["authenticator"],
    }

    s = ibis.connect(
        f"snowflake://{conn_params['user']}:_@{conn_params['account']}/{conn_params['database']}/{conn_params['schema']}?warehouse={conn_params['warehouse']}&role={conn_params['role']}&authenticator={conn_params['authenticator']}",
    )
    # TODO: replace above

    # the dirtiest code I've ever written?
    # run the ibis code and compile the `model` variable
    exec(code)
    compiled = str(eval("ibis.snowflake.compile(model)"))

    return compiled

it's hardcoded -- I couldn't figure out the best way to get the profile/adapter credentials into the parser here. I'll ask internally.

beyond that, I could not figure out a good way to execute the ibis code. unfortunately, we do need to make a connection for ibis to read metadata from the tables and compile its expressions down into SQL. I used an exec into eval, which I'm not proud of

this doesn't work on some more complicated ibis expressions that generates "%(param_9)s" in the compiled strings -- not sure why but I expect due to the hackiness of the above, as the same code works fine normally in ibis

an example of a model that results in that, based off the newer https://github.com/dbt-labs/dbt-demo-data, is:

# get upstream models
stg_orders = s.table("stg_orders")
stg_order_items = s.table("stg_order_items")
stg_products = s.table("stg_products")
stg_locations = s.table("stg_locations")
stg_supplies = s.table("stg_supplies")

# lowercase all column names
stg_orders = stg_orders.relabel(
    {col: str(col).lower() for col in stg_orders.columns}
)
stg_order_items = stg_order_items.relabel(
    {col: str(col).lower() for col in stg_order_items.columns}
)
stg_products = stg_products.relabel(
    {col: str(col).lower() for col in stg_products.columns}
)
stg_locations = stg_locations.relabel(
    {col: str(col).lower() for col in stg_locations.columns}
)
stg_supplies = stg_supplies.relabel(
    {col: str(col).lower() for col in stg_supplies.columns}
)

# reused expressions
count_food_items = stg_products.is_food_item.sum()
count_drink_items = stg_products.is_drink_item.sum()
count_items = stg_products.product_id.count()

subtotal_drink_items = stg_products.product_price.sum(
    where=stg_products.is_drink_item == 1
)
subtotal_food_items = stg_products.product_price.sum(
    where=stg_products.is_food_item == 1
)
subtotal = stg_products.product_price.sum()

# intermediate transforms
order_items_summary = (
    stg_order_items.left_join(stg_products, "product_id")
    .group_by("order_id")
    .aggregate(
        count_food_items=count_food_items,
        count_drink_items=count_drink_items,
        count_items=count_items,
        subtotal_drink_items=subtotal_drink_items,
        subtotal_food_items=subtotal_food_items,
        subtotal=subtotal,
    )
)

order_supplies_summary = (
    stg_order_items.left_join(stg_supplies, "product_id")
    .group_by("order_id")
    .aggregate(order_cost=stg_supplies.supply_cost.sum())
)

joined = (
    stg_orders.left_join(order_items_summary, "order_id")
    .relabel({"order_id_x": "order_id"})
    .drop("order_id_y")
    .left_join(order_supplies_summary, "order_id")
    .relabel({"order_id_x": "order_id"})
    .drop("order_id_y")
    .left_join(stg_locations, "location_id")
    .relabel({"location_id_x": "location_id"})
    .drop("location_id_y")
)

joined = joined.mutate(
    customer_order_index=ibis.row_number().over(
        ibis.window(group_by=stg_orders.customer_id, order_by=stg_orders.ordered_at)
    )
    + 1
)

final = joined.mutate(
    first_order=(joined.customer_order_index == 1).ifelse(1, 0),
    is_food_order=(joined.count_food_items > 0).ifelse(1, 0),
    is_drink_order=(joined.count_drink_items > 0).ifelse(1, 0),
)

# use the `model` variable
model = final

resulting in this traceback:

(venv) cody@dbtpro dbt-demo-data % dbt run -s ibis_test
17:42:54  Running with dbt=1.4.0-a1
17:43:00  Found 10 models, 20 tests, 0 snapshots, 0 analyses, 613 macros, 0 operations, 0 seed files, 6 sources, 0 exposures, 4 metrics
17:43:00  The selection criterion 'ibis_test' does not match any nodes
17:43:00  
17:43:01  Concurrency: 8 threads (target='dev')
17:43:01  
17:43:01  1 of 1 START sql view model dbt_cody_ibis.ibis_test ............................ [RUN]
17:43:01  1 of 1 ERROR creating sql view model dbt_cody_ibis.ibis_test ................... [ERROR in 0.48s]
17:43:02  
17:43:02  Finished running 1 view model in 0 hours 0 minutes and 1.39 seconds (1.39s).
17:43:02  
17:43:02  Completed with 1 error and 0 warnings:
17:43:02  
17:43:02  Database Error in model ibis_test (models/ibis_test.ibis)
17:43:02    001003 (42000): SQL compilation error:
17:43:02    syntax error line 7 at position 340 unexpected '%'.
17:43:02    syntax error line 7 at position 350 unexpected 's'.
17:43:02    syntax error line 7 at position 358 unexpected '%'.
17:43:02    syntax error line 7 at position 370 unexpected 'ELSE'.
17:43:02    syntax error line 8 at position 607 unexpected ')'.
17:43:02    compiled Code at target/run/demo_data/models/ibis_test.ibis
17:43:02  
17:43:02  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
(venv) cody@dbtpro dbt-demo-data % 

from this compiled code:

WITH t0 AS 
(SELECT t2."ORDER_ITEM_ID" AS order_item_id, t2."ORDER_ID" AS order_id, t2."PRODUCT_ID" AS product_id 
FROM stg_order_items AS t2)
 SELECT t1.order_id, t1.location_id, t1.customer_id, t1.order_total, t1.tax_paid, t1.ordered_at, t1.count_food_items, t1.count_drink_items, t1.count_items, t1.subtotal_drink_items, t1.subtotal_food_items, t1.subtotal, t1.order_cost, t1.location_name, t1.tax_rate, t1.opened_at, t1.customer_order_index, CASE WHEN (t1.customer_order_index = %(param_1)s) THEN %(param_2)s ELSE %(param_3)s END AS first_order, CASE WHEN (t1.count_food_items > %(param_4)s) THEN %(param_5)s ELSE %(param_6)s END AS is_food_order, CASE WHEN (t1.count_drink_items > %(param_7)s) THEN %(param_8)s ELSE %(param_9)s END AS is_drink_order 
FROM (SELECT t2.order_id AS order_id, t2.location_id AS location_id, t2.customer_id AS customer_id, t2.order_total AS order_total, t2.tax_paid AS tax_paid, t2.ordered_at AS ordered_at, t2.count_food_items AS count_food_items, t2.count_drink_items AS count_drink_items, t2.count_items AS count_items, t2.subtotal_drink_items AS subtotal_drink_items, t2.subtotal_food_items AS subtotal_food_items, t2.subtotal AS subtotal, t2.order_cost AS order_cost, t2.location_name AS location_name, t2.tax_rate AS tax_rate, t2.opened_at AS opened_at, (row_number() OVER (PARTITION BY t2.customer_id ORDER BY t2.ordered_at) - %(param_10)s) + %(param_11)s AS customer_order_index 
FROM (SELECT t3.order_id AS order_id, t3.location_id_x AS location_id, t3.customer_id AS customer_id, t3.order_total AS order_total, t3.tax_paid AS tax_paid, t3.ordered_at AS ordered_at, t3.count_food_items AS count_food_items, t3.count_drink_items AS count_drink_items, t3.count_items AS count_items, t3.subtotal_drink_items AS subtotal_drink_items, t3.subtotal_food_items AS subtotal_food_items, t3.subtotal AS subtotal, t3.order_cost AS order_cost, t3.location_id_y AS location_id_y, t3.location_name AS location_name, t3.tax_rate AS tax_rate, t3.opened_at AS opened_at 
FROM (SELECT t4.order_id AS order_id, t4.location_id AS location_id_x, t4.customer_id AS customer_id, t4.order_total AS order_total, t4.tax_paid AS tax_paid, t4.ordered_at AS ordered_at, t4.count_food_items AS count_food_items, t4.count_drink_items AS count_drink_items, t4.count_items AS count_items, t4.subtotal_drink_items AS subtotal_drink_items, t4.subtotal_food_items AS subtotal_food_items, t4.subtotal AS subtotal, t4.order_cost AS order_cost, t5.location_id AS location_id_y, t5.location_name AS location_name, t5.tax_rate AS tax_rate, t5.opened_at AS opened_at 
FROM (SELECT t6.order_id AS order_id, t6.location_id AS location_id, t6.customer_id AS customer_id, t6.order_total AS order_total, t6.tax_paid AS tax_paid, t6.ordered_at AS ordered_at, t6.count_food_items AS count_food_items, t6.count_drink_items AS count_drink_items, t6.count_items AS count_items, t6.subtotal_drink_items AS subtotal_drink_items, t6.subtotal_food_items AS subtotal_food_items, t6.subtotal AS subtotal, t6.order_cost AS order_cost 
FROM (SELECT t7.order_id_x AS order_id, t7.location_id AS location_id, t7.customer_id AS customer_id, t7.order_total AS order_total, t7.tax_paid AS tax_paid, t7.ordered_at AS ordered_at, t7.count_food_items AS count_food_items, t7.count_drink_items AS count_drink_items, t7.count_items AS count_items, t7.subtotal_drink_items AS subtotal_drink_items, t7.subtotal_food_items AS subtotal_food_items, t7.subtotal AS subtotal, t7.order_id_y AS order_id_y, t7.order_cost AS order_cost 
FROM (SELECT t8.order_id AS order_id_x, t8.location_id AS location_id, t8.customer_id AS customer_id, t8.order_total AS order_total, t8.tax_paid AS tax_paid, t8.ordered_at AS ordered_at, t8.count_food_items AS count_food_items, t8.count_drink_items AS count_drink_items, t8.count_items AS count_items, t8.subtotal_drink_items AS subtotal_drink_items, t8.subtotal_food_items AS subtotal_food_items, t8.subtotal AS subtotal, t9.order_id AS order_id_y, t9.order_cost AS order_cost 
FROM (SELECT t10.order_id AS order_id, t10.location_id AS location_id, t10.customer_id AS customer_id, t10.order_total AS order_total, t10.tax_paid AS tax_paid, t10.ordered_at AS ordered_at, t10.count_food_items AS count_food_items, t10.count_drink_items AS count_drink_items, t10.count_items AS count_items, t10.subtotal_drink_items AS subtotal_drink_items, t10.subtotal_food_items AS subtotal_food_items, t10.subtotal AS subtotal 
FROM (SELECT t11.order_id_x AS order_id, t11.location_id AS location_id, t11.customer_id AS customer_id, t11.order_total AS order_total, t11.tax_paid AS tax_paid, t11.ordered_at AS ordered_at, t11.order_id_y AS order_id_y, t11.count_food_items AS count_food_items, t11.count_drink_items AS count_drink_items, t11.count_items AS count_items, t11.subtotal_drink_items AS subtotal_drink_items, t11.subtotal_food_items AS subtotal_food_items, t11.subtotal AS subtotal 
FROM (SELECT t12.order_id AS order_id_x, t12.location_id AS location_id, t12.customer_id AS customer_id, t12.order_total AS order_total, t12.tax_paid AS tax_paid, t12.ordered_at AS ordered_at, t13.order_id AS order_id_y, t13.count_food_items AS count_food_items, t13.count_drink_items AS count_drink_items, t13.count_items AS count_items, t13.subtotal_drink_items AS subtotal_drink_items, t13.subtotal_food_items AS subtotal_food_items, t13.subtotal AS subtotal 
FROM (SELECT t14."ORDER_ID" AS order_id, t14."LOCATION_ID" AS location_id, t14."CUSTOMER_ID" AS customer_id, t14."ORDER_TOTAL" AS order_total, t14."TAX_PAID" AS tax_paid, t14."ORDERED_AT" AS ordered_at 
FROM stg_orders AS t14) AS t12 LEFT OUTER JOIN (SELECT t14.order_id AS order_id, sum(t14.is_food_item) AS count_food_items, sum(t14.is_drink_item) AS count_drink_items, count(t14.product_id_y) AS count_items, sum(CASE WHEN (t14.is_drink_item = %(param_12)s) THEN t14.product_price ELSE NULL END) AS subtotal_drink_items, sum(CASE WHEN (t14.is_food_item = %(param_13)s) THEN t14.product_price ELSE NULL END) AS subtotal_food_items, sum(t14.product_price) AS subtotal 
FROM (SELECT t0.order_item_id AS order_item_id, t0.order_id AS order_id, t0.product_id AS product_id_x, t16.product_id AS product_id_y, t16.product_name AS product_name, t16.product_type AS product_type, t16.product_description AS product_description, t16.product_price AS product_price, t16.is_food_item AS is_food_item, t16.is_drink_item AS is_drink_item 
FROM t0 LEFT OUTER JOIN (SELECT t17."PRODUCT_ID" AS product_id, t17."PRODUCT_NAME" AS product_name, t17."PRODUCT_TYPE" AS product_type, t17."PRODUCT_DESCRIPTION" AS product_description, t17."PRODUCT_PRICE" AS product_price, t17."IS_FOOD_ITEM" AS is_food_item, t17."IS_DRINK_ITEM" AS is_drink_item 
FROM stg_products AS t17) AS t16 ON t0.product_id = t16.product_id) AS t14 GROUP BY t14.order_id) AS t13 ON t12.order_id = t13.order_id) AS t11) AS t10) AS t8 LEFT OUTER JOIN (SELECT t10.order_id AS order_id, sum(t10.supply_cost) AS order_cost 
FROM (SELECT t0.order_item_id AS order_item_id, t0.order_id AS order_id, t0.product_id AS product_id_x, t12.supply_uuid AS supply_uuid, t12.supply_id AS supply_id, t12.product_id AS product_id_y, t12.supply_name AS supply_name, t12.supply_cost AS supply_cost, t12.is_perishable_supply AS is_perishable_supply 
FROM t0 LEFT OUTER JOIN (SELECT t13."SUPPLY_UUID" AS supply_uuid, t13."SUPPLY_ID" AS supply_id, t13."PRODUCT_ID" AS product_id, t13."SUPPLY_NAME" AS supply_name, t13."SUPPLY_COST" AS supply_cost, t13."IS_PERISHABLE_SUPPLY" AS is_perishable_supply 
FROM stg_supplies AS t13) AS t12 ON t0.product_id = t12.product_id) AS t10 GROUP BY t10.order_id) AS t9 ON t8.order_id = t9.order_id) AS t7) AS t6) AS t4 LEFT OUTER JOIN (SELECT t6."LOCATION_ID" AS location_id, t6."LOCATION_NAME" AS location_name, t6."TAX_RATE" AS tax_rate, t6."OPENED_AT" AS opened_at 
FROM stg_locations AS t6) AS t5 ON t4.location_id = t5.location_id) AS t3) AS t2) AS t1

Checklist

@cla-bot cla-bot bot added the cla:yes label Nov 21, 2022
@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@lostmygithubaccount
Copy link
Contributor Author

after discussions w/ @jtcohen6 and @ChenyuLInx, noting here for posterity:

  • a potential easy fix on the adapter piece is moving the code from /parser/models.py into /compilation.py, around the same place Python models are -- then can likely reuse the adapter
  • longer term, we'd probably want a dbt-ibis language plugin where:
  • dbt-core is doing the parsing and compilation -- of dbt-ified Ibis code into regular Ibis code, rather than down into SQL
  • for executing of the model, it's then calling off into the dbt-ibis adapter/plugin
    • this is making the required Ibis connection and executing the Python code
    • this allows us to isolate that adapter/plugin from dbt-core
    • again instead of compiling down to SQL, as this current PR is doing, it'd simply run the Ibis Python code after the ref/source/config + DDL has been resolved by dbt-core
    • in this approach, I think the different Ibis backends would transparently work. the adapter-specific code would be in the connection string

@lostmygithubaccount
Copy link
Contributor Author

158aa81 makes the updates per suggestions

the good

the code is so much cleaner! and no longer uses hardcoded re-reading of my local profile, instead using the context object -- could the same have been done with the config object previously? not sure

"""
adapted from https://github.com/dbt-labs/dbt-core/pull/5982/files
"""
import ibis
def compile(code: str, context):
conn_params = {
"account": context["target"]["account"],
"user": context["target"]["user"],
"role": context["target"]["role"],
"warehouse": context["target"]["warehouse"],
"database": context["target"]["database"],
"schema": context["target"]["schema"],
"authenticator": "externalbrowser",
}
s = ibis.connect(
f"snowflake://{conn_params['user']}:_@{conn_params['account']}/{conn_params['database']}/{conn_params['schema']}?warehouse={conn_params['warehouse']}&role={conn_params['role']}&authenticator={conn_params['authenticator']}",
)
# the dirtiest code I've ever written?
# run the ibis code and compile the `model` variable
exec(code)
compiled = str(eval(f"ibis.{context['target']['type']}.compile(model)"))
return compiled

the bad

it's slow! very slow. a run is now taking closer than 5 seconds, up from 1-2. I don't know why. I made a lazy attempt to make it better

image

@github-actions
Copy link
Contributor

This PR has been marked as Stale because it has been open with no activity as of late. If you would like the PR to remain open, please comment on the PR or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label May 23, 2023
@github-actions
Copy link
Contributor

Although we are closing this PR as stale, it can still be reopened to continue development. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this May 30, 2023
@binste
Copy link

binste commented Sep 7, 2023

In case anyone lands here looking for an integration of dbt and ibis, you might be interested in dbt-ibis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes stale Issues that have gone stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants