# Dask-Snowflake Integration Demo

This notebook demonstrates the `dask-snowflake` integration package which supports parallel read/write from Snowflake to Python with Dask.

The notebook is structured as follows:
1. Setup remote Dask resources with Coiled
2. Connect to Snowflake
3. Write Data to Snowflake in Parallel
4. Read Data from Snowflake in Parallel
5. Use Dask to Train XGBoost on Snowflake Data

In [2]:
import os

In [3]:
# provide snowflake credentials here
os.environ["SNOWFLAKE_USER"] = ""
os.environ["SNOWFLAKE_PASSWORD"] = ""
os.environ["SNOWFLAKE_ACCOUNT"] = ""
os.environ["SNOWFLAKE_WAREHOUSE"] = ""

## 1. Set-up Dask Resources

We'll start by launching our remote Dask cluster resources using Coiled.

In [4]:
import coiled

In [5]:
# # create a Coiled software environment (Docker image) that will be distributed to all workers in our Dask cluster
# coiled.create_software_environment(
#     name="snowflake",
#     account="coiled-examples",
#     pip=[
#         "dask[distributed, dataframe, diagnostics]==2021.11.2",
#         "snowflake-connector-python",
#         "dask-snowflake",
#         "lz4",
#         "xgboost",
#     ],
# )

In [6]:
# spin up Coiled cluster
cluster = coiled.Cluster(
    name="coiled-snowflake",
    software="coiled-examples/snowflake",
    n_workers=20,
    shutdown_on_close=False,
    scheduler_options={'idle_timeout':'2 hours'},
    backend_options={'spot': True},
)

In [7]:
# connect cluster to Dask
from dask.distributed import Client
client = Client(cluster)
client


+---------+--------+-----------+---------+
| Package | client | scheduler | workers |
+---------+--------+-----------+---------+
| pandas  | 1.3.4  | 1.3.5     | 1.3.5   |
+---------+--------+-----------+---------+


'http://34.205.89.212:8787'

## 2. Connect to Snowflake
Let's now connect our Python session to Snowflake using Snowflake's connector. 

**NOTE:** For this section to work you will need to have the Snowflake Sample Data available in your account. Otherwise, change the query to something relevant/applicable to your use case.

In [9]:
import os
import snowflake.connector

In [None]:
# create Snowflake Python connector
ctx = snowflake.connector.connect(
    user=os.environ["SNOWFLAKE_USER"],
    password=os.environ["SNOWFLAKE_PASSWORD"],
    account=os.environ["SNOWFLAKE_ACCOUNT"],
)

And run some sample code to test the connection:

In [None]:
# run sample code to test connection
cs = ctx.cursor()

schema = "TPCDS_SF100TCL"
table = "CALL_CENTER"

cs.execute("USE SNOWFLAKE_SAMPLE_DATA")
cs.execute("SELECT * FROM " + schema + "." + table)

one_row = str(cs.fetchone())

print(one_row)

## 3. Parallel Write to Snowflake

Now that we have launched our remote compute resources and tested our connection to Snowflake, let's generate some synthetic data with Dask and then write to a Snowflake database in parallel.

In [None]:
import dask

In [21]:
# generate synthetic timeseries data
ddf = dask.datasets.timeseries(
    start="2021-01-01",
    end="2021-03-31",
)

In [22]:
# create warehouse and database
cs.execute("CREATE WAREHOUSE IF NOT EXISTS dask_snowflake_wh")
cs.execute("CREATE DATABASE IF NOT EXISTS dask_snowflake_db")
cs.execute("USE DATABASE dask_snowflake_db")

<snowflake.connector.cursor.SnowflakeCursor at 0x7f734f7ebc70>

In [23]:
from dask_snowflake import to_snowflake

In [24]:
connection_kwargs = {
    "user": os.environ["SNOWFLAKE_USER"],
    "password": os.environ["SNOWFLAKE_PASSWORD"],
    "account": os.environ["SNOWFLAKE_ACCOUNT"],
    "warehouse": os.environ["SNOWFLAKE_WAREHOUSE"],
    "database": "dask_snowflake_db",
    "schema": "PUBLIC",
}

In [25]:
%%time
# write Dask dataframe to Snowflake in parallel
to_snowflake(
    ddf,
    name="dask_snowflake_table",
    connection_kwargs=connection_kwargs,
)

  results = connection.execute(


CPU times: user 936 ms, sys: 44.5 ms, total: 981 ms
Wall time: 1min 13s


## 4. Parallel Read from Snowflake
We can now read this data back into our Python session in parallel.

In [26]:
from dask_snowflake import read_snowflake

In [27]:
%%time
# read data from snowflake into a Dask dataframe
snowflake_data = read_snowflake(
    query="""
      SELECT *
      FROM dask_snowflake_table;
   """,
    connection_kwargs=connection_kwargs,
)

print(snowflake_data.head())

     ID      NAME         X         Y
0  1029   Norbert  0.652481 -0.937071
1   992     Laura  0.063575  0.909713
2  1002  Patricia  0.593139 -0.653950
3  1036       Dan -0.340827  0.678265
4  1042     Frank  0.052302  0.782666
CPU times: user 262 ms, sys: 7.89 ms, total: 270 ms
Wall time: 4.51 s


In [28]:
snowflake_data

Unnamed: 0_level_0,ID,NAME,X,Y
npartitions=74,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,int16,object,float64,float64
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


In [29]:
# perform computation over Snowflake data with Dask
result = snowflake_data.X.mean()
print(result.compute())

7.50774355061549e-05


## 5. Machine Learning
After loading data into our Python session from Snowflake, we can use Python for what it's good at: things like free-form, iterative exploratory analyses and complex Machine Learning models.

Let's read in some data from Snowflake using the `dask-snowflake` connector and then train an XGBoost ML model on that data.

In [None]:
# define schema and query
SCHEMA = "SNOWFLAKE_SAMPLE_DATA.TPCH_SF100"

example_query=f"""
SELECT
    
    C_CUSTKEY,
    C_NAME,
    SUM(L_QUANTITY) AS sum_qty,
    SUM(PS_AVAILQTY) AS sum_avail_qty,
    MAX(P_RETAILPRICE) AS max_retail_price
    
    FROM {SCHEMA}.CUSTOMER
    
        JOIN {SCHEMA}.ORDERS
            ON C_CUSTKEY = O_CUSTKEY
            
            JOIN {SCHEMA}.LINEITEM
                ON L_ORDERKEY = O_ORDERKEY
                
                JOIN {SCHEMA}.PART
                    ON P_PARTKEY = L_PARTKEY
                    
                    JOIN {SCHEMA}.PARTSUPP
                        ON P_PARTKEY = PS_PARTKEY
    
    WHERE PS_SUPPLYCOST > 10

GROUP BY C_CUSTKEY, C_NAME
"""

In [None]:
# set connection parameters
connection_kwargs = {
    "user": os.environ["SNOWFLAKE_USER"],
    "password": os.environ["SNOWFLAKE_PASSWORD"],
    "account": os.environ["SNOWFLAKE_ACCOUNT"],
    "warehouse": os.environ["SNOWFLAKE_WAREHOUSE"],
    "database": "SNOWFLAKE_SAMPLE_DATA",
    "schema": "TPCH_SF100",
}

In [None]:
%%time
# read in data from snowflake
ddf = read_snowflake(
    query=example_query,
    connection_kwargs=connection_kwargs,
)

In [None]:
import xgboost as xgb

In [None]:
# define predictor and target features
X = ddf[['SUM_AVAIL_QTY', 'MAX_RETAIL_PRICE']]
y = ddf.SUM_QTY

In [None]:
# create Dask DMatrix
dtrain = xgb.dask.DaskDMatrix(client, X, y)

In [None]:
%%time
# train XGBoost with Dask
output = xgb.dask.train(
    client,
    {
        'verbosity': 2,
        'tree_method': 'hist',
        'objective': 'reg:squarederror'
    },
    dtrain,
    num_boost_round=10,
    evals=[(dtrain, 'train')]
)

In [None]:
# make predictions
y_pred = xgb.dask.predict(client, output["booster"], X)
y_pred.compute()

For more details on how to use distributed XGBoost training with Dask, see [this blog post](https://coiled.io/blog/dask-xgboost-python-example/).

# Summary

This notebook has demonstrated:
1. How to use the `dask-snowflake` connector for fast, parallel data transfer between Snowflake and Python
2. How to use Dask to continue manipulating the Snowflake data in a Python session, performing iterative EDA and/or machine learning tasks.

Join the [Dask Discourse](https://dask.discourse.group/) to continue the conversation!