# Using Delta Lake without Spark

This notebook shows you how to use Delta Lake without Spark, by using the [delta-rs](https://delta-io.github.io/delta-rs/) package.

You might want to use Delta Lake without Spark because:
- You don’t want to learn Spark
- Your team doesn’t use Spark
- You don’t want to use the Java Virtual Machine (JVM) 
- You are working with relatively small datasets

There are many ways to use Delta Lake without Spark. 

Let’s group them into two categories for clarity:

1. dedicated Delta Connectors let you use Delta Lake from engines like Flink, Hive, Trino, PrestoDB, and many others
2. the [delta-rs](https://delta-io.github.io/delta-rs/) package lets you use Delta Lake in Rust or Python, e.g. with pandas, polars, Dask, Daft, DuckDB and many others

This notebook will focus on **category (2): i.e. using Delta Lake with delta-rs**. 

For more information on the dedicated Delta connectors, refer to the [Delta Without Spark blog]().


## Setup

Make sure to install all the necessary dependencies to run the code in this notebook.

For quick testing, you can run the `pip install` cell below. This should work fine.

In [1]:
# # uncomment and run to pip install dependencies
# !pip install deltalake pandas polars getdaft dask-deltatable duckdb datafusion

For **reliable reproducibility**, create a virtual environment with the following versions, for example using `conda` and the following specs stored in a `.yml` file:

```
name: deltalake-no-spark
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.11
  - ipykernel
  - pandas==2.2.2
  - polars==0.20.30
  - jupyterlab
  - deltalake==0.17.4
  - dask-deltatable==0.3.1
  - duckdb==0.10.3
  - pip
  - pip:
      - getdaft
      - datafusion==37.1.0
```

See [creating a conda env from a .yml file](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file) for more information.

## Use Delta Lake with pandas

In [1]:
import pandas as pd
from deltalake import write_deltalake, DeltaTable

In [2]:
data = {'first_name': ['bob', 'li', 'leah'], 'age': [47, 23, 51]}
data_2 = {"first_name": ["suh", "anais"], "age": [33, 68]}

In [3]:
df = pd.DataFrame.from_dict(data)
write_deltalake("tmp/pandas-table", df)

In [4]:
print(DeltaTable("tmp/pandas-table/").to_pandas())

  first_name  age
0        bob   47
1         li   23
2       leah   51


In [5]:
df2 = pd.DataFrame(data_2)
write_deltalake("tmp/pandas-table", df2, mode="append")

In [6]:
print(DeltaTable("tmp/pandas-table/").to_pandas())

  first_name  age
0        suh   33
1      anais   68
2        bob   47
3         li   23
4       leah   51


In [7]:
print(DeltaTable("tmp/pandas-table/", version=0).to_pandas())

  first_name  age
0        bob   47
1         li   23
2       leah   51


## Use Delta Lake with polars

In [8]:
import polars as pl

In [9]:
df = pl.DataFrame(data)
df.write_delta("tmp/polars_table")

In [10]:
print(pl.read_delta("tmp/polars_table"))

shape: (3, 2)
┌────────────┬─────┐
│ first_name ┆ age │
│ ---        ┆ --- │
│ str        ┆ i64 │
╞════════════╪═════╡
│ bob        ┆ 47  │
│ li         ┆ 23  │
│ leah       ┆ 51  │
└────────────┴─────┘


In [11]:
df = pl.DataFrame(data_2)
df.write_delta("tmp/polars_table", mode="append")

In [12]:
print(pl.read_delta("tmp/polars_table"))

shape: (5, 2)
┌────────────┬─────┐
│ first_name ┆ age │
│ ---        ┆ --- │
│ str        ┆ i64 │
╞════════════╪═════╡
│ suh        ┆ 33  │
│ anais      ┆ 68  │
│ bob        ┆ 47  │
│ li         ┆ 23  │
│ leah       ┆ 51  │
└────────────┴─────┘


In [13]:
print(pl.read_delta("tmp/polars_table", version=0))

shape: (3, 2)
┌────────────┬─────┐
│ first_name ┆ age │
│ ---        ┆ --- │
│ str        ┆ i64 │
╞════════════╪═════╡
│ bob        ┆ 47  │
│ li         ┆ 23  │
│ leah       ┆ 51  │
└────────────┴─────┘


## Use Delta Lake with Daft

In [1]:
import daft

In [5]:
# read existing delta table
df = daft.read_delta_lake("tmp/pandas-table")
print(df.collect())

                                                             

╭────────────┬───────╮
│[1m first_name [0m┆[1m age   [0m│
│[1m ---        [0m┆[1m ---   [0m│
│[1m Utf8       [0m┆[1m Int64 [0m│
╞════════════╪═══════╡
│ bob        ┆ 47    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ li         ┆ 23    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ leah       ┆ 51    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ suh        ┆ 33    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ anais      ┆ 68    │
╰────────────┴───────╯

(Showing first 5 of 5 rows)




In [6]:
# query data
print(df.where(df["age"] > 40).collect())

                                                       

╭────────────┬───────╮
│[1m first_name [0m┆[1m age   [0m│
│[1m ---        [0m┆[1m ---   [0m│
│[1m Utf8       [0m┆[1m Int64 [0m│
╞════════════╪═══════╡
│ bob        ┆ 47    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ leah       ┆ 51    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ anais      ┆ 68    │
╰────────────┴───────╯

(Showing first 3 of 3 rows)




In [None]:
# write to delta table
df.write_deltalake("tmp/daft-table", mode="overwrite")

## Use Delta Lake with Dask

❗️`dask-deltatable` currently (0.3.1) only works with `deltalake<=0.13.0` so make sure to downgrade `deltalake` using the cell below. 

You will need to restart your kernel for the changes to take effect.

In [20]:
%%capture
# dask-deltatable works with deltalake<=0.13.0
!pip install deltalake==0.13.0

In [1]:
import dask
dask.__version__

'2024.5.1'

❗️`dask-deltatable` (0.3.1) does not work with Dask's new query planner. You can disable the query planner as follows:

In [2]:
# uncomment and run if dask >= 2024.3.0
dask.config.set({'dataframe.query-planning': False})

<dask.config.set at 0x103d29150>

In [3]:
import dask_deltatable as ddt
import dask.dataframe as dd

In [4]:
data = {'first_name': ['bob', 'li', 'leah'], 'age': [47, 23, 51]}
data_2 = {"first_name": ["suh", "anais"], "age": [33, 68]}

In [5]:
ddf = dd.from_dict(data, npartitions=1)
ddf.compute()

Unnamed: 0,first_name,age
0,bob,47
1,li,23
2,leah,51


In [6]:
ddt.to_deltalake("tmp/dask-table", ddf)

In [7]:
# read delta table into Dask DataFrame
delta_path = "tmp/dask-table/"
ddf = ddt.read_deltalake(delta_path)

In [8]:
print(ddf.compute())

  first_name  age
0        bob   47
1         li   23
2       leah   51


In [9]:
ddf_2 = dd.from_dict(data_2, npartitions=1)
print(ddf_2.compute())

  first_name  age
0        suh   33
1      anais   68


In [10]:
ddt.to_deltalake("tmp/dask-table", ddf_2, mode="append")

In [11]:
# read delta table into Dask DataFrame
delta_path = "tmp/dask-table/"
ddf = ddt.read_deltalake(delta_path)

In [12]:
print(ddf.compute())

  first_name  age
0        bob   47
1         li   23
2       leah   51
0        suh   33
1      anais   68


In [13]:
# read delta table into Dask DataFrame
delta_path = "tmp/dask-table/"
ddf = ddt.read_deltalake(delta_path, version=0)
print(ddf.compute())

  first_name  age
0        bob   47
1         li   23
2       leah   51


## Use Delta Lake with DuckDB

In [14]:
import duckdb
from deltalake import write_deltalake, DeltaTable

In [15]:
# load in an existing delta table
dt = DeltaTable("tmp/pandas-table/")

In [16]:
# convert to Arrow dataset
arrow_data = dt.to_pyarrow_dataset()

# convert to DuckDB dataset
duck_data = duckdb.arrow(arrow_data)

In [17]:
query = """
select
  age
from duck_data
order by 1 desc
"""

duckdb.query(query)

┌───────┐
│  age  │
│ int64 │
├───────┤
│    68 │
│    51 │
│    47 │
│    33 │
│    23 │
└───────┘

In [18]:
# convert result to arrow
arrow_table = duckdb.query(query).to_arrow_table()

In [19]:
# write result to delta lake
write_deltalake(
    data=arrow_table, 
    table_or_uri="tmp/duckdb-table", 
)

In [20]:
dt = DeltaTable("tmp/duckdb-table/")
print(dt.to_pandas())

   age
0   68
1   51
2   47
3   33
4   23


In [21]:
query = """
select
  age
from duck_data
order by 1 desc
limit 3
"""

arrow_table = duckdb.query(query).to_arrow_table()

# write result to delta lake
write_deltalake(
    data=arrow_table, 
    table_or_uri="tmp/duckdb-table", 
    mode="overwrite",
)

In [22]:
dt = DeltaTable("tmp/duckdb-table/")
print(dt.to_pandas())

   age
0   68
1   51
2   47


In [23]:
dt = DeltaTable("tmp/duckdb-table/", version=0)
print(dt.to_pandas())

   age
0   68
1   51
2   47
3   33
4   23


## Use Delta Lake with Datafusion

In [28]:
from datafusion import SessionContext
from deltalake import write_deltalake, DeltaTable

In [29]:
ctx = SessionContext()

In [30]:
table = DeltaTable("tmp/pandas-table/")

In [31]:
arrow_data = table.to_pyarrow_dataset()
ctx.register_dataset("my_delta_table", arrow_data)

In [32]:
query = "select age from my_delta_table order by 1 desc"
ctx.sql(query)

DataFrame()
+-----+
| age |
+-----+
| 68  |
| 51  |
| 47  |
| 33  |
| 23  |
+-----+

In [33]:
arrow_table = ctx.sql(query).to_arrow_table()

write_deltalake(
    data=arrow_table, 
    table_or_uri="tmp/datafusion-table", 
)

In [34]:
dt = DeltaTable("tmp/datafusion-table/")
print(dt.to_pandas())

   age
0   68
1   51
2   47
3   33
4   23


In [35]:
query = "select age from my_delta_table order by 1 desc limit 3"

arrow_table = ctx.sql(query).to_arrow_table()

write_deltalake(
    data=arrow_table, 
    table_or_uri="tmp/datafusion-table", 
    mode="overwrite",
)

In [36]:
dt = DeltaTable("tmp/datafusion-table/")
print(dt.to_pandas())

   age
0   68
1   51
2   47


In [37]:
dt = DeltaTable("tmp/datafusion-table/", version=0)
print(dt.to_pandas())

   age
0   68
1   51
2   47
3   33
4   23
